Dynamic programming for fun and profit

Imagine that you decide to make an investment. You have 1720 currency units (CU) at your disposal and the investment options that are available are:

Option# min amount max amount interest sign up bonus
1 50 99 1% 0.5
2 100 299 1.1% 1
3 300 499 1.6% 2.2
4 500 999 1.8% 2.5
5 1000 1999 2% 4

You may use multiple investment options, each one possibly multiple times. Each option accepts a minimum and maximum amount of CUs, and provides an interest on that amount plus an additional bonus amount which does not depend on the invested amount. How will you invest your money? Take your time and continue reading when ready...

Let see how you did! If you invested all 1720 CU in option 5 then... you chose poorly. Your profit is going to be 38.4 CU — not bad, but you can do better. If you invested 1020 in option 5 and 700 in option 4, that's a bit better at 39.5 CU, but still not optimal. What about 1020 in option 5, 500 in option 4 and 200 in option 2? That's actually a bit worse at 39.1 CU! I'll spare you the agony and reveal the best solution: 1020 in option 1, 300 in option 3 twice, and 100 in option 2, giving a profit of 40.5 CU. This is about 5.5% better than the straightforward solution of assigning everything to option 5.

As this short exercise has hopefully exhibited, finding the optimal portofolio of options is far from trivial. Is there a way to programmatically find the optimal solution?

The road to dynamic programming

First, let's define what are trying to find. A portofolio is a multiset of ( o,n ) pairs, each pair representing that we are assigning n CU to option o . For this solution to be acceptable, all the pairs need to be valid and the total sum of all pairs must not exceed the total amount of money we want to use. We are trying to find the portofolio which provides the maximum profit.

Brute force

To understand the solution space a bit more let's devise a way to enumerate all the portofolios, which is in fact not trivial.

One way to think about this is that to produce a portofolio we first have to split our total amount in some way, and assign an option to each element in the split. Each such split is formally called a partition of n . The number of partitions for a number n is given by the partition function p ( n ) which becomes spectacular very quickly. For each such partition we also need to factor in all the different acceptable option permutations.

Another way to think about this, which turns out to be more helpful, is with a recursive, constructive view of the problem. Let P n be the set of portfolios for n total CU. From this set, the the subset of porfolios that use option o with k CU consists of all portofolios with n k total CU combined with the option ( o,k ) . Expressing this with set notation we have ( is the multiset sum operator, which in our case is used to add items to the multiset):

P n ( o,k )= { Q { ( o,k ) }| Q P n k }

The whole set P n is the union of all P n ( o,k ) sets for all valid combinations of o and k :

O k = { Options that accept k }
P n = 1 k n o O k P n ( o,k )

Using this recursive set description of the solution space we can write a program to enumerate all the solutions for a particular amount and set of options. This description provides a straightforward, albeit rather inefficient, way to solve this problem: by brute force — evaluate all solutions and use the best. In pseudocode this would look like:

solve_brute_force(n):
    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                for s in (solve_brute_force(n-i) ∪ {∅}):
                    solutions = solutions ∪ (s ⊎ {(o,k)})

    return solutions

solve_driver(n):
    return argmax(solve_brute_force(n), profit)

Optimal substructure

Let's see if this problem has any interesting properties we can take advantage of to find the optimal solution more efficiently.

If we look at P n ( o,k ) we see that we are combining ( o,k ) with all portofolios for n k , and we ultimately check all these combinations to find the best solution. This is wasteful, however, since the choice of ( o,k ) doesn't affect the profit of any of the members of P n k . To put it more formally, the profit function is an additive multiset function:

Pr of it ( P n ( o,k ))= Pr of it ( Q { ( o,k ) } )= Pr of it ( Q )+ Pr of it (( o,k )) , Q P n k

We can thus get the best total value by using one of the members of the P n ( o,k ) set with the highest profit. Any other choice from that set would yield a lower total profit. We have thus shown that our problem exhibits the optimal substructure property: an optimal solution includes optimal solutions to subproblems.

Armed with this revelation we can now produce a much more efficient formulation:

P n ( o,k )= Q { ( o,k ) } ,Q is any one el ement of ar g max S P n k Pr of it ( S )

This leads us to the following pseudocode:

solve_optimal_substructure(n):
    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                solutions = solutions ∪
                            (solve_optimal_substructure(n-i) ⊎
                             {(o,k)})

    return argmax(solutions, profit)

solve_driver(n):
    return solve_optimal_substructure(n)

Overlapping subproblems

Another interesting thing to note is that to solve for n we need to solve for n 1 .. . 1 . Similarly, to solve for n 1 we need to solve for n 2 .. . 1 and so on. We are solving the same problems over and over again! Our problem thus exhibits the property of overlapping subproblems. We can take advantage of this property by storing results in a cache and reusing them instead of recalculating them.

Updating our original pseudocode to use a cache we have:

solve_overlapping_subproblems(n):
    if n in CACHE: return CACHE[n]

    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                for s in (solve_overlapping_subproblems(n-i) ∪ {∅}):
                    solutions = solutions ∪ (s ⊎ {(o,k)})

    CACHE[n] = solutions
    return CACHE[n]

solve_driver(n):
    return argmax(solve_overlapping_subproblems(n), profit)

Dynamic programming

If we combine both optimizations, taking advantage of both the optimal substructure property and the overlapping subproblems property, we reach the dynamic programming solution:

solve_dp_recursive(n):
    if n in CACHE: return CACHE[n]
    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                solutions = solutions ∪
                            (solve_dp_recursive(n-i) ⊎ {(o,k)})

    CACHE[n] = argmax(solutions, profit)
    return CACHE[n]

solve_driver(n):
    return solve_dp_recursive(n)

Alternatively, we can express this without recursion:

solve_dp_iterative(n):
    solutions = {}

    for i in [1..n]:
        for o in OPTIONS:
            if o accepts i:
                solutions = solutions ∪ (CACHE[n-i] ⊎ {(o,k)})

    return argmax(solutions, profit)

solve_driver(n):
    for i in [1..n]:
        CACHE[i] = solve_dp_iterative(i)
    return CACHE[n]

Scaling down

Instances of this problem with a large n can still be prohibitive to solve even when using the dynamic programming approach I just described. Not all is lost, however, if we are willing to relax our requirement for a guaranteed optimal solution.

One approach to cutting down our solution time, while still producing a possibly good solution, is to scale down our problem. We do this by dividing the start, end and bonus values of all options, and also the target number for which we solve for by a particular scale factor. Another way to view this is that scaling down by scale corresponds to performing the algorithms with a minimum step of value scale instead of value 1.

For example if we scaled down our original problem by 10 we would solve for 172 (= 1720 / 10) and use the following scaled down options:

Option# min amount max amount interest sign up bonus
1 5 9 1% 0.05
2 10 29 1.1% 0.1
3 30 49 1.6% 0.22
4 50 99 1.8% 0.25
5 100 199 2% 0.4

When we find the optimal solution we can scale it back up using the same scale factor.

The expectation is that the optimal solution for the scaled down version of our problem is going to be in the proximity of the optimal solution for the original version, and assuming that the Profit function is generally smooth, the profit is also going to be near the optimal one.

Getting greedy

Another approach to solving this problem faster is to explore whether there is greedy solution for it. Below I will describe an approach that works in many cases, but is not guaranteed to provide the optimal solution.

Taking inspiration from the greedy solution to the fractional knapsack problem, at each step we greedily select to put as many CUs as possible in the option with the best marginal profit. The key observation here is that in our case each option has two different marginal profits. The first one involves investing in a new instance of an option using its minimum acceptable amount. The second one involves investing more in an instance of an option we have already invested in. In the first case the sign up bonus kicks in and increases the marginal profit. In the second case the marginal profit is simply the interest.

For our original scenario we have:

Option# min amount max amount min marginal profit add marginal profit
1 50 99 2% 1%
2 100 299 2.1% 1.1%
3 300 499 2.3333% 1.6%
4 500 999 2.3% 1.8%
5 1000 1999 2.4% 2%

For a total amount of 1720 this method works flawlessly. We first select to add ( o 5 , 1000) to our portofolio for a marginal profit of 2.4%. From the remaining 720 we add ( o 3 , 300) for a marginal profit of 2.333%. From the remaining 420 we again choose ( o 3 , 300) . We now have 120 left, for which we choose ( o 3 , 100) , and the final 20 we add to the ( o 5 , 1000) instance we already have.

Unfortunately, this method doesn't find the optimal solution in all cases. Take for example a total amount of 500. The greedy algorithm chooses ( o 3 , 300) , ( o 2 , 100) , ( o 2 , 100) for a total profit of 11.2. Alas, the optimal solution is simply ( o 4 , 500) for a total profit of 11.4.

Perhaps there is some other greedy approach that provides optimal solutions in all cases. Still, the described greedy approach is useful if we want to quickly get a possibly good solution. We could also try both the greedy approach and the scaling approach to raise our confidence level in the generated solution.

Can we do better?

We have mostly focused on solving our investment problem using a dynamic programming approach. Taking a step back, we may ask if this is indeed the best way to get an optimal solution. Can we approach this problem from a completely different angle to improve performance?

It turns out that the answer is yes! With some creativity our problem can be expressed as an integer linear program (ILP), for which we have algorithms that work quite efficiently in practice. I plan to present this approach in more detail in a future post.

An ounce of action...

If you want to try out the ideas mentioned in this post, I have created a program implementing all the algorithms and variations. You can find it at: https://github.com/afrantzis/invest.

Enjoy!

Metrics for test suite comprehensiveness

In a previous post I discussed a few FOSS specific mentalities and practices that I believe play a role in discouraging adoption of comprehensive automated testing in FOSS. One of the points that came up in discussions, is whether the basic premise of the post, that FOSS projects don't typically employ comprehensive automated testing, including not having any tests at all, is actually true. That's a valid concern, given that the post was motivated by my long-term observations working on and with FOSS and didn't provide any further data. In this post will try to address this concern.

The main question is how we can measure the comprehensiveness of a test suite. Code coverage is the standard metric used in the industry and makes intuitive sense. However, it presents some difficulties for large scale surveys, since it's not computationally cheap to produce and often requires per project changes or arrangements.

I would like to propose and explore two alternative metrics that are easier to produce, and are therefore better suited to large scale surveys.

The first metric is the test commit ratio of the codebase — the number of commits that affect test code as a percentage of all commits. Ideally, every change that adds a feature or fixes a bug in the production code should be accompanied by a corresponding change in the test code. The more we depart from this ideal, and, hence, the less often we update the test code, the less comprehensive our test suite tends to be. This metric is affected by the project's particular commit practices, so some amount of variance is expected even between projects considered to be well tested.

The second metric is the test code size ratio of the codebase — the size of the test code as a percentage of the size of all the code. It makes sense intuitively that, typically, more test code will be able to test more production code. That being said, the size itself does not provide the whole picture. Depending on the project, a compact test suite may be adequately comprehensive, or, conversely, large test data files may skew this metric.

Neither of these metrics is failproof, but my hypothesis is that when combined and aggregated over many projects they can provide a good indication about the overall trend of the comprehensiveness of test suites, which is the main goal of this post.

Let's see what these metrics give us for FOSS projects. I chose two software suites that are considered quite prominent in the FOSS world, namely GNOME and KDE, which together consist of over 1500 projects.

A small number of these projects are not good candidates for this survey, because, for example, they are empty, or are pure documentation. Although they are included in the survey, their count is low enough to not affect the overall trend reported below.

Here is the distribution of the percentages of commits affecting test code in GNOME and KDE projects:

kde-gnome-commits

Here is the distribution of the percentages of test code size in GNOME and KDE projects:

kde-gnome-sizes

The first thing to notice is the tall lines in the left part of both graphs. For the second graph this shows that a very large percentage of the projects, roughly 55%, have either no tests at all, or so few as to be practically non-existent. Another interesting observation is the position of the 80% percentile lines, which show that 80% of the projects have test commit ratios less than 11.2%, and test code size ratios less than 8.8%.

In other words, out of ten commits that change the code base, only about one (or fewer) touches the tests in the majority of the projects. Although this doesn't constitute indisputable proof that tests are not comprehensive, it is nevertheless a big red flag, especially when combined with low test code size percentages. Each project may have different needs and development patterns, and these numbers need to be interpreted with care, but as a general trend this is not encouraging.

On the bright side, there are some projects with higher values in this distribution. It's no surprise that this set consists mainly of core libraries from these software suites, but does not include many end-user applications.

Going off on a slight tangent, one may argue that the distribution is unfairly skewed since many of these projects are GUI applications which, according to conventional wisdom, are not easy to test. However, this argument fails on multiple fronts. First, it's not unfair to include these programs, because we expect no less of them in terms of quality compared to other types of programs. They don't get a free pass because they have a GUI. In addition, being a GUI program is not a valid excuse for inadequate testing, because although testing the UI itself, or the functionality through the UI, may not be easy, there is typically a lot more we can test. Programs provide some core domain functionality, which we should be able to test independently if we decouple our core domain logic from the UI, often by using a different architecture, for example, the Hexagonal Architecture.

After having seen some general trends, let's see some examples of individual codebases that do better in these metrics:

misc-test-stats

This graph displays quite a diverse collection of projects including a database, graphics libraries, a GUI toolkit, a display compositor, system tools and even a GUI application. These projects are considered to be relatively well tested, each in its own particular way, so these higher numbers provide some evidence that the metrics correlate with test comprehensiveness.

If we accept this correlation, this collection also shows that we can achieve more comprehensive testing in a wide variety of projects. Depending on project, the trade-offs we need to make will differ, but it is almost always possible to do well in this area.

The interpretation of individual results varies with the project, but, in general, I have found that the test commit ratio is typically a more reliable indicator of test comprehensiveness, since it's less sensitive to test specifics compared to test code size ratio.

Tools and Data

In order to produce the data, I developed the git-test-annotate program which produces a list of files and commits from a git repository and marks them as related to testing or not.

git-test-annotate decides whether a file is a test file by checking for the string "test" anywhere in the file's path within the repository. This is a very simple heurestic, but works surprisingly well. In order to make test code size calculation more meaningful, the tool ignores files that are not typically considered testable sources, for example, non-text files and translations, both in the production and the test code.

For commit annotations, only mainline commits are taken into account, To check if a commit affects the tests the tool checks if it changes at least one file with "test" in its path.

To get the stats for KDE and GNOME I downloaded all their projects from their github organizations/mirrors and ran the git-test-annotate tool on each project. All the annotated data and a python script to process them are available in the foss-test-annotations repository.

Epilogue

I hope this post has provided some useful information about the utility of the proposed metrics, and some evidence that there is ample room for improvement in automated testing of FOSS projects. It would certainly be interesting to perform a more rigorous investigation to evaluate how well these metrics correlate with code coverage.

Before closing, I would like to mention that there are cases where projects are primarily tested through external test suites. Examples in the FOSS world are the piglit suite for Mesa, and various tests suites for the Linux kernel. In such cases, project test comprehensiveness metrics don't provide the complete picture, since they don't take into account the external tests. These metrics are still useful though, because external suites typically perform functional or conformance testing only, and the metrics can provide information about internal testing, for example unit testing, done within the projects themselves.

On the low adoption of automated testing in FOSS

A few times in the recent past I've been in the unfortunate position of using a prominent Free and Open Source Software (FOSS) program or library, and running into issues of such fundamental nature that made me wonder how those issues even made it into a release.

In all cases, the answer came quickly when I realized that, invariably, the project involved either didn't have a test suite, or, if it did have one, it was not adequately comprehensive.

I am using the term comprehensive in a very practical, non extreme way. I understand that it's often not feasible to test every possible scenario and interaction, but, at the very least, a decent test suite should ensure that under typical circumstances the code delivers all the functionality it promises to.

For projects of any value and significance, having such a comprehensive automated test suite is nowadays considered a standard software engineering practice. Why, then, don't we see more prominent FOSS projects employing this practice, or, when they do, why is it often employed poorly?

In this post I will highlight some of the reasons that I believe play a role in the low adoption of proper automated testing in FOSS projects, and argue why these reasons may be misguided. I will focus on topics that are especially relevant from a FOSS perspective, omitting considerations, which, although important, are not particular to FOSS.

My hope is that by shedding some light on this topic, more FOSS projects will consider employing an automated test suite.

As you can imagine, I am a strong proponent of automating testing, but this doesn't mean I consider it a silver bullet. I do believe, however, that it is an indispensable tool in the software engineering toolbox, which should only be forsaken after careful consideration.

1. Underestimating the cost of bugs

Most FOSS projects, at least those not supported by some commercial entity, don't come with any warranty; it's even stated in the various licenses! The lack of any formal obligations makes it relatively inexpensive, both in terms of time and money, to have the occasional bug in the codebase. This means that there are fewer incentives for the developer to spend extra resources to try to safeguard against bugs. When bugs come up, the developers can decide at their own leisure if and when to fix them and when to release the fixed version. Easy!

At first sight, this may seem like a reasonably pragmatic attitude to have. After all, if fixing bugs is so cheap, is it worth spending extra resources trying to prevent them?

Unfortunately, bugs are only cheap for the developer, not for the users who may depend on the project for important tasks. Users expect the code to work properly and can get frustrated or disappointed if this is not the case, regardless of whether there is any formal warranty. This is even more pronounced when security concerns are involved, for which the cost to users can be devastating.

Of course, lack of formal obligations doesn't mean that there is no driver for quality in FOSS projects. On the contrary, there is an exceptionally strong driver: professional pride. In FOSS projects the developers are in the spotlight and no (decent) developer wants to be associated with a low quality, bug infested codebase. It's just that, due to the mentality stated above, in many FOSS projects the trade-offs developers make seem to favor a reactive rather than proactive attitude.

2. Overtrusting code reviews

One of the development practices FOSS projects employ ardently is code reviews. Code reviews happen naturally in FOSS projects, even in small ones, since most contributors don't have commit access to the code repository and the original author has to approve any contributions. In larger projects there are often more structured procedures which involve sending patches to a mailing list or to a dedicated reviewing platform. Unfortunately, in some projects the trust on code reviews is so great, that other practices, like automated testing, are forsaken.

There is no question that code reviews are one of the best ways to maintain and improve the quality of a codebase. They can help ensure that code is designed properly, it is aligned with the overall architecture and furthers the long term goals of the project. They also help catch bugs, but only some of them, some of the time!

The main problem with code reviews is that we, the reviewers, are only human. We humans are great at creative thought, but we are also great at overlooking things, occasionally filling in the gaps with our own unicorns-and-rainbows inspired reality. Another reason is that we tend to focus more on the code changes at a local level, and less on how the code changes affect the system as a whole. This is not an inherent problem with the process itself but rather a limitation of humans performing the process. When a codebase gets large enough, it's difficult for our brains to keep all the possible states and code paths in mind and check them mentally, even in a codebase that is properly designed.

In theory, the problem of human limitations is offset by the open nature of the code. We even have the so called Linus's law which states that "given enough eyeballs, all bugs are shallow". Note the clever use of the indeterminate term "enough". How many are enough? How about the qualitative aspects of the "eyeballs"?

The reality is that most contributions to big, successful FOSS projects are reviewed on average by a couple of people. Some projects are better, most are worse, but in no case does being FOSS magically lead to a large number of reviewers tirelessly checking code contributions. This limit in the number of reviewers also limits the extent to which code reviews can stand as the only process to ensure quality.

3. It's not in the culture

In order to try out a development process in a project, developers first need to learn about it and be convinced that it will be beneficial. Although there are many resources, like books and articles, arguing in favor of automated tests, the main driver for trying new processes is still learning about them from more experienced developers when working on a project. In the FOSS world this also takes the form of studying what other projects, especially the high-profile ones, are doing.

Since comprehensive automated testing is not the norm in FOSS, this creates a negative network effect. Why should you bother doing automated tests if the high profile projects, which you consider to be role models, don't do it properly or at all?

Thankfully, the culture is beginning to shift, especially in projects using technologies in which automated testing is part of the culture of the technologies themselves. Unfortunately, many of the system-level and middleware FOSS projects are still living in the non automated test world.

4. Tests as an afterthought

Tests as an afterthought is not a situation particular to FOSS projects, but it is especially relevant to them since the way they spring and grow can disincentivize the early writing of tests.

Some FOSS projects start as small projects to scratch an itch, without any plans for significant growth or adoption, so the incentives to have tests at this stage are limited.

In addition, many projects, even the ones that start with more lofty adoption goals, follow a "release early, release often" mentality. This mentality has some benefits, but at the early stages also carries the risk of placing the focus exclusively on making the project as relevant to the public as possible, as quickly as possible. From such a perspective, spending the probably limited resources on tests instead of features seems like a bad use of developer time.

As the project grows and becomes more complex, however, more and more opportunities for bugs arise. At this point, some projects realize that adding a test suite would be beneficial for maintaining quality in the long term. Unfortunately, for many projects, it's already too late. The code by now has become test-unfriendly and significant effort is needed to change it.

The final effect is that many projects remain without an automated test suite, or, in the best case, with a poor one.

5. Missing CI infrastructure

Automated testing delivers the most value if it is combined with a CI service that runs the tests automatically for each commit or merge proposal. Until recently, access to such services was difficult to get for a reasonably low effort and cost. Developers either had to set up and host CI themselves, or pay for a commercial service, thus requiring resources which unsponsored FOSS projects were unlikely to be able to afford.

Nowadays, it's far easier to find and use free CI services, with most major code hosting platforms supporting them. Hopefully, with time, this reason will completely cease being a factor in the lack of automated testing adoption.

6. Not the hacker way

The FOSS movement originated from the hacker culture and still has strong ties to it. In the minds of some, the processes around software testing are too enterprise-y, too 9-to-5, perceived as completely contrary to the creative and playful nature of hacking.

My argument against this line of thought is that the hacker values technical excellence very highly, and, automated testing, as a tool that helps achieve such excellence, can not be inconsistent with the hacker way.

Some pseudo-hackers may also argue that their skills are so refined that their code doesn't require testing. When we are talking about a codebase of any significant size, I consider this attitude to be a sign of inexperience and immaturity rather than a testament of superior skills.

Epilogue

I hope this post will serve as a good starting point for a discussion about the reasons which discourage FOSS projects from adopting a comprehensive automated test suite. Identifying both valid concerns and misconceptions is the first step in convincing both fledging and mature FOSS projects to embrace automated testing, which will hopefully lead to an improvement in the overall quality of FOSS.

More posts...