Metrics for test suite comprehensiveness

In a previous post I discussed a few FOSS specific mentalities and practices that I believe play a role in discouraging adoption of comprehensive automated testing in FOSS. One of the points that came up in discussions, is whether the basic premise of the post, that FOSS projects don't typically employ comprehensive automated testing, including not having any tests at all, is actually true. That's a valid concern, given that the post was motivated by my long-term observations working on and with FOSS and didn't provide any further data. In this post will try to address this concern.

The main question is how we can measure the comprehensiveness of a test suite. Code coverage is the standard metric used in the industry and makes intuitive sense. However, it presents some difficulties for large scale surveys, since it's not computationally cheap to produce and often requires per project changes or arrangements.

I would like to propose and explore two alternative metrics that are easier to produce, and are therefore better suited to large scale surveys.

The first metric is the test commit ratio of the codebase — the number of commits that affect test code as a percentage of all commits. Ideally, every change that adds a feature or fixes a bug in the production code should be accompanied by a corresponding change in the test code. The more we depart from this ideal, and, hence, the less often we update the test code, the less comprehensive our test suite tends to be. This metric is affected by the project's particular commit practices, so some amount of variance is expected even between projects considered to be well tested.

The second metric is the test code size ratio of the codebase — the size of the test code as a percentage of the size of all the code. It makes sense intuitively that, typically, more test code will be able to test more production code. That being said, the size itself does not provide the whole picture. Depending on the project, a compact test suite may be adequately comprehensive, or, conversely, large test data files may skew this metric.

Neither of these metrics is failproof, but my hypothesis is that when combined and aggregated over many projects they can provide a good indication about the overall trend of the comprehensiveness of test suites, which is the main goal of this post.

Let's see what these metrics give us for FOSS projects. I chose two software suites that are considered quite prominent in the FOSS world, namely GNOME and KDE, which together consist of over 1500 projects.

A small number of these projects are not good candidates for this survey, because, for example, they are empty, or are pure documentation. Although they are included in the survey, their count is low enough to not affect the overall trend reported below.

Here is the distribution of the percentages of commits affecting test code in GNOME and KDE projects:

kde-gnome-commits

Here are is the distribution of the percentages of test code size in GNOME and KDE projects:

kde-gnome-sizes

The first thing to notice is the tall lines in the left part of both graphs. For the second graph this shows that a very large percentage of the projects, roughly 55%, have either no tests at all, or so few as to be practically non-existent. Another interesting observation is the position of the 80% percentile lines, which show that 80% of the projects have test commit ratios less than 11.2%, and test code size ratios less than 8.8%.

In other words, out of ten commits that change the code base, only about one (or fewer) touches the tests in the majority of the projects. Although this doesn't constitute indisputable proof that tests are not comprehensive, it is nevertheless a big red flag, especially when combined with low test code size percentages. Each project may have different needs and development patterns, and these numbers need to be interpreted with care, but as a general trend this is not encouraging.

On the bright side, there are some projects with higher values in this distribution. It's no surprise that this set consists mainly of core libraries from these software suites, but does not include many end-user applications.

Going off on a slight tangent, one may argue that the distribution is unfairly skewed since many of these projects are GUI applications which, according to conventional wisdom, are not easy to test. However, this argument fails on multiple fronts. First, it's not unfair to include these programs, because we expect no less of them in terms of quality compared to other types of programs. They don't get a free pass because they have a GUI. In addition, being a GUI program is not a valid excuse for inadequate testing, because although testing the UI itself, or the functionality through the UI, may not be easy, there is typically a lot more we can test. Programs provide some core domain functionality, which we should be able to test independently if we decouple our core domain logic from the UI, often by using a different architecture, for example, the Hexagonal Architecture.

After having seen some general trends, let's see some examples of individual codebases that do better in these metrics:

misc-test-stats

This graph displays quite a diverse collection of projects including a database, graphics libraries, a GUI toolkit, a display compositor, system tools and even a GUI application. These projects are considered to be relatively well tested, each in its own particular way, so these higher numbers provide some evidence that the metrics correlate with test comprehensiveness.

If we accept this correlation, this collection also shows that we can achieve more comprehensive testing in a wide variety of projects. Depending on project, the trade-offs we need to make will differ, but it is almost always possible to do well in this area.

The interpretation of individual results varies with the project, but, in general, I have found that the test commit ratio is typically a more reliable indicator of test comprehensiveness, since it's less sensitive to test specifics compared to test code size ratio.

Tools and Data

In order to produce the data, I developed the git-test-annotate program which produces a list of files and commits from a git repository and marks them as related to testing or not.

git-test-annotate decides whether a file is a test file by checking for the string "test" anywhere in the file's path within the repository. This is a very simple heurestic, but works surprisingly well. In order to make test code size calculation more meaningful, the tool ignores files that are not typically considered testable sources, for example, non-text files and translations, both in the production and the test code.

For commit annotations, only mainline commits are taken into account, To check if a commit affects the tests the tool checks if it changes at least one file with "test" in its path.

To get the stats for KDE and GNOME I downloaded all their projects from their github organizations/mirrors and ran the git-test-annotate tool on each project. All the annotated data and a python script to process them are available in the foss-test-annotations repository.

Epilogue

I hope this post has provided some useful information about the utility of the proposed metrics, and some evidence that there is ample room for improvement in automated testing of FOSS projects. It would certainly be interesting to perform a more rigorous investigation to evaluate how well these metrics correlate with code coverage.

Before closing, I would like to mention that there are cases where projects are primarily tested through external test suites. Examples in the FOSS world are the piglit suite for Mesa, and various tests suites for the Linux kernel. In such cases, project test comprehensiveness metrics don't provide the complete picture, since they don't take into account the external tests. These metrics are still useful though, because external suites typically perform functional or conformance testing only, and the metrics can provide information about internal testing, for example unit testing, done within the projects themselves.

On the low adoption of automated testing in FOSS

A few times in the recent past I've been in the unfortunate position of using a prominent Free and Open Source Software (FOSS) program or library, and running into issues of such fundamental nature that made me wonder how those issues even made it into a release.

In all cases, the answer came quickly when I realized that, invariably, the project involved either didn't have a test suite, or, if it did have one, it was not adequately comprehensive.

I am using the term comprehensive in a very practical, non extreme way. I understand that it's often not feasible to test every possible scenario and interaction, but, at the very least, a decent test suite should ensure that under typical circumstances the code delivers all the functionality it promises to.

For projects of any value and significance, having such a comprehensive automated test suite is nowadays considered a standard software engineering practice. Why, then, don't we see more prominent FOSS projects employing this practice, or, when they do, why is it often employed poorly?

In this post I will highlight some of the reasons that I believe play a role in the low adoption of proper automated testing in FOSS projects, and argue why these reasons may be misguided. I will focus on topics that are especially relevant from a FOSS perspective, omitting considerations, which, although important, are not particular to FOSS.

My hope is that by shedding some light on this topic, more FOSS projects will consider employing an automated test suite.

As you can imagine, I am a strong proponent of automating testing, but this doesn't mean I consider it a silver bullet. I do believe, however, that it is an indispensable tool in the software engineering toolbox, which should only be forsaken after careful consideration.

1. Underestimating the cost of bugs

Most FOSS projects, at least those not supported by some commercial entity, don't come with any warranty; it's even stated in the various licenses! The lack of any formal obligations makes it relatively inexpensive, both in terms of time and money, to have the occasional bug in the codebase. This means that there are fewer incentives for the developer to spend extra resources to try to safeguard against bugs. When bugs come up, the developers can decide at their own leisure if and when to fix them and when to release the fixed version. Easy!

At first sight, this may seem like a reasonably pragmatic attitude to have. After all, if fixing bugs is so cheap, is it worth spending extra resources trying to prevent them?

Unfortunately, bugs are only cheap for the developer, not for the users who may depend on the project for important tasks. Users expect the code to work properly and can get frustrated or disappointed if this is not the case, regardless of whether there is any formal warranty. This is even more pronounced when security concerns are involved, for which the cost to users can be devastating.

Of course, lack of formal obligations doesn't mean that there is no driver for quality in FOSS projects. On the contrary, there is an exceptionally strong driver: professional pride. In FOSS projects the developers are in the spotlight and no (decent) developer wants to be associated with a low quality, bug infested codebase. It's just that, due to the mentality stated above, in many FOSS projects the trade-offs developers make seem to favor a reactive rather than proactive attitude.

2. Overtrusting code reviews

One of the development practices FOSS projects employ ardently is code reviews. Code reviews happen naturally in FOSS projects, even in small ones, since most contributors don't have commit access to the code repository and the original author has to approve any contributions. In larger projects there are often more structured procedures which involve sending patches to a mailing list or to a dedicated reviewing platform. Unfortunately, in some projects the trust on code reviews is so great, that other practices, like automated testing, are forsaken.

There is no question that code reviews are one of the best ways to maintain and improve the quality of a codebase. They can help ensure that code is designed properly, it is aligned with the overall architecture and furthers the long term goals of the project. They also help catch bugs, but only some of them, some of the time!

The main problem with code reviews is that we, the reviewers, are only human. We humans are great at creative thought, but we are also great at overlooking things, occasionally filling in the gaps with our own unicorns-and-rainbows inspired reality. Another reason is that we tend to focus more on the code changes at a local level, and less on how the code changes affect the system as a whole. This is not an inherent problem with the process itself but rather a limitation of humans performing the process. When a codebase gets large enough, it's difficult for our brains to keep all the possible states and code paths in mind and check them mentally, even in a codebase that is properly designed.

In theory, the problem of human limitations is offset by the open nature of the code. We even have the so called Linus's law which states that "given enough eyeballs, all bugs are shallow". Note the clever use of the indeterminate term "enough". How many are enough? How about the qualitative aspects of the "eyeballs"?

The reality is that most contributions to big, successful FOSS projects are reviewed on average by a couple of people. Some projects are better, most are worse, but in no case does being FOSS magically lead to a large number of reviewers tirelessly checking code contributions. This limit in the number of reviewers also limits the extent to which code reviews can stand as the only process to ensure quality.

3. It's not in the culture

In order to try out a development process in a project, developers first need to learn about it and be convinced that it will be beneficial. Although there are many resources, like books and articles, arguing in favor of automated tests, the main driver for trying new processes is still learning about them from more experienced developers when working on a project. In the FOSS world this also takes the form of studying what other projects, especially the high-profile ones, are doing.

Since comprehensive automated testing is not the norm in FOSS, this creates a negative network effect. Why should you bother doing automated tests if the high profile projects, which you consider to be role models, don't do it properly or at all?

Thankfully, the culture is beginning to shift, especially in projects using technologies in which automated testing is part of the culture of the technologies themselves. Unfortunately, many of the system-level and middleware FOSS projects are still living in the non automated test world.

4. Tests as an afterthought

Tests as an afterthought is not a situation particular to FOSS projects, but it is especially relevant to them since the way they spring and grow can disincentivize the early writing of tests.

Some FOSS projects start as small projects to scratch an itch, without any plans for significant growth or adoption, so the incentives to have tests at this stage are limited.

In addition, many projects, even the ones that start with more lofty adoption goals, follow a "release early, release often" mentality. This mentality has some benefits, but at the early stages also carries the risk of placing the focus exclusively on making the project as relevant to the public as possible, as quickly as possible. From such a perspective, spending the probably limited resources on tests instead of features seems like a bad use of developer time.

As the project grows and becomes more complex, however, more and more opportunities for bugs arise. At this point, some projects realize that adding a test suite would be beneficial for maintaining quality in the long term. Unfortunately, for many projects, it's already too late. The code by now has become test-unfriendly and significant effort is needed to change it.

The final effect is that many projects remain without an automated test suite, or, in the best case, with a poor one.

5. Missing CI infrastructure

Automated testing delivers the most value if it is combined with a CI service that runs the tests automatically for each commit or merge proposal. Until recently, access to such services was difficult to get for a reasonably low effort and cost. Developers either had to set up and host CI themselves, or pay for a commercial service, thus requiring resources which unsponsored FOSS projects were unlikely to be able to afford.

Nowadays, it's far easier to find and use free CI services, with most major code hosting platforms supporting them. Hopefully, with time, this reason will completely cease being a factor in the lack of automated testing adoption.

6. Not the hacker way

The FOSS movement originated from the hacker culture and still has strong ties to it. In the minds of some, the processes around software testing are too enterprise-y, too 9-to-5, perceived as completely contrary to the creative and playful nature of hacking.

My argument against this line of thought is that the hacker values technical excellence very highly, and, automated testing, as a tool that helps achieve such excellence, can not be inconsistent with the hacker way.

Some pseudo-hackers may also argue that their skills are so refined that their code doesn't require testing. When we are talking about a codebase of any significant size, I consider this attitude to be a sign of inexperience and immaturity rather than a testament of superior skills.

Epilogue

I hope this post will serve as a good starting point for a discussion about the reasons which discourage FOSS projects from adopting a comprehensive automated test suite. Identifying both valid concerns and misconceptions is the first step in convincing both fledging and mature FOSS projects to embrace automated testing, which will hopefully lead to an improvement in the overall quality of FOSS.

Bless Hex Editor 0.6.1

A long time ago, on a computer far, far away... well, actually, 14 years ago, on a computer that is still around somewhere in the basement, I wrote the first lines of source code for what would become the Bless hex editor.

For my initial experiments I used C++ with the gtkmm bindings, but C++ compilation times were so appallingly slow on my feeble computer, that I decided to give the relatively young Mono framework a try. The development experience was much better, so I continued with Mono and Gtk#. For revision control, I started out with tla (remember that?), but eventually settled on bzr.

Development continued at a steady pace until 2009, when life's responsibilities got in the way, and left me with little time to work on the project. A few attempts were made by other people to revive Bless after that, but, unfortunately, they also seem to have stagnated. The project had been inactive for almost 8 years when the gna.org hosting site closed down in 2017 and pulled the official Bless page and bzr repository with it into the abyss.

Despite the lack of development and maintenance, Bless remained surprisingly functional through the years. I, and many others it seems, have kept using it, and, naturally, a few bugs have been uncovered during this time.

I recently found some time to bring the project back to life, although, I should warn, this does not imply any intention to resume feature development on it. My free time is still scarce, so the best I can do is try to maintain it and accept contributions. The project's new official home is at https://github.com/afrantzis/bless.

To mark the start of this new era, I have released Bless 0.6.1, containing fixes for many of the major issues I could find reports for. Enjoy!

Important Note: There seems to be a bug in some versions of Mono that manifests as a crash when selecting bytes. The backtrace looks like:

free(): invalid pointer
Stacktrace:

  at <unknown> <0xffffffff>
  at (wrapper managed-to-native) GLib.SList.g_free (intptr) <0x0005f>
  at GLib.ListBase.Empty () <0x0013c>
  at GLib.ListBase.Dispose (bool) <0x0000f>
  at GLib.ListBase.Finalize () <0x0001d>
  at (wrapper runtime-invoke) object.runtime_invoke_virtual_void__this__ (object,intptr,intptr,intptr) <0x00068>

Searching for this backtrace you can find various reports of other Mono programs also affected by this bug. At the time of writing, the mono packages in Debian and Ubuntu (4.6.2) exhibit this problem. If you are affected, the solution is to update to a newer version of Mono, e.g., from https://www.mono-project.com/download/stable/.

More posts...