Metrics for test suite comprehensiveness
November 3, 2018In a previous post I discussed a few FOSS specific mentalities and practices that I believe play a role in discouraging adoption of comprehensive automated testing in FOSS. One of the points that came up in discussions, is whether the basic premise of the post, that FOSS projects don't typically employ comprehensive automated testing, including not having any tests at all, is actually true. That's a valid concern, given that the post was motivated by my long-term observations working on and with FOSS and didn't provide any further data. In this post will try to address this concern.
The main question is how we can measure the comprehensiveness of a test suite. Code coverage is the standard metric used in the industry and makes intuitive sense. However, it presents some difficulties for large scale surveys, since it's not computationally cheap to produce and often requires per project changes or arrangements.
I would like to propose and explore two alternative metrics that are easier to produce, and are therefore better suited to large scale surveys.
The first metric is the test commit ratio of the codebase — the number of commits that affect test code as a percentage of all commits. Ideally, every change that adds a feature or fixes a bug in the production code should be accompanied by a corresponding change in the test code. The more we depart from this ideal, and, hence, the less often we update the test code, the less comprehensive our test suite tends to be. This metric is affected by the project's particular commit practices, so some amount of variance is expected even between projects considered to be well tested.
The second metric is the test code size ratio of the codebase — the size of the test code as a percentage of the size of all the code. It makes sense intuitively that, typically, more test code will be able to test more production code. That being said, the size itself does not provide the whole picture. Depending on the project, a compact test suite may be adequately comprehensive, or, conversely, large test data files may skew this metric.
Neither of these metrics is failproof, but my hypothesis is that when combined and aggregated over many projects they can provide a good indication about the overall trend of the comprehensiveness of test suites, which is the main goal of this post.
Let's see what these metrics give us for FOSS projects. I chose two software suites that are considered quite prominent in the FOSS world, namely GNOME and KDE, which together consist of over 1500 projects.
A small number of these projects are not good candidates for this survey, because, for example, they are empty, or are pure documentation. Although they are included in the survey, their count is low enough to not affect the overall trend reported below.
Here is the distribution of the percentages of commits affecting test code in GNOME and KDE projects:
Here is the distribution of the percentages of test code size in GNOME and KDE projects:
The first thing to notice is the tall lines in the left part of both graphs. For the second graph this shows that a very large percentage of the projects, roughly 55%, have either no tests at all, or so few as to be practically non-existent. Another interesting observation is the position of the 80% percentile lines, which show that 80% of the projects have test commit ratios less than 11.2%, and test code size ratios less than 8.8%.
In other words, out of ten commits that change the code base, only about one (or fewer) touches the tests in the majority of the projects. Although this doesn't constitute indisputable proof that tests are not comprehensive, it is nevertheless a big red flag, especially when combined with low test code size percentages. Each project may have different needs and development patterns, and these numbers need to be interpreted with care, but as a general trend this is not encouraging.
On the bright side, there are some projects with higher values in this distribution. It's no surprise that this set consists mainly of core libraries from these software suites, but does not include many end-user applications.
Going off on a slight tangent, one may argue that the distribution is unfairly skewed since many of these projects are GUI applications which, according to conventional wisdom, are not easy to test. However, this argument fails on multiple fronts. First, it's not unfair to include these programs, because we expect no less of them in terms of quality compared to other types of programs. They don't get a free pass because they have a GUI. In addition, being a GUI program is not a valid excuse for inadequate testing, because although testing the UI itself, or the functionality through the UI, may not be easy, there is typically a lot more we can test. Programs provide some core domain functionality, which we should be able to test independently if we decouple our core domain logic from the UI, often by using a different architecture, for example, the Hexagonal Architecture.
After having seen some general trends, let's see some examples of individual codebases that do better in these metrics:
This graph displays quite a diverse collection of projects including a database, graphics libraries, a GUI toolkit, a display compositor, system tools and even a GUI application. These projects are considered to be relatively well tested, each in its own particular way, so these higher numbers provide some evidence that the metrics correlate with test comprehensiveness.
If we accept this correlation, this collection also shows that we can achieve more comprehensive testing in a wide variety of projects. Depending on project, the trade-offs we need to make will differ, but it is almost always possible to do well in this area.
The interpretation of individual results varies with the project, but, in general, I have found that the test commit ratio is typically a more reliable indicator of test comprehensiveness, since it's less sensitive to test specifics compared to test code size ratio.
Tools and Data
In order to produce the data, I developed the git-test-annotate program which produces a list of files and commits from a git repository and marks them as related to testing or not.
git-test-annotate decides whether a file is a test file by checking for the string "test" anywhere in the file's path within the repository. This is a very simple heurestic, but works surprisingly well. In order to make test code size calculation more meaningful, the tool ignores files that are not typically considered testable sources, for example, non-text files and translations, both in the production and the test code.
For commit annotations, only mainline commits are taken into account, To check if a commit affects the tests the tool checks if it changes at least one file with "test" in its path.
To get the stats for KDE and GNOME I downloaded all their projects from their github organizations/mirrors and ran the git-test-annotate tool on each project. All the annotated data and a python script to process them are available in the foss-test-annotations repository.
Epilogue
I hope this post has provided some useful information about the utility of the proposed metrics, and some evidence that there is ample room for improvement in automated testing of FOSS projects. It would certainly be interesting to perform a more rigorous investigation to evaluate how well these metrics correlate with code coverage.
Before closing, I would like to mention that there are cases where projects are primarily tested through external test suites. Examples in the FOSS world are the piglit suite for Mesa, and various tests suites for the Linux kernel. In such cases, project test comprehensiveness metrics don't provide the complete picture, since they don't take into account the external tests. These metrics are still useful though, because external suites typically perform functional or conformance testing only, and the metrics can provide information about internal testing, for example unit testing, done within the projects themselves.