Metrics for test suite comprehensiveness
November 3, 2018
In a previous
post I discussed
a few FOSS specific mentalities and practices that I believe play a role in
discouraging adoption of comprehensive automated testing in FOSS. One of the
points that came up in discussions, is whether the basic premise of the post,
that FOSS projects don't typically employ comprehensive automated testing,
including not having any tests at all, is actually true. That's a valid
concern, given that the post was motivated by my long-term observations working
on and with FOSS and didn't provide any further data. In this post will try to
address this concern.
The main question is how we can measure the comprehensiveness of a test suite.
Code coverage is the standard metric used in the industry and makes intuitive
sense. However, it presents some difficulties for large scale surveys, since
it's not computationally cheap to produce and often requires per project
changes or arrangements.
I would like to propose and explore two alternative metrics that are easier to
produce, and are therefore better suited to large scale surveys.
The first metric is the test commit ratio of the codebase — the number of
commits that affect test code as a percentage of all commits. Ideally, every
change that adds a feature or fixes a bug in the production code should be
accompanied by a corresponding change in the test code. The more we depart from
this ideal, and, hence, the less often we update the test code, the less
comprehensive our test suite tends to be. This metric is affected by the
project's particular commit practices, so some amount of variance is expected
even between projects considered to be well tested.
The second metric is the test code size ratio of the codebase — the size of
the test code as a percentage of the size of all the code. It makes sense
intuitively that, typically, more test code will be able to test more
production code. That being said, the size itself does not provide the whole
picture. Depending on the project, a compact test suite may be adequately
comprehensive, or, conversely, large test data files may skew this metric.
Neither of these metrics is failproof, but my hypothesis is that when combined
and aggregated over many projects they can provide a good indication about the
overall trend of the comprehensiveness of test suites, which is the main goal
of this post.
Let's see what these metrics give us for FOSS projects. I chose two software
suites that are considered quite prominent in the FOSS world, namely GNOME and
KDE, which together consist of over 1500 projects.
A small number of these projects are not good candidates for this survey,
because, for example, they are empty, or are pure documentation. Although they
are included in the survey, their count is low enough to not affect the overall
trend reported below.
Here is the distribution of the percentages of commits affecting test code in
GNOME and KDE projects:

Here are is the distribution of the percentages of test code size in GNOME and
KDE projects:

The first thing to notice is the tall lines in the left part of both graphs.
For the second graph this shows that a very large percentage of the projects,
roughly 55%, have either no tests at all, or so few as to be practically
non-existent. Another interesting observation is the position of the 80%
percentile lines, which show that 80% of the projects have test commit ratios
less than 11.2%, and test code size ratios less than 8.8%.
In other words, out of ten commits that change the code base, only about one
(or fewer) touches the tests in the majority of the projects. Although this
doesn't constitute indisputable proof that tests are not comprehensive, it is
nevertheless a big red flag, especially when combined with low test code size
percentages. Each project may have different needs and development patterns,
and these numbers need to be interpreted with care, but as a general trend this
is not encouraging.
On the bright side, there are some projects with higher values in this
distribution. It's no surprise that this set consists mainly of core libraries
from these software suites, but does not include many end-user applications.
Going off on a slight tangent, one may argue that the distribution is unfairly
skewed since many of these projects are GUI applications which, according to
conventional wisdom, are not easy to test. However, this argument fails on
multiple fronts. First, it's not unfair to include these programs, because we
expect no less of them in terms of quality compared to other types of programs.
They don't get a free pass because they have a GUI. In addition, being a GUI
program is not a valid excuse for inadequate testing, because although testing
the UI itself, or the functionality through the UI, may not be easy, there is
typically a lot more we can test. Programs provide some core domain
functionality, which we should be able to test independently if we decouple our
core domain logic from the UI, often by using a different architecture, for
example, the Hexagonal
Architecture.
After having seen some general trends, let's see some examples of individual
codebases that do better in these metrics:

This graph displays quite a diverse collection of projects including a
database, graphics libraries, a GUI toolkit, a display compositor, system tools
and even a GUI application. These projects are considered to be relatively well
tested, each in its own particular way, so these higher numbers provide some
evidence that the metrics correlate with test comprehensiveness.
If we accept this correlation, this collection also shows that we can achieve
more comprehensive testing in a wide variety of projects. Depending on project,
the trade-offs we need to make will differ, but it is almost always possible to
do well in this area.
The interpretation of individual results varies with the project, but, in
general, I have found that the test commit ratio is typically a more reliable
indicator of test comprehensiveness, since it's less sensitive to test
specifics compared to test code size ratio.
In order to produce the data, I developed the
git-test-annotate program
which produces a list of files and commits from a git repository and marks them
as related to testing or not.
git-test-annotate decides whether a file is a test file by checking for the
string "test" anywhere in the file's path within the repository. This is a very
simple heurestic, but works surprisingly well. In order to make test code size
calculation more meaningful, the tool ignores files that are not typically
considered testable sources, for example, non-text files and translations, both
in the production and the test code.
For commit annotations, only mainline commits are taken into account, To check
if a commit affects the tests the tool checks if it changes at least one file
with "test" in its path.
To get the stats for KDE and GNOME I downloaded all their projects from their
github organizations/mirrors and ran the git-test-annotate tool on each
project. All the annotated data and a python script to process them are
available in the
foss-test-annotations
repository.
Epilogue
I hope this post has provided some useful information about the utility of the
proposed metrics, and some evidence that there is ample room for improvement in
automated testing of FOSS projects. It would certainly be interesting to
perform a more rigorous investigation to evaluate how well these metrics
correlate with code coverage.
Before closing, I would like to mention that there are cases where projects are
primarily tested through external test suites. Examples in the FOSS world are
the piglit suite for Mesa, and various tests suites for the Linux kernel. In
such cases, project test comprehensiveness metrics don't provide the complete
picture, since they don't take into account the external tests. These metrics
are still useful though, because external suites typically perform functional
or conformance testing only, and the metrics can provide information about
internal testing, for example unit testing, done within the projects
themselves.
On the low adoption of automated testing in FOSS
October 15, 2018
A few times in the recent past I've been in the unfortunate position of using a
prominent Free and Open Source Software (FOSS) program or library, and running
into issues of such fundamental nature that made me wonder how those issues
even made it into a release.
In all cases, the answer came quickly when I realized that, invariably, the
project involved either didn't have a test suite, or, if it did have one, it
was not adequately comprehensive.
I am using the term comprehensive in a very practical, non extreme way. I
understand that it's often not feasible to test every possible scenario and
interaction, but, at the very least, a decent test suite should ensure that
under typical circumstances the code delivers all the functionality it promises
to.
For projects of any value and significance, having such a comprehensive
automated test suite is nowadays considered a standard software engineering
practice. Why, then, don't we see more prominent FOSS projects employing this
practice, or, when they do, why is it often employed poorly?
In this post I will highlight some of the reasons that I believe play a role in
the low adoption of proper automated testing in FOSS projects, and argue why
these reasons may be misguided. I will focus on topics that are especially
relevant from a FOSS perspective, omitting considerations, which, although
important, are not particular to FOSS.
My hope is that by shedding some light on this topic, more FOSS projects will
consider employing an automated test suite.
As you can imagine, I am a strong proponent of automating testing, but this
doesn't mean I consider it a silver bullet. I do believe, however, that it is
an indispensable tool in the software engineering toolbox, which should only be
forsaken after careful consideration.
1. Underestimating the cost of bugs
Most FOSS projects, at least those not supported by some commercial entity,
don't come with any warranty; it's even stated in the various licenses! The
lack of any formal obligations makes it relatively inexpensive, both in terms
of time and money, to have the occasional bug in the codebase. This means that
there are fewer incentives for the developer to spend extra resources to try to
safeguard against bugs. When bugs come up, the developers can decide at their
own leisure if and when to fix them and when to release the fixed version.
Easy!
At first sight, this may seem like a reasonably pragmatic attitude to have.
After all, if fixing bugs is so cheap, is it worth spending extra resources
trying to prevent them?
Unfortunately, bugs are only cheap for the developer, not for the users who may
depend on the project for important tasks. Users expect the code to work
properly and can get frustrated or disappointed if this is not the case,
regardless of whether there is any formal warranty. This is even more
pronounced when security concerns are involved, for which the cost to users can
be devastating.
Of course, lack of formal obligations doesn't mean that there is no driver for
quality in FOSS projects. On the contrary, there is an exceptionally strong
driver: professional pride. In FOSS projects the developers are in the
spotlight and no (decent) developer wants to be associated with a low quality,
bug infested codebase. It's just that, due to the mentality stated above, in
many FOSS projects the trade-offs developers make seem to favor a reactive
rather than proactive attitude.
2. Overtrusting code reviews
One of the development practices FOSS projects employ ardently is code reviews.
Code reviews happen naturally in FOSS projects, even in small ones, since most
contributors don't have commit access to the code repository and the original
author has to approve any contributions. In larger projects there are often
more structured procedures which involve sending patches to a mailing list or
to a dedicated reviewing platform. Unfortunately, in some projects the trust on
code reviews is so great, that other practices, like automated testing, are
forsaken.
There is no question that code reviews are one of the best ways to maintain and
improve the quality of a codebase. They can help ensure that code is designed
properly, it is aligned with the overall architecture and furthers the long
term goals of the project. They also help catch bugs, but only some of them,
some of the time!
The main problem with code reviews is that we, the reviewers, are only human.
We humans are great at creative thought, but we are also great at overlooking
things, occasionally filling in the gaps with our own unicorns-and-rainbows
inspired reality. Another reason is that we tend to focus more on the code
changes at a local level, and less on how the code changes affect the system as
a whole. This is not an inherent problem with the process itself but rather a
limitation of humans performing the process. When a codebase gets large enough,
it's difficult for our brains to keep all the possible states and code paths in
mind and check them mentally, even in a codebase that is properly designed.
In theory, the problem of human limitations is offset by the open nature of the
code. We even have the so called Linus's law which states that "given enough
eyeballs, all bugs are shallow". Note the clever use of the indeterminate term
"enough". How many are enough? How about the qualitative aspects of the
"eyeballs"?
The reality is that most contributions to big, successful FOSS projects are
reviewed on average by a couple of people. Some projects are better, most are
worse, but in no case does being FOSS magically lead to a large number of
reviewers tirelessly checking code contributions. This limit in the number of
reviewers also limits the extent to which code reviews can stand as the only
process to ensure quality.
3. It's not in the culture
In order to try out a development process in a project, developers first need
to learn about it and be convinced that it will be beneficial. Although there
are many resources, like books and articles, arguing in favor of automated
tests, the main driver for trying new processes is still learning about them
from more experienced developers when working on a project. In the FOSS world
this also takes the form of studying what other projects, especially the
high-profile ones, are doing.
Since comprehensive automated testing is not the norm in FOSS, this creates a
negative network effect. Why should you bother doing automated tests if the
high profile projects, which you consider to be role models, don't do it
properly or at all?
Thankfully, the culture is beginning to shift, especially in projects using
technologies in which automated testing is part of the culture of the
technologies themselves. Unfortunately, many of the system-level and middleware
FOSS projects are still living in the non automated test world.
4. Tests as an afterthought
Tests as an afterthought is not a situation particular to FOSS projects, but it
is especially relevant to them since the way they spring and grow can
disincentivize the early writing of tests.
Some FOSS projects start as small projects to scratch an itch, without any
plans for significant growth or adoption, so the incentives to have tests at
this stage are limited.
In addition, many projects, even the ones that start with more lofty adoption
goals, follow a "release early, release often" mentality. This mentality has
some benefits, but at the early stages also carries the risk of placing the
focus exclusively on making the project as relevant to the public as possible,
as quickly as possible. From such a perspective, spending the probably limited
resources on tests instead of features seems like a bad use of developer time.
As the project grows and becomes more complex, however, more and more
opportunities for bugs arise. At this point, some projects realize that adding
a test suite would be beneficial for maintaining quality in the long term.
Unfortunately, for many projects, it's already too late. The code by now has
become test-unfriendly and significant effort is needed to change it.
The final effect is that many projects remain without an automated test suite,
or, in the best case, with a poor one.
5. Missing CI infrastructure
Automated testing delivers the most value if it is combined with a CI service
that runs the tests automatically for each commit or merge proposal. Until
recently, access to such services was difficult to get for a reasonably low
effort and cost. Developers either had to set up and host CI themselves, or pay
for a commercial service, thus requiring resources which unsponsored FOSS
projects were unlikely to be able to afford.
Nowadays, it's far easier to find and use free CI services, with most major
code hosting platforms supporting them. Hopefully, with time, this reason will
completely cease being a factor in the lack of automated testing adoption.
6. Not the hacker way
The FOSS movement originated from the hacker culture and still has strong ties
to it. In the minds of some, the processes around software testing are too
enterprise-y, too 9-to-5, perceived as completely contrary to the creative and
playful nature of hacking.
My argument against this line of thought is that the hacker values technical
excellence very highly, and, automated testing, as a tool that helps achieve
such excellence, can not be inconsistent with the hacker way.
Some pseudo-hackers may also argue that their skills are so refined that their
code doesn't require testing. When we are talking about a codebase of any
significant size, I consider this attitude to be a sign of inexperience and
immaturity rather than a testament of superior skills.
Epilogue
I hope this post will serve as a good starting point for a discussion about the
reasons which discourage FOSS projects from adopting a comprehensive automated
test suite. Identifying both valid concerns and misconceptions is the first
step in convincing both fledging and mature FOSS projects to embrace automated
testing, which will hopefully lead to an improvement in the overall quality of
FOSS.
Bless Hex Editor 0.6.1
September 19, 2018
A long time ago, on a computer far, far away... well, actually, 14 years ago,
on a computer that is still around somewhere in the basement, I wrote the first
lines of source code for what would become the Bless hex editor.
For my initial experiments I used C++ with the gtkmm bindings, but C++
compilation times were so appallingly slow on my feeble computer, that I
decided to give the relatively young Mono
framework a try. The development experience was much better, so I continued
with Mono and Gtk#. For revision control, I started out with
tla (remember that?), but eventually
settled on bzr.
Development continued at a steady pace until 2009, when life's responsibilities
got in the way, and left me with little time to work on the project. A few
attempts were made by other people to revive Bless after that, but,
unfortunately, they also seem to have stagnated. The project had been inactive
for almost 8 years when the gna.org hosting site closed down in 2017 and pulled
the official Bless page and bzr repository with it into the abyss.
Despite the lack of development and maintenance, Bless remained surprisingly
functional through the years. I, and many others it seems, have kept using it,
and, naturally, a few bugs have been uncovered during this time.
I recently found some time to bring the project back to life, although, I
should warn, this does not imply any intention to resume feature development on
it. My free time is still scarce, so the best I can do is try to maintain it
and accept contributions. The project's new official home is at
https://github.com/afrantzis/bless.
To mark the start of this new era, I have released Bless
0.6.1, containing fixes for many
of the major issues I could find reports for. Enjoy!
Important Note: There seems to be a bug in some versions of Mono that
manifests as a crash when selecting bytes. The backtrace looks like:
free(): invalid pointer
Stacktrace:
at <unknown> <0xffffffff>
at (wrapper managed-to-native) GLib.SList.g_free (intptr) <0x0005f>
at GLib.ListBase.Empty () <0x0013c>
at GLib.ListBase.Dispose (bool) <0x0000f>
at GLib.ListBase.Finalize () <0x0001d>
at (wrapper runtime-invoke) object.runtime_invoke_virtual_void__this__ (object,intptr,intptr,intptr) <0x00068>
Searching for this backtrace you can find various reports of other Mono
programs also affected by this bug. At the time of writing, the mono packages
in Debian and Ubuntu (4.6.2) exhibit this problem. If you are affected, the
solution is to update to a newer version of Mono, e.g., from
https://www.mono-project.com/download/stable/.