Code Coverage Targets - Recipe for Disaster

Many companies introduce Code Coverage Targets as a diet pill to bring Unit Testing into the organization. The problem is Code Coverage can be easily "gamed", zero value and wasted time.

Dec 08, 2022

Before, I wrote a story about an engineering manager who introduced code coverage metrics into a team who had no previous experience nor skillset in unit testing, and how it ended up as a disaster - even worse compared to no unit testing! You can read the short story here: Don't chase Code Coverage goals - and several follow-up articles.

Due to a high level of interest from readers in this topic, I decided now to write a longer follow-up with the broader story behind the disaster…

The broader story about the “problems“ with Code Coverage metrics

Story #1 by Vladimir Khorikov (Unit Testing)

Vladimir Khorikov, in the book Unit Testing Principles, Practices, and Patterns, writes a story (see 1.3.3 Problems with coverage metrics), wrote about an organization that did not have unit testing practices, and very few practiced it. Management decided to introduce a company-wide rule regarding minimum code coverage. The developers (who had no previous skillset but yet were under pressure to meet the targets) figured out how to game the system, and the initiative went downhill.

Years ago, I worked on a project where management imposed a strict requirement of having 100% code coverage for every project under development. This initiative had noble intentions. It was during the time when unit testing wasn’t as prevalent as it is today. Few people in the organization practiced it, and even fewer did unit testing consistently.
A group of developers had gone to a conference where many talks were devoted to unit testing. After returning, they decided to put their new knowledge into practice. Upper management supported them, and the great conversion to better programming techniques began. Internal presentations were given. New tools were installed. And, more importantly, a new company-wide rule was imposed: all development teams had to focus on writing tests exclusively until they reached the 100% code coverage mark. After they reached this goal, any code check-in that lowered the metric had to be rejected by the build systems.
As you might guess, this didn’t play out well. Crushed by this severe limitation, developers started to seek ways to game the system. Naturally, many of them came to the same realization: if you wrap all tests with try/catch blocks and don’t introduce any assertions in them, those tests are guaranteed to pass. People started to mindlessly create tests for the sake of meeting the mandatory 100% coverage requirement. Needless to say, those tests didn’t add any value to the projects. Moreover, they damaged the projects because of all the effort and time they steered away from productive activities, and because of the upkeep costs required to maintain the tests moving forward.
Eventually, the requirement was lowered to 90% and then to 80%; after some period of time, it was retracted altogether (for the better!).

Story #2 by Dave Farley (Modern Software Engineering)

Dave Farley in Modern Software Engineering:

I can think of lots of examples of measuring the wrong things. At one of my clients, they decided that they could improve the quality of their code by increasing the level of test coverage. So, they began a project to institute the measurement, collected the data, and adopted a policy to encourage improved test coverage. They set a target of “80 percent test coverage“. Then they used that measurement to incentivize their development teams, bonuses were tied to hitting targets in test coverage.
Guess what? They achieved their goal!
Some time later, they analyzed the tests that they had and found more than 25 percent of their tests had no assertions in them at all. So they had paid people on development teams, via bonuses, to write tests that tested nothing at all.
In this case, a much better measure would have been stability. What is organization really wanted was not more tests but better quality code, so measuring that more directly worked better.

So let’s go through the stories. Any quotes below are copied from the above two stories.

Code Coverage targets start with “noble intentions”

Let’s start with the story by Vladimir Khorikov, in the book Unit Testing Principles, Practices, and Patterns:

Years ago, I worked on a project where management imposed a strict requirement of having 100% code coverage for every project under development. This initiative had noble intentions.

This is an example of the very common scenario of imposing code coverage requirements top-down. In some cases, it might be 100%; in order cases 90% or 80%… but if you read this story further, you’ll realize the consequences are the same regardless of the “high“ number chosen…

Introducing Code Coverage target WITHOUT having any Unit Testing practices

Few people in the organization practiced it [unit testing], and even fewer did unit testing consistently.

This is important to note! The organization did not have any practices regarding unit testing at all (let alone doing it effectively). So what happens in this case - when no one (or hardly anyone) is doing unit testing and then all of a sudden we introduce code coverage targets? Let’s see…

Let’s just attend a conference to learn Unit Testing in a day!

A group of developers had gone to a conference where many talks were devoted to unit testing. After returning, they decided to put their new knowledge into practice. Upper management supported them, and the great conversion to better programming techniques began. Internal presentations were given. New tools were installed.

The most “common“ solution to solving skillset challenges is - conferences, training, courses, books… So we go off, listen to some speak about the benefits of unit testing, and maybe show us some simple examples (which are far simpler than our real projects) we all get motivated to do it, we want to do it, we want to apply the new knowledge for the first time.

Now we see some great signs above:

Developers were motivated to apply unit testing based on the new knowledge
Management is supporting the developers in this initiative
Developers organize internal presentations to share knowledge with colleagues
New tools are set up to support unit testing

Company-wide Code Coverage targets seem like a good idea (at the beginning)

And, more importantly, a new company-wide rule was imposed: all development teams had to focus on writing tests exclusively until they reached the 100% code coverage mark. After they reached this goal, any code check-in that lowered the metric had to be rejected by the build systems.

So after the conference and the internal presentations, management introduced a large-scale company wide-rule to mandate code coverage targets. Again, some companies set a target of 100%, some 90%, and some 80%… but the essence remains the same.

Furthermore, the code coverage is formally imposed via the build system - your code would be rejected if you don’t meet the code coverage target.

All the above sounds great; what could possibly go wrong?!

Mandated Code Coverage targets can be easily gamed by developers

As you might guess, this didn’t play out well. Crushed by this severe limitation, developers started to seek ways to game the system. Naturally, many of them came to the same realization: if you wrap all tests with try/catch blocks and don’t introduce any assertions in them, those tests are guaranteed to pass. People started to mindlessly create tests for the sake of meeting the mandatory 100% coverage requirement.

Here we see the most classical solution to gaming coverage metrics:

Write assertion-free tests. By writing tests that just execute code but do not assert anything, it means you’re not testing anything at all! But code coverage can’t detect this, because code coverage metrics can NOT detect whether you have any assertions at all!
Wrap all tests with try/catch blocks. This ensures that your tests will always pass because you’re silently catching any exceptions in the test.

Needless to say, those tests didn’t add any value to the projects.

The primary value of a test automation suite is protection against regression bugs. If our test suite does not provide us with protection against regression bugs, then it is WORTHLESS!

That was the problem with these assertion-free tests (wrapped in try/catch blocks) is that they provide us with ZERO protection against regression bugs! We’re just spending time writing them, but they aren’t providing us with any value at all!

Moreover, they damaged the projects because of all the effort and time they steered away from productive activities, and because of the upkeep costs required to maintain the tests moving forward.

Going back to the start of the initiative, we know that management had good intentions when they first started. But due to the focus on the Code Coverage metric (and the absence of skillset and experience in effective Unit Testing across the organization - because remember, they did not practice Unit Testing or only a few people practiced it), the results were as follows:

VALUE: ZERO. The tests didn’t add any value because they only satisfied Code Coverage metrics but provided ZERO protection against regression bugs.
COST: HIGH. Writing those tests required effort and time. Maintaining those tests needed even more effort and time. This time was now “wasted“ because it was used on an unproductive activity.

So mathematically: ROI = ZERO.

How about some bonuses? Tying bonuses to code coverage goes wrong

Dave Farley in Modern Software Engineering:

I can think of lots of examples of measuring the wrong things. At one of my clients, they decided that they could improve the quality of their code by increasing the level of test coverage. So, they began a project to institute the measurement, collected the data, and adopted a policy to encourage improved test coverage. They set a target of “80 percent test coverage“.

Very often, there tends to be a mistaken belief that coverage is a measure of quality. All these initiatives start with good intentions…. And these intentions result in policies that set coverage targets.

Then they used that measurement to incentivize their development teams, bonuses were tied to hitting targets in test coverage.

Often, to motivate people towards reaching the coverage targets, either some form of reward (or punishment) is used. In this case, bonuses are used to incentivize coverage targets.

Guess what? They achieved their goal!

We get what we incentivize. And reaching high coverage is easy; it’s easy to game the metric, as we’ll see below…

Some time later, they analyzed the tests that they had and found more than 25 percent of their tests had no assertions in them at all. So they had paid people on development teams, via bonuses, to write tests that tested nothing at all.

So here’s the reality. The developers wrote meaningless assertion-free tests!

Lesson #1 High Coverage is meaningless

Khorikov writes in Unit Testing Principles, Practices, and Patterns:

Let me repeat myself: coverage metrics are a good negative indicator, but a bad positive indicator. Low coverage numbers - say, below 60% - are a certain sign of trouble. They mean there’s a lot of untested code in your code base. But high numbers don’t mean anything.

Fowler writes about the same topic regarding Code Coverage:

From time to time I hear people asking what value of test coverage (also called code coverage) they should aim for, or stating their coverage levels with pride. Such statements miss the point. Test coverage is a useful tool for finding untested parts of a codebase. Test coverage is of little use as a numeric statement of how good your tests are.

Ok so let’s summarize what BOTH of them are saying:

LOW code coverage is a useful indicator - it tells us we have a problem! It means that parts of the codebase are not covered by tests, so it helps us find parts of the codebase that are untested.
HIGH code coverage is NOT a useful indicator - it does NOT tell us anything, it is not a measure that we have good tests, it is not a measure of quality, it doesn’t measure anything at all.

Lesson #2 High Coverage can be easily gamed

Most of the techniques below are well-known and collected from various sources, so here’s the summary of the techniques that are used:

The most straightforward strategy for gaming code coverage is: writing assertion-free tests for classes & methods, and wrapping the tests with try/catch blocks. This is indeed the “default“ strategy used by developers to game coverage metrics. Here we are writing tests that don’t test anything at all, and they can never fail.
But what if the company has SonarQube installed which has the rule that a test must have at least one assertion? That prevents us from assertion-free testing, but we could bypass it through annotations in our tests (to suppress the warnings), or changing SonarQube rule configuration… or if we really can’t get away from it, we can write “pro forma“ assertions which will satisfy SonarQube but not test anything at all.
Furthermore, if want the “lazy way“ out, well there’s yet another solution - use reflection! So above, we had to write out tests for every class and method manually, but using reflection could save us even more time - by just looping through the classes and methods, inspecting input parameter types for the methods, and executing the methods.

These techniques can even be done very quickly, even on a large code base!

Techniques #1 and #2 are pure mechanical activities that just require us to copy-paste class names and method names and pass in parameters. Even on a large code base, it could be done in a few hours or a day. It is a mindless activity that doesn’t require anything thinking or skillset.
Technique #3 requires some up-front development time and knowledge of reflection, but overall it can give you future unit tests “for free“ since it means you won’t have to open up classes and methods names manually, but instead, reflection mechanisms will handle it for you

Please note, the techniques above are examples of some known techniques, but there could be even more. Essentially, those techniques are exploiting a key weakness in code coverage, that code coverage is only sensitive to execution, but that code coverage is not sensitive to actually testing/verifying/asserting outcomes.

For further reading about gaming code coverage see:

Feel free to do further Google searches, you’ll find plenty of examples. Unfortunately, these practices are prevalent in teams where Coverage Targets are set without having adequate Unit Testing practices at all.

Lesson #3 High Coverage doesn’t require skillsets

As we can see above, coverage metrics can be EASILY gamed. Many developers have already figured it out, and it’s readily available on the internet. This means that now, developers with LOW skillsets in Unit Testing can easily get to high coverage metrics…. AND also get a bonus on top too? Unfortunately, yes.

To summarize:

You do NOT need skillsets to achieve high Code Coverage per se. It’s very easy to game it. Actually, you don’t even need to be a good developer at all to achieve them.
You DO need a skillset to write effective Unit Tests. There are no quick solutions. Conferences and internal presentations can transfer knowledge, but not how to apply it effectively.

Lesson #4 Targets? You’ll get what you measure

There’s a vast difference between setting the following goals:

Your target is High Code Coverage; versus,
Your target is High Quality

If we incentivize specific Coverage Targets (either by mandating them or by the rewarding achievement of it through bonuses), then the ONLY thing we’ll get is just the coverage metric, and nothing else. Indeed, the tests we’ll get might not test anything at all (e.g assertion-free tests)! We have NOT improved quality at all.

If we incentivize High Quality, well that’s a different thing. As a foundation, we’d need a reliable test suite. Now a reliable test suite would at least need to provide safety against regression bugs and low maintenance costs. In that case, high coverage happens to be just one of the side effects, but not a goal itself.

Lesson #5 Coverage Targets can be destructive if you don’t have proper practices

You can read more about the impacts of (blindly) enforcing Code Coverage targets with adequate unit testing practices in Unit Testing Principles, Practices, and Patterns (Vladimir Khorikov); see 1.3.3 Problems with coverage metrics.

Here I’ll keep it short.

Many bugs! The problem is when we “game“ code coverage metrics, we can write assertion-free unit tests with 100% Code Coverage but do NOT test anything at all! This means these unit tests are NOT providing any protection against regression bugs, these unit tests have ZERO value.
Slow delivery! Developers were now WASTING time writing zero-value tests. These tests are just a waste. They are slowing down the delivery because developers have to maintain these zero-value tests.
Wrong incentives! Developers are incentivized towards meeting target metrics, but not towards quality at all. They feel like it’s a waste of time, they game the system due to pressure, and there is ZERO movement towards quality.

The Analysis

Let’s summarize what’s happening:

Effective Tests versus Ineffective Tests

Tests should be coupled to the behavior of code and decoupled from the structure of code. - Kent Beck

✅Effective Tests are tests are executable specifications of functional requirements. They test functionality rather than implementation details (coupled to behavior rather than structure). These tests help provide us assurance that the code is functioning as expected. Writing effective tests requires a high level of skillset. To learn the foundations, I recommend starting with Unit Testing Principles, Practices, and Patterns as well as Software Engineering at Google.

⚠️Ineffective Tests are the inverse - tests which are NOT executable specifications of functional requirements. They either do not test functionality at all (as is the case with assertion-free tests) or they test implementation details rather than functionality (coupling tests to structure rather than behavior). On the other hand, writing ineffective tests does NOT require any skillset at all. Anyone can do it!

What’s the Value? Regression Bug Protection.

In manual (regression) testing, a Manual Tester performs a scripted set of test procedures and then marks the tests as passing or as failing. A passing regression test indicates that the functionality works as before. A failing regression test indicates that the functionality is broken, i.e. that a regression bug has been introduced.

In automated (regression) testing, an automated system performs all the above, the need for a Manual Tester; because automation is more reliable & repeatable, more time-efficient and cost-efficient, and can be executed much more frequently.

Whether done manually or automated, the essence remains the same: the primary value of tests is to protect us against (functional) regression bugs.

Note: Here, we are focused on regression testing because the biggest problem (and cost driver) in software development is regression bugs. There are other types of testing - exploratory testing, performance testing, security testing, etc. which are beyond the scope of this article.

So, what is the relationship between Coverage & Regression Bug Protection?

High code coverage per se does NOT provide regression bug protection.

Effective tests provide protection. Ineffective tests don’t.

✅Effective Tests with high code coverage provide high regression bug protection. Effective Tests are executable behavioral specifications, they capture the functionality of the system. Thus, they are a meaningful executable specification of requirements, and we have assurance that when those tests are passing that functionality is working as expected, whereas if a test fails then it means we have introduced a regression bug. If we have covered a large amount of functionality with these tests, then we have high assurance that those functionalities are protected against regression bugs. Side-effects: Code coverage will also be high.

⚠️Ineffective Tests with high code coverage provide low regression bug protection. Ineffective Tests are NOT executable behavioral specifications. The most severe case of ineffective tests (illustrated in the diagram) are assertion-free tests with try/catch blocks. Even if we reach 100% coverage, these tests provide us ZERO regression bug protection! Other cases of ineffective tests are those which do have partial asseritons, or have adequate assertions but they are coupled to behavior rather than structure. Those tests have limited ability to detect regression bugs - so it’s better than zero, but still low. Here we see that high coverage is meaningless!

Notice that Ineffective Tests (in the most severe case) are just as bad as No Tests - both are providing ZERO regression bug protection.

What’s the Cost? Total Maintenance Cost

Total Maintenance Cost is the cost of changes (whether changes in functionality or bug fixes) over the course of the software lifetime.

Cost of Behavioral Change is the cost of Functional Change (e.g. new functionalities, changing existing functionalities)
Cost of Structural Change is the cost of Refactoring (e.g. refactoring code to make it cleaner, restructuring classes or methods without changing observable behavior)

So, what is the relationship between Coverage & Total Maintenance Cost?

High code coverage per se does NOT reduce total maintenance costs.

Effective tests help reduce maintenance costs. Ineffective tests increase maintenance costs.

✅Effective Tests with high code coverage provide low maintenance costs. Effective Tests are executable behavioral specifications, capturing system functionality. Thus, when we make changes, they provide a safety guard:

Cost of Behavioral Change (Functional Change) = Cost of Codebase Change + (low) Cost of Testsuite Change. The Cost of Testsuite Change is relatively low because effective tests are understandable, easy to read, write and maintain.
Cost of Structural Change (Refactoring) = Cost of Codebase Change. There is no test suite modification cost because effective tests are stable, they don’t break during refactoring.

⚠️Ineffective Tests with high code coverage have high maintenance costs. Ineffective Tests are useless (can’t protect us against regression bugs)… but we still have to maintain them!

Cost of Behavioral Change (Functional Change) = Cost of Codebase Change + (high) Cost of Testsuite Change + (high) Cost of Manual Testing. The Cost of Testsuite Change is high because ineffective tests are often coupled to implementation details rather than just the behavior itself, this results in both writing a lot more test code and being affected by the extent of change in implementation details. Furthermore, since ineffective tests have poor protection against regression bugs, it means that we need to do manual testing to have the assurance that we are protected against regression bugs.
Cost of Structural Change (Refactoring) = Cost of Codebase Change + (high) Cost of Testsuite Change + (high) Cost of Manual Testing. Ineffective tests are fragile; they break during refactoring, which means refactoring causes tests to break and thus tests have to be changed (this is different from effective tests, which are stable during refactoring, hence no change to tests). Furthermore, since ineffective tests have poor protection against regression bugs, it means that we need to do manual testing to have the assurance that we are protected against regression bugs.

What’s the ROI?

Bad Tests are WORSE than No Tests:

Value: INEFFECTIVE tests with 100% Coverage are WORTHLESS in regression bug protection. The reason is because, we can achieve 100% code coverage even without any assertions, i.e. without actually testing anything! Or writing meaningless assertions. It doesn't matter whether we have 0% or 50% or 80% or 100%, the value remains ZERO.
Cost: INEFFECTIVE tests with 100% Coverage are a maintenance WASTE. The reason is that ineffective tests, even though they cannot protect us against regression bugs, are still a maintenance burden. Even worse, these tests - aside from not testing anything, tend to be coupled to implementation details, thus requiring even higher maintenance. With ineffective tests, the higher the coverage, the higher the WASTE!

To conclude, it's better to have no tests than poor tests:

BEST ROI: Effective Tests with High Coverage
BASE ROI: No Tests with Zero Coverage
WORST ROI: Ineffective Tests with High Coverage

Epilogue - Part 1: What about Mutation Testing?

This whole article above is focusing on stories about how Code Coverage Targets = Recipe are a recipe disaster! Many companies introduce Code Coverage Targets as a diet pill to bring Unit Testing into the organization, thinking that Code Coverage is a measure of quality.

The problem is that Code Coverage can be easily "gamed" - zero value tests and wasted time. We’ve described techniques on how to game code coverage; see Lesson #2 High Coverage can be easily gamed. These techniques are reliant on purely executing code without any assertions at all, which could be done manually or it could be done through reflection.

Ben Morris explained it well in Don’t use test coverage as a target:

Setting code coverage targets does not enforce any good programmer behaviours. In fact, it can encourage some bad ones. If you want to nurture good engineering practices then you’re better off using a combination of coaching, training and techniques that enhance discipline, such as pair programming.

The question is - whether the problem is just with Code Coverage per see?

Let’s look at “better“ metrics, such as Mutation Testing (see Code Coverage vs Mutation Testing). Mutation Testing is “better“ than Code Coverage because Code Coverage only measures execution coverage, whereas Mutation Testing measures both execution coverage and assertion coverage.

But, Mutation Testing can be gamed too - by gaming both code execution AND assertions. This could be done manually (by going through mutation testing reports and adding pro-forma meaningless assertions) or it could be done automatically (using unit testing generator tools).

The problems associated with unit testing tools (auto-generated tests)):

The tooling can’t understand what constitutes observable behavior for your production code. It also can’t give your tests meaningful names. All it can do is pick up implementation details and test those.
And that leads to horrible tests…
… the test’s accuracy goes to zero when it generates a lot of noise [bold added], even if it’s capable of finding all the bugs in code.
The issue with auto-generated tests is that they generate a huge amount of noise. Almost any refactoring of the underlying code makes such tests fail, even if the refactoring didn’t introduce any bugs. You will be drowned in all the false positives such tests produce.
So, don’t cut corners when writing tests. Treat the test code as a first-class citizen. Don’t outsource unit testing to scaffolding or code generation tooling.

The generated tests are mirroring implementation details, test implementation details rather than behavior, resulting in a huge amount of noise and leading towards tests with ZERO accuracy. These tests have then ZERO value; we cannot trust them. Furthermore, these tests, as they are coupled to implementation details, cause an even greater maintenance cost.

Epilogue - Part 2: What should be the target?

To summarize, both metrics (Code Coverage and Mutation Testing) can be “cheated“ with Unit Testing Tools. You don’t need any skillset. You can get 100% Code Coverage. You can get 100% Mutation Scores. But you get zero value at a high cost, worse than not having tests.

So what should be measured instead? Ben Morris explained it well in Don’t use test coverage as a target:

Rather than measuring test coverage, it makes more sense to measure the outcomes that improved coverage is supposed to influence. You’ve probably got enough tests when defects tend not to escape into production and the development team are confident about making changes without regression problems.
A more meaningful measure of quality might be found in the tracking of defects that escape into production. The rate of deployment can provide a practical view of the stability of the system, especially if regular hot-fixes or patches are being released. The overall time it takes features to make it into production can help to indicate whether development is being undermined by regression.

To summarize, if you want high-quality outcomes, focus on Effective Tests rather than chasing metrics:

With Effective Tests, you get high-quality outcomes, high ROI … and high metric scores as a side effect.
With Ineffective Tests, you can achieve high metric scores but not quality! You have low-quality outcomes and low ROI.

Target quality outcomes, not metrics. This requires effective test practices.

Epilogue - Part 3: So what are “good“ tests?

Since you’ve now reached the end, it means you’ve probably reached the conclusion that writing effective (“good“) tests is what matters, not the metrics themselves.

So what is a “good test“? To learn more about the foundations of good unit tests, there’s a whole book written about it Unit Testing Principles, Practices, and Patterns (Vladimir Khorikov) and the following extract summarizes the answer:

A good unit test has the following four attributes:
Protection against regressions
Resistance to refactoring
Fast feedback
Maintainability

Here is a short summary of these:

Protection against regressions is the ability of the test to detect regression bugs. This can be achieved by maximizing the code executed (and outcomes verified) - the higher the complexity and domain significance, the more value we get from this protection against regressions.
Resistance to refactoring is the ability of the test to remain unchanged during refactoring (i.e. the test doesn’t turn red when we refactor). This can be achieved by writing tests coupled to behavioral outcomes rather than implementation details.
Fast feedback means that the tests execute fast, so that we can run many tests in a short time. This encourages us to run the tests frequently, alerts us if we introduced bugs, and thus reduces the cost of bug fixing.
Maintainability of tests refers to how easy (or how hard) it is to read and understand the test, to adapt the test. By treating tests as first-class citizens, we ought to apply quality standards to test code as much as we do for production code.

But what about Code Coverage & other metrics? A careful reader will notice that the attributes of good unit tests do not mention any metric scores as criteria for a good unit test. Instead, by following the attributed above, we will naturally get high metric scores too, without chasing metrics.

Optivem Journal

Discussion about this post