Code Coverage vs Mutation Testing
Code Coverage can only measure the extent that code is executed through the tests, but Code Coverage cannot measure whether our tests are asserting code behavior - welcome to Mutation Testing.
You're walking through a dark corridor.
… But you can't see anything.
There are HOLES almost EVERYWHERE.
… The holes exist, but you can NOT see them.
You turn on your torchlight.
… Now you CAN see the holes!
This is the difference between Code Coverage and Mutation Testing.
With Code Coverage:
You're walking in the dark
You can NOT see the holes in your test suite
With Mutation Testing:
You have a torchlight
You CAN see the holes in your test suite
Code Coverage is not enough
100% Code Coverage does NOT mean high test suite quality! Setting coverage goals for a team will backfire if the team does not have the adequate mindset (and skillset) in test automation.
Code Coverage can be dangerous
In my previous article Don't chase Code Coverage goals I wrote about why chasing Code Coverage might be not just meaningless but even dangerous.
A few days before that, I came across Bryan Finster’s post, which illustrated that theme even better written - I was merely referencing an anecdotal story, but he was referencing NISTIR 8397. I highly recommend that you read Bryan’s article 5 Minute DevOps: The Most Dangerous Metric, where he wrote:
The most dangerous metric goal I’ve found so far in software engineering is code coverage.
Now, I’m going to quote a whole chunk of text because I think the expression here is worth in gold:
What’s code coverage? Simply, it’s the measure of the amount of code that is executed by test code. How could this be dangerous at all? Well, things like this happen
“All applications must have a minimum of 80% code coverage!”
A mandate like this is depressingly common in both the public and private sectors. If you ask why in the context of the Department of Defense, you’ll be pointed to NISTIR 8397 which has the following to say about code coverage in the Summary in section 2.0:
“Create code-based (structural) test cases. Add cases as necessary to reach at
least 80 % coverage (2.7).”
That’s pretty specific. It’s also very curious. Code coverage is a side effect of good testing, not a driver of good tests. In fact, it’s been repeatedly demonstrated that imposing this metric as a measure of quality results in dangerously poor “tests”. So, why would NIST recommend this? I can only assume the authors believe people will read past the summary. Spoilers: they don’t.
Metrics are just signals, not goals
I wanted to share some insightful comments from my previous post on Code Coverage.
The problem of “focus“ is well-summarized in this comment from Bryan Finster; indeed, the problem is not with metrics but rather with the focus:
There’s a difference between a team that’s focused on improving testing and management demanding a coverage goal. The former is good. The latter is destructive to quality.
Thanks also to Ariel Pérez for his comment, which explains the need to focus on outcomes and the use of metrics as signals to help us narrow our focus:
Management should be focusing on outcomes, not outputs.
The DORA Metrics are a good place start when looking at how an engineering team is doing: Deployment Frequency, Lead Time, Mean Time to Recovery, and Change Failure Rate.
Customer-facing operational metrics as well, in particular those that result in FCIs (Failed Customer Interactions) are also very valuable: Latency, Availability, Error Rates, Mean Time Between Failures, Mean Time to Detect, etc.
I'd also look at Defect Escape Rates to let us know how good we're getting at catching bugs further left.
Lastly, these metrics don't exist in a vacuum. If your team's Lead Time is long, these can't tell you WHY. Often you'll find that it's not anything the engineering team is actually doing or can even control but rather, the system, processes, and environment around them that hampers them.
At the end of the day, they're just signals to help you narrow your focus and find where the bottlenecks are.
Test Suite & Maintenance Cost
Now, there isn’t anything wrong with the metric per se, but the problem is that if management just imposes Code Coverage, but without any test mindset and without supporting the team in test skillset, then I can assure you that Code Coverage will lead to a WORSE state because that team would produce unmaintainable test suite, which can be worse than a having a test suite at all.
We’ve got two problems:
If we don’t have an automated test suite, then we’ll have a test maintenance cost (due to the cost of debugging and manual regression testing)
But if we have a BAD automated test suite, then we’ll also have a high maintenance cost (due to expensive test maintenance)
But there’s also good news, if we have a GOOD automated test suite, then we reduce our software maintenance cost.
I didn’t go into more detail regarding test maintenance (I’ll leave that for perhaps some future article), but here’s one insightful comment from Kyle Griffin Aretae which summarizes how I see it too:
I lean towards even worse.
Tests require maintenance. More tests more maintenance.… we're going to make unforseen changes to the application over the next decade +/-. When we make changes we will then have to make changes to whatever automated tests we have.
… test maintainability is enormously important to TCO of the application, as well as cost of change, and please-to-thank-you lead time.
The problem with Code Coverage
100% Code Coverage does NOT mean high test suite quality.
Many teams treat Code Coverage as a goal, a measure of test suite quality. Managers use Code Coverage to track whether developers are writing tests.
This is a big mistake.
A team with no prior knowledge in unit testing is "forced" into writing unit tests with 100% code coverage - because a manager heard somewhere that code coverage is good and that 100% code coverage equals quality. So the team - who doesn't have the skillset in writing unit tests - simply went through the Code Coverage report and added code to execute every branch in every method in every class. (For the full version of the story, you can read it in my previous article Don't chase Code Coverage goals)
So that team achieved 100% Code Coverage, because, yes, the code was getting executed. But there were zero assertions!
This is known as ASSERTION FREE TESTING.
It writing a test that executes code BUT has ZERO assertions. Additionally, wrap this code with a try-catch block, so if there's an exception thrown, it will be swallowed by the test.
The "beauty" of this approach? Firstly, the test will cover code execution pathways - so in metrics of line coverage, branch coverage, etc., the developers can achieve 100% coverage. Secondly, the tests can't fail for two reasons - there are no assertions AND no exceptions due to silent error handling in tests.
Unfortunately, it's already happened to numerous teams. This is what happens when you enforce a metric without firstly changing the mindset and teaching practices.
Often it’s not intentional - no one intentionally writes bad tests. But faced with time pressure, with deadlines and with having to satisfy some arbitrary measure set by management.
BUT is there any hope?
Mutation Testing comes to the rescue
If teams can essentially "cheat" with assertion-free testing, is there any way we can somehow measure the quality of our test suite?
Fortunately, there is a solution, and it's mutation testing. We need Mutation Testing because it covers both execution AND assertions.
With mutation testing, the mutator changes bits of code and verifies whether any tests will register the mutation and fail accordingly. If behavioral assertions exist, the test will register the change, and the mutant will be killed.
Zero surviving mutations is a good indicator of a quality test suite.
So if you're going to use a metric - use mutation testing.
If we’ve achieved a 100% Mutation Score, then this means that there aren’t holes in the test suite - the code does not exhibit any behavior which isn’t covered by the tests. This means we have a 1-1 correspondence between the behaviors asserted through tests and the behaviors exhibited by the code.
On the other hand, if we have a lower Mutation Score, it means that code exhibits behaviors that are not asserted in any tests, these“holes“ means that if a regression bug occurs in those parts of the code, then the tests do not offer us any protection.
It should be noted that Mutation Testing has a much longer execution time compared to Code Coverage. This means it makes sense to run Mutation Testing given that we have high Code Coverage. You may look at PIT (Pitest) for Java and Stryker.NET for .NET.
For more practical examples, Bryan shared with me this page which shows how to “game“ Code Coverage (Dojo Consortium). Also, a while ago, I had written a quick-start article that also shows the problems of Code Coverage, how they can be overcome with Mutation Testing, and how to run Mutation Testing in Java and .NET Mutation Testing (Xtrem TDD).
Code Coverage vs Mutation Testing
Code Coverage is a useful negative indicator. This means that if you have a low coverage score, it indicates that you have a low-quality test suite.
However, Code Coverage is NOT a useful positive indicator. This means that even if you have a high coverage score, it does NOT indicate a high-quality test suite. The reason is that Code Coverage only evaluates whether code was executed, but not whether there were any assertions (or adequate assertions).
Mutation Testing helps overcome Code Coverage problems by covering both execution and assertions. This means it can detect holes in your test suite - the missing tests.
Mutation Score is a better metric than Code Coverage - because Code Coverage covers only execution, whereas Mutation Testing covers both execution and assertion. Mutation Testing is currently the “best“ metric for measuring test suite quality.
We saw that 100% Code Coverage didn’t mean we had a good test suite - indeed, we can get 100% Code Coverage (because all the method branches are being executed) but 0% Mutation Score (because there are zero assertions).
Code Coverage detects only execution gaps. If some behavior was implemented in code, but if that behavior is not executed in the test suite, then we can detect this problem through Code Coverage. It can only tell us which percentage of code was executed. It doesn’t cover assertions! This means we can write tests with zero assertions (meaningless tests!), but code coverage can’t detect this - we can still get 100% coverage score!
Mutation Testing detects both execution and assertion gaps. If some behavior was implemented in code, and that behavior was executed in the test suite, but there were no assertions or the assertions were inadequate, then we can detect this problem through Mutation Testing. So, we can’t “cheat“ by writing zero assertion tests. Indeed, mutation testing detects gaps between our test suite and the code - if our code has a certain behavior that is not covered by the test suite, then we will get a lower mutation score - which is great; it tells us we have a gap in the test suite.
So then we should chase a 100% Mutation Score?
Previously, we were chasing 100% Code Coverage, and we could see that it was the wrong goal.
So then, should we chase a 100% Mutation Score because it’s a better metric than Code Coverage? After all, Code Coverage can only evaluate execution, whereas Mutation Score includes both execution and assertions.
But no, once again, we should NOT chase metrics.
Mutation Testing is a great metric; it is a much better measure than Code Coverage. But Mutation Testing isn’t the end of the story either.
The next question we have is, does it matter if we work in a Test First or Test Last approach? Does it affect the Mutation Score? Read more here A Model for Test Suite Quality to find out how TDD vs TLD affects the types of holes in our test suite.