A Model for Test Suite Quality
Test Suite Quality is critical in providing us protection against regression bugs. But is our Test Suite good enough? What are the types of holes we can get in Test Suites and how
In the previous posts, we discovered that Mutation Testing can detect “holes“ in our test suite. Unlike Code Coverage, which only looks at holes in execution, Mutation Testing can identify both holes in execution and assertion. Make sure you’ve read Code Coverage vs Mutation Testing before proceeding further.
I now want to present a model for classifying holes and the root causes of holes in our test suites. This model is based on the types of errors I’ve seen in practice and in the literature.
Test Suite Quality - Can we trust our tests?
Can we trust our Test Suite? One important criterion for Test Suite Quality is whether our test suite can detect errors in our code. If we introduce an error into our codebase (we introduce a mutant into the codebase), will the test suite register the error (some test(s) will start failing; thus, the mutant was successfully detected).
A high mutation score assures us that we can trust our test suite; it’s protecting us
A low mutation score indicates that we’ve got holes in our test suite, which means when we introduce errors into some parts of our codebase, our test suite will not register the errors!
But how about Codebase Structural Quality? That’s out of scope, it’s an orthogonal dimension. We can have a high Mutation Test Score which means our Test Suite can be trusted. It does NOT measure the structural quality of our source code - the source code could be well-written or poorly-written; it’s beyond the topic of Test Suite Quality.
Test First vs Test Last? Incremental or Batch?
The question now is, does it really matter if we write our tests before or after code? Does the increment size matter - incremental versus batch-based?
We compare based on two dimensions:
Test First vs Test Last
Test First means writing test(s) before code
Test Last means writing test(s) after code
Incremental vs Batch
Incremental means we’re working one test at a time
Batch means we’re working with multiple tests at a time
So we end up with these combinations:
Test First (Incremental)
Test First (Batch)
Test Last (Incremental)
Test Last (Batch)
But where’s TDD? Where’s Refactoring?
You’re probably noticed that I didn’t mention TDD anywhere above.
The closest you’ll see is “Test First (Incremental)” above, which could be seen as corresponding to just Red-Green steps, but there’s no Refactor step at all.
In all the above, regardless of Test First or Last, I didn’t specify whether we’re writing any working code to make the test pass (and then refactor later) or whether we’re trying to make the working code clean right away.
The purpose of this article is only to cover Test Suite Quality - can we trust our tests to protect us against erroneous changes in our codebase? We are NOT covering Codebase Quality - at this point, here we don’t care if the codebase is “clean” or “ugly”.
What are the root causes of holes in Test Suites?
The “happy“ case (no hole): our test is actually asserting the expected behavior, and the code is not exhibiting any additional behaviors beyond the expected ones. So this means when that test passes, it indicates that the actual behavior matches the expected behavior. We’re in a balanced state, an equilibrium. All mutants would be killed; we’d get a 100% Mutation Score.
However, what if we have holes in our test suite? Well, then we’d get a < 100% Mutation Score due to the holes. The hole means there’s a discrepancy between the behaviors exhibited by the codebase versus behaviors that are asserted by the test suite. Let’s see the two types of holes, I’ll call them Type-U and Type-O:
TYPE-U Hole: The expected behavior is exhibited by the codebase but is not asserted in the test. Test passes, but the test isn't asserting that expected behavior. The test may have zero assertions, is not fully asserting the behavior or is asserting something else. I’ll say that the test is “under“-representing the expected behavior. This introduced a hole.
TYPE-O Hole: The expected behavior was implemented both in the codebase and is asserted in the test. However, we accidentally implemented some more logic in our codebase beyond the expected logic, so this additional logic is then just in the codebase and, of course, not in the test (because it was not expected!). I’ll say that we went “over“-implemented in our codebase, we introduced additional behaviors beyond the expected.
So how do we get to these holes?
Model Summary
Test First (Incremental)
Write a failing test to specify the new behavior. To write a failing test means we are “forced“ to write a test with assertions (this prevents Assertion Free Testing!). Furthermore, our assertion should fail, proving that the source code does not currently satisfy the expected behavior. If we see that our test is (unexpectedly) passing, we have to re-check and fix our test until we get a failing test:
Write a test for one expected behavior
Run the test to verify that it fails
Write just enough code to satisfy the expected behavior
Run the test to see if it passes
✅ We implemented just enough behavior and NOT more. We wrote enough to pass the test without introducing any new behaviors beyond the test. This means actual behavior and expected behavior are equivalent. No gaps in the test suite.
⚠️We implemented MORE behavior in our code than was needed to pass the test. This means that we’ve created a hole in our test suite because now our code exhibits behavior not covered by the test suite.
❓We don’t get feedback about whether or not we were “minimalist“ enough during our implementation. It requires discipline. But we’ll find out during Mutation Testing.
Test First (Batch)
We write all failing tests first; these tests specify several new behaviors, and we write a specification of new behaviors in bulk. As in the case above, these failing tests assure us that the tests themselves have assertions. (However, we could have a problem with the tests - there could be unnecessary tests among the new tests, we can’t see this duplication right now because we see multiple failures at once).
Write tests for expected behaviors
Run the tests to verify that they fail
Write some code to satisfy the expected behaviors
Run the tests to see them pass
✅ We implemented just enough code to pass the new tests and not more; we didn’t introduce any more behaviors. This means actual behaviors and expected behaviors are equivalent. No gaps in the test suite.
⚠️We implemented MORE behaviors in our code than was needed to pass the tests. This means that we’ve created holes in our test suite. This is the same problem as in thee Incremental version, except here it’s much more likely!
❓We don’t get feedback about whether or not we were “minimalist“ enough during our implementation. But we’ll find out during Mutation Testing. Quite likely, passing multiple tests, may require multiple code changes, so there is a higher probability here (compared to the incremental version) that we may have accidentally introduced new behaviors into the codebase.
Test Last (Incremental)
Implement some new behavior based on expectations of the behavior which are in your mind. Write a test that retroactively specifies the expectations.
Write some code to satisfy the expected behavior
Write a test that expresses the expected behavior
Run the test to see if it passes
✅ We wrote a test that exactly asserts the new behavior we implemented, and we didn’t make any “mistakes” whilst writing the test. This means actual behavior and expected behavior are equivalent.
⚠️We could have accidentally implemented additional behaviors in our codebase beyond the expected behaviors, so we have holes in our test suite now. This is the same problem type that can occur in Test First (Incremental).
⚠️ We could have written a test that isn’t asserting the expected behavior. For example, we wrote a test that doesn’t execute the code behavior. Or, the test is executing the code, but it has zero assertions. Or we have assertions that do not verify the new behavior. This means we’ve created holes in our test suite.
❓We don’t get feedback on whether or not our tests are asserting the new behavior. We could get feedback - by commenting on the modified code, watching our test fail (which means the test is indeed asserting the new behavior that isn’t present in the test suite), then uncomment out the modified code, and watching the test pass. This is tiresome in real life; no one would actually do it. So we’ll just wait until we run Mutation Testing (But then again, Mutation Testing is timewise expensive, so we wouldn’t run it frequently; therefore any feedback here is delayed).
Test Last (Batch)
Implement multiple new behaviors based on expectations of behavior that are in our minds. Write tests that retroactively specify the expectations.
Write some code to satisfy the expected behaviors
Write tests that express the expected behaviors
Run the tests to see them pass
✅ We wrote tests that asserted the new behaviors we implemented and didn’t make any mistakes while writing the tests. This means actual behavior and expected behavior are equivalent. No gaps in the test suite.
⚠️We could have accidentally implemented additional behaviors in our codebase beyond the expected behaviors, so we have holes in our test suite now. This is the same problem as described above in the Test Last (Incremental) version, except now there is a higher likelihood due to the larger codebase affected area.
⚠️We could have accidentally implemented additional behaviors in our codebase beyond the expected behaviors. This is the same problem as described above in the Test Last (Incremental) version,
We made a mistake when writing our tests. For example, we wrote a test that doesn’t execute the code behavior. Or, the test is executing the code, but it has zero assertions. Or we have assertions that do not verify the new behavior. This means we’ve created holes in our test suite.
❓We don’t get feedback on whether or not our tests are asserting the new behavior. The only way we could get that feedback is as follows: comment out all the new code, watch all the tests fail, then uncomment out a part of the code, see a test pass, then uncomment out another part of the code, and so on, so that we see the tests go green, one by one. This would be tiresome so no one would do this. But I wrote it to illustrate what it takes to be “sure“.
The synthesis
In Test First approaches, holes occur because we don’t follow minimalism whilst implementing code; we’ve implemented more behavior in our code than was needed to pass the test(s). Therefore, we’ve introduced new behaviors in our code that aren’t covered by the tests; therefore, we have holes in our test suite.
In Test Last approaches, holes occur because the test(s) we wrote after the code either didn’t execute the new code behavior, or they executed it but didn’t assert it at all, or the tests do have assertions but are under-asserting. Therefore, there exist behaviors in our code that are not covered by the tests; therefore, we have holes in our test suite.
In the incremental approach, there is a very small delta between the expected and actual behavioral states. Since we’re moving in smaller change deltas, there is a higher probability that we’ll keep the two states in sync.
In the batch approach, there is a very large delta between the expected and actual behavioral states. Since we’ve moved in larger change deltas, there is a much lower probability that we’ll be able to keep these two states in sync; thus, we’re highly likely to have a much larger gap.
Epilogue
In the above, post, I mentioned that in all approaches we have risks of holes. Type-U holes can be avoided by adopting Test First Approaches. But how to avoid Type-O holes? Thanks to Ariel Pérez for the comment:
Michaël Azerhad's Tech Excellence talk on Refactoring vs Transformation goes a long way in helping engineers avoid Type-O holes.
In short, this is where TPP (https://blog.cleancoder.com/uncle-bob/2013/05/27/TheTransformationPriorityPremise.html) helps you write just enough code without inadvertently introducing new behaviors.
As you and I have discussed, another practice that can help you fill in gaps in behavioral coverage (once Identified) is Property Testing (https://techbeacon.com/app-dev-testing/how-make-your-code-bulletproof-property-testing).
Ok, we’ve seen that the foundational characteristic of a quality test suite is that we can trust the test suite; it protects against regression bugs. We’ve also seen the root causes of holes in our test suite and that we can detect holes through mutation testing.
Why TDD, 100% Code Coverage & 100% Mutation Score are NOT enough?
Let’s say we’ve achieved all this:
We’re practicing TDD
We have 100% code coverage
We have 100% mutation score
Suppose additionally; we’re also following the Test Pyramid - the majority are Unit Tests, followed by a some Integration Tests, Contract Tests, and a very small number of e2e Tests.
Have we reached the mountain top?
No. Even though we could have achieved all the above, the one big problem that can’t be “detected” by the above….
The problem is the coupling between the test suite and the codebase.
This is a problem faced by many teams, who may consider themselves successful TDD practitioners, and who have great metric scores, but the problem they face is high test suite maintenance due to high test suite coupling! Indeed, often they might not even be aware they have a problem, but it’s manifested as follows:
The test suite is expensive to maintain - economically wasteful!
The test suite hinders refactoring - because the tests break during refactoring!
The test suite freezes architecture - preventing design improvements!
The difficulty is: neither TDD nor Code Coverage nor Mutation Testing can protect you from the above because now the root cause of the problem is the level of coupling between the test suite and the code base. We discuss it in Critique #4 Is Unit Testing Harmful?
I believe this sentence needs to be fixed: "Furthermore, our assertion does not fail, proving that the source code does not currently satisfy the expected behavior"
Hmm... those "Hype O-Hole"s in figures are typos, right?