Flaws in coverage measurement

Tuesday 30 October 2007This is more than 17 years old. Be careful.

Coverage testing is a great way to find out what parts of your code are not tested by your test suite. You turn on coverage.py, then run your tests. At the end, coverage can show you which lines were never executed, either by line number or visually in an annotated source file.

When your test coverage is less than 100%, coverage testing works well: it points you to the lines in your code that are never run, showing the way to new tests to write. The ultimate goal, of course, is to get your test coverage to 100%.

But then you have problems, because 100% test coverage doesn’t really mean much. There are dozens of ways your code or your tests could still broken, but now you aren’t getting any directions. The measurement coverage.py provides is more accurately called statement coverage, because it tells you which statements were executed. Statement coverage testing has taken you to the end of its road, and the bad news is, you aren’t at your destination, but you’ve run out of road.

By way of illustration, here are a few examples of 100% statement coverage of buggy code.

Combinations of paths

With multiple branches in a function, there may be combinations that aren’t tested, even though each individual line is covered by a test:

def two_branches(a, b):
    if a:
        d = 0
    else:
        d = 2
        
    if b:
        x = 2/d
    else:
        x = d/2
        
    return x

# These tests give 100% coverage:
two_branches(False, False) == 1
two_branches(True, False) == 0
two_branches(False, True) == 1

# This test fails with a ZeroDivisionError:
two_branches(True, True)

Loops can have similar issues:

def loop_paths(a):
    while a:
        x = 1
        a -= 1
    return x

# This test gives 100% coverage:
loop_paths(1) == 1

# This test fails with a NameError:
loop_paths(0)

Data-driven code

You can often simplify a function by putting complexity into data tables, but there’s no way to measure which parts of a data structure were used:

divisors = {
    'x': 1,
    'y': 0,
}

def data_driven(thing):
    return 2/divisors.get(thing)
    
# This test gives 100% coverage:
data_driven('x') == 2

# This test fails with a ZeroDivisionError:
data_driven('y')

Hidden conditionals

Real code often contains implied conditionals that don’t live on a separate line to be measured:

def implied_conditional(a):
    if (a % 2 == 0) or (a % 0 == 0):
        print "Special case"
    return a+2

# 100% coverage:
implied_conditional(0) == 2
implied_conditional(2) == 4

Although we have 100% coverage, we never found out that due to a typo, the second condition on line 3 will divide by zero.

Conditionals can also be hidden inside functions that aren’t being measured in the first place.

def fix_url(u):
    # If we're an https url, make it http.
    return u.replace('https://', 'xyzzyWRONG:')
    
# 100% coverage:
fix_url('http://foo.com') == 'http://foo.com'

The replace method here is essentially a big if statement on the condition that the string contains the substring being replaced. Our test never takes that path, but the if is hidden from us, so our coverage testing doesn’t help us find the missed coverage.

Incomplete tests

Just because your tests execute the code doesn’t mean they properly test the results.

def my_awesome_sort(l):
    # Magic mumbo-jumbo that will sort the list (NOT!)
    l.reverse()
    return l
    
# 100% code coverage!
l = [4,2,5,3,1]
type(my_awesome_sort(l)) == list
len(my_awesome_sort(l)) == 5
my_awesome_sort(l)[0] == 1

Here our “sort” routine passes all the tests, and the coverage is 100%. But, oops, we forgot to check that the list returned is really sorted.

Real world

Of course, these examples are absurd. It’s easy to see where we went wrong in each of them. Most likely, though, your tests have the same underlying problems, but in ways that are much more difficult to find.

Improved tools could help some of these cases, but not all. Some C-based tools provide branch analysis that could help with the path problems above. But no tool can guarantee there aren’t path problems (what if a loop works incorrectly if executed a prime number of times?), and no tool will point out that your tests aren’t checking the important things about results.

For more on the problems of coverage testing, the wikipedia article on Code Coverage has a number of fine jumping-off points. Cem Kaner has a depressingly exhaustive overview of the Measurement of the Extent of Testing. After perusing it, you may wonder why you bother with puny statement coverage testing at all!

Statement coverage testing is a good measure of what isn’t being tested in your code. It’s a good start for understanding the completeness of your tests. Brian Merick’s How to Misuse Code Coverage sums it up best: “Coverage tools are only helpful if they’re used to enhance thought, not replace it.”

Comments

[gravatar]
And that's why MCDC is recommended in DO78B as a measure of test effectiveness for safety critical aerospace software systems.

It's also why unit testing has always had an aura of menace and horror about it for me - exhaustive testing can be a bit tedious.
[gravatar]
Some really nice well thoughout examples. Kudos. L(
[gravatar]
Is this where I report bugs?

On my project, I put the test cases for foo.py in foo.test.py. I don't know why I do this; it is just what my predecessor did.

In this case, in the HTML index page, clicking on the foo.py line takes you to the foo.test.py coverage instead.

If I rename the test cases to foo_test.py, it functions perfectly, and is quite an eye-opener to the amount of code never executed.

(This bug-report can be filed under "If it hurts when you do that, then don't do that," but I thought I would let you know.)

Thanks for a very useful utility.
[gravatar]
@Julian: much better is to report them on bitbucket or the Testing in Python mailing list, but I've created a bug report for this one: http://bitbucket.org/ned/coveragepy/issue/46/footestpy-confuses-html-reporting
[gravatar]
The "Measurement of the Extent of Testing" appears to have moved to a PDF only form on http://www.kaner.com/pdfs/pnsqc00.pdf ...
[gravatar]
"there's no way to measure which parts of a data structure were used"

True. Moreover, there's no way to to measure which variables are used at all. I was concerned that I was spending effort creating and updating a number of unneeded self. variables. Coverage.py told me that I am most certainly executing all code that does so. :-) But it would be nice to eliminate that code if the variables are not used.

No disparagement implied. Great program and great examples of "things that can go wrong." But, in case someone knows, is there a Py tool that will flag unused variables -- and perhaps even unused portions of data structures?
[gravatar]
Hopfrog,

Running a syntax checker (such as PyLint and Flake8) can help you find unnused variables.

I use an Emacs plugin flymake that highlights lines with syntax errors, including unused variables/imports, bad indentation, and really anything that goes agaisnt PEP8 standards.
[gravatar]
There is tool called instrumental that can help with problems described in this article: http://instrumental.readthedocs.org/en/latest/
[gravatar]
I believe another example of a limitation of coverage testing is lambdas. The body of a lambda is not treated as a "line" for coverage purposes. Coverage is noted when the lambda is defined, but the body of the lambda may never be executed, so an error embedded in a lambda may not be found despite "100%" (in lines) test coverage.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.