Who Tests What

Saturday 10 December 2016

The next big feature for coverage.py is what I informally call “Who Tests What.” People want a way to know more than just what lines were covered by the tests, but also, which tests covered which lines.

This idea/request is not new: it was first suggested over four years ago as issue 170, and two other issues (#185 and #311) have been closed as duplicates. It’s a big job, but people keep asking for it, so maybe it’s time.

There are a number of challenges. I’ll explain them here, and lay out some options and questions. If you have opinions, answers, or energy to help, get in touch.

First, it’s important to understand that coverage.py works in two main phases, with an optional phase in the middle:

  • The first phase is measurement, where your test suite runs. Coverage.py notes which code was executed, and collects that information in memory. At the end of the run, that data is written to a file.
  • If you are combining data from a number of test runs, perhaps for multiple versions of Python, then there’s an optional combination phase. Multiple coverage data files are combined into one data file.
  • The reporting phase is where your project is analyzed to understand what code could have run, and the data files are read to understand what code was run. The difference between the two is the code that did not run. That information is reported in some useful way: HTML, XML, or textually.

OK, let’s talk about what has to be done...


The measurement phase has to collect and record the data about what ran.

What is Who?

At the heart of “Who Tests What” is the Who. Usually people want to know what tests run each line of code, so during measurement we need to figure out what test is being run.

I can see two ways to identify the test being run: either coverage.py figures it out by examining function names being run for “test_*” patterns, or the test runner tells coverage.py when each test starts.

But I think the fully general way to approach Who Tests What is to not assume that Who means “which test.” There are other uses for this feature, so instead of hard-coding it to “test”, I’m thinking in terms of the more general concept of “context.” Often, the context would be “the current test,” but maybe you’re only interested in “Python version”, or “subsystem,” or “unit/integration/load.”

So the question is, how to know when contexts begin and end? Clearly with this general an idea, coverage.py can’t know. Coverage.py already has a plugin mechanism, so it seems like we should allow a plugin to determine the boundaries of contexts. Coverage.py can provide a plugin implementation that suffices for most people.

A context will be a string, and each different context will have its own collected coverage data. In the discussion on issue 170, you can see people suggesting that we collect an entire stack trace for each line executed. This seems to me to be enormously more bulky to collect, more difficult to make use of, and ultimately not as flexible as simply noting a string context.

There might be interesting things you can glean from that compendium of stack traces. I’d like to hear from you if you have ideas of things to do with stack traces that you can’t do with contexts.

Another minor point: what should be done with code executed before any context is established? I guess an None context would be good enough.

Storing data

Having multiple contexts will multiply the amount of data to be stored. It’s hard to guess how much more, since that will depend on how overlapping your contexts are. My crude first guess is that large projects would have roughly C/4 times more data, where C is the number of contexts. If you have 500 tests in your test suite, you might need to manage 100 to 200 times more data, which could be a real problem.

Recording the data on disk isn’t a show-stopper, but keeping the data in memory might be. Today coverage.py keeps everything in memory until the end of the process, then writes it all to disk. Q: Will we need something more sophisticated? Can we punt on that problem until later?

The data in memory is something like a dictionary of ints. There are much more compact ways to record line numbers. Is it worth it? Recording pairs of line numbers (for branch coverage) is more complicated to compact (see Speeding up coverage data storage for one experiment on this). Eventually, we might get to counting the number of times a line is executed, rather than just a yes/no, which again would complicate things. Q: Is it important to try to conserve memory?

Today, the .coverage data files are basically JSON. This much data might need a different format. Q: Is it time for a SQLite data file?


The combine command won’t change much, other than properly dealing with the context information that will now be in the data files.

But thinking about combining adds another need for the measurement phase: when running tests, you should be able to specify a context that applies to the entire run. For example, you run your test suite twice, once on Python 2, and again on Python 3. The first run should record that it was a “python2” context, and the second, “python3”. Then when the files are combined, they will have the correct context recorded.

This also points up the need for context labels that can indicate nesting, so that we can record that lines were run under Python 2 and also note the test names that ran them. Contexts might look like “python2.test_function_one”, for example.


Reporting is where things get really murky. If I have a test suite with 500 tests, how do I display all the information about those 500 tests? I can’t create an HTML report where each line of code is annotated with the names of all the tests that ran it. It’s too bulky to write, and far too cluttered to read.

Partly the problem here is that I don’t know how people will want to use the data. When someone says, “I want to know which tests covered which lines,” are they going to start from a line of code, and want to see which tests ran it? Or will they start from a test, and want to see what lines it ran? Q: How would you use the data?

One possibility is a new command, the opposite of “coverage combine”: it would take a large data file, and subset it to write a smaller data file. You could specify a pattern of contexts to include in the output. This would let you slice and dice arbitrarily, and then you can report as you want from the resulting segmented data file. Q: Would this be too clumsy?

Perhaps the thing to do is to provide a SQLite interface. A new “report” would produce a SQLite database with a specified schema. You can then write queries against that database to your heart’s content. Q: Is that too low-level? Will it be possible to write a useful report from it?

What’s already been done

I started hacking on this context idea a year ago. Coverage.py currently has some support for it. The measurement support is there, and data is collected in memory. I did it to test whether the plugin idea would be fast enough, and it seems to be. If you are interested to see it, search for “wtw” in the code.

The data is not written out to a .coverage data file, and there is zero support for combining, segmenting, or reporting on context data.

How you can help

I’m interested to hear about how you would use this feature. I’m interested to hear ideas for implementation. If you want to help, let me know.


Pamela McANulty 3:18 AM on 11 Dec 2016
- I think that memory will definitely become a problem when the number of contexts grows, but can be punted until the underlying approach is completed.

- I think manipulating .coverage files and then extracting sub-sets would, in fact, be clumsy - thus I favor the sqllite approach

- I think my most-common use case will be "Ah, I see that this code that I want to fix is tested, what tests it?"

- I like the idea of using sqllite for storage. It addresses two main problems: storage of large quantities of data, and lets one defer the "how would you use it?" issue a bit by allowing users to write their own queries. I could envision splitting coverage.py a bit into a "gather coverage data" system and a "query the data api". The lowest level of the api would simply be sqllit sql, but there would probably be a need to provide some higher level wrapper/plugin api, and maybe some useful jupyter notebook examples, using it's visualization tools, etc.
@Pam, thanks for the answers. I'm having a hard time picturing how people would make use of a SQLite output file. Maybe I need to give people a way to get one of those files, and see what people do with it.
Michal Bultrowicz 3:14 PM on 11 Dec 2016
As for the ideas of usage: it would be useful for refactoring test suites of large legacy or not-so-well maintained projects. The ability to see which "unit tests" actually touch big chunks of code could show where test quality is subpar. It could also be used to detect redundant tests checking the same path.
I don't know if this is the same kind of thing, but the feature I've wanted, and clumsily tried to emulate, is the ability to, in the terms above, define a context for each source and test file, and only collect coverage data when the contexts match.

I currently emulate this, when I really want it, with multiple tiny runs that all get stitched together.

The goal here is to get unit-y tests without worrying about mock behavior.
@Michal: thanks, but I could use more detail. Exactly what data would you look at to find redundant tests?
@mwchase: it sounds like if you had a context per test file, then you could look at the coverage in foo.py of just the test_foo.py context. Is that right?
Michal Bultrowicz 8:10 PM on 11 Dec 2016
@Ned: My ideas aren't really well thought through, but I'll try to be a bit more specific.

First, what I understand from the article:
- a context can contain the name of a test function
- each context will have its own coverage data. It's basibally a JSON, right? So this data will be a set (list) of files that the file touches, mapped to the specific lines or statements touched. Right? So it's a kind of a directed graph "context -> file -> statements" (-> meaning a one-to-many relation).

Now, analyzing this data I can:
- Find tests that map to the same set of file-line pairs (cover the same code). These could mean redundant tests. Not all tests covering exactly the same statements in your code are bad, though (e.g. property-based tests).
- Find tests which have supersets of file-line pairs of other tests. E.g. an integration checking the same stuff as a unit-test plus some extra. Maybe I can stop maintaining (delete) the unit-test then?
- Check if the integration/functional tests are actually touching as many files (as many layers of a software stack) as we want them to.

I think this kind of data would be useful to for an ambitious QA engineer that would have to work with some of the projects I maintained.

What about using contexts like tags and tag each executed line with the context that is active when that line is executed. A line could then be tagged by multiple contexts if it is ran more than once. Map the context strings to unique ints (or some other low memory unique identifier) to save memory.

- every time the plugin is called to set a context `coverage.set_context('app.module.class.test_name', 'python2.7', 'etc.')` add those strings to a dict (if they aren't already there) and give it a unique int as a value (CTXINT).
- keep track of the currently active contexts.
- For every line that is executed, tag that line with all currently active contexts by mapping the line to the CTXINT for each active context.

At the end, for every line executed, you should be able to see what CTXINTs are associated with it, and then map those CTXINTs back to the string name of the context.

Benefits? of this approach:

1) The use of the int (or some other minimal memory identifier) reduces overall memory usage.
2) You can have multiple contexts active at a single time.

1M LOC with 100% test coverage, using a 4-byte unsigned int with a single tag per line, comes out to ~4M bytes or 4MB.

This seems small enough that you could worry about making this work first and managing memory later?

Or, if my memory calculations are off, track how many LOC are being watched and if that translates to too much memory being used (pick an arbitrary number), throw an exception indicating high memory usage. Offer a an API to set this limit higher and/or turn if off if desired. This too would allow you to focus on getting the basics working without having to worry about non-memory storage management (like SQLite).
I just released SQlite storage for testmon last week.(https://github.com/tarpas/pytest-testmon). I definitely think it's the way to go.

However while thinking about answers to your questions it seems to me that coverage.py is fine as it is. This feature should probably be implemented in the test runners.

The problems I see for my usage are:
I need to store the outcomes of the tests (passing/failing + exception)
I need to optimize the storage for performance of my use case. E.g. I'll store modified times and checksums of the files. And of course I'm storing checksums of code blocks instead of the line numbers (so that a newline somewhere doesn't invalidate data of the whole file)

So I guess I would not be able to use the new data. I'll think about it some more ....
After a basic change I ran coverage.py test suite with testmon. The resulting sqlite file is here: https://drive.google.com/file/d/0ByTESd2M3LzEVGlSamJIeWJ6dzg/view?usp=sharing

Of course it has less information than the full coverage file should have (no line numbers). But it's a self explanatory example of how to represent/use the data we're thinking about.

testmon nomenclature:
context => nodeid
"context for the whole run" => variant
Pamela McANulty 3:23 AM on 13 Dec 2016
@Ned wrt querying a sqllite file - i'd use sql from the command line
You say an HTML report would be too cluttered to read, but I can imagine with the application of some Javascript it could be manageable. I'm picturing a code listing, with colour highlighting in the gutter corresponding to the number of tests that touched each line, with the individual tests listed perhaps in a hover. This could even be expanded to allow jumping between the "who tested this" and "what else did they test" views (I'm imagining the underlying data structure as a bit like a bipartite graph of tests and lines of code).
I would like to have that feature when CI runs against my merge / pull requests. It would tell me which tests I should modify or mimic to verify my changes. For instance:

* A pull request is sent to coverage.py and modifies parser.py
* The CI runs and reports coverage is missing, for modified lines but also for deleted lines. It is useful to know, when code is removed, if it was tested or not.
* The pull request commits are displayed by the CI with lines annotated with the file:line testing them. The annotated lines include the context of the hunk.

On the local machine I would use coverage who-test-what themodifiedfile.py to get the annotated source with the list of file:line that tested each line.
Talking about the usage:
I'm considering trigger a full regression of UT suite after each git commit in my project. While the time consuming of the full regression is much too long. So, my solution is either shorter the whole suite's execution time or just run cases impacted by this commit. Apparently, the latter one is better.
If this feature could tell me exactly on "who test what", with the output of git diff, I can dynamically execute all neccessary cases rather than making a full regression. It will optimize a lot if this dynamic execute is within minutes.
PS: Full regression is still necessary during the night. As to my project, the full regression would cost around 100+ minutes...
I am not sure how far you have got with your implementation line, but i think two changes would be useful :
- in the code coverage report a clickable button which then reports on the tests which exercised that line.
- a new report which lists each test case which was executed - each test has a clickable button. Under this button is a new report which displays the call tree - either by showing a tree, or nested rectangles. It would be useful to be able to view the arguments passed in each call - if possible, and obviously there would need to be some way to show deep levels of a call stack - maybe by only showing a limited number of levels, and allowing drill down if a function/method dril makes deeper calls
I was wondering if we could have a "heat map" of coverage. For example, a function/method that is called 30 times (say) would be against a white background whereas a function/method called once would be against a deep blue one.

This way, one could tell visually which pieces of code were not excised as much as others.

Would this be useful?…
@Sadrathrion: it sounds like you would like to count how many times each line is counted. There isn't a way to do that yet, but issue #669 is asking for it. Feel free to add your idea there.

Also, coverage.py currently doesn't do any per-function reporting, only per-line and per-file.
I have a usecase that might not have been suggested before but which I think is valuable. I actually found this post while looking into ways of hacking my usecase.

I generally have a dedicated test file for each of my source files, so what I'd like to do is check that each source file is 100% covered by the tests in it's test file.
This is a second level of coverage where the first level is "covered by something" and the second level is "covered by the associated test file"

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.