SQLite data storage for coverage.py

Tuesday 14 August 2018

I’m starting to make some progress on Who Tests What. The first task is to change how coverage.py records the data it collects during execution. Currently, all of the data is held in memory, and then written to a JSON file at the end of the process.

But Who Tests What is going to increase the amount of data. If your test suite has N tests, you will have roughly N times as much data to store. Keeping it all in memory will become unwieldy. Also, since the data is more complicated, you’ll want a richer way to access the data.

To solve both these problems, I’m switching over to using SQLite to store the data. This will give us a way to write the data as it is collected, rather than buffering it all to write at the end. BTW, there’s a third side-benefit to this: we would be able to measure processes without having to control their ending.

When running with --parallel, coverage adds the process id and a random number to the name of the data file, so that many processes can be measured independently. With JSON storage, we didn’t need to decide on this filename until the end of the process. With SQLite, we need it at the beginning. This has required a surprising amount of refactoring. (You can follow the carnage on the data-sqlite branch.)

There’s one problem I don’t know how to solve: a process can start coverage measurement, then fork, and continue measurement in both of the child processes, as described in issue 56. With JSON storage, the in-memory data is naturally forked when the processes fork, and then each copy proceeds on its way. When each process ends, it writes its data to a file that includes the (new) process id, and all the data is recorded.

How can I support that use case with SQLite? The file name will be chosen before the fork, and data will be dribbled into the file as it happens. After the fork, both child processes will be trying to write to the same database file, which will not work (SQLite is not good at concurrent access).

Possible solutions:

  1. Even with SQLite, buffer all the data in memory. This imposes a memory penalty on everyone just for the rare case of measuring forking processes, and loses the extra benefit of measuring non-ending processes.
  2. Make buffer-it-all be an option. This adds to the complexity of the code, and will complicate testing. I don’t want to run every test twice, with buffering and not. Does pytest offer tools for conveniently doing this only for a subset of tests?
  3. Keep JSON storage as an option. This doesn’t have an advantage over #2, and has all the complications.
  4. Somehow detect that two processes are now writing to the same SQLite file, and separate them then?
  5. Use a new process just to own the SQLite database, with coverage talking to it over IPC. That sounds complicated.
  6. Monkeypatch os.fork so we can deal with the split? Yuck.
  7. Some other thing I haven’t thought of?

Expect to see an alpha of coverage.py in the next few weeks with SQLite data storage, and please test it. I’m sure there are other use cases that might experience some turbulence...


If you've selected a file, say coverage.sql, each process can write to coverage.[process-id].sql and be guaranteed that only one process at a time will ever be writing to it. Then the problem reduces to knowing when everything has settled down and the results can be coalesced.

I've used a variation on this theme for handling coverage data from multiple AppDomains in .net and Mono -- there, process termination makes an unmistakeable signal that everything is done and ready to process, which does make that end of things simpler.
@Steve Thanks, yes, that's the basic mechanism. But if the process forks, both the parent and the child process will still have open the database file created before the fork, and both processes will be trying to write to the same file.
In my experience, multiple writers work fine with sqlite. Have you tried it and had problems?
On a quick read, I'd add another option that I've used before: have the child open a new DB file and then have the parent merge it in at the end (end of child process, end of program, you decide). This might have solved your problem with needing to know the file name at the beginning: open using a tmp name, merge or rename at the end when you've decided on a file name.
Ah, thinking about it a bit more (and my answer was just repeating what Steve had said), I think @andershovmoller gave you the best answer: check your pid before you write to the db, if it's changed, start a new db. Coalesce at the end. Hopefully you're batching your writes anyway, so the syscall is negligible.
I'm with Gary on the "try it all to the same database". It'll lock per-process for any given commit, but you're going to be batching up queries for your commits anyway for performance reasons.

If that doesn't work out, could you get away with making a more efficient binary representation during the execution/storage phase, dumping those as contexts "end", and then if necessary converting it to something more nicely queryable in the reporting phase? I imagine that most line coverage data is going to be really bursty and should compress well for both small contexts that cover very little and large ones that cover very large swaths of code...?

Do you have the planned data model written up somewhere?

Thank you.
I think it's clear at this point that I'll have to do some empirical experiments on this.

About the data model: it's very simple at the moment. Here's the SQL: https://github.com/nedbat/coveragepy/blob/da37af9a65b144ce6b1f26430bcbc9786e055f8b/coverage/sqldata.py#L21-L52
The usual way to handle a difficulty with forking is to add an atfork handler. In C you'd call pthread_atfork; there's a pyatfork package that implements a Python interface. (But requiring this package would complicate the installation of coverage.py and so it would be simpler and probably cover 99% of cases to monkey-patch os.fork.)
Many people are telling me that concurrent access will be fine, but using the same db file from two processes would mean I couldn't keep some data in memory (like the list of files measured so far). So I might prefer to avoid concurrent access anyway.

Right now, an idea from a Twitter thread seems like the most promising: when opening the database, keep the process id. Any time we are about to access the database, check if our process id has changed. If it has, a fork must have happened somehow, so start a new database. The new database will only have activity since the fork, but that's OK, since the two databases will be combined in a later step anyway.
@Gareth: thanks for the suggestion, and for chiming in after all these years! :) It turns out that Python 3.7 has atfork support in the stdlib (https://docs.python.org/3/library/os.html#os.register_at_fork). You are right that a third-party package is off the table for coverage.py.
The first thing I thought of is to make the PID a column in the data you’re logging, then the data from multiple processes can naturally live in a single database. The only question is if you can write concurrently to the same database fast enough for your purposes.
> Does pytest offer tools for conveniently doing this only for a subset of tests?

Yes, there are ways to accomplish that. If the subset of tests are all using the same fixture, you can parametrize the fixture and those tests will run twice automatically:

@pytest.fixture(params=['sqlite', 'inmemory'])
def backend(request):
    if request.param == 'sqlite':
        return ...
    elif request.param == 'inmemory':
        return ...
All tests which use the "backend" fixture will run twice.

There are other ways involving marks as well. It depends on how you structured your tests, really.

Please do create an issue in pytest's tracker and we will be happy to help out. :)

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.