« | » Main « | »

D'oh: Coverage.py 4.3.1

Wednesday 28 December 2016

Yesterday I released five months' of fixes as Coverage.py 4.3, and today I am releasing Coverage.py 4.3.1. This is not because releasing is fun, but because releasing is error-prone.

Two bad problems were very quickly reported by my legions of adoring fans, and they are now fixed. I'll sheepishly tell you that one of them was a UnicodeError in a bit of too-cute code in setup.py.

Perhaps I should have released a 4.3 beta. But my experience in the past is that betas do not get the kind of attention that final releases do. Partly this is just due to people's attention budget: lots of people won't install a beta. But it's also due to continuous integration servers. When a final release is out, hundreds if not thousands of CI servers will install it automatically as part of the next triggered build. They won't install pre-releases.

So there's a not-great choice to make: should I put out a beta, and hope that people try it and tell me what went wrong? Will enough people in enough disparate environments take that step to truly test the release?

Or should I skip that step, jump straight to a final release, and prepare instead to quickly fix whatever problems occur? I chose the latter course for 4.3. I guess I could use meta-feedback about which form of feedback I should pursue in the future...

Coverage.py 4.3

Tuesday 27 December 2016

The latest Coverage.py release: Coverage.py 4.3 is ready.

This version adds --skip-covered support to the HTML report, implements sys.excepthook support, reads configuration from tox.ini, and contains improvements that close 18 issues. The complete change history is in the source.

A special shout-out to Loïc Dachary: he read my blog post about Who Tests What, and got interested in contributing. And I mean, really interested. Suddenly he seemed to be everywhere, making pull requests and commenting on issues. In a week, I had 122 emails due to his activity. That energy really helped push me along, and is a big reason why this release happened, five months after 4.2.

Random trivia: this is the 30th version on PyPI; it's the 57th if you include pre-releases.

Finding test coupling

Thursday 22 December 2016

Before we get started: this is a story about a problem I had and how I solved it. This retelling is leaving out lots of small false trails and hard learnings, which I will summarize at the end. I report these stories not to lecture from on high, but to share with peers, help people learn, and ideally, elicit teachings from others so that I can do it better next time. The main qualities I am demonstrating here are not intelligence and experience, but perseverance, patience, and optimism.

OK, on with the story:

Running our large test suite the other day, we got a test failure. It seemed unrelated to the changes we were making, but you can never be sure, so we investigated. Along the way I used a few techniques to narrow down, widen, and identify suspects.

Running just that one test passed, but running the whole test suite, it failed, and this behavior was repeatable. So we had some kind of coupling between tests. Ideally, all tests would be isolated from each other. Perfect test isolation would mean that no matter what order you ran tests, and no matter what subset of tests you ran, the results would be the same. Clearly we did not have perfect test isolation.

The job now was to find the test we were coupled with, or perhaps one of the many possible tests that we were coupled with.

The test failure itself was a UnicodeError while trying to log a warning message involving a username with a non-ASCII character in it. Apparently this is something that doesn't work well: when warnings are routed through the logging system, if the message is actually logged, and the message has a non-ASCII Unicode string, an exception will happen. That's unfortunate, but we'll have to live with that for the moment.

Our best guess at the moment is that when the test passes, it's because either the warnings settings, or the logging settings, are deciding not to log the warning. When the test fails, it's because some previous test has changed one (or both!) of those settings, causing the message to proceed all the way through the warnings/logging pipeline, to the point of producing the UnicodeError. This is a plausible theory because those settings are global to the process, and would be easy to change without realizing the consequences for test suites.

But we still have to find that test. Here's the command that runs just the one test, that failed:

python ./manage.py lms test --verbosity=2 --with-id --settings=test \
    --xunitmp-file=/edx/app/edxapp/edx-platform/reports/lms/nosetests.xml \
    --with-database-isolation \

This is the Django test runner, using nose. That last line selects one particular test method in one particular class in one specific test file. To try to find a failing combination, we'll widen the scope of our test run by peeling off trailing components. This will give us progressively more tests in the run, and eventually (we hope), the test will fail:


This last one finally failed, with 1810 tests. That's still too many to examine manually. We can run those tests again, with nose-randomly to randomize the order of the tests. This gives us an opportunity to run experiments where the randomization can tell us something about coupling. If we run the 1810 tests, and our failing test doesn't fail, then none of the tests that ran before it were the one that cause the problem. If the test does fail, then the tests that ran before it might be bad.

I used a bash loop to run those 1810 test over and over, capturing the output in separate result files:

export FLAGS=blah blah omitted for brevity
for i in $(seq 9999)do
    echo --- $i
    python ./manage.py lms test -v $FLAGS openedx/core/djangoapps -v > test$i.txt 2>&1

Overnight, this gave me 72 test result files to look at. The -v and --with-id flags gave us output that looked like this:

... lots of throat-clearing here ...
Synchronizing apps without migrations:
  Creating tables...
    Creating table coursewarehistoryextended_studentmodulehistoryextended
    Running deferred SQL...
  Installing custom SQL...
Running migrations:
  No migrations to apply.
Using --randomly-seed=1482407901
#19 test_platform_name (openedx.core.djangoapps.site_configuration.tests.test_context_processors.ContextProcessorTests) ... ok
#20 test_configuration_platform_name (openedx.core.djangoapps.site_configuration.tests.test_context_processors.ContextProcessorTests) ... ok
#21 test_get_value (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
#22 test_get_value_for_org (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
#23 test_get_dict (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
#24 test_get_value_for_org_2 (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
... much more ...

A small Python program provided the analysis: test_analysis.py. (Warning: this is for Python 3.6, so f-strings ahead!)

Although I had 72 runs, the results converged after 11 runs: 179 tests were in the maybe-bad set, and more runs didn't reduce the set. That's because of nose-randomly's behavior, which I didn't fully understand: it doesn't completely shuffle the tests. Because of the possibility of module-level and class-level setup code, it randomizes within those scopes, but will not intermix between scopes. The test modules are run in a random order, but everything in one module will always run contiguously. The classes within a module will run in a random order, but all of the methods within a class will run contiguously.

The list of classes that test_analysis.py provided made clear what was going on: all of the maybe-bad tests were in credit/tests/test_views.py. There are 179 tests in that file, and something in there is causing our test failure. Because they always run contiguously, there's no way nose-randomly could give us more information about the true culprit.

Time for some low-tech divide and conquer: we'll run one class from test_views.py, and then our failing test. If we do that once for each class in test_views.py, we should get information about which class to examine. I'd love to tell you I had some clever way to get the list of test classes, but I just manually picked them out of the file and wrote this loop:

export FAILING_TEST=openedx/core/djangoapps/external_auth/tests/test_openid_provider.py:OpenIdProviderTest.test_provider_login_can_handle_unicode_email_inactive_account 
for c in CreditCourseViewSetTests CreditEligibilityViewTests CreditProviderCallbackViewTests CreditProviderRequestCreateViewTests CreditProviderViewSetTests; do
    echo ------------- $c
    python ./manage.py lms test -v $FLAGS \
        openedx/core/djangoapps/credit/tests/test_views.py:$c \
        $FAILING_TEST 2>&1 | tee ${c}_out | grep unicode_email

(My bash-looping skillz were improving each time!) This showed me that three of the five classes were failing. These classes use mixins, and the thing the three class had in common was the AuthMixin, which provides four test methods. So it's probably one of those methods. I picked the first of the test classes, and ran a new experiment four times, once for each of the four test methods:

for t in test_authentication_required test_oauth test_session_auth test_jwt_auth; do
    echo ---------- $t
    python ./manage.py lms test -v $FLAGS \
        openedx/core/djangoapps/credit/tests/test_views.py:CreditCourseViewSetTests.$t \
        $FAILING_TEST 2>&1 | tee ${t}_out | grep unicode_email

And this showed that test_jwt_auth was the problem! Now I had a two-test scenario that would produce the failure.

To find the line in the test, I could comment-out or otherwise neuter parts of the test method and run my fast two-test scenario. The cause was a JWT authorization header in a test client get() function call. JWT-related code is scarce enough in our huge code base, that I could identify a likely related function, place a pudb breakpoint, and start walking through code until I found the problem: a line deep in a library that changed the warnings settings! (cue the dramatic music)

Commenting out that line, and running my reproducer confirmed that it was the root cause. A simple pull request fixes the problem. Note in the pull request that the library's test case had a simple mistake that might have been the reason for the bad line to begin with.

It felt really good to find and fix the problem, perhaps because it took so long to find.

As promised: things I didn't know or did wrong:

  • Our test names have random data in them. I didn't realize this until my results started showing the sum of known-good and maybe-bad as greater than the total number of tests. Until then, the numbers were skittering all over the place. Once I canonicalized the file names, the numbers converged quickly.
  • I should have understood how nose-randomly worked earlier.
  • I had to (once again) Google the bash loop syntax.
  • I fat-fingered a few bash loops which gave me mysterious and sometimes discouraging false results.
  • A few times, I spent a while just reading through possible culprit tests looking for clues, which was fruitless, since the actual problem line was in a library in a different repo.

We're all learning. Be careful out there.

Dragon iterators

Saturday 17 December 2016

Advent of Code is running again this year, and I love it. It reveals a new two-part Santa-themed puzzle each day for the first 25 days in December. The puzzles are algorithmically slanted, and the second part is only revealed after you've solved the first part. The second part often requires you to refactor your code, or deal with growing computational costs.

I've long been fascinated with Python's iteration tools, so Day 16: Dragon Checksum was especially fun.

Here's an adapted version of the first part of the directions:

You'll need to use a modified dragon curve. Start with an appropriate initial string of 0's and 1's. Then, for as long as you need, repeat the following steps:

  • Call the data you have at this point, A.
  • Make a copy of A; call this copy B.
  • Reverse the order of the characters in B.
  • In B, replace all instances of 0 with 1 and all 1's with 0.
  • The resulting data is A, then a single 0, then B.

For example, after a single step of this process,

  • 1 becomes 100.
  • 0 becomes 001.
  • 11111 becomes 11111000000.
  • 111100001010 becomes 1111000010100101011110000.

We have a few options for how to produce these strings. My first version took an initial seed, and a number of steps to iterate:

ZERO_ONE = str.maketrans("01", "10")

def reverse01(s):
    """Reverse a string, and swap 0 and 1."""
    return s.translate(ZERO_ONE)[::-1]

def dragon_iterative(seed, steps):
    d = seed
    for _ in range(steps):
        d = d + "0" + reverse01(d)
    return d

(BTW, I also wrote tests as I went, but I'll omit those for brevity. The truly curious I'm sure can find the full code on GitHub.) This is a simple iterative function.

The problem statement sounds like it would lend itself well to recursion, so let's try that too:

def dragon_recursive(seed, steps):
    if steps == 0:
        return seed
        d = dragon_recursive(seed, steps-1)
        return d + "0" + reverse01(d)

Both of these functions have the same downside: they produce complete strings. One thing I know about Advent of Code is that they love to give you problems that can be brute-forced, but then turn up the dials high enough that you need a cleverer algorithm.

I don't know if this will be needed, but let's try writing a recursive generator that doesn't create the entire string before returning. This was tricky to write. In addition to the seed and the steps, we'll track whether we are going forward (for the first half of a step), or backward for the second half:

def dragon_gen(seed, steps, reverse=False):
    if reverse:
        if steps == 0:
            yield from reverse01(seed)
            yield from dragon_gen(seed, steps-1, reverse=not reverse)
            yield "1"
            yield from dragon_gen(seed, steps-1, reverse=reverse)
        if steps == 0:
            yield from seed
            yield from dragon_gen(seed, steps-1, reverse=reverse)
            yield "0"
            yield from dragon_gen(seed, steps-1, reverse=not reverse)

If you are still using Python 2, the "yield from" may be new to you: it yields all the values from an iterable. This function works, but feels unwieldy. There may be a way to fold the similar lines together more nicely, but maybe not.

In any case, all of these function still have a common problem: they require the caller to specify the number of steps to execute. The actual Advent of Code problem instead tells us how many characters of result we need. There's a simple way to calculate how many steps to run based on the length of the seed and the desired length of the result. But more interesting is an infinite dragon generator:

def dragon_infinite(seed):
    """Generate characters of dragon forever."""
    yield from seed
    for steps in itertools.count():
        yield "0"
        yield from dragon_gen(seed, steps, reverse=True)

This relies on the fact that the first part of an N-step dragon string is the N-1-step dragon string. Each time around the for-loop, we've produced a dragon string for a particular number of steps. To extend it, we just have to output the "0", and then a reverse version of the string we've already produced. It's a little surprising that this only calls dragon_gen with reverse=True, but discovering that is part of the fun of these sorts of exercises.

Now we can write dragon_finite, to give us a result string of a desired length:

def dragon_finite(seed, length):
    return "".join(itertools.islice(dragon_infinite(seed), length))

The second part of the puzzle involved a checksum over the result, which I wrote in a simple way. It meant that although I had written generators which could produce the dragon string a character at a time, I was using them to create a complete string before computing the checksum. I could have continued down this path and written a checksum function that didn't need a complete string, but this is as far as I got.

One other implementation I didn't tackle: a function that could produce the Nth character in the dragon string, given the seed and N, but without generating the entire sequence.

Who Tests What

Saturday 10 December 2016

The next big feature for coverage.py is what I informally call "Who Tests What." People want a way to know more than just what lines were covered by the tests, but also, which tests covered which lines.

This idea/request is not new: it was first suggested over four years ago as issue 170, and two other issues (#185 and #311) have been closed as duplicates. It's a big job, but people keep asking for it, so maybe it's time.

There are a number of challenges. I'll explain them here, and lay out some options and questions. If you have opinions, answers, or energy to help, get in touch.

First, it's important to understand that coverage.py works in two main phases, with an optional phase in the middle:

  • The first phase is measurement, where your test suite runs. Coverage.py notes which code was executed, and collects that information in memory. At the end of the run, that data is written to a file.
  • If you are combining data from a number of test runs, perhaps for multiple versions of Python, then there's an optional combination phase. Multiple coverage data files are combined into one data file.
  • The reporting phase is where your project is analyzed to understand what code could have run, and the data files are read to understand what code was run. The difference between the two is the code that did not run. That information is reported in some useful way: HTML, XML, or textually.

OK, let's talk about what has to be done...


The measurement phase has to collect and record the data about what ran.

What is Who?

At the heart of "Who Tests What" is the Who. Usually people want to know what tests run each line of code, so during measurement we need to figure out what test is being run.

I can see two ways to identify the test being run: either coverage.py figures it out by examining function names being run for "test_*" patterns, or the test runner tells coverage.py when each test starts.

But I think the fully general way to approach Who Tests What is to not assume that Who means "which test." There are other uses for this feature, so instead of hard-coding it to "test", I'm thinking in terms of the more general concept of "context." Often, the context would be "the current test," but maybe you're only interested in "Python version", or "subsystem," or "unit/integration/load."

So the question is, how to know when contexts begin and end? Clearly with this general an idea, coverage.py can't know. Coverage.py already has a plugin mechanism, so it seems like we should allow a plugin to determine the boundaries of contexts. Coverage.py can provide a plugin implementation that suffices for most people.

A context will be a string, and each different context will have its own collected coverage data. In the discussion on issue 170, you can see people suggesting that we collect an entire stack trace for each line executed. This seems to me to be enormously more bulky to collect, more difficult to make use of, and ultimately not as flexible as simply noting a string context.

There might be interesting things you can glean from that compendium of stack traces. I'd like to hear from you if you have ideas of things to do with stack traces that you can't do with contexts.

Another minor point: what should be done with code executed before any context is established? I guess an None context would be good enough.

Storing data

Having multiple contexts will multiply the amount of data to be stored. It's hard to guess how much more, since that will depend on how overlapping your contexts are. My crude first guess is that large projects would have roughly C/4 times more data, where C is the number of contexts. If you have 500 tests in your test suite, you might need to manage 100 to 200 times more data, which could be a real problem.

Recording the data on disk isn't a show-stopper, but keeping the data in memory might be. Today coverage.py keeps everything in memory until the end of the process, then writes it all to disk. Q: Will we need something more sophisticated? Can we punt on that problem until later?

The data in memory is something like a dictionary of ints. There are much more compact ways to record line numbers. Is it worth it? Recording pairs of line numbers (for branch coverage) is more complicated to compact (see Speeding up coverage data storage for one experiment on this). Eventually, we might get to counting the number of times a line is executed, rather than just a yes/no, which again would complicate things. Q: Is it important to try to conserve memory?

Today, the .coverage data files are basically JSON. This much data might need a different format. Q: Is it time for a SQLite data file?


The combine command won't change much, other than properly dealing with the context information that will now be in the data files.

But thinking about combining adds another need for the measurement phase: when running tests, you should be able to specify a context that applies to the entire run. For example, you run your test suite twice, once on Python 2, and again on Python 3. The first run should record that it was a "python2" context, and the second, "python3". Then when the files are combined, they will have the correct context recorded.

This also points up the need for context labels that can indicate nesting, so that we can record that lines were run under Python 2 and also note the test names that ran them. Contexts might look like "python2.test_function_one", for example.


Reporting is where things get really murky. If I have a test suite with 500 tests, how do I display all the information about those 500 tests? I can't create an HTML report where each line of code is annotated with the names of all the tests that ran it. It's too bulky to write, and far too cluttered to read.

Partly the problem here is that I don't know how people will want to use the data. When someone says, "I want to know which tests covered which lines," are they going to start from a line of code, and want to see which tests ran it? Or will they start from a test, and want to see what lines it ran? Q: How would you use the data?

One possibility is a new command, the opposite of "coverage combine": it would take a large data file, and subset it to write a smaller data file. You could specify a pattern of contexts to include in the output. This would let you slice and dice arbitrarily, and then you can report as you want from the resulting segmented data file. Q: Would this be too clumsy?

Perhaps the thing to do is to provide a SQLite interface. A new "report" would produce a SQLite database with a specified schema. You can then write queries against that database to your heart's content. Q: Is that too low-level? Will it be possible to write a useful report from it?

What's already been done

I started hacking on this context idea a year ago. Coverage.py currently has some support for it. The measurement support is there, and data is collected in memory. I did it to test whether the plugin idea would be fast enough, and it seems to be. If you are interested to see it, search for "wtw" in the code.

The data is not written out to a .coverage data file, and there is zero support for combining, segmenting, or reporting on context data.

How you can help

I'm interested to hear about how you would use this feature. I'm interested to hear ideas for implementation. If you want to help, let me know.

« | » Main « | »