A tale of two exceptions

Sunday 22 January 2017

It was the best of times, it was the worst of times...

This week saw the release of three different versions of Coverage.py. This is not what I intended. Clearly something was getting tangled up. It had to do with some tricky exception handling. The story is kind of long and intricate, but has a number of chewy nuggets that fascinate me. Your mileage may vary.

Writing it all out, many of these missteps seem obvious and stupid. If you take nothing else from this, know that everyone makes mistakes, and we are all still trying to figure out the best way to solve some problems.

It started because I wanted to get the test suite running well on Jython. Jython is hard to support in Coverage.py: it can do "coverage run", but because it doesn't have the same internals as CPython, it can't do "coverage report" or any of the other reporting code. Internally, there's one place in the common reporting code where we detect this, and raise an exception. Before all the changes I'm about to describe, that code looked like this:

for attr in ['co_lnotab', 'co_firstlineno']:
    if not hasattr(self.code, attr):
        raise CoverageException(
            "This implementation of Python doesn't support code analysis.\n"
            "Run coverage.py under CPython for this command."

The CoverageException class is derived from Exception. Inside of Coverage.py, all exceptions raised are derived from CoverageException. This is a good practice for any library. For the coverage command-line tool, it means we can catch CoverageException at the top of main() so that we can print the message without an ugly traceback from the internals of Coverage.py.

The problem with running the test suite under Jython is that this "can't support code analysis" exception was being raised from hundreds of tests. I wanted to get to zero failures or errors, either by making the tests pass (where the operations were supported on Jython) or skipping the tests (where the operations were unsupported).

There are lots of tests in the Coverage.py test suite that are skipped for all sorts of reasons. But I didn't want to add decorators or conditionals to hundreds of tests for the Jython case. First, it would be a lot of noise in the tests. Second, it's not always immediately clear from a test that it is going to touch the analysis code. Lastly and most importantly, if someday in the future I figured out how to do analysis on Jython, or if it grew the features to make the current code work, I didn't want to have to then remove all that test-skipping noise.

So I wanted to somehow automatically skip tests when this particular exception was raised. The unittest module already has a way to do this: tests are skipped by raising a unittest.SkipTest exception. If the exception raised for "can't support code analysis" derived from SkipTest, then the tests would be skipped automatically. Genius idea!

So in 4.3.2, the code changed to this (spread across a few files):

from coverage.backunittest import unittest

class StopEverything(unittest.SkipTest):
    """An exception that means everything should stop.

    This derives from SkipTest so that tests that spring this trap will be
    skipped automatically, without a lot of boilerplate all over the place.


class IncapablePython(CoverageException, StopEverything):
    """An operation is attempted that this version of Python cannot do."""


# Alternative Python implementations don't always provide all the
# attributes on code objects that we need to do the analysis.
for attr in ['co_lnotab', 'co_firstlineno']:
    if not hasattr(self.code, attr):
        raise IncapablePython(
            "This implementation of Python doesn't support code analysis.\n"
            "Run coverage.py under another Python for this command."

It felt a little off to derive a product exception (StopEverything) from a testing exception (SkipTest), but that seemed acceptable. One place in the code, I had to deal specifically with StopEverything. In an inner loop of reporting, we catch exceptions that might happen on individual files being reported. But if this exception happens once, it will happen for all the files, so we wanted to end the report, not show this failure for every file. In pseudo-code, the loop looked like this:

for f in files_to_report:
    except StopEverything:
        # Don't report this on single files, it's a systemic problem.
    except Exception as ex:
        record_exception_for_file(f, ex)

This all seemed to work well: the tests skipped properly, without a ton of noise all over the place. There were no test failures in any supported environment. Ship it!

Uh-oh: very quickly, reports came in that coverage didn't work on Python 2.6 any more. In retrospect, it was obvious: the whole point of the "from coverage.backunittest" line in the code above was because Python 2.6 doesn't have unittest.SkipTest. For the Coverage.py tests on 2.6, I install unittest2 to get a backport of things 2.6 is missing, and that gave me SkipTest, but without my test requiements, it doesn't exist.

So my tests passed on 2.6 because I installed a package that provided what was missing, but in the real world, unittest.SkipTest is truly missing.

This is a conundrum that I don't have a good answer to:

How can you test your code to be sure it works properly when the testing requirements aren't installed?

To fix the problem, I changed the definition of StopEverything. Coverage.py 4.3.3 went out the door with this:

class StopEverything(unittest.SkipTest if env.TESTING else object):
    """An exception that means everything should stop."""

The env.TESTING setting was a pre-existing variable: it's true if we are running the coverage.py test suite. This also made me uncomfortable: as soon as you start conditionalizing on whether you are running tests or not, you have a very slippery slope. In this case it seemed OK, but it wasn't: it hid the fact that deriving an exception from object is a dumb thing to do.

So 4.3.3 failed also, and not just on Python 2.6. As soon as an exception was raised inside that reporting loop that I showed above, Python noticed that I was trying to catch a class that doesn't derive from Exception. Of course, my test suite didn't catch this, because when I was running my tests, my exception derived from SkipTest.

Changing "object" to "Exception" would fix the problem, but I didn't like the test of env.TESTING anyway. So for 4.3.4, the code is:

class StopEverything(getattr(unittest, 'SkipTest', Exception)):
    """An exception that means everything should stop."""

This is better, first because it uses Exception rather than object. But also, it's duck-typing the base class rather than depending on env.TESTING.

But as I kept working on getting rid of test failures on Jython, I got to this test failure (pseudo-code):

def test_sort_report_by_invalid_option(self):
    msg = "Invalid sorting option: 'Xyzzy'"
    with self.assertRaisesRegex(CoverageException, msg):

This is a reporting operation, so Jython will fail with a StopEverything exception saying, "This implementation of Python doesn't support code analysis." StopEverything is a CoverageException, so the assertRaisesRegex will catch it, but it will fail because the messages don't match.

StopEverything is both a CoverageException and a SkipTest, but the SkipTest is the more important aspect. To fix the problem, I did this, but felt silly:

def test_sort_report_by_invalid_option(self):
    msg = "Invalid sorting option: 'Xyzzy'"
    with self.assertRaisesRegex(CoverageException, msg):
        except SkipTest:
            raise SkipTest()

I knew this couldn't be the right solution. Talking it over with some co-workers (OK, I was griping and whining), we came up with the better solution. I realized that CoverageException is used in the code base to mean, "an ordinary problem from inside Coverage.py." StopEverything is not an ordinary problem. It reminded me of typical mature exception hierarchies, where the main base class, like Exception, isn't actually the root of the hierarchy. There are always a few special-case classes that derive from a real root higher up.

For example, in Python, the classes Exception, SystemExit, and KeyboardInterrupt all derive from BaseException. This is so "except Exception" won't interfere with SystemExit and KeyboardInterrupt, two exceptions meant to forcefully end the program.

I needed the same thing here, for the same reason. I want to have a way to catch "all" exceptions without interfering with the exceptions that mean "end now!" I adjusted my exception hierarchy, and now the code looks like this:

class BaseCoverageException(Exception):
    """The base of all Coverage exceptions."""

class CoverageException(BaseCoverageException):
    """A run-of-the-mill exception specific to coverage.py."""

class StopEverything(
        getattr(unittest, 'SkipTest', Exception)
    """An exception that means everything should stop."""

Now I could remove the weird SkipTest dance in that test. The catch clause in my main() function changes from CoverageException to BaseCoverageException, and things work great. The end...?

One of the reasons I write this stuff down is because I'm hoping to get feedback that will improve my solution, or advance my understanding. As I lay out this story, I can imagine points of divergence: places in this narrative where a reader might object and say, "you should blah blah blah." For example:

  • "You shouldn't bother supporting 2.6." Perhaps not, but that doesn't change the issues explored here, just makes them less likely.
  • "You shouldn't bother supporting Jython." Ditto.
  • "You should just have dependencies for the things you need, like unittest2." Coverage.py has a long-standing tradition of having no dependencies. This is driven by a desire to be available to people porting to new platforms, without having to wait for the dependencies to be ported.
  • "You should have more realistic integration testing." I agree. I'm looking for ideas about how to test the scenario of having no test dependencies installed.

That's my whole tale. Ideas are welcome.

Evil ninja module initialization

Tuesday 10 January 2017

A question about import styles on the Python-Dev mailing list asked about imports like this:

import os as _os

Understanding why people do this is an interesting lesson in how modules work. A module is nothing more than a collection of names. When you define a name in a .py file, it becomes an attribute of the module, and is then importable from the module.

An underlying simplicity in Python is that many statements are really just assignment statements in disguise. All of these define the name X:

X = 17
def X(): print("look!")
import X

When you create a module, you can make the name "X" importable from that module by assigning to it, or defining it as a function. You can also make it importable by importing it yourself.

Suppose your module looks like this:

# yourmodule.py
import os

def doit():

This module has two names defined in it: "doit", and "os". Someone else can now do this:

# someone.py
from yourmodule import os

# or worse, this imports os and doit:
from yourmodule import *

This bothers some people. "os" is not part of the actual interface of yourmodule. That first import I showed prevents this leaking of your imports into your interface. Importing star doesn't pull in names starting with underscores. (Another solution is to define __all__ in your module.)

Most people though, don't worry about this kind of name leaking. Import-star is discouraged anyway, and people know not to import os from other modules. The solution of renaming os to _os just makes your code ugly for little benefit.

The part of the discussion thread that really caught my eye was Daniel Holth's winking suggestion of the "evil ninja mode pattern" of module initialization:

def ninja():
    global exported
    import os
    def exported():

del ninja

What's going on here!? Remember that def is an assignment statement like any other. When used inside a function, it defines a local name, as assignment always does. But an assignment in a function can define a global name if the name is declared as global. It's a little unusual to see a global statement without an explicit assignment at the top-level, but it works just fine. The def statement defines a global "exported" function, because the global statement told it to. "os" is now a local in our function, because again, the import statement is just another form of assignment.

So we define ninja(), and then execute it immediately. This defines the global "exported", and doesn't define a global "os". The only problem is the name "ninja" has been defined, which we can clean up with a del statement.

Please don't ever write code this way. It's a kind of over-defensiveness that isn't needed in typical Python code. But understanding what it does, and why it does it, is a good way to flex your understanding of Python workings.

For more about how names (and values) work in Python, people seem to like my PyCon talk, Python Names and Values.

No PyCon for me this year

Thursday 5 January 2017

2017 will be different for me in one specific way: I won't be attending PyCon. I've been to ten in a row:

Ten consecutive PyCon badges

This year, Open edX con is in Madrid two days later after PyCon, actually overlapping with the sprints. I'm not a good enough traveler to do both. Crossing nine timezones is not something to be taken lightly.

I'll miss the usual love-fest at PyCon, but after ten in a row, it should be OK to miss one. I can say that now, but probably in May I will feel like I am missing the party. Maybe I really will watch talks on video for a change.

I usually would be working on a presentation to give. I like making presentations, but it is a lot of work. This spring I'll have that time back.

In any case, this will be a new way to experience the Python community. See you all in 2018 in Cleveland!

D'oh: Coverage.py 4.3.1

Wednesday 28 December 2016

Yesterday I released five months' of fixes as Coverage.py 4.3, and today I am releasing Coverage.py 4.3.1. This is not because releasing is fun, but because releasing is error-prone.

Two bad problems were very quickly reported by my legions of adoring fans, and they are now fixed. I'll sheepishly tell you that one of them was a UnicodeError in a bit of too-cute code in setup.py.

Perhaps I should have released a 4.3 beta. But my experience in the past is that betas do not get the kind of attention that final releases do. Partly this is just due to people's attention budget: lots of people won't install a beta. But it's also due to continuous integration servers. When a final release is out, hundreds if not thousands of CI servers will install it automatically as part of the next triggered build. They won't install pre-releases.

So there's a not-great choice to make: should I put out a beta, and hope that people try it and tell me what went wrong? Will enough people in enough disparate environments take that step to truly test the release?

Or should I skip that step, jump straight to a final release, and prepare instead to quickly fix whatever problems occur? I chose the latter course for 4.3. I guess I could use meta-feedback about which form of feedback I should pursue in the future...

Coverage.py 4.3

Tuesday 27 December 2016

The latest Coverage.py release: Coverage.py 4.3 is ready.

This version adds --skip-covered support to the HTML report, implements sys.excepthook support, reads configuration from tox.ini, and contains improvements that close 18 issues. The complete change history is in the source.

A special shout-out to Loïc Dachary: he read my blog post about Who Tests What, and got interested in contributing. And I mean, really interested. Suddenly he seemed to be everywhere, making pull requests and commenting on issues. In a week, I had 122 emails due to his activity. That energy really helped push me along, and is a big reason why this release happened, five months after 4.2.

Random trivia: this is the 30th version on PyPI; it's the 57th if you include pre-releases.

Finding test coupling

Thursday 22 December 2016

Before we get started: this is a story about a problem I had and how I solved it. This retelling is leaving out lots of small false trails and hard learnings, which I will summarize at the end. I report these stories not to lecture from on high, but to share with peers, help people learn, and ideally, elicit teachings from others so that I can do it better next time. The main qualities I am demonstrating here are not intelligence and experience, but perseverance, patience, and optimism.

OK, on with the story:

Running our large test suite the other day, we got a test failure. It seemed unrelated to the changes we were making, but you can never be sure, so we investigated. Along the way I used a few techniques to narrow down, widen, and identify suspects.

Running just that one test passed, but running the whole test suite, it failed, and this behavior was repeatable. So we had some kind of coupling between tests. Ideally, all tests would be isolated from each other. Perfect test isolation would mean that no matter what order you ran tests, and no matter what subset of tests you ran, the results would be the same. Clearly we did not have perfect test isolation.

The job now was to find the test we were coupled with, or perhaps one of the many possible tests that we were coupled with.

The test failure itself was a UnicodeError while trying to log a warning message involving a username with a non-ASCII character in it. Apparently this is something that doesn't work well: when warnings are routed through the logging system, if the message is actually logged, and the message has a non-ASCII Unicode string, an exception will happen. That's unfortunate, but we'll have to live with that for the moment.

Our best guess at the moment is that when the test passes, it's because either the warnings settings, or the logging settings, are deciding not to log the warning. When the test fails, it's because some previous test has changed one (or both!) of those settings, causing the message to proceed all the way through the warnings/logging pipeline, to the point of producing the UnicodeError. This is a plausible theory because those settings are global to the process, and would be easy to change without realizing the consequences for test suites.

But we still have to find that test. Here's the command that runs just the one test, that failed:

python ./manage.py lms test --verbosity=2 --with-id --settings=test \
    --xunitmp-file=/edx/app/edxapp/edx-platform/reports/lms/nosetests.xml \
    --with-database-isolation \

This is the Django test runner, using nose. That last line selects one particular test method in one particular class in one specific test file. To try to find a failing combination, we'll widen the scope of our test run by peeling off trailing components. This will give us progressively more tests in the run, and eventually (we hope), the test will fail:


This last one finally failed, with 1810 tests. That's still too many to examine manually. We can run those tests again, with nose-randomly to randomize the order of the tests. This gives us an opportunity to run experiments where the randomization can tell us something about coupling. If we run the 1810 tests, and our failing test doesn't fail, then none of the tests that ran before it were the one that cause the problem. If the test does fail, then the tests that ran before it might be bad.

I used a bash loop to run those 1810 test over and over, capturing the output in separate result files:

export FLAGS=blah blah omitted for brevity
for i in $(seq 9999)do
    echo --- $i
    python ./manage.py lms test -v $FLAGS openedx/core/djangoapps -v > test$i.txt 2>&1

Overnight, this gave me 72 test result files to look at. The -v and --with-id flags gave us output that looked like this:

... lots of throat-clearing here ...
Synchronizing apps without migrations:
  Creating tables...
    Creating table coursewarehistoryextended_studentmodulehistoryextended
    Running deferred SQL...
  Installing custom SQL...
Running migrations:
  No migrations to apply.
Using --randomly-seed=1482407901
#19 test_platform_name (openedx.core.djangoapps.site_configuration.tests.test_context_processors.ContextProcessorTests) ... ok
#20 test_configuration_platform_name (openedx.core.djangoapps.site_configuration.tests.test_context_processors.ContextProcessorTests) ... ok
#21 test_get_value (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
#22 test_get_value_for_org (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
#23 test_get_dict (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
#24 test_get_value_for_org_2 (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
... much more ...

A small Python program provided the analysis: test_analysis.py. (Warning: this is for Python 3.6, so f-strings ahead!)

Although I had 72 runs, the results converged after 11 runs: 179 tests were in the maybe-bad set, and more runs didn't reduce the set. That's because of nose-randomly's behavior, which I didn't fully understand: it doesn't completely shuffle the tests. Because of the possibility of module-level and class-level setup code, it randomizes within those scopes, but will not intermix between scopes. The test modules are run in a random order, but everything in one module will always run contiguously. The classes within a module will run in a random order, but all of the methods within a class will run contiguously.

The list of classes that test_analysis.py provided made clear what was going on: all of the maybe-bad tests were in credit/tests/test_views.py. There are 179 tests in that file, and something in there is causing our test failure. Because they always run contiguously, there's no way nose-randomly could give us more information about the true culprit.

Time for some low-tech divide and conquer: we'll run one class from test_views.py, and then our failing test. If we do that once for each class in test_views.py, we should get information about which class to examine. I'd love to tell you I had some clever way to get the list of test classes, but I just manually picked them out of the file and wrote this loop:

export FAILING_TEST=openedx/core/djangoapps/external_auth/tests/test_openid_provider.py:OpenIdProviderTest.test_provider_login_can_handle_unicode_email_inactive_account 
for c in CreditCourseViewSetTests CreditEligibilityViewTests CreditProviderCallbackViewTests CreditProviderRequestCreateViewTests CreditProviderViewSetTests; do
    echo ------------- $c
    python ./manage.py lms test -v $FLAGS \
        openedx/core/djangoapps/credit/tests/test_views.py:$c \
        $FAILING_TEST 2>&1 | tee ${c}_out | grep unicode_email

(My bash-looping skillz were improving each time!) This showed me that three of the five classes were failing. These classes use mixins, and the thing the three class had in common was the AuthMixin, which provides four test methods. So it's probably one of those methods. I picked the first of the test classes, and ran a new experiment four times, once for each of the four test methods:

for t in test_authentication_required test_oauth test_session_auth test_jwt_auth; do
    echo ---------- $t
    python ./manage.py lms test -v $FLAGS \
        openedx/core/djangoapps/credit/tests/test_views.py:CreditCourseViewSetTests.$t \
        $FAILING_TEST 2>&1 | tee ${t}_out | grep unicode_email

And this showed that test_jwt_auth was the problem! Now I had a two-test scenario that would produce the failure.

To find the line in the test, I could comment-out or otherwise neuter parts of the test method and run my fast two-test scenario. The cause was a JWT authorization header in a test client get() function call. JWT-related code is scarce enough in our huge code base, that I could identify a likely related function, place a pudb breakpoint, and start walking through code until I found the problem: a line deep in a library that changed the warnings settings! (cue the dramatic music)

Commenting out that line, and running my reproducer confirmed that it was the root cause. A simple pull request fixes the problem. Note in the pull request that the library's test case had a simple mistake that might have been the reason for the bad line to begin with.

It felt really good to find and fix the problem, perhaps because it took so long to find.

As promised: things I didn't know or did wrong:

  • Our test names have random data in them. I didn't realize this until my results started showing the sum of known-good and maybe-bad as greater than the total number of tests. Until then, the numbers were skittering all over the place. Once I canonicalized the file names, the numbers converged quickly.
  • I should have understood how nose-randomly worked earlier.
  • I had to (once again) Google the bash loop syntax.
  • I fat-fingered a few bash loops which gave me mysterious and sometimes discouraging false results.
  • A few times, I spent a while just reading through possible culprit tests looking for clues, which was fruitless, since the actual problem line was in a library in a different repo.

We're all learning. Be careful out there.

Dragon iterators

Saturday 17 December 2016

Advent of Code is running again this year, and I love it. It reveals a new two-part Santa-themed puzzle each day for the first 25 days in December. The puzzles are algorithmically slanted, and the second part is only revealed after you've solved the first part. The second part often requires you to refactor your code, or deal with growing computational costs.

I've long been fascinated with Python's iteration tools, so Day 16: Dragon Checksum was especially fun.

Here's an adapted version of the first part of the directions:

You'll need to use a modified dragon curve. Start with an appropriate initial string of 0's and 1's. Then, for as long as you need, repeat the following steps:

  • Call the data you have at this point, A.
  • Make a copy of A; call this copy B.
  • Reverse the order of the characters in B.
  • In B, replace all instances of 0 with 1 and all 1's with 0.
  • The resulting data is A, then a single 0, then B.

For example, after a single step of this process,

  • 1 becomes 100.
  • 0 becomes 001.
  • 11111 becomes 11111000000.
  • 111100001010 becomes 1111000010100101011110000.

We have a few options for how to produce these strings. My first version took an initial seed, and a number of steps to iterate:

ZERO_ONE = str.maketrans("01", "10")

def reverse01(s):
    """Reverse a string, and swap 0 and 1."""
    return s.translate(ZERO_ONE)[::-1]

def dragon_iterative(seed, steps):
    d = seed
    for _ in range(steps):
        d = d + "0" + reverse01(d)
    return d

(BTW, I also wrote tests as I went, but I'll omit those for brevity. The truly curious I'm sure can find the full code on GitHub.) This is a simple iterative function.

The problem statement sounds like it would lend itself well to recursion, so let's try that too:

def dragon_recursive(seed, steps):
    if steps == 0:
        return seed
        d = dragon_recursive(seed, steps-1)
        return d + "0" + reverse01(d)

Both of these functions have the same downside: they produce complete strings. One thing I know about Advent of Code is that they love to give you problems that can be brute-forced, but then turn up the dials high enough that you need a cleverer algorithm.

I don't know if this will be needed, but let's try writing a recursive generator that doesn't create the entire string before returning. This was tricky to write. In addition to the seed and the steps, we'll track whether we are going forward (for the first half of a step), or backward for the second half:

def dragon_gen(seed, steps, reverse=False):
    if reverse:
        if steps == 0:
            yield from reverse01(seed)
            yield from dragon_gen(seed, steps-1, reverse=not reverse)
            yield "1"
            yield from dragon_gen(seed, steps-1, reverse=reverse)
        if steps == 0:
            yield from seed
            yield from dragon_gen(seed, steps-1, reverse=reverse)
            yield "0"
            yield from dragon_gen(seed, steps-1, reverse=not reverse)

If you are still using Python 2, the "yield from" may be new to you: it yields all the values from an iterable. This function works, but feels unwieldy. There may be a way to fold the similar lines together more nicely, but maybe not.

In any case, all of these function still have a common problem: they require the caller to specify the number of steps to execute. The actual Advent of Code problem instead tells us how many characters of result we need. There's a simple way to calculate how many steps to run based on the length of the seed and the desired length of the result. But more interesting is an infinite dragon generator:

def dragon_infinite(seed):
    """Generate characters of dragon forever."""
    yield from seed
    for steps in itertools.count():
        yield "0"
        yield from dragon_gen(seed, steps, reverse=True)

This relies on the fact that the first part of an N-step dragon string is the N-1-step dragon string. Each time around the for-loop, we've produced a dragon string for a particular number of steps. To extend it, we just have to output the "0", and then a reverse version of the string we've already produced. It's a little surprising that this only calls dragon_gen with reverse=True, but discovering that is part of the fun of these sorts of exercises.

Now we can write dragon_finite, to give us a result string of a desired length:

def dragon_finite(seed, length):
    return "".join(itertools.islice(dragon_infinite(seed), length))

The second part of the puzzle involved a checksum over the result, which I wrote in a simple way. It meant that although I had written generators which could produce the dragon string a character at a time, I was using them to create a complete string before computing the checksum. I could have continued down this path and written a checksum function that didn't need a complete string, but this is as far as I got.

One other implementation I didn't tackle: a function that could produce the Nth character in the dragon string, given the seed and N, but without generating the entire sequence.

Who Tests What

Saturday 10 December 2016

The next big feature for coverage.py is what I informally call "Who Tests What." People want a way to know more than just what lines were covered by the tests, but also, which tests covered which lines.

This idea/request is not new: it was first suggested over four years ago as issue 170, and two other issues (#185 and #311) have been closed as duplicates. It's a big job, but people keep asking for it, so maybe it's time.

There are a number of challenges. I'll explain them here, and lay out some options and questions. If you have opinions, answers, or energy to help, get in touch.

First, it's important to understand that coverage.py works in two main phases, with an optional phase in the middle:

  • The first phase is measurement, where your test suite runs. Coverage.py notes which code was executed, and collects that information in memory. At the end of the run, that data is written to a file.
  • If you are combining data from a number of test runs, perhaps for multiple versions of Python, then there's an optional combination phase. Multiple coverage data files are combined into one data file.
  • The reporting phase is where your project is analyzed to understand what code could have run, and the data files are read to understand what code was run. The difference between the two is the code that did not run. That information is reported in some useful way: HTML, XML, or textually.

OK, let's talk about what has to be done...


The measurement phase has to collect and record the data about what ran.

What is Who?

At the heart of "Who Tests What" is the Who. Usually people want to know what tests run each line of code, so during measurement we need to figure out what test is being run.

I can see two ways to identify the test being run: either coverage.py figures it out by examining function names being run for "test_*" patterns, or the test runner tells coverage.py when each test starts.

But I think the fully general way to approach Who Tests What is to not assume that Who means "which test." There are other uses for this feature, so instead of hard-coding it to "test", I'm thinking in terms of the more general concept of "context." Often, the context would be "the current test," but maybe you're only interested in "Python version", or "subsystem," or "unit/integration/load."

So the question is, how to know when contexts begin and end? Clearly with this general an idea, coverage.py can't know. Coverage.py already has a plugin mechanism, so it seems like we should allow a plugin to determine the boundaries of contexts. Coverage.py can provide a plugin implementation that suffices for most people.

A context will be a string, and each different context will have its own collected coverage data. In the discussion on issue 170, you can see people suggesting that we collect an entire stack trace for each line executed. This seems to me to be enormously more bulky to collect, more difficult to make use of, and ultimately not as flexible as simply noting a string context.

There might be interesting things you can glean from that compendium of stack traces. I'd like to hear from you if you have ideas of things to do with stack traces that you can't do with contexts.

Another minor point: what should be done with code executed before any context is established? I guess an None context would be good enough.

Storing data

Having multiple contexts will multiply the amount of data to be stored. It's hard to guess how much more, since that will depend on how overlapping your contexts are. My crude first guess is that large projects would have roughly C/4 times more data, where C is the number of contexts. If you have 500 tests in your test suite, you might need to manage 100 to 200 times more data, which could be a real problem.

Recording the data on disk isn't a show-stopper, but keeping the data in memory might be. Today coverage.py keeps everything in memory until the end of the process, then writes it all to disk. Q: Will we need something more sophisticated? Can we punt on that problem until later?

The data in memory is something like a dictionary of ints. There are much more compact ways to record line numbers. Is it worth it? Recording pairs of line numbers (for branch coverage) is more complicated to compact (see Speeding up coverage data storage for one experiment on this). Eventually, we might get to counting the number of times a line is executed, rather than just a yes/no, which again would complicate things. Q: Is it important to try to conserve memory?

Today, the .coverage data files are basically JSON. This much data might need a different format. Q: Is it time for a SQLite data file?


The combine command won't change much, other than properly dealing with the context information that will now be in the data files.

But thinking about combining adds another need for the measurement phase: when running tests, you should be able to specify a context that applies to the entire run. For example, you run your test suite twice, once on Python 2, and again on Python 3. The first run should record that it was a "python2" context, and the second, "python3". Then when the files are combined, they will have the correct context recorded.

This also points up the need for context labels that can indicate nesting, so that we can record that lines were run under Python 2 and also note the test names that ran them. Contexts might look like "python2.test_function_one", for example.


Reporting is where things get really murky. If I have a test suite with 500 tests, how do I display all the information about those 500 tests? I can't create an HTML report where each line of code is annotated with the names of all the tests that ran it. It's too bulky to write, and far too cluttered to read.

Partly the problem here is that I don't know how people will want to use the data. When someone says, "I want to know which tests covered which lines," are they going to start from a line of code, and want to see which tests ran it? Or will they start from a test, and want to see what lines it ran? Q: How would you use the data?

One possibility is a new command, the opposite of "coverage combine": it would take a large data file, and subset it to write a smaller data file. You could specify a pattern of contexts to include in the output. This would let you slice and dice arbitrarily, and then you can report as you want from the resulting segmented data file. Q: Would this be too clumsy?

Perhaps the thing to do is to provide a SQLite interface. A new "report" would produce a SQLite database with a specified schema. You can then write queries against that database to your heart's content. Q: Is that too low-level? Will it be possible to write a useful report from it?

What's already been done

I started hacking on this context idea a year ago. Coverage.py currently has some support for it. The measurement support is there, and data is collected in memory. I did it to test whether the plugin idea would be fast enough, and it seems to be. If you are interested to see it, search for "wtw" in the code.

The data is not written out to a .coverage data file, and there is zero support for combining, segmenting, or reporting on context data.

How you can help

I'm interested to hear about how you would use this feature. I'm interested to hear ideas for implementation. If you want to help, let me know.

Mac un-installs

Monday 7 November 2016

The Mac is a nice machine and operating system, but there's one part of the experience I don't understand: software installation and uninstallation. I'm sure the App Store is meant to solve some of this, but the current situation is oddly manual.

Usually when I install applications on the Mac, I get a .dmg file, I open it, and there's something to copy to the Applications folder. Often, the .dmg window that opens has a cute graphic as a background, to encourage me to drag the application to the folder.

Proponents of this say, "it's so simple! The whole app is just a folder, so you can just drag it to Applications, and you're done. When you don't want the application any more, you just drag the application to the Trash."

This is not true. Applications may start self-contained in a folder, but they write data to other places on the disk. Those places are orphaned when you discard the application. Why is there no uninstaller to clean up those things?

As an example, I was cleaning up my disk this morning. Grand Perspective helped me find some big stuff I didn't need. One thing it pointed out to me was in a Caches folder. I wondered how much stuff was in folders called Caches:

sudo find / -type d -name '*Cache*' -exec du -sk {} \; -prune 2>&-

(Find every directory with 'Cache' in its name, show its disk usage in Kb, and don't show any errors along the way.) This found all sorts of interesting things, including folders from applications I had long ago uninstalled.

Now I could search for other directories belonging to these long-gone applications. For example:

sudo find / -type d -name '*TweetDeck*' -exec du -sh {} \; -prune 2>&-
 12K    /Users/ned/Library/Application Support/Fluid/FluidApps/TweetDeck
 84K    /Users/ned/Library/Caches/com.fluidapp.FluidApp.TweetDeck
 26M    /Users/ned/Library/Containers/com.twitter.TweetDeck
1.7M    /Users/ned/Library/Saved Application State/com.fluidapp.FluidApp.TweetDeck.savedState
sudo find / -type d -name '*twitter-mac*' -exec du -sh {} \; -prune 2>&-
288K    /private/var/folders/j2/gr3cj3jn63s5q8g3bjvw57hm0000gp/C/com.twitter.twitter-mac
 99M    /Users/ned/Library/Containers/com.twitter.twitter-mac
4.0K    /Users/ned/Library/Group Containers/N66CZ3Y3BX.com.twitter.twitter-mac.today-group

That's about 128Mb of junk left behind by two applications I no longer have. In the scheme of things, 128Mb isn't that much, but it's a lot more disk space than I want to devote to applications I've discarded. And what about other apps I tried and removed? Why leave this? Am I missing something that should have handled this for me?

One of Them

Thursday 3 November 2016

I have not written here about this year's presidential election. I am as startled, confused, and dismayed as many others about how Trump has managed to con people into following him, with nothing more than bluster and lies.

It feels enormous to take it on in writing. Nathan Uno also feels as I do, but for different reasons. I've never met Nathan: he's an online friend, part of a small close-knit group who mostly share a religious background, and who enjoy polite intellectual discussions of all sorts of topics. I'm not sure why they let me in the group... :)

Nathan and I started talking about our feelings about the election, and it quickly became clear that he had a much more visceral reason to oppose Trump than I did. I encouraged him to write about it, and he did. Here it is, "One of Them."

•    •    •

One of Them

Armed police came in the middle of the night and in the middle of winter, to take a husband away from his wife and a father away from his children. No explanation was given and his family was not allowed to see him or even know where he was being held. A few months later the man’s wife and children were also rounded up and taken away. They had only the belongings that they could carry with them, leaving everything else to be lost or stolen or claimed by others, including some of the family’s most precious possessions. The family was imprisoned in a camp surrounded by barbed wire and armed soldiers. They had little food and little heat and absolutely no freedom. A few months after the wife and children arrived they were finally reunited with their husband and father, seven months after he was taken from them in the night. They remained together at the camp for years until being released, given $25 and a bus ticket each, and left to try to put their shattered lives back together.

No member of the family was ever charged with a crime. In fact, no member of the family was ever even suspected of a crime. They were imprisoned, along with tens of thousands of others, simply for being “one of them.”

This is the story of my grandfather’s family. And my grandmother’s family. And tens of thousands of other families of Japanese descent who had the misfortune of living on the Pacific coast of the United States after the attack on Pearl Harbor.

In the 1980s the U.S. government formally apologized, acknowledging their mistake, and financial reparations were made. Growing up I believed that we, as a country, had moved on, had learned a lesson. It never occurred to me that such a thing could happen again. And yet here we are, with a presidential candidate who has openly advocated violence against his opponents and detractors, offered to pay legal fees for those who break the law on his behalf, recommended policies that would discriminate against people based on their ethnicity, religion, or country of ancestry, suggested that deliberately killing women and children might be an appropriate response to terrorism, and yes, even said that he “might have” supported the policies that imprisoned my family.

Xenophobic public policy leaves enduring scars on our society, scars that may not be obvious at first. We have Chinatowns today largely because public policy in San Fransisco in the late 1800s pushed Chinese immigrants to a live in a specific neighborhood. The proliferation of Chinese restaurants and Chinese laundries in our country can be traced back to the same time period, when policy restricted employment opportunities for Chinese immigrants and pushed them into doing low-paying “women’s work," like cooking and cleaning.

I’ve chosen to make my point with these simple examples from the history of Asian Americans because that’s my heritage. But these examples are trivial compared to the deep, ugly scars left on our society by slavery, and Jim Crow, and the near genocide of the Native American peoples. And despite many positive gains, women continue to be at a significant disadvantage from millennia of policies designed to keep “them” from being on equal footing with “us."

But the real danger of Donald Trump isn’t that he, himself, is a xenophobe and threatens to enact xenophobic policy. The danger is that Trump rallies xenophobes, and justifies and condones their behavior and attitudes. The harsh, unfair internment of my family during World War II was only the beginning of decades of discrimination and abuse. Members of my family were spat upon and threatened and passed over for employment and educational opportunities. And they were the lucky ones — other Japanese Americans were shot at and had their homes set on fire.

In 1945, four men were accused of causing an explosion and a fire on the property of the Doi family, who had recently returned from Colorado’s Grenada internment camp. One of the men confessed and implicated the others. At trial, their lawyer simply argued that “this is a white man’s country” and that his clients’ actions were necessary to keep it that way. All four men were acquitted by the jury, a jury doubtless influenced by the fact that the federal government had chosen to imprison the Doi family for years. The federal government declared them to be a danger simply because of their Japanese heritage, a declaration that was used to justify violence.

And we’re seeing the same again today: violence at Trump’s rallies and by some of Trump’s supporters. Violence that is either condoned or ignored by Donald Trump. My wife is not an American, nor is the rest of her family who currently reside in the United States. I am not white, nor is the rest of my family, which means that my children aren’t white either. We have family members of various ethnicities and friends of different ethnicities and religions. Donald Trump’s rhetoric and proposed policies pose an existential threat to myself, my family, and a number of our friends. But Donald Trump’s supporters may pose a physical threat to our collective safety.

While it worries me that, at the time of this writing, FiveThirtyEight puts Donald Trump’s chances of winning at somewhere around 33%, what I simply cannot fathom is their prediction that roughly 45% of the American public will choose to vote for Donald Trump. 45% of Americans apparently consider themselves to be “one of us," and seem unconcerned about what might happen to “them." If you are still reading this you may not be one of those people. But if you are considering voting for Donald Trump, or know others who are, I implore you to carefully consider your decision.

Donald Trump does not deserve your support, because he is not on your side. He does not share your ideology. He does not support your viewpoints in any meaningful way. Donald Trump is many things, but more than anything he’s an opportunist. His pursuit of the presidency is about his own self interest, whether that be feeding his ego or preparing for his next set of business schemes. It’s not about what’s best for you, or for the country.

Perhaps you’re a Republican and believe that your party’s interests are of paramount importance. Donald Trump is not a champion of your party’s interests - he is an opportunist who only cares about his own interests. He does not hold to the Republican party line, has attacked key members of your leadership, and is actively dividing, and possibly destroying, your party right now. A vote for Donald Trump isn’t a vote to save the Republican party, it’s a vote for the destruction of the Republican party so that one man can promote his own public persona and guarantee himself the attention he so desperately craves.

Perhaps you’re a Christian and believe the Christian leaders who’ve told you that Trump is the right choice for Christians. Donald Trump is not a defender of the Christian faith - he is an opportunist interested only in defending his own fame and expanding his power and influence. His behavior is consistently antithetical to Christian values and he has shown a dramatic lack of understanding of Christ and the Bible. A vote for Donald Trump isn’t a vote to protect Christian values, it’s a vote to protect the personal interests and appalling lack of character of a man whose behavior is entirely un-Christ-like.

Perhaps you’re pro-life and believe that the sanctity of human life must take precedence over all other issues. Donald Trump doesn’t care about the sanctity of human life - he is an opportunist who puts the sanctity of his own life above all others, and is happy to look out for the lives of those who support him, but cares not about the lives of those who oppose him. A man who openly advocates the murder of the wives and children of suspected terrorists does not care about the sanctity of a pregnant woman’s life or the sanctity of the life of that woman’s unborn child. Trump has no real plans to end abortion. In fact, if you look carefully, you can find the week in his campaign where he changed his position on abortion five different times, carefully experimenting to find the position that would gain him the most support. A vote for Donald Trump isn’t a vote to protect the sanctity of human life, it’s a vote that protects the idea that “our” lives matter and “their” lives don’t.

Perhaps you’re concerned about the threat of terrorism and value the safety of our country over all other concerns. Donald Trump is not interested in defusing the threat of terrorism - he is an opportunist who can’t wait to exercise more power than he’s ever had before. His approach to guaranteeing the “safety” of our nation is to abandon our allies, pulling out of strategic partnerships like NATO, and ramp up the level of violence against terrorists and “terrorist nations." He has openly talked about attacking countries in the Middle East simply to seize their oil, without any regard to how that might affect America’s relationship with other nations or encourage additional forms of terrorism. A vote for Donald Trump isn’t a vote to fight the growing threat of terrorism, it’s a vote to give dangerous amounts of power to a man committed to wielding that power to fight whomever he sees as an opponent, regardless of the consequences or the impact on others.

Perhaps you’ve faced economic hardship for some time and you hope that he will provide you with more financial or job security. Donald Trump is unconcerned with your economic security - he is an opportunist who is concerned only with his own economic security. He doesn’t want you to see his tax returns because he doesn’t want you to see how much he’s earned while you’ve suffered, or how many taxes he’s avoided paying while you’ve been struggling to pay yours. He’s been consistently accused of refusing to pay people for work that they’ve done on his behalf. A vote for Donald Trump isn’t a vote to improve the prosperity of the working class, it’s a vote to improve the prosperity of Donald Trump, perhaps not in the short term, but certainly in the long term.

Or perhaps you have an entirely different reason for voting for Trump. Regardless of your reason, Trump is not on your side. He is an opportunist, and nothing more. It’s possible that you might benefit if your interests are directly aligned with his, but please consider the many many lives that may be negatively impacted along the way, and understand that Trump has a history of taking people from the “us” category and putting them into the “them” category at the slightest provocation. A vote for Trump is a vote guaranteed only to benefit Donald Trump. Others might benefit, but only as a secondary effect to the benefits gained by Donald Trump.

To be clear: I am not a fan of Hillary Clinton. Or of Bill Clinton. Or of the Democratic party, or of their policies. I disagree with many so-called “liberal” viewpoints. The prospect of Hillary Clinton as a president is not at all ideal from my perspective. But that prospect does not fill me with fear, and so I will be obliged, for the first time in my life, to cast a vote for the Democratic party’s candidate for president. I implore you to carefully consider doing the same.

Multi-parameter Jupyter notebook interaction

Saturday 29 October 2016

I'm working on figuring out retirement scenarios. I wasn't satified with the usual online calculators. I made a spreadsheet, but it was hard to see how the different variables affected the outcome. Aha! This sounds like a good use for a Jupyter Notebok!

Using widgets, I could make a cool graph with sliders for controlling the variables, and affecting the result. Nice.

But there was a way to make the relationship between the variables and the outcome more apparent: choose one of the variables, and plot its multiple values on a single graph. And of course, I took it one step further, so that I could declare my parameters, and have the widgets, including the selection of the variable to auto-slide, generated automatically.

I'm pleased with the result, even if it's a little rough. You can download retirement.ipynb to try it yourself.

The general notion of a declarative multi-parameter model with an auto-slider is contained in a class:

%pylab --no-import-all inline

from collections import namedtuple

from ipywidgets import interact, IntSlider, FloatSlider

class Param(namedtuple('Param', "default, range")):
    A parameter for `Model`.
    def make_widget(self):
        """Create a widget for a parameter."""
        is_float = isinstance(self.default, float)
        is_float = is_float or any(isinstance(v, float) for v in self.range)
        wtype = FloatSlider if is_float else IntSlider
        return wtype(
            min=self.range[0], max=self.range[1], step=self.range[2], 

class Model:
    A multi-parameter model.

    output_limit = None
    num_auto = 7
    def _show_it(self, auto_param, **kw):
        if auto_param == 'None':
            plt.plot(self.inputs, self.run(self.inputs, **kw))
            autop = self.params[auto_param]

            auto_values = np.arange(*autop.range)
            if len(auto_values) > self.num_auto:
                lo, hi = autop.range[:2]
                auto_values = np.arange(lo, hi, (hi-lo)/self.num_auto)
            for auto_val in auto_values:
                kw[auto_param] = auto_val
                output = self.run(self.inputs, **kw)
                plt.plot(self.inputs, output, label=str(auto_val))
            plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
        if self.output_limit is not None:

    def interact(self):
        widgets = {
            name:p.make_widget() for name, p in self.params.items()
        param_names = ['None'] + sorted(self.params)
        interact(self._show_it, auto_param=param_names, **widgets)

To make a model, derive a class from Model. Define a dict called params as a class attribute. Each parameter has a default value, and a range of values it can take, expressed (min, max, step):

class Retirement(Model):
    params = dict(
        invest_return=Param(3, (1.0, 8.0, 0.5)),
        p401k=Param(10, (0, 25, 1)),
        retire_age=Param(65, (60, 75, 1)),
        live_on=Param(100000, (50000, 150000, 10000)),
        inflation=Param(2.0, (1.0, 4.0, 0.25)),
        inherit=Param(1000000, (0, 2000000, 200000)),
        inherit_age=Param(70, (60, 90, 5)),

Your class can also have some constants:

start_savings = 100000
salary = 100000
socsec = 10000

Define the inputs to the graph (the x values), and the range of the output (the y values):

inputs = np.arange(30, 101)
output_limit = (0, 10000000)

Finally, define a run method that calculates the output from the inputs. It takes the inputs as an argument, and also has a keyword argument for each parameter you defined:

def run(self, inputs, 
    invest_return, p401k, retire_age, live_on,
    inflation, inherit, inherit_age
    for year, age in enumerate(inputs):
        if year == 0:
            yearly_money = [self.start_savings]
        inflation_factor = (1 + inflation/100)**year
        money = yearly_money[-1]
        money = money*(1+(invest_return/100))
        if age == inherit_age:
            money += inherit
        if age <= retire_age:
            money += self.salary * inflation_factor *(p401k/100)
            money += self.socsec
            money -= live_on * inflation_factor

    return np.array(yearly_money)

To run the model, just instantiate it and call interact():


You'll get widgets and a graph like this:

Jupyter notebook, in action

There are things I would like to be nicer about this:

  • The sliders are a mess: if you make too many parameters, the slider and the graph don't fit on the screen.
  • The values chosen for the auto parameter are not "nice", like tick marks on a graph are nice.
  • It'd be cool to be able to auto-slide two parameters at once.
  • The code isn't packaged in a way people can easily re-use.

I thought about fixing a few of these things, but I likely won't get to them. The code is here in this blog post or in the notebook file if you want it. Ideas welcome about how to make improvements.

BTW: my retirement plans are not based on inheriting a million dollars when I am 70, but it's easy to add parameters to this model, and it's fun to play with...


Even older...