Rubik's algorithms

Sunday 26 February 2017

Recently, a nephew asked about how to solve a Rubik's Cube. I couldn't sit down with him to show him what I knew, so I looked around the web for explanations. I was surprised by two things: first, that all the pages offering solutions seemed to offer the same one, even down to the colors discussed: "Start by making a white cross, ..., finally, finish the yellow side."

Second, that the techniques (or "algorithms") were often given without explanation. They're presented as something to memorize.

My own solving technique uses a few algorithms constructed in a certain way that I describe in Two-Part Rubik's Algorithms. I wrote them up as a resource I hope my nephew will be able to use.

A Rubik's Cube with two edges flipped

BTW, that page makes use of Conrad Rider's impressive TwistSim library.

Https

Saturday 25 February 2017

Someone posted a link to my latest blog post on /r/Python, but somehow got an https link for it. That's odd: my site doesn't even properly serve content over https. People were confused by the broken link.

I should say, my site didn't even serve content over https, because now it does. I'd been meaning to enable https, and force its use, for a long time. This broken link pushed it to the top of the list.

Let's Encrypt is the certificate authority of choice these days, because they are free and automatable. And people say they make it easy, but I have to say, I would not have classified this as easy. I'm sure it's easier than it used to be, but it's still a confusing maze of choices, with decision points you are expected to navigate.

Actually getting everything installed requires sudo, or without sudo, using third-party tools, with instructions from obscure blog posts. There's clearly still room for improvement.

Once you have the certificate in place, you need to redirect your http site to https. Then you have to fix the http references in your site. Protocol-relative (or schema-less) URLs are handy here.

It's all done now, the entire site should always be https. I'm glad I finally got the kick in the pants to do it. If you find something wrong, let me know.

A tale of two exceptions, continued

Thursday 23 February 2017

In my last blog post, A tale of two exceptions, I laid out the long drawn-out process of trying to get a certain exception to make tests skip in my test runner. I ended on a solution I liked at the time.

But it still meant having test-specific code in the product code, even if it was only a single line to set a base class for an exception. It didn't feel right to say "SkipTest" in the product code, even once.

In that blog post, I said,

One of the reasons I write this stuff down is because I'm hoping to get feedback that will improve my solution, or advance my understanding. ... a reader might object and say, "you should blah blah blah."

Sure enough, Ionel said,

A better way is to handle this in coverage's test suite. Possible solution: wrap all your tests in a decorator that reraises with a SkipException.

I liked this idea. The need was definitely a testing need, so it should be handled in the tests. First I tried doing something with pytest to get it to do the conversion of exceptions for me. But I couldn't find a way to make it work.

So: how to decorate all my tests? The decorator itself is fairly simple. Just call the method with all the arguments, and return its value, but if it raises StopEverything, then raise SkipTest instead:

def convert_skip_exceptions(method):
    """A decorator for test methods to convert StopEverything to SkipTest."""
    def wrapper(*args, **kwargs):
        """Run the test method, and convert exceptions."""
        try:
            result = method(*args, **kwargs)
        except StopEverything:
            raise unittest.SkipTest("StopEverything!")
        return result
    return wrapper

But decorating all the test methods would mean adding a @convert_skip_exceptions line to hundreds of test methods, which I clearly was not going to do. I could use a class decorator, which meant I would only have to add a decorator line to dozens of classes. That also felt like too much to do and remember to do in the future when I write new test classes.

It's not often I say this, but: it was time for a metaclass. Metaclasses are one of the darkest magics Python has, and they can be mysterious. At heart, they are simple, but in a place you don't normally think to look. Just as a class is used to make objects, a metaclass is used to make classes. Since there's something I want to do everytime I make a new class (decorate its methods), a metaclass gives me the tools to do it.

class SkipConvertingMetaclass(type):
    """Decorate all test methods to convert StopEverything to SkipTest."""
    def __new__(mcs, name, bases, attrs):
        for attr_name, attr_value in attrs.items():
            right_name = attr_name.startswith('test_')
            right_type = isinstance(attr_value, types.FunctionType)
            if right_name and right_type:
                attrs[attr_name] = convert_skip_exceptions(attr_value)

        return super(SkipConvertingMetaclass, mcs).__new__(mcs, name, bases, attrs)

There are details here that you can skip as incantations if you like. Classes are all instances of "type", so if we want to make a new thing that makes classes, it derives from type to get those same behaviors. The method that gets called when a new class is made is __new__. It gets passed the metaclass itself (just as classmethods get cls and instance methods get self), the name of the class, the tuple of base classes, and a dict of all the names and values defining the class (the methods, attributes, and so on).

The important part of this metaclass is what happens in the __new__ method. We look at all the attributes being defined on the class. If the name starts with "test_", and it's a function, then it's a test method, and we decorate the value with our decorator. Remember that @-syntax is just a shorthand for passing the function through the decorator, which we do here the old-fashioned way.

Then we use super to let the usual class-defining mechanisms in "type" do their thing. Now all of our test methods are decorated, with no explicit @-lines in the code. There's only one thing left to do: make sure all of our test classes use the metaclass:

CoverageTestMethodsMixin = SkipConvertingMetaclass('CoverageTestMethodsMixin', (), {})

class CoverageTest(
    ... some other mixins ...
    CoverageTestMethodsMixin,
    unittest.TestCase,
):
    """The base class for all coverage.py test classes."""

Metaclasses make classes, just the way classes make instances: you call them. Here we call our with the arguments it needs (class name, base classes, and attributes) to make a class called CoverageTestMethodsMixin.

Then we use CoverageTestMethodsMixin as one of the base classes of CoverageTest, which is the class used to derive all of the actual test classes.

Pro tip: if you are using unittest-style test classes, make a single class to be the base of all of your test classes, you will be glad.

After all of this class machinations, what have we got? Our test classes all derive from a base class which uses a metaclass to decorate all the test methods. As a result, any test which raises StopEverything will instead raise SkipTest to the test runner, and the test will be skipped. There's now no mention of SkipTest in the product code at all. Better.

A tale of two exceptions

Sunday 22 January 2017

It was the best of times, it was the worst of times...

This week saw the release of three different versions of Coverage.py. This is not what I intended. Clearly something was getting tangled up. It had to do with some tricky exception handling. The story is kind of long and intricate, but has a number of chewy nuggets that fascinate me. Your mileage may vary.

Writing it all out, many of these missteps seem obvious and stupid. If you take nothing else from this, know that everyone makes mistakes, and we are all still trying to figure out the best way to solve some problems.

It started because I wanted to get the test suite running well on Jython. Jython is hard to support in Coverage.py: it can do "coverage run", but because it doesn't have the same internals as CPython, it can't do "coverage report" or any of the other reporting code. Internally, there's one place in the common reporting code where we detect this, and raise an exception. Before all the changes I'm about to describe, that code looked like this:

for attr in ['co_lnotab', 'co_firstlineno']:
    if not hasattr(self.code, attr):
        raise CoverageException(
            "This implementation of Python doesn't support code analysis.\n"
            "Run coverage.py under CPython for this command."
        )

The CoverageException class is derived from Exception. Inside of Coverage.py, all exceptions raised are derived from CoverageException. This is a good practice for any library. For the coverage command-line tool, it means we can catch CoverageException at the top of main() so that we can print the message without an ugly traceback from the internals of Coverage.py.

The problem with running the test suite under Jython is that this "can't support code analysis" exception was being raised from hundreds of tests. I wanted to get to zero failures or errors, either by making the tests pass (where the operations were supported on Jython) or skipping the tests (where the operations were unsupported).

There are lots of tests in the Coverage.py test suite that are skipped for all sorts of reasons. But I didn't want to add decorators or conditionals to hundreds of tests for the Jython case. First, it would be a lot of noise in the tests. Second, it's not always immediately clear from a test that it is going to touch the analysis code. Lastly and most importantly, if someday in the future I figured out how to do analysis on Jython, or if it grew the features to make the current code work, I didn't want to have to then remove all that test-skipping noise.

So I wanted to somehow automatically skip tests when this particular exception was raised. The unittest module already has a way to do this: tests are skipped by raising a unittest.SkipTest exception. If the exception raised for "can't support code analysis" derived from SkipTest, then the tests would be skipped automatically. Genius idea!

So in 4.3.2, the code changed to this (spread across a few files):

from coverage.backunittest import unittest

class StopEverything(unittest.SkipTest):
    """An exception that means everything should stop.

    This derives from SkipTest so that tests that spring this trap will be
    skipped automatically, without a lot of boilerplate all over the place.

    """
    pass

class IncapablePython(CoverageException, StopEverything):
    """An operation is attempted that this version of Python cannot do."""
    pass

...

# Alternative Python implementations don't always provide all the
# attributes on code objects that we need to do the analysis.
for attr in ['co_lnotab', 'co_firstlineno']:
    if not hasattr(self.code, attr):
        raise IncapablePython(
            "This implementation of Python doesn't support code analysis.\n"
            "Run coverage.py under another Python for this command."
        )

It felt a little off to derive a product exception (StopEverything) from a testing exception (SkipTest), but that seemed acceptable. One place in the code, I had to deal specifically with StopEverything. In an inner loop of reporting, we catch exceptions that might happen on individual files being reported. But if this exception happens once, it will happen for all the files, so we wanted to end the report, not show this failure for every file. In pseudo-code, the loop looked like this:

for f in files_to_report:
    try:
        generate_report_for_file(f)
    except StopEverything:
        # Don't report this on single files, it's a systemic problem.
        raise
    except Exception as ex:
        record_exception_for_file(f, ex)

This all seemed to work well: the tests skipped properly, without a ton of noise all over the place. There were no test failures in any supported environment. Ship it!

Uh-oh: very quickly, reports came in that coverage didn't work on Python 2.6 any more. In retrospect, it was obvious: the whole point of the "from coverage.backunittest" line in the code above was because Python 2.6 doesn't have unittest.SkipTest. For the Coverage.py tests on 2.6, I install unittest2 to get a backport of things 2.6 is missing, and that gave me SkipTest, but without my test requiements, it doesn't exist.

So my tests passed on 2.6 because I installed a package that provided what was missing, but in the real world, unittest.SkipTest is truly missing.

This is a conundrum that I don't have a good answer to:

How can you test your code to be sure it works properly when the testing requirements aren't installed?

To fix the problem, I changed the definition of StopEverything. Coverage.py 4.3.3 went out the door with this:

class StopEverything(unittest.SkipTest if env.TESTING else object):
    """An exception that means everything should stop."""
    pass

The env.TESTING setting was a pre-existing variable: it's true if we are running the coverage.py test suite. This also made me uncomfortable: as soon as you start conditionalizing on whether you are running tests or not, you have a very slippery slope. In this case it seemed OK, but it wasn't: it hid the fact that deriving an exception from object is a dumb thing to do.

So 4.3.3 failed also, and not just on Python 2.6. As soon as an exception was raised inside that reporting loop that I showed above, Python noticed that I was trying to catch a class that doesn't derive from Exception. Of course, my test suite didn't catch this, because when I was running my tests, my exception derived from SkipTest.

Changing "object" to "Exception" would fix the problem, but I didn't like the test of env.TESTING anyway. So for 4.3.4, the code is:

class StopEverything(getattr(unittest, 'SkipTest', Exception)):
    """An exception that means everything should stop."""
    pass

This is better, first because it uses Exception rather than object. But also, it's duck-typing the base class rather than depending on env.TESTING.

But as I kept working on getting rid of test failures on Jython, I got to this test failure (pseudo-code):

def test_sort_report_by_invalid_option(self):
    msg = "Invalid sorting option: 'Xyzzy'"
    with self.assertRaisesRegex(CoverageException, msg):
        coverage.report(sort='Xyzzy')

This is a reporting operation, so Jython will fail with a StopEverything exception saying, "This implementation of Python doesn't support code analysis." StopEverything is a CoverageException, so the assertRaisesRegex will catch it, but it will fail because the messages don't match.

StopEverything is both a CoverageException and a SkipTest, but the SkipTest is the more important aspect. To fix the problem, I did this, but felt silly:

def test_sort_report_by_invalid_option(self):
    msg = "Invalid sorting option: 'Xyzzy'"
    with self.assertRaisesRegex(CoverageException, msg):
        try:
            coverage.report(sort='Xyzzy')
        except SkipTest:
            raise SkipTest()

I knew this couldn't be the right solution. Talking it over with some co-workers (OK, I was griping and whining), we came up with the better solution. I realized that CoverageException is used in the code base to mean, "an ordinary problem from inside Coverage.py." StopEverything is not an ordinary problem. It reminded me of typical mature exception hierarchies, where the main base class, like Exception, isn't actually the root of the hierarchy. There are always a few special-case classes that derive from a real root higher up.

For example, in Python, the classes Exception, SystemExit, and KeyboardInterrupt all derive from BaseException. This is so "except Exception" won't interfere with SystemExit and KeyboardInterrupt, two exceptions meant to forcefully end the program.

I needed the same thing here, for the same reason. I want to have a way to catch "all" exceptions without interfering with the exceptions that mean "end now!" I adjusted my exception hierarchy, and now the code looks like this:

class BaseCoverageException(Exception):
    """The base of all Coverage exceptions."""
    pass

class CoverageException(BaseCoverageException):
    """A run-of-the-mill exception specific to coverage.py."""
    pass

class StopEverything(
        BaseCoverageException,
        getattr(unittest, 'SkipTest', Exception)
    ):
    """An exception that means everything should stop."""
    pass

Now I could remove the weird SkipTest dance in that test. The catch clause in my main() function changes from CoverageException to BaseCoverageException, and things work great. The end...?

One of the reasons I write this stuff down is because I'm hoping to get feedback that will improve my solution, or advance my understanding. As I lay out this story, I can imagine points of divergence: places in this narrative where a reader might object and say, "you should blah blah blah." For example:

  • "You shouldn't bother supporting 2.6." Perhaps not, but that doesn't change the issues explored here, just makes them less likely.
  • "You shouldn't bother supporting Jython." Ditto.
  • "You should just have dependencies for the things you need, like unittest2." Coverage.py has a long-standing tradition of having no dependencies. This is driven by a desire to be available to people porting to new platforms, without having to wait for the dependencies to be ported.
  • "You should have more realistic integration testing." I agree. I'm looking for ideas about how to test the scenario of having no test dependencies installed.

That's my whole tale. Ideas are welcome.

Update: the story continues, but fair warning: metaclasses ahead!

Evil ninja module initialization

Tuesday 10 January 2017

A question about import styles on the Python-Dev mailing list asked about imports like this:

import os as _os

Understanding why people do this is an interesting lesson in how modules work. A module is nothing more than a collection of names. When you define a name in a .py file, it becomes an attribute of the module, and is then importable from the module.

An underlying simplicity in Python is that many statements are really just assignment statements in disguise. All of these define the name X:

X = 17
def X(): print("look!")
import X

When you create a module, you can make the name "X" importable from that module by assigning to it, or defining it as a function. You can also make it importable by importing it yourself.

Suppose your module looks like this:

# yourmodule.py
import os

def doit():
    os.something_or_other()

This module has two names defined in it: "doit", and "os". Someone else can now do this:

# someone.py
from yourmodule import os

# or worse, this imports os and doit:
from yourmodule import *

This bothers some people. "os" is not part of the actual interface of yourmodule. That first import I showed prevents this leaking of your imports into your interface. Importing star doesn't pull in names starting with underscores. (Another solution is to define __all__ in your module.)

Most people though, don't worry about this kind of name leaking. Import-star is discouraged anyway, and people know not to import os from other modules. The solution of renaming os to _os just makes your code ugly for little benefit.

The part of the discussion thread that really caught my eye was Daniel Holth's winking suggestion of the "evil ninja mode pattern" of module initialization:

def ninja():
    global exported
    import os
    def exported():
        os.do_something()

ninja()
del ninja

What's going on here!? Remember that def is an assignment statement like any other. When used inside a function, it defines a local name, as assignment always does. But an assignment in a function can define a global name if the name is declared as global. It's a little unusual to see a global statement without an explicit assignment at the top-level, but it works just fine. The def statement defines a global "exported" function, because the global statement told it to. "os" is now a local in our function, because again, the import statement is just another form of assignment.

So we define ninja(), and then execute it immediately. This defines the global "exported", and doesn't define a global "os". The only problem is the name "ninja" has been defined, which we can clean up with a del statement.

Please don't ever write code this way. It's a kind of over-defensiveness that isn't needed in typical Python code. But understanding what it does, and why it does it, is a good way to flex your understanding of Python workings.

For more about how names (and values) work in Python, people seem to like my PyCon talk, Python Names and Values.

No PyCon for me this year

Thursday 5 January 2017

2017 will be different for me in one specific way: I won't be attending PyCon. I've been to ten in a row:

Ten consecutive PyCon badges

This year, Open edX con is in Madrid two days later after PyCon, actually overlapping with the sprints. I'm not a good enough traveler to do both. Crossing nine timezones is not something to be taken lightly.

I'll miss the usual love-fest at PyCon, but after ten in a row, it should be OK to miss one. I can say that now, but probably in May I will feel like I am missing the party. Maybe I really will watch talks on video for a change.

I usually would be working on a presentation to give. I like making presentations, but it is a lot of work. This spring I'll have that time back.

In any case, this will be a new way to experience the Python community. See you all in 2018 in Cleveland!

D'oh: Coverage.py 4.3.1

Wednesday 28 December 2016

Yesterday I released five months' of fixes as Coverage.py 4.3, and today I am releasing Coverage.py 4.3.1. This is not because releasing is fun, but because releasing is error-prone.

Two bad problems were very quickly reported by my legions of adoring fans, and they are now fixed. I'll sheepishly tell you that one of them was a UnicodeError in a bit of too-cute code in setup.py.

Perhaps I should have released a 4.3 beta. But my experience in the past is that betas do not get the kind of attention that final releases do. Partly this is just due to people's attention budget: lots of people won't install a beta. But it's also due to continuous integration servers. When a final release is out, hundreds if not thousands of CI servers will install it automatically as part of the next triggered build. They won't install pre-releases.

So there's a not-great choice to make: should I put out a beta, and hope that people try it and tell me what went wrong? Will enough people in enough disparate environments take that step to truly test the release?

Or should I skip that step, jump straight to a final release, and prepare instead to quickly fix whatever problems occur? I chose the latter course for 4.3. I guess I could use meta-feedback about which form of feedback I should pursue in the future...

Coverage.py 4.3

Tuesday 27 December 2016

The latest Coverage.py release: Coverage.py 4.3 is ready.

This version adds --skip-covered support to the HTML report, implements sys.excepthook support, reads configuration from tox.ini, and contains improvements that close 18 issues. The complete change history is in the source.

A special shout-out to Loïc Dachary: he read my blog post about Who Tests What, and got interested in contributing. And I mean, really interested. Suddenly he seemed to be everywhere, making pull requests and commenting on issues. In a week, I had 122 emails due to his activity. That energy really helped push me along, and is a big reason why this release happened, five months after 4.2.

Random trivia: this is the 30th version on PyPI; it's the 57th if you include pre-releases.

Finding test coupling

Thursday 22 December 2016

Before we get started: this is a story about a problem I had and how I solved it. This retelling is leaving out lots of small false trails and hard learnings, which I will summarize at the end. I report these stories not to lecture from on high, but to share with peers, help people learn, and ideally, elicit teachings from others so that I can do it better next time. The main qualities I am demonstrating here are not intelligence and experience, but perseverance, patience, and optimism.

OK, on with the story:

Running our large test suite the other day, we got a test failure. It seemed unrelated to the changes we were making, but you can never be sure, so we investigated. Along the way I used a few techniques to narrow down, widen, and identify suspects.

Running just that one test passed, but running the whole test suite, it failed, and this behavior was repeatable. So we had some kind of coupling between tests. Ideally, all tests would be isolated from each other. Perfect test isolation would mean that no matter what order you ran tests, and no matter what subset of tests you ran, the results would be the same. Clearly we did not have perfect test isolation.

The job now was to find the test we were coupled with, or perhaps one of the many possible tests that we were coupled with.

The test failure itself was a UnicodeError while trying to log a warning message involving a username with a non-ASCII character in it. Apparently this is something that doesn't work well: when warnings are routed through the logging system, if the message is actually logged, and the message has a non-ASCII Unicode string, an exception will happen. That's unfortunate, but we'll have to live with that for the moment.

Our best guess at the moment is that when the test passes, it's because either the warnings settings, or the logging settings, are deciding not to log the warning. When the test fails, it's because some previous test has changed one (or both!) of those settings, causing the message to proceed all the way through the warnings/logging pipeline, to the point of producing the UnicodeError. This is a plausible theory because those settings are global to the process, and would be easy to change without realizing the consequences for test suites.

But we still have to find that test. Here's the command that runs just the one test, that failed:

python ./manage.py lms test --verbosity=2 --with-id --settings=test \
    --xunitmp-file=/edx/app/edxapp/edx-platform/reports/lms/nosetests.xml \
    --with-database-isolation \
    openedx/core/djangoapps/external_auth/tests/test_openid_provider.py:OpenIdProviderTest.test_provider_login_can_handle_unicode_email_inactive_account

This is the Django test runner, using nose. That last line selects one particular test method in one particular class in one specific test file. To try to find a failing combination, we'll widen the scope of our test run by peeling off trailing components. This will give us progressively more tests in the run, and eventually (we hope), the test will fail:

openedx/core/djangoapps/external_auth/tests/test_openid_provider.py:OpenIdProviderTest
openedx/core/djangoapps/external_auth/tests/test_openid_provider.py
openedx/core/djangoapps/external_auth
openedx/core/djangoapps

This last one finally failed, with 1810 tests. That's still too many to examine manually. We can run those tests again, with nose-randomly to randomize the order of the tests. This gives us an opportunity to run experiments where the randomization can tell us something about coupling. If we run the 1810 tests, and our failing test doesn't fail, then none of the tests that ran before it were the one that cause the problem. If the test does fail, then the tests that ran before it might be bad.

I used a bash loop to run those 1810 test over and over, capturing the output in separate result files:

export FLAGS=blah blah omitted for brevity
for i in $(seq 9999)do
    echo --- $i
    python ./manage.py lms test -v $FLAGS openedx/core/djangoapps -v > test$i.txt 2>&1
done

Overnight, this gave me 72 test result files to look at. The -v and --with-id flags gave us output that looked like this:

... lots of throat-clearing here ...
Synchronizing apps without migrations:
  Creating tables...
    Creating table coursewarehistoryextended_studentmodulehistoryextended
    Running deferred SQL...
  Installing custom SQL...
Running migrations:
  No migrations to apply.
Using --randomly-seed=1482407901
#19 test_platform_name (openedx.core.djangoapps.site_configuration.tests.test_context_processors.ContextProcessorTests) ... ok
#20 test_configuration_platform_name (openedx.core.djangoapps.site_configuration.tests.test_context_processors.ContextProcessorTests) ... ok
#21 test_get_value (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
#22 test_get_value_for_org (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
#23 test_get_dict (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
#24 test_get_value_for_org_2 (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
... much more ...

A small Python program provided the analysis: test_analysis.py. (Warning: this is for Python 3.6, so f-strings ahead!)

Although I had 72 runs, the results converged after 11 runs: 179 tests were in the maybe-bad set, and more runs didn't reduce the set. That's because of nose-randomly's behavior, which I didn't fully understand: it doesn't completely shuffle the tests. Because of the possibility of module-level and class-level setup code, it randomizes within those scopes, but will not intermix between scopes. The test modules are run in a random order, but everything in one module will always run contiguously. The classes within a module will run in a random order, but all of the methods within a class will run contiguously.

The list of classes that test_analysis.py provided made clear what was going on: all of the maybe-bad tests were in credit/tests/test_views.py. There are 179 tests in that file, and something in there is causing our test failure. Because they always run contiguously, there's no way nose-randomly could give us more information about the true culprit.

Time for some low-tech divide and conquer: we'll run one class from test_views.py, and then our failing test. If we do that once for each class in test_views.py, we should get information about which class to examine. I'd love to tell you I had some clever way to get the list of test classes, but I just manually picked them out of the file and wrote this loop:

export FAILING_TEST=openedx/core/djangoapps/external_auth/tests/test_openid_provider.py:OpenIdProviderTest.test_provider_login_can_handle_unicode_email_inactive_account 
for c in CreditCourseViewSetTests CreditEligibilityViewTests CreditProviderCallbackViewTests CreditProviderRequestCreateViewTests CreditProviderViewSetTests; do
    echo ------------- $c
    python ./manage.py lms test -v $FLAGS \
        openedx/core/djangoapps/credit/tests/test_views.py:$c \
        $FAILING_TEST 2>&1 | tee ${c}_out | grep unicode_email
done

(My bash-looping skillz were improving each time!) This showed me that three of the five classes were failing. These classes use mixins, and the thing the three class had in common was the AuthMixin, which provides four test methods. So it's probably one of those methods. I picked the first of the test classes, and ran a new experiment four times, once for each of the four test methods:

for t in test_authentication_required test_oauth test_session_auth test_jwt_auth; do
    echo ---------- $t
    python ./manage.py lms test -v $FLAGS \
        openedx/core/djangoapps/credit/tests/test_views.py:CreditCourseViewSetTests.$t \
        $FAILING_TEST 2>&1 | tee ${t}_out | grep unicode_email
done

And this showed that test_jwt_auth was the problem! Now I had a two-test scenario that would produce the failure.

To find the line in the test, I could comment-out or otherwise neuter parts of the test method and run my fast two-test scenario. The cause was a JWT authorization header in a test client get() function call. JWT-related code is scarce enough in our huge code base, that I could identify a likely related function, place a pudb breakpoint, and start walking through code until I found the problem: a line deep in a library that changed the warnings settings! (cue the dramatic music)

Commenting out that line, and running my reproducer confirmed that it was the root cause. A simple pull request fixes the problem. Note in the pull request that the library's test case had a simple mistake that might have been the reason for the bad line to begin with.

It felt really good to find and fix the problem, perhaps because it took so long to find.

As promised: things I didn't know or did wrong:

  • Our test names have random data in them. I didn't realize this until my results started showing the sum of known-good and maybe-bad as greater than the total number of tests. Until then, the numbers were skittering all over the place. Once I canonicalized the file names, the numbers converged quickly.
  • I should have understood how nose-randomly worked earlier.
  • I had to (once again) Google the bash loop syntax.
  • I fat-fingered a few bash loops which gave me mysterious and sometimes discouraging false results.
  • A few times, I spent a while just reading through possible culprit tests looking for clues, which was fruitless, since the actual problem line was in a library in a different repo.

We're all learning. Be careful out there.

Dragon iterators

Saturday 17 December 2016

Advent of Code is running again this year, and I love it. It reveals a new two-part Santa-themed puzzle each day for the first 25 days in December. The puzzles are algorithmically slanted, and the second part is only revealed after you've solved the first part. The second part often requires you to refactor your code, or deal with growing computational costs.

I've long been fascinated with Python's iteration tools, so Day 16: Dragon Checksum was especially fun.

Here's an adapted version of the first part of the directions:

You'll need to use a modified dragon curve. Start with an appropriate initial string of 0's and 1's. Then, for as long as you need, repeat the following steps:

  • Call the data you have at this point, A.
  • Make a copy of A; call this copy B.
  • Reverse the order of the characters in B.
  • In B, replace all instances of 0 with 1 and all 1's with 0.
  • The resulting data is A, then a single 0, then B.

For example, after a single step of this process,

  • 1 becomes 100.
  • 0 becomes 001.
  • 11111 becomes 11111000000.
  • 111100001010 becomes 1111000010100101011110000.

We have a few options for how to produce these strings. My first version took an initial seed, and a number of steps to iterate:

ZERO_ONE = str.maketrans("01", "10")

def reverse01(s):
    """Reverse a string, and swap 0 and 1."""
    return s.translate(ZERO_ONE)[::-1]

def dragon_iterative(seed, steps):
    d = seed
    for _ in range(steps):
        d = d + "0" + reverse01(d)
    return d

(BTW, I also wrote tests as I went, but I'll omit those for brevity. The truly curious I'm sure can find the full code on GitHub.) This is a simple iterative function.

The problem statement sounds like it would lend itself well to recursion, so let's try that too:

def dragon_recursive(seed, steps):
    if steps == 0:
        return seed
    else:
        d = dragon_recursive(seed, steps-1)
        return d + "0" + reverse01(d)

Both of these functions have the same downside: they produce complete strings. One thing I know about Advent of Code is that they love to give you problems that can be brute-forced, but then turn up the dials high enough that you need a cleverer algorithm.

I don't know if this will be needed, but let's try writing a recursive generator that doesn't create the entire string before returning. This was tricky to write. In addition to the seed and the steps, we'll track whether we are going forward (for the first half of a step), or backward for the second half:

def dragon_gen(seed, steps, reverse=False):
    if reverse:
        if steps == 0:
            yield from reverse01(seed)
        else:
            yield from dragon_gen(seed, steps-1, reverse=not reverse)
            yield "1"
            yield from dragon_gen(seed, steps-1, reverse=reverse)
    else:
        if steps == 0:
            yield from seed
        else:
            yield from dragon_gen(seed, steps-1, reverse=reverse)
            yield "0"
            yield from dragon_gen(seed, steps-1, reverse=not reverse)

If you are still using Python 2, the "yield from" may be new to you: it yields all the values from an iterable. This function works, but feels unwieldy. There may be a way to fold the similar lines together more nicely, but maybe not.

In any case, all of these function still have a common problem: they require the caller to specify the number of steps to execute. The actual Advent of Code problem instead tells us how many characters of result we need. There's a simple way to calculate how many steps to run based on the length of the seed and the desired length of the result. But more interesting is an infinite dragon generator:

def dragon_infinite(seed):
    """Generate characters of dragon forever."""
    yield from seed
    for steps in itertools.count():
        yield "0"
        yield from dragon_gen(seed, steps, reverse=True)

This relies on the fact that the first part of an N-step dragon string is the N-1-step dragon string. Each time around the for-loop, we've produced a dragon string for a particular number of steps. To extend it, we just have to output the "0", and then a reverse version of the string we've already produced. It's a little surprising that this only calls dragon_gen with reverse=True, but discovering that is part of the fun of these sorts of exercises.

Now we can write dragon_finite, to give us a result string of a desired length:

def dragon_finite(seed, length):
    return "".join(itertools.islice(dragon_infinite(seed), length))

The second part of the puzzle involved a checksum over the result, which I wrote in a simple way. It meant that although I had written generators which could produce the dragon string a character at a time, I was using them to create a complete string before computing the checksum. I could have continued down this path and written a checksum function that didn't need a complete string, but this is as far as I got.

One other implementation I didn't tackle: a function that could produce the Nth character in the dragon string, given the seed and N, but without generating the entire sequence.

Who Tests What

Saturday 10 December 2016

The next big feature for coverage.py is what I informally call "Who Tests What." People want a way to know more than just what lines were covered by the tests, but also, which tests covered which lines.

This idea/request is not new: it was first suggested over four years ago as issue 170, and two other issues (#185 and #311) have been closed as duplicates. It's a big job, but people keep asking for it, so maybe it's time.

There are a number of challenges. I'll explain them here, and lay out some options and questions. If you have opinions, answers, or energy to help, get in touch.

First, it's important to understand that coverage.py works in two main phases, with an optional phase in the middle:

  • The first phase is measurement, where your test suite runs. Coverage.py notes which code was executed, and collects that information in memory. At the end of the run, that data is written to a file.
  • If you are combining data from a number of test runs, perhaps for multiple versions of Python, then there's an optional combination phase. Multiple coverage data files are combined into one data file.
  • The reporting phase is where your project is analyzed to understand what code could have run, and the data files are read to understand what code was run. The difference between the two is the code that did not run. That information is reported in some useful way: HTML, XML, or textually.

OK, let's talk about what has to be done...

Measurement

The measurement phase has to collect and record the data about what ran.

What is Who?

At the heart of "Who Tests What" is the Who. Usually people want to know what tests run each line of code, so during measurement we need to figure out what test is being run.

I can see two ways to identify the test being run: either coverage.py figures it out by examining function names being run for "test_*" patterns, or the test runner tells coverage.py when each test starts.

But I think the fully general way to approach Who Tests What is to not assume that Who means "which test." There are other uses for this feature, so instead of hard-coding it to "test", I'm thinking in terms of the more general concept of "context." Often, the context would be "the current test," but maybe you're only interested in "Python version", or "subsystem," or "unit/integration/load."

So the question is, how to know when contexts begin and end? Clearly with this general an idea, coverage.py can't know. Coverage.py already has a plugin mechanism, so it seems like we should allow a plugin to determine the boundaries of contexts. Coverage.py can provide a plugin implementation that suffices for most people.

A context will be a string, and each different context will have its own collected coverage data. In the discussion on issue 170, you can see people suggesting that we collect an entire stack trace for each line executed. This seems to me to be enormously more bulky to collect, more difficult to make use of, and ultimately not as flexible as simply noting a string context.

There might be interesting things you can glean from that compendium of stack traces. I'd like to hear from you if you have ideas of things to do with stack traces that you can't do with contexts.

Another minor point: what should be done with code executed before any context is established? I guess an None context would be good enough.

Storing data

Having multiple contexts will multiply the amount of data to be stored. It's hard to guess how much more, since that will depend on how overlapping your contexts are. My crude first guess is that large projects would have roughly C/4 times more data, where C is the number of contexts. If you have 500 tests in your test suite, you might need to manage 100 to 200 times more data, which could be a real problem.

Recording the data on disk isn't a show-stopper, but keeping the data in memory might be. Today coverage.py keeps everything in memory until the end of the process, then writes it all to disk. Q: Will we need something more sophisticated? Can we punt on that problem until later?

The data in memory is something like a dictionary of ints. There are much more compact ways to record line numbers. Is it worth it? Recording pairs of line numbers (for branch coverage) is more complicated to compact (see Speeding up coverage data storage for one experiment on this). Eventually, we might get to counting the number of times a line is executed, rather than just a yes/no, which again would complicate things. Q: Is it important to try to conserve memory?

Today, the .coverage data files are basically JSON. This much data might need a different format. Q: Is it time for a SQLite data file?

Combining

The combine command won't change much, other than properly dealing with the context information that will now be in the data files.

But thinking about combining adds another need for the measurement phase: when running tests, you should be able to specify a context that applies to the entire run. For example, you run your test suite twice, once on Python 2, and again on Python 3. The first run should record that it was a "python2" context, and the second, "python3". Then when the files are combined, they will have the correct context recorded.

This also points up the need for context labels that can indicate nesting, so that we can record that lines were run under Python 2 and also note the test names that ran them. Contexts might look like "python2.test_function_one", for example.

Reporting

Reporting is where things get really murky. If I have a test suite with 500 tests, how do I display all the information about those 500 tests? I can't create an HTML report where each line of code is annotated with the names of all the tests that ran it. It's too bulky to write, and far too cluttered to read.

Partly the problem here is that I don't know how people will want to use the data. When someone says, "I want to know which tests covered which lines," are they going to start from a line of code, and want to see which tests ran it? Or will they start from a test, and want to see what lines it ran? Q: How would you use the data?

One possibility is a new command, the opposite of "coverage combine": it would take a large data file, and subset it to write a smaller data file. You could specify a pattern of contexts to include in the output. This would let you slice and dice arbitrarily, and then you can report as you want from the resulting segmented data file. Q: Would this be too clumsy?

Perhaps the thing to do is to provide a SQLite interface. A new "report" would produce a SQLite database with a specified schema. You can then write queries against that database to your heart's content. Q: Is that too low-level? Will it be possible to write a useful report from it?

What's already been done

I started hacking on this context idea a year ago. Coverage.py currently has some support for it. The measurement support is there, and data is collected in memory. I did it to test whether the plugin idea would be fast enough, and it seems to be. If you are interested to see it, search for "wtw" in the code.

The data is not written out to a .coverage data file, and there is zero support for combining, segmenting, or reporting on context data.

How you can help

I'm interested to hear about how you would use this feature. I'm interested to hear ideas for implementation. If you want to help, let me know.

Older:

Even older...