Finding test coupling

Thursday 22 December 2016

Before we get started: this is a story about a problem I had and how I solved it. This retelling is leaving out lots of small false trails and hard learnings, which I will summarize at the end. I report these stories not to lecture from on high, but to share with peers, help people learn, and ideally, elicit teachings from others so that I can do it better next time. The main qualities I am demonstrating here are not intelligence and experience, but perseverance, patience, and optimism.

OK, on with the story:

Running our large test suite the other day, we got a test failure. It seemed unrelated to the changes we were making, but you can never be sure, so we investigated. Along the way I used a few techniques to narrow down, widen, and identify suspects.

Running just that one test passed, but running the whole test suite, it failed, and this behavior was repeatable. So we had some kind of coupling between tests. Ideally, all tests would be isolated from each other. Perfect test isolation would mean that no matter what order you ran tests, and no matter what subset of tests you ran, the results would be the same. Clearly we did not have perfect test isolation.

The job now was to find the test we were coupled with, or perhaps one of the many possible tests that we were coupled with.

The test failure itself was a UnicodeError while trying to log a warning message involving a username with a non-ASCII character in it. Apparently this is something that doesn’t work well: when warnings are routed through the logging system, if the message is actually logged, and the message has a non-ASCII Unicode string, an exception will happen. That’s unfortunate, but we’ll have to live with that for the moment.

Our best guess at the moment is that when the test passes, it’s because either the warnings settings, or the logging settings, are deciding not to log the warning. When the test fails, it’s because some previous test has changed one (or both!) of those settings, causing the message to proceed all the way through the warnings/logging pipeline, to the point of producing the UnicodeError. This is a plausible theory because those settings are global to the process, and would be easy to change without realizing the consequences for test suites.

But we still have to find that test. Here’s the command that runs just the one test, that failed:

python ./ lms test --verbosity=2 --with-id --settings=test \
    --xunitmp-file=/edx/app/edxapp/edx-platform/reports/lms/nosetests.xml \
    --with-database-isolation \

This is the Django test runner, using nose. That last line selects one particular test method in one particular class in one specific test file. To try to find a failing combination, we’ll widen the scope of our test run by peeling off trailing components. This will give us progressively more tests in the run, and eventually (we hope), the test will fail:



This last one finally failed, with 1810 tests. That’s still too many to examine manually. We can run those tests again, with nose-randomly to randomize the order of the tests. This gives us an opportunity to run experiments where the randomization can tell us something about coupling. If we run the 1810 tests, and our failing test doesn’t fail, then none of the tests that ran before it were the one that cause the problem. If the test does fail, then the tests that ran before it might be bad.

I used a bash loop to run those 1810 test over and over, capturing the output in separate result files:

export FLAGS=blah blah omitted for brevity

for i in $(seq 9999); do
    echo --- $i
    python ./ lms test -v $FLAGS openedx/core/djangoapps -v > test$i.txt 2>&1

Overnight, this gave me 72 test result files to look at. The -v and --with-id flags gave us output that looked like this:

... lots of throat-clearing here ...
Synchronizing apps without migrations:
  Creating tables...
    Creating table coursewarehistoryextended_studentmodulehistoryextended
    Running deferred SQL...
  Installing custom SQL...
Running migrations:
  No migrations to apply.
Using --randomly-seed=1482407901
#19 test_platform_name (openedx.core.djangoapps.site_configuration.tests.test_context_processors.ContextProcessorTests) ... ok
#20 test_configuration_platform_name (openedx.core.djangoapps.site_configuration.tests.test_context_processors.ContextProcessorTests) ... ok
#21 test_get_value (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
#22 test_get_value_for_org (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
#23 test_get_dict (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
#24 test_get_value_for_org_2 (openedx.core.djangoapps.site_configuration.tests.test_helpers.TestHelpers) ... ok
... much more ...

A small Python program provided the analysis: (Warning: this is for Python 3.6, so f-strings ahead!)

Although I had 72 runs, the results converged after 11 runs: 179 tests were in the maybe-bad set, and more runs didn’t reduce the set. That’s because of nose-randomly’s behavior, which I didn’t fully understand: it doesn’t completely shuffle the tests. Because of the possibility of module-level and class-level setup code, it randomizes within those scopes, but will not intermix between scopes. The test modules are run in a random order, but everything in one module will always run contiguously. The classes within a module will run in a random order, but all of the methods within a class will run contiguously.

The list of classes that provided made clear what was going on: all of the maybe-bad tests were in credit/tests/ There are 179 tests in that file, and something in there is causing our test failure. Because they always run contiguously, there’s no way nose-randomly could give us more information about the true culprit.

Time for some low-tech divide and conquer: we’ll run one class from, and then our failing test. If we do that once for each class in, we should get information about which class to examine. I’d love to tell you I had some clever way to get the list of test classes, but I just manually picked them out of the file and wrote this loop:

export FAILING_TEST=openedx/core/djangoapps/external_auth/tests/ 

for c in CreditCourseViewSetTests CreditEligibilityViewTests CreditProviderCallbackViewTests CreditProviderRequestCreateViewTests CreditProviderViewSetTests; do
    echo ------------- $c
    python ./ lms test -v $FLAGS \
        openedx/core/djangoapps/credit/tests/$c \
        $FAILING_TEST 2>&1 | tee ${c}_out | grep unicode_email

(My bash-looping skillz were improving each time!) This showed me that three of the five classes were failing. These classes use mixins, and the thing the three class had in common was the AuthMixin, which provides four test methods. So it’s probably one of those methods. I picked the first of the test classes, and ran a new experiment four times, once for each of the four test methods:

for t in test_authentication_required test_oauth test_session_auth test_jwt_auth; do
    echo ---------- $t
    python ./ lms test -v $FLAGS \
        openedx/core/djangoapps/credit/tests/$t \
        $FAILING_TEST 2>&1 | tee ${t}_out | grep unicode_email

And this showed that test_jwt_auth was the problem! Now I had a two-test scenario that would produce the failure.

To find the line in the test, I could comment-out or otherwise neuter parts of the test method and run my fast two-test scenario. The cause was a JWT authorization header in a test client get() function call. JWT-related code is scarce enough in our huge code base, that I could identify a likely related function, place a pudb breakpoint, and start walking through code until I found the problem: a line deep in a library that changed the warnings settings! (cue the dramatic music)

Commenting out that line, and running my reproducer confirmed that it was the root cause. A simple pull request fixes the problem. Note in the pull request that the library’s test case had a simple mistake that might have been the reason for the bad line to begin with.

It felt really good to find and fix the problem, perhaps because it took so long to find.

As promised: things I didn’t know or did wrong:

  • Our test names have random data in them. I didn’t realize this until my results started showing the sum of known-good and maybe-bad as greater than the total number of tests. Until then, the numbers were skittering all over the place. Once I canonicalized the file names, the numbers converged quickly.
  • I should have understood how nose-randomly worked earlier.
  • I had to (once again) Google the bash loop syntax.
  • I fat-fingered a few bash loops which gave me mysterious and sometimes discouraging false results.
  • A few times, I spent a while just reading through possible culprit tests looking for clues, which was fruitless, since the actual problem line was in a library in a different repo.

We’re all learning. Be careful out there.


Nice! It doesn't look like nose-randomly bought you much, though: could you have run {all currently suspect tests} followed by the failing test, and reduced the set of suspect tests recursively using binary search?
@pozorvlak nose-randomly got me down from 1810 to 179, which isn't bad. If I had tried explicit binary search from the beginning, it might have been faster, yes.
Interesting article. I might be missing something but wouldn't it be possible to do something like (disclaimer : bash code typed on a mobile phone is most probably wrong):

for t in $(list all tests); do
run_test "$t" "$test_that_fails";

(Assuming that it is easy to have the list of test names on the standard output and that it is possible to run two unrelated tests and only these two)
@Sylvain, yes that would have given me an answer too, and if I had to do this again, I might try that. But it might take too long. My two-test case took 20 seconds. The full test suite has something like 6000 tests. That comes out to 33 hours. I'm not sure I would have started that experiment if I couldn't narrow it down first.
> I had to (once again) Google the bash loop syntax.
> I fat-fingered a few bash loops which gave me mysterious and sometimes discouraging false results.

Glad I'm not the only seasoned programmer who has a hard time with Bash. :)
There is another reason for nose-randomly's scoping behavior, but it's a bit subtle.
If you wish to find coupling soon after it is introduced then it is good to always run with randomly.
However then when you do get a failure you need to rerun the suite in the same random order. Randomly prints the random seed it is using and has a command line option for specifying a seed for just this purpose.
Of course with a failure in a large test suite the first thing you want to do is narrow the scope (as you did).
And importantly you want the random ordering for that narrowed scope to be the same as it was for that subset of the suite when the full suite was run.
And the easiest way to achieve that is to generate the random ordering separately at each scope level, which is what nose-randomly does internally.
When I had problems with test interdependencies in the past, I got good results with instrumenting my test runner to take snapshots of global state before and after each test, and comparing them, to warn about tests that change global state in unexpected ways.

I had plugins to take snapshots of logging configuration, the Zope component registry, the size of memory allocations made by libxml2 (that was before lxml and you had to free libxml2's objects explicitly) etc.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.