tl;dw: Speedily practical large-scale tests

Monday 11 June 2012This is over 12 years old. Be careful.

At PyCon 2012, Erik Rose gave a talk entitled, Speedily Practical Large-Scale Tests. This is my textual summary of the talk, one of a series of summaries. Here’s the video of Erik:

For those in too much of a hurry to even read the text here, the tl;dr:

Slow test suites are usually slow due to I/O. Erik reduced test fixture I/O by moving fixture setup and teardown to the class instead of the tests, by rearranging classes to reuse fixtures, and by removing fixtures in favor of constructed objects. Two nose plugins, django-nose and nose-progressive, helped Erik improve the usability, maintainability and usefulness of his tests.

This was a long (40 minute) talk, and Erik covered a number of points, I hope I captured it. In the text that follows, times in square brackets are positions in the video you might want to look at, for example, code samples. Here’s (roughly) what Erik said:

What Erik said

I work at Mozilla, we have lots of large web services to keep running, some with 30 billion requests a month. I work on sumo, support.mozilla.com. We take testing seriously, and use a lot of automated testing. Throwing servers at the testing problem isn’t interesting, we need to make the tests work better.

First I’ll cover how to make the tests fast, then how to keep them maintainable, then how to make them as informative as possible.

Phase 1: Fast

The sumo project has really clean code, but the tests are really slow. It’s Django, but the speed tricks here will work for anything.

The tests take 20 minutes on the build server, 5 minutes on a developer’s laptop. 5 minutes may not sound like a lot, but it wastes your time, you switch contexts, you lose flow, and so on. Or people just don’t run the tests, and you have broken code.

My goal is to run the entire test suite in 1 minute on a local machine.

Where to start? With web app tests with a db, I suspect I/O, it’s always the problem. Here’s a figure of the memory hierarchy [6:50]:

  • 1 dot == 1ns
  • L1 cache: 1ns
  • L2 cache: 4.7ns
  • RAM: 83ns
  • Disk: 13.7 ms

Disk is really slow, avoid it at all costs. First, profile, guessing is no good. But the Python profiler only shows CPU time, not I/O. The Unix “time” command can give you a view into the split between CPU and I/O.

If you have a db, that’s the first place to look. “top” can find the processes with the most cpu, which might also be the ones with the most i/o. Looking at the sumo tests, MySQL was at the top.

“lsof” lists all the open files. [9:30] Give it a process id, it gives you a list of files. Scan through it for anything unexpected. In my case, I found files containing database test fixtures, one was 98,123,728 bytes.

Test fixtures are fake model data created from json files. Here’s how it works [10:30]: the test data goes into JSON files. This is an actual sumo test fixture, not a big one, 39 small objects, each equates to a SQL insert statement, so it tends to add up. Then you list the fixture files in your test cases.

But loading these files, even three of them, shouldn’t take 4 minutes, so where is that coming from? The trouble became clear when I turned on logging in MySQL. Great technique: log in as the MySQL root user, and “set global general_log=on”, and tail the log file.

Doing this showed that fixtures are reloaded separately for each test. Each test begins a transaction, loads the fixture, runs, and rolls back, leaving db in a pristine state, which is tidy but inefficient. Load/rollback/load/rollback, etc. In sumo, this produced 37583 queries, we can do a lot better.

Why not per-class test fixtures, loading and committing the fixtures once for each class? Now each test rolls back to pristine test fixtures. Then the class teardown has to explicitly remove the fixtures, since we can’t rollback twice. We run a modified version of Django test-fixture loading apparatus that keeps track of what was added, so we can remove it later. We use truncate because it’s faster than “delete *”.

With the stock Django fixture loader, sumo used 37583 queries. With per-class fixtures, it was down to 4116 queries, nine times less traffic to MySQL. In terms of time, stock fixtures was 302 seconds, and per-class fixtures were 97 seconds, at the bounds of tolerability. Another 4 seconds were saved by reusing the database connection instead of Django’s style of opening and closing the connection each time.

A minute and a half is a big improvement, just by getting rid of I/O, with no change to the tests at all.

Phase 2: Fixture bundling

Here [14:10] are three actual test cases from sumo. They all use the same set of test fixtures. Imagine how many times those fixtures are loaded: three times, once for each test class. We could combine them by combining the test classes, but I don’t want to have my class structure dictated to me by the test fixtures.

We use nose to get more speed. nose is generally great, it lets you skip all the boilerplate from unittest, run profiling, coverage, etc. Plugins give you tremendous power over every phase of running tests: finding and picking tests, dealing with errors, formatting results, and the real dark magic: mucking with the test suite before it runs. We can do better than per-class fixture loading with this last power.

When nose runs your tests, it runs them in alphabetical order. [16:00] The trouble is that consecutive test classes may have very similar test fixtures. Even with class-loaded test fixtures, a class may tear down a fixture only to have the next class re-create it. If we sort our test classes to keep similar fixtures together, then we can add advisory flags to the classes. One indicates the first class in a series with the same fixtures, which sets up the fixtures, and another indicates the last, which will tear them down.

Test independence is preserved, we’re just factoring out pointlessly repeated setup. In the future, we could make one more improvement: we’ve already set up A, B, and C, why tear them down just to set up A, B, C, and D? We should leave A, B, and C, and just set up D. This could be a computational issue, but computes are cheap as long as they save you I/O.

We did this with sumo: before bundling, we have 114 test classes with fixtures, so we did the loading and unloading 114 times, but there were only 11 distinct sets of fixtures. With bundling, we only did it 11 times, reducing the time from 97 seconds to 74 seconds.

Phase 3: Startup speedups

Sometimes, it isn’t the test that’s slow, it’s the harness itself. Sometimes I want to run a trivial test, it takes, say, 1/10 second. I have to wait 15 seconds for all the databases to set up at the beginning, even though I don’t need a new database. It was already valid, from the end of the last test run, so we could have re-used it.

Building on some work we’d already done in this direction, I decided to skip the tear-down of the test db, and also the set-up of the test db on future test runs. This is risky: if you make a schema change, you need to give it a hint, “you should re-initialize here,” but it’s a tremendous net win. I force a full initialization on the build farm, and maybe before committing, but other than that, I get a fast start-up time. This change took the test runner overhead from 15 seconds to 7 seconds. That brings the total sumo test suite time down from 74 seconds to 62 seconds, within 2 seconds of the magic one-minute run time.

To wrap up:

  • Stock Django: 302 seconds
  • Per-class fixtures: 97 seconds
  • Fixture bundling: 74 seconds
  • Startup speedups: 62 seconds

Now we’re saving something like 4 minutes per test run. It may not sound like a big number, but it really adds up. At Mozilla we had a team of 4, and if we conservatively estimate that we each ran the test suite four times a day (which is a gross under-estimate), that’s 64 minutes per day, which comes out to 261 hours, or 32 working days a year: we can each take a day off a month!

So if you happen to be using Django and you have a lot of fixture-heavy tests, all these optimizations are available as part of the django-nose package.

Maintainability

Shared setup is evil. The unittest module encourages you to create common setup shared by many tests. This is a coupling problem: if your requirements change, you have to change setup, and now you don’t know which tests have been invalidated. Especially if you try hard not to repeat yourself, your setup will be shared by many tests.

[21:50] Here we break the setup into individual helpers, which makes it much clearer which tests are using what. This can be more efficient since tests only invoke the helpers they really need, instead of all tests running the full setup. Memoized properties can make the code more readable if you like.

[21:30] An example of a test that referred to a specific user from a fixture with a cryptic primary key. It’s difficult to read the test and know what it does. Model makers can help with this. Model makers are a pattern, not a library, they are simple to write.

Here’s [23:30] an example of a model maker. document() instantiates an ORM object, filling in just enough of it to be a valid object. If you pass other data into it, it will set it. Here we give the document a title, but don’t care about the rest. Everything you care about, you set explicitly, the rest you let default. Now your tests are self-documenting.

You can nest these model makers if you make them right. Here’s [24:25] a test for sumo’s wiki: the revision has an explicit document, but I could omit document if I didn’t care which document it referred to. There are no database hits here, and lexically represents the structure of the objects. Here’s [25:00] the implementation of the document() model maker, six lines, and this is one of the complicated ones. I got fancy here and put a Unicode character into the title to expose Unicode problems.

The one library-ish thing I did is the @with_save decorator [25:30], which makes it so when you create the object, you can include “save=True” in the arguments, and it will save the object in the database immediately, so you can create and save the object in one line.

Some people have written libraries for model makers. There’s a Django-centric one called factory_boy which lets you do this in a more declarative way. I’m up in the air about whether the extra complexity is worth it or not. It tries to look at your models and figure out what a valid instance would look like. Sometimes it guesses wrong.

In summary, shared setup routines make tests:

  • coupled to each other
  • brittle
  • hard to understand
  • lexically far from the test code
  • slow

Local setup gives you:

  • decoupling
  • robustness
  • clarity
  • efficiency
  • freedom to refactor, tests aren’t bound to class setup methods.

There are some situations where you don’t want model instances at all. Then you use mocking. A mock is used when the real implementation is too performance-intensive, or too complicated to understand. A mock is a self-evidently correct lightweight substitute for more complicated things. We use mocks to test code that operates on those things.

[27:40] Django’s HTTP request object: complicated! A mock for it is two lines! It’s self-evident what it does, we don’t make up arguments to instantiate it. You don’t need a library to create this sort of mock.

For more elaborate mocks that can record what they’ve done, or send fake return values that changes over time, you can use a library. My rule of thumb: when I need tests for my test infrastructure, I should use someone else’s already-tested library.

There are two mocking libraries I really like: mock, and fudge. Mock is very imperative, and fudge is more declarative. Sometimes one feels more natural, sometimes the other one does. I’ll show you both.

Mock [28:30]: Here a with block calls patch.object on APIVoterManager to replace the _raw_voters method. I want to replace it with something that is very simple and predictable. This binds the new mocked-out _raw_voters method as “voters”, and then I can say, “your return value is such-and-such.” Then I can do my test, make one assert, then make a second assert that the method was called. The mock returns the value I want and tracks whether it was called. It’s very fast, it doesn’t have to run through all the logic in the real _raw_voters method, which is a couple hundred lines, it doesn’t have to hit the database, it prevents the brittleness of depending on test data, it gets the sleeps out of your code (because other servers don’t need to be started and stopped), it’s a huge win all around.

Fudge is the more declarative path, I find it preferable when writing facades. This piece of code [29:45] tests oedipus, a library to put in front of the Sphinx search engine, to make it more Pythonic.

[30:00] At the bottom is the S(Biscuit) call that we want to test. Since it’s a facade, all it does is make API calls through to the Sphinx native API. We use fudge to say, these are the API calls I expect my code to make against this interface. You don’t see the assertions, fudge takes care of all the asserting for you.

Informative tests

[31:00] How do you make your tests more useful to people as they are running, and after they are running? How do they help you diagnose and debug? I hate plain dots. If I get an error, I get an E, and can’t get any information about it until I get more dots. At the very end, we get a pile of messy tracebacks.

[33:00] Wouldn’t it be nice if we had more useful output? I put together an alternative test runner called nose-progressive. It works with your existing tests. It shows a progress bar, the name of the current test, and it shows the tracebacks as you go. The tracebacks are more compact, they use relatives paths, the procedure names are colored, it omits the test harness stack frames. The best part of all are the editor short-cuts, each source reference in the traceback is an editor invocation you can paste to jump to that file in your editor.

Questions?

Comments: The zope test runner does many of the things described in nose-progressive. So does twisted.trial.

Q: What about running the tests in parallel? A: Parallelization support in nose needs work. Py.test is better, but we haven’t tried it yet. Comment: twisted.trial is good at parallelization

Q: Any recommendations for integration testing over unit testing? A: Sure, if you have limited resources, do integration testing, since it gives you broader coverage, but is a blunt tool.

Comments

[gravatar]
Wolfgang Schnerring 12:05 AM on 12 Jun 2012
"In the future, we could make one more improvement: we've already set up A, B, and C, why tear them down just to set up A, B, C, and D? We should leave A, B, and C, and just set up D."

FWIW, zope.testrunner implements shared setup including this kind of stacking of setups, via so called "test layers". You write a layer object that has its own setUp/tearDown, and your TestCases say "layer = MY_LAYER". The layer's setUp is called (once!) before any of the corresponding TestCases are run, and the tearDown after.
(See http://pypi.python.org/pypi/plone.testing#layers for a more detailed description of layers).

This a really, really powerful feature, on the one hand for speed, since it allows you to run expensive setup code only once, but, moreover for organizing your test code, since you can split up your setup code into separate concerns, put each of those in a separate layer and then combine them at will.

I've been thinking for quite a while that I'd like to implement test layers as a nose plugin, now I'm just waiting to get enough round toits to do it. ;)

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.