« | » Main « | »

Turning 0x32

Saturday 16 June 2012

I was born on this day in 1962, which means, somehow, inexplicably, I am 50 years old.

Yesterday I celebrated with Susan and the boys with a ballon ride in Quechee, VT, a great time. Today is the Special Olympics swim meet, and tomorrow is Father's Day, so the weekend is stuffed full with special events.

50 is a really big number. Not only does it put me into a different decade of life, but a different round fraction of the century. On top of that, statistically speaking, I will not live another 50 years, so I am decidedly in the second half of my life.

But on the other hand, a birthday is a day like any other, only one day older, and all of that. It's hard to know what to make of these milestones. Yesterday I was 49.99726, and now I am 50.00000, which is hardly any difference at all. Tomorrow I will be only 50.00274. But 50 is very different from 20, and those changes come from somewhere. Birthdays are simply when we mark them.

Although I still spend most of my waking hours on a computer, it seems to be more about people and less about the computer. I create more English, and less code. I don't know if this is an extended trend, or just a phase. Luckily I don't have to know, I just do what interests me.

I freelance and enjoy my independence, but sometimes miss the extended shared focus of a cohesive work group. Again, trend or phase? Don't know.

So 50 hasn't brought me wisdom, except to be thoughtful about what matters to me, and to be good to those around me. Perhaps that's all anyone needs to know.

tl;dw: Speedily practical large-scale tests

Monday 11 June 2012

At PyCon 2012, Erik Rose gave a talk entitled, Speedily Practical Large-Scale Tests. This is my textual summary of the talk, one of a series of summaries. Here's the video of Erik:

For those in too much of a hurry to even read the text here, the tl;dr:

Slow test suites are usually slow due to I/O. Erik reduced test fixture I/O by moving fixture setup and teardown to the class instead of the tests, by rearranging classes to reuse fixtures, and by removing fixtures in favor of constructed objects. Two nose plugins, django-nose and nose-progressive, helped Erik improve the usability, maintainability and usefulness of his tests.

This was a long (40 minute) talk, and Erik covered a number of points, I hope I captured it. In the text that follows, times in square brackets are positions in the video you might want to look at, for example, code samples. Here's (roughly) what Erik said:

What Erik said

I work at Mozilla, we have lots of large web services to keep running, some with 30 billion requests a month. I work on sumo, support.mozilla.com. We take testing seriously, and use a lot of automated testing. Throwing servers at the testing problem isn't interesting, we need to make the tests work better.

First I'll cover how to make the tests fast, then how to keep them maintainable, then how to make them as informative as possible.

Phase 1: Fast

The sumo project has really clean code, but the tests are really slow. It's Django, but the speed tricks here will work for anything.

The tests take 20 minutes on the build server, 5 minutes on a developer's laptop. 5 minutes may not sound like a lot, but it wastes your time, you switch contexts, you lose flow, and so on. Or people just don't run the tests, and you have broken code.

My goal is to run the entire test suite in 1 minute on a local machine.

Where to start? With web app tests with a db, I suspect I/O, it's always the problem. Here's a figure of the memory hierarchy [6:50]:

  • 1 dot == 1ns
  • L1 cache: 1ns
  • L2 cache: 4.7ns
  • RAM: 83ns
  • Disk: 13.7 ms

Disk is really slow, avoid it at all costs. First, profile, guessing is no good. But the Python profiler only shows CPU time, not I/O. The Unix "time" command can give you a view into the split between CPU and I/O.

If you have a db, that's the first place to look. "top" can find the processes with the most cpu, which might also be the ones with the most i/o. Looking at the sumo tests, MySQL was at the top.

"lsof" lists all the open files. [9:30] Give it a process id, it gives you a list of files. Scan through it for anything unexpected. In my case, I found files containing database test fixtures, one was 98,123,728 bytes.

Test fixtures are fake model data created from json files. Here's how it works [10:30]: the test data goes into JSON files. This is an actual sumo test fixture, not a big one, 39 small objects, each equates to a SQL insert statement, so it tends to add up. Then you list the fixture files in your test cases.

But loading these files, even three of them, shouldn't take 4 minutes, so where is that coming from? The trouble became clear when I turned on logging in MySQL. Great technique: log in as the MySQL root user, and "set global general_log=on", and tail the log file.

Doing this showed that fixtures are reloaded separately for each test. Each test begins a transaction, loads the fixture, runs, and rolls back, leaving db in a pristine state, which is tidy but inefficient. Load/rollback/load/rollback, etc. In sumo, this produced 37583 queries, we can do a lot better.

Why not per-class test fixtures, loading and committing the fixtures once for each class? Now each test rolls back to pristine test fixtures. Then the class teardown has to explicitly remove the fixtures, since we can't rollback twice. We run a modified version of Django test-fixture loading apparatus that keeps track of what was added, so we can remove it later. We use truncate because it's faster than "delete *".

With the stock Django fixture loader, sumo used 37583 queries. With per-class fixtures, it was down to 4116 queries, nine times less traffic to MySQL. In terms of time, stock fixtures was 302 seconds, and per-class fixtures were 97 seconds, at the bounds of tolerability. Another 4 seconds were saved by reusing the database connection instead of Django's style of opening and closing the connection each time.

A minute and a half is a big improvement, just by getting rid of I/O, with no change to the tests at all.

Phase 2: Fixture bundling

Here [14:10] are three actual test cases from sumo. They all use the same set of test fixtures. Imagine how many times those fixtures are loaded: three times, once for each test class. We could combine them by combining the test classes, but I don't want to have my class structure dictated to me by the test fixtures.

We use nose to get more speed. nose is generally great, it lets you skip all the boilerplate from unittest, run profiling, coverage, etc. Plugins give you tremendous power over every phase of running tests: finding and picking tests, dealing with errors, formatting results, and the real dark magic: mucking with the test suite before it runs. We can do better than per-class fixture loading with this last power.

When nose runs your tests, it runs them in alphabetical order. [16:00] The trouble is that consecutive test classes may have very similar test fixtures. Even with class-loaded test fixtures, a class may tear down a fixture only to have the next class re-create it. If we sort our test classes to keep similar fixtures together, then we can add advisory flags to the classes. One indicates the first class in a series with the same fixtures, which sets up the fixtures, and another indicates the last, which will tear them down.

Test independence is preserved, we're just factoring out pointlessly repeated setup. In the future, we could make one more improvement: we've already set up A, B, and C, why tear them down just to set up A, B, C, and D? We should leave A, B, and C, and just set up D. This could be a computational issue, but computes are cheap as long as they save you I/O.

We did this with sumo: before bundling, we have 114 test classes with fixtures, so we did the loading and unloading 114 times, but there were only 11 distinct sets of fixtures. With bundling, we only did it 11 times, reducing the time from 97 seconds to 74 seconds.

Phase 3: Startup speedups

Sometimes, it isn't the test that's slow, it's the harness itself. Sometimes I want to run a trivial test, it takes, say, 1/10 second. I have to wait 15 seconds for all the databases to set up at the beginning, even though I don't need a new database. It was already valid, from the end of the last test run, so we could have re-used it.

Building on some work we'd already done in this direction, I decided to skip the tear-down of the test db, and also the set-up of the test db on future test runs. This is risky: if you make a schema change, you need to give it a hint, "you should re-initialize here," but it's a tremendous net win. I force a full initialization on the build farm, and maybe before committing, but other than that, I get a fast start-up time. This change took the test runner overhead from 15 seconds to 7 seconds. That brings the total sumo test suite time down from 74 seconds to 62 seconds, within 2 seconds of the magic one-minute run time.

To wrap up:

  • Stock Django: 302 seconds
  • Per-class fixtures: 97 seconds
  • Fixture bundling: 74 seconds
  • Startup speedups: 62 seconds

Now we're saving something like 4 minutes per test run. It may not sound like a big number, but it really adds up. At Mozilla we had a team of 4, and if we conservatively estimate that we each ran the test suite four times a day (which is a gross under-estimate), that's 64 minutes per day, which comes out to 261 hours, or 32 working days a year: we can each take a day off a month!

So if you happen to be using Django and you have a lot of fixture-heavy tests, all these optimizations are available as part of the django-nose package.


Shared setup is evil. The unittest module encourages you to create common setup shared by many tests. This is a coupling problem: if your requirements change, you have to change setup, and now you don't know which tests have been invalidated. Especially if you try hard not to repeat yourself, your setup will be shared by many tests.

[21:50] Here we break the setup into individual helpers, which makes it much clearer which tests are using what. This can be more efficient since tests only invoke the helpers they really need, instead of all tests running the full setup. Memoized properties can make the code more readable if you like.

[21:30] An example of a test that referred to a specific user from a fixture with a cryptic primary key. It's difficult to read the test and know what it does. Model makers can help with this. Model makers are a pattern, not a library, they are simple to write.

Here's [23:30] an example of a model maker. document() instantiates an ORM object, filling in just enough of it to be a valid object. If you pass other data into it, it will set it. Here we give the document a title, but don't care about the rest. Everything you care about, you set explicitly, the rest you let default. Now your tests are self-documenting.

You can nest these model makers if you make them right. Here's [24:25] a test for sumo's wiki: the revision has an explicit document, but I could omit document if I didn't care which document it referred to. There are no database hits here, and lexically represents the structure of the objects. Here's [25:00] the implementation of the document() model maker, six lines, and this is one of the complicated ones. I got fancy here and put a Unicode character into the title to expose Unicode problems.

The one library-ish thing I did is the @with_save decorator [25:30], which makes it so when you create the object, you can include "save=True" in the arguments, and it will save the object in the database immediately, so you can create and save the object in one line.

Some people have written libraries for model makers. There's a Django-centric one called factory_boy which lets you do this in a more declarative way. I'm up in the air about whether the extra complexity is worth it or not. It tries to look at your models and figure out what a valid instance would look like. Sometimes it guesses wrong.

In summary, shared setup routines make tests:

  • coupled to each other
  • brittle
  • hard to understand
  • lexically far from the test code
  • slow

Local setup gives you:

  • decoupling
  • robustness
  • clarity
  • efficiency
  • freedom to refactor, tests aren't bound to class setup methods.

There are some situations where you don't want model instances at all. THen you use mocking. A mock is used when the real implementation is too performance-intensive, or too complicated to understand. A mock is a self-evidently correct lightweight substitute for more complicated things. We use mocks to test code that operates on those things.

[27:40] Django's HTTP request object: complicated! A mock for it is two lines! It's self-evident what it does, we don't make up arguments to instantiate it. You don't need a library to create this sort of mock.

For more elaborate mocks that can record what they've done, or send fake return values that changes over time, you can use a library. My rule of thumb: when I need tests for my test infrastructure, I should use someone else's already-tested library.

There are two mocking libraries I really like: mock, and fudge. Mock is very imperative, and fudge is more declarative. Sometimes one feels more natural, sometimes the other one does. I'll show you both.

Mock [28:30]: Here a with block calls patch.object on APIVoterManager to replace the _raw_voters method. I want to replace it with something that is very simple and predictable. This binds the new mocked-out _raw_voters method as "voters", and then I can say, "your return value is such-and-such." Then I can do my test, make one assert, then make a second assert that the method was called. The mock returns the value I want and tracks whether it was called. It's very fast, it doesn't have to run through all the logic in the real _raw_voters method, which is a couple hundred lines, it doesn't have to hit the database, it prevents the brittleness of depending on test data, it gets the sleeps out of your code (because other servers don't need to be started and stopped), it's a huge win all around.

Fudge is the more declarative path, I find it preferable when writing facades. This piece of code [29:45] tests oedipus, a library to put in front of the Sphinx search engine, to make it more Pythonic.

[30:00] At the bottom is the S(Biscuit) call that we want to test. Since it's a facade, all it does is make API calls through to the Sphinx native API. We use fudge to say, these are the API calls I expect my code to make against this interface. You don't see the assertions, fudge takes care of all the asserting for you.

Informative tests

[31:00] How do you make your tests more useful to people as they are running, and after they are running? How do they help you diagnose and debug? I hate plain dots. If I get an error, I get an E, and can't get any information about it until I get more dots. At the very end, we get a pile of messy tracebacks.

[33:00] Wouldn't it be nice if we had more useful output? I put together an alternative test runner called nose-progressive. It works with your existing tests. It shows a progress bar, the name of the current test, and it shows the tracebacks as you go. The tracebacks are more compact, they use relatives paths, the procedure names are colored, it omits the test harness stack frames. The best part of all are the editor short-cuts, each source reference in the traceback is an editor invocation you can paste to jump to that file in your editor.


Comments: The zope test runner does many of the things described in nose-progressive. So does twisted.trial.

Q: What about running the tests in parallel? A: Parallelization support in nose needs work. Py.test is better, but we haven't tried it yet. Comment: twisted.trial is good at parallelization

Q: Any recommendations for integration testing over unit testing? A: Sure, if you have limited resources, do integration testing, since it gives you broader coverage, but is a blunt tool.

Eval really is dangerous

Wednesday 6 June 2012

Python has an eval() function which evaluates a string of Python code:

assert eval("2 + 3 * len('hello')") == 17

This is very powerful, but is also very dangerous if you accept strings to evaluate from untrusted input. Suppose the string being evaluated is "os.system('rm -rf /')" ? It will really start deleting all the files on your computer. (In the examples that follow, I'll use 'clear' instead of 'rm -rf /' to prevent accidental foot-shootings.)

Some have claimed that you can make eval safe by providing it with no globals. eval() takes a second argument which are the global values to use during the evaluation. If you don't provide a globals dictionary, then eval uses the current globals, which is why "os" might be available. If you provide an empty dictionary, then there are no globals. This now raises a NameError, "name 'os' is not defined":

eval("os.system('clear')", {})

But we can still import modules and use them, with the builtin function __import__. This succeeds:

eval("__import__('os').system('clear')", {})

The next attempt to make things safe is to refuse access to the builtins. The reason names like __import__ and open are available to you in Python 2 is because they are in the __builtins__ global. We can explicitly specify that there are no builtins by defining that name as an empty dictionary in our globals. Now this raises a NameError:

eval("__import__('os').system('clear')", {'__builtins__':{}})

Are we safe now? Some say yes, but they are wrong. As a demonstration, running this in CPython will segfault your interpreter:

s = """
(lambda fc=(
    lambda n: [
        c for c in 
            if c.__name__ == n
eval(s, {'__builtins__':{}})

Let's unpack this beast and see what's going on. At the center we find this:


which is a fancy way of saying "object". The first base class of a tuple is "object". Remember, we can't simply say "object", since we have no builtins. But we can create objects with literal syntax, and then use attributes from there.

Once we have object, we can get the list of all the subclasses of object:


or in other words, a list of all the classes that have been instantiated to this point in the program. We'll come back to this at the end. If we shorthand this as ALL_CLASSES, then this is a list comprehension that examines all the classes to find one named n:

[c for c in ALL_CLASSES if c.__name__ == n][0]

We'll use this to find classes by name, and because we need to use it twice, we'll create a function for it:

lambda n: [c for c in ALL_CLASSES if c.__name__ == n][0]

But we're in an eval, so we can't use the def statement, or the assignment statement to give this function a name. But default arguments to a function are also a form of assignment, and lambdas can have default arguments. So we put the rest of our code in a lambda function to get the use of the default arguments as an assignment:

(lambda fc=(
    lambda n: [
        c for c in ALL_CLASSES if c.__name__ == n
    # code goes here...

Now that we have our "find class" function fc, what will we do with it? We can make a code object! It isn't easy, you need to provide 12 arguments to the constructor, but most can be given simple default values.


The string "KABOOM" is the actual bytecodes to use in the code object, and as you can probably guess, "KABOOM" is not a valid sequence of bytecodes. Actually, any one of these bytecodes would be enough, they are all binary operators that will try to operate on an empty operand stack, which will segfault CPython. "KABOOM" is just more fun, thanks to lvh for it.

This gives us a code object: fc("code") finds the class "code" for us, and then we invoke it with the 12 arguments. You can't invoke a code object directly, but you can create a function with one:

fc("function")(CODE_OBJECT, {})

And of course, once you have a function, you can call it, which will run the code in its code object. In this case, that will execute our bogus bytecodes, which will segfault the CPython interpreter. Here's the dangerous string again, in more compact form:

(lambda fc=(lambda n: [c for c in ().__class__.__bases__[0].__subclasses__()
    if c.__name__ == n][0]): fc("function")(fc("code")(0,0,0,0,"KABOOM",(),

So eval is not safe, even if you remove all the globals and the builtins!

We used the list of all subclasses of object here to make a code object and a function. You can of course find other classes and use them. Which classes you can find depends on where the eval() call actually is. In a real program, there will be many classes already created by the time the eval() happens, and all of them will be in our list of ALL_CLASSES. As an example:

s = """
    c for c in 
    if c.__name__ == "Quitter"

The standard site module defines a class called Quitter, it's what the name "quit" is bound to, so that you can type quit() at the interactive prompt to exit the interpreter. So in eval we simply find Quitter, instantiate it, and call it. This string cleanly exits the Python interpreter.

Of course, in a real system, there will be all sorts of powerful classes lying around that an eval'ed string could instantiate and invoke. There's no end to the havoc that could be caused.

The problem with all of these attempts to protect eval() is that they are blacklists. They explicitly remove things that could be dangerous. That is a losing battle because if there's just one item left off the list, you can attack the system.

While I was poking around on this topic, I stumbled on Python's restricted evaluation mode, which seems to be an attempt to plug some of these holes. Here we try to access the code object for a lambda, and find we aren't allowed to:

>>> eval("(lambda:0).func_code", {'__builtins__':{}})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1, in <module>
RuntimeErrorfunction attributes not accessible in restricted mode

Restricted mode is an explicit attempt to blacklist certain "dangerous" attribute access. It's specifically triggered when executing code if your builtins are not the official builtins. There's a much more detailed explanation and links to other discussion on this topic on Tav's blog. As we've seen, the existing restricted mode it isn't enough to prevent mischief.

So, can eval be made safe? Hard to say. At this point, my best guess is that you can't do any harm if you can't use any double underscores, so maybe if you exclude any string with double underscores you are safe. Maybe...

Update: from a thread on Reddit about recovering cleared globals, a similar snippet that will get you the original builtins:

    c for c in ().__class__.__base__.__subclasses__() 
    if c.__name__ == 'catch_warnings'

tl;dw: Stop mocking, start testing

Saturday 2 June 2012

At PyCon 2012, Augie Fackler and Nathaniel Manista gave a talk entitled, Stop Mocking, Start Testing. This is my textual summary of the talk, the first of a series of summaries. You can look at Augie and Nathaniel's slides themselves, or watch the video:

If you not only don't have time to watch the video, but don't even want to read this summary, here's the tl;dr:

Test doubles are useful, but can get out of hand. Use fakes, not mocks. Have one authoritative fake implementation of each service.

Here's (roughly) what Augie and Nathaniel said:

We work on Google Code, which has been a project since July 2006. There are about 50 engineer-years of work on it so far. Median time on the project is 2 years, people rotate in and out, which is usual for Google. Google code offers svn, hg, git, wiki, issue tracker, download service, offline batch, etc. They started off with a few implementation languages, now there are at least eight.

There are many servers and processes, components, including RPC services, all talking to each other, until finally at the bottom there's persistence. Your code is probably like this too: stateless components, messages sent between components, user data stored statefully at the bottom.

What's been the evolution of the testing process? Standard operating procedure as of 2006: Limited test coverage. We inherited the svn test suite, but it had to be run manually against a preconfigured dev env then manually examine output! Took all afternoon!

"Tests? We have users to test!" An effective but stressful way to find bugs. Users are not a test infrastructure. Tests that cost more people time than CPU time are bad. A project can't grow this way. If the feature surface area grows linearly, the time spent testing grows quadratically.

Starting to Test (2009): A new crew of engineers rolled onto the project, but they didn't understand the existing code. Policy: tests are required for new and modified code. Untouched code remained untested. The core persistence is changed a lot, so it's well tested, but the layers above might not, and that untested code would break on deploy. We set up a continuous build server, with red/green light, though a few engrs are red/green blind, so we had to find just the right colors!

We thought we were doing well, adding tests was helping, but the tests were problems themselves. Everyone made their own mock objects. We had N different implementations of a mock. When the real code changed, you have to find all N mocks and update them.

It wasn't just N mocks: even with one mock, it would tell us what we wanted to hear. The mocks do what we said, instead of accurately modeling the real code. Tests would pass, then the product would break on deploy. The mocks had diverged from the real code.

Lessons so far:

  • Share mocks among test modules.
  • Maybe you don't need a mock: if an object is cheap, then don't mock it.
  • If you need a mock, have exactly one well-tested mock.
  • Test the mock against the real implementation.
  • If you don't have time to fully test the mock, at least use Python introspection to confirm that the interfaces are the same. The inspect module has the tools to do this.

We tried to use full Selenium system tests to make up for gaps in unit coverage. Selenium is slow, race conditions creep in, difficult to diagnose problems. They weren't a good replacement for unit tests, unit tests give much better information.

We tested user stories with full system tests, this worked much better. Still use system tests, but test the user story, not the edge conditions.

We went through Enlightenment, now we have modern mocking:

  • A common collection of authoritative mocks.
  • The mock collection is narrow, only the things we really need to test.
  • The mock is isolated. No code dependency between the mocks and the real code. Mocks don't inherit from real implementations.
  • Mocks are actually fakes. Lots of terms: mocks, stubs, dummies, fakes, doubles, etc. Fakes are full in-memory implementations of the interface they are mocking.
  • (from a question at the end:) Mocks are works in progress, they only implement what is needed, so strong interface checking wouldn't work to confirm they have the right interface.
  • What gets mocked? Only expensive external services. Everything else is real code.

Testing today: Tests are written to the interface, not the implementation. When writing tests ask yourself, "how much could the implementation change, and not have to change the test?" Running against mocks in CI makes the tests go faster, and reduces cycles.

We used to do bad things:

  • use a framework to inject dependencies.
  • use a framework to create mock objects.
  • have constructors automatically create resources if they were not passed in.
  • twist code into weird shapes to make it work.

Now we do good things:

  • Object dependencies are required constructor params. Implicit params are bad, because it's hard to track all those implicit connections. If you forget a required parameter, it's very apparent. If object A doesn't make sense without object B, then don't default it.
  • Separate state from behavior. (code sample at 22:30 in the video) An instance method that reads an attribute, performs calculations on it, and assigns it back to an attribute. The calculation in the middle can be pulled into a pure function, and the method can change to self.b = pure_fn(self.a).
  • Classes shrink before your eyes. Functional programming is very testable.

Define clear interfaces between components. If you can't figure out how to write a test, it's a code smell, you need to think more about the product code.


Saturday 2 June 2012

One of the common dynamics at PyCon is that you can't see all the talks you want to. There are five tracks going at once, and often, you find yourself in a face-to-face conversation with a real person, and it seems a shame to walk away from an IRL interaction to watch a talk that will be available later on video.

So you tell yourself, "I'll watch it later on video," but if you're like me, you rarely go back to watch the video. There's something about 30-minute videos that are hard to sit still for. My mind (and mouse finger) wanders, and I realize I've missed something, and I don't get out of it what I thought I would.

But I want to hear the talk! So both to keep myself focused on the talk, and to maybe give others another way to "hear the talk," I'm going to watch some videos and write blog posts summarizing each one. I'm calling it the "too long; didn't watch" series, and I'm starting with the six testing talks at PyCon.

When I give talks, I like to write them out in prose, partly as part of my preparation, and partly so the content is available textually. This will be kind of like that, except the text will be sketchier, and they aren't my words. I know I appreciate having text instead of video, so this is my contribution to the text-loving world.

« | » Main « | »