A server memory leak

Saturday 27 September 2008

We pushed new code to our production servers last week. There were a lot of changes, including our upgrade to Django 1.0. As soon as the servers restarted, they immediately suffered, with Python processes bloated to 2Gb or more memory each. Yikes! We reverted to the old code, and began the process of finding the leak.

These are details on what we (Dave, Peter, and I, mostly them) did to find and fix the problem.

We used Guppy, a very capable Python memory diagnostic tool. It showed that the Python heap was much smaller than the memory footprint of the server process, so the leak seemed to be in memory managed by C extensions.

We identified these C extensions:

We tried to keep these possibilities in mind as we worked through our next steps. PIL and PDFlib in particular seemed likely given how heavily we use them, and because they traffic in large data (high-res images).

We had some unit tests that showed fat memory behavior. We ran valgrind on them hoping they would demonstrate a leak that we could fix. Valgrind is a very heavy-weight tool, requiring re-compiling the Python interpreter to get good results, and even so, we were overwhelmed with data and noise. The tests took long enough to run that other techniques proved more productive.

Our staging server had been running the code for over a week, and showed no ill effects. We tried to reason out what is the important difference between the staging server and the production server? We figured the biggest difference is the traffic they each receive. We tried to load up the staging server with traffic. An aggressive test downloading many dynamic PDFs quickly ballooned the memory on the staging server, so we suspected PDFlib as the culprit.

Closely reading the relevant code, we realized we had a memory leak if an exception occurred:

p = PDF_new()
# Lots of stuff, including an exception
PDF_delete(p)   # Not called: leak!

We felt pretty good about finding that, and fixed it up with a lot of unfortunate try/finally clauses. We put the code on our staging server, and it behaved much better. Lots of PDF downloads would still cause the memory to grow, but when the requests were done, it would settle back down again. So we liked the theory that this was the fix. The only flaw in the theory was it didn't provide a reason why our old code was good and our new code was bad. We put the fixed code on the production server: boom, the app server processes ballooned immediately. Apparently as good as this exception fix was for our PDFlib code, it wasn't the real problem.

We tried chopping out functionality to isolate the problem. Certain subsets of URLs were removed from the URL map to remove the traffic from the server. We ran the code for short five-minute bursts to see the behavior under real traffic, and it was no better. To be sure it wasn't still PDFlib somehow, we tried removing PDFlib by raising an exception at the one place in our code where PDF contexts are allocated. Memory still exploded. We tried removing PIL by writing a dummy Image.py that raises exceptions unconditionally. It didn't help.

We tried logging requests and memory footprints, but correlations elusive. We tried changing the process architecture to use only one thread per process, no luck.

We tried reverting all the Django 1.0 changes, to move back to the Django version we had been using before. This changed back the Django code, and the adaptations we'd made to that code, but (in theory) left in place all of the feature work and bug fixes we had done.

We pushed that to the servers, and everything performed beautifully, the server processes used reasonable amounts of memory, and didn't grow and shrink. So now we know the leak is either in the Django 1.0 code, or in our botched adaptation to it, or in some combination of the two. Many people are using Django 1.0, so it seemed unlikely to be as simple as a Django leak, so we focused on our Django-intensive code.

Now that we'd narrowed it down to the Django upgrade, how to find it? We went back to the request logs, examining them more closely for any clues. We found one innocuous-seeming URL that appeared near a number of the memory explosions.

We took one app server out of rotation, so that it wasn't serving any live requests. Our nginx load balancer is configured so that a URL parameter can direct a request to a particular app server. We used that to hit the isolated app server once with the suspect request. Sure enough, the process ballooned to 1Gb, and stayed there. Then we killed that process, and did it again. The Python process grew to 1Gb again. Yay! We had a single URL that reproduced the problem!

Now we could review the code that handled that URL, and eyeball everything for suspects. We found this:

def getRecentStories(num=5):
    """ Return num most recent stories. Only public stories are returned.
    stories = Story.objects.published(access=kAccess.public).\
    if num:
        stories = stories[:num]
    return stories

Our @memoize decorator here caches the result of the function, based on its argument values. The result of the function is a QuerySet. Most of the code that calls getRecentStories uses a specific num value, so it returns a QuerySet for a small number of stories, and the caller simply uses that value (for example, in a template context variable).

However, in this case, the getRecentStories function is called like this:

next_story = getRecentStories(0).filter(published_date__lt=the_date)[0]

The QuerySet is left unlimited until after it is filtered by published_date, and then the first story is limited off.

Now we're getting to the heart of one of our mysteries: why was the old Django code good, and the new Django code bad? The Django ORM changed a great deal in 1.0, and one of the changes was in what happened when you pickle a QuerySet.

To cache a QuerySet, you have to pickle it. Django's QuerySets are lazy: they only actually query the database when they need to. For as long as possible, they simply collect up the parameters that define the query. In Django 0.96, pickling a QuerySet didn't force the query to execute, you simply got a pickled version of the query parameters. In Django 1.0, pickling the query causes it to query the database, and the results of the query are part of the pickle.

Looking at how the getRecentStories function is called, you see that it returns a QuerySet for all the public stories in the database, which is then narrowed by the caller first on the published_date, but more importantly, with the [0] slice.

In Django 0.96, the query wasn't executed against the database until the [0] had been applied, meaning the SQL query had a "LIMIT 1" clause added. In Django 1.0, the query is executed when cached, meaning we request a list of all public stories from the database, then cache that result list. Then the caller further filters the query, and executes it again to get just one result.

So in Django 0.96, this code resulted in one query to the database, with a LIMIT 1 clause included, but in Django 1.0, this code resulted in two queries. The first was executed when the result was cached by the @memoize decorator, the second when that result was further refined in the caller. The second query is the same one the old code ran, but the first query is new, and it returns a lot of results because it has no LIMIT clause at all.

The fix to reduce the database query was to split getRecentStories into two functions: one that caches its result, and is used when the result will not be filtered further, and another uncached function to use when it will be filtered:

def getRecentStories(num=5):
    """ Return num most recent stories. Only public stories are returned.
        Use this function if you want to filter the results yourself.
        Otherwise use getCachedRecentStories.
    stories = Story.objects.published(access=kAccess.public).\
    if num:
        stories = stories[:num]
    return stories
def getCachedRecentStories(num=5):
    """ Return num most recent stories. Only public stories are returned.
        If you need to filter the results further, use getRecentStories.
    return list(getRecentStories(num=num))

One last point about the Django change: should we have known this from reading the docs? Neither the QuerySet refactoring notes nor the 1.0 backwards incompatible changes pages mention this change, or address the question of pickled QuerySets directly. Interestingly, an older version of the docs does describe this exact behavior. This changes was explicitly made and discussed, but seems to have been misplaced in the 1.0 doc refactoring. Of course, we may not have realized we had this behavior even if we had read about the change.

So we've found a big difference in the queries made using the old code and the new code. But why the leak? The theory is that MySQLdb has a leak which has been fixed on its trunk. Looking at the MySQLdb code, it's pretty clear that they've been developing for a while since releasing version 1.2.2. Unfortunately, the MySQLdb trunk doesn't work under Django yet, so we can't verify the theory that MySQLdb is the source of the leak.

Ironically, MySQLdb was not on our list of C extensions to look at. If it had been, we might have identified it as the culprit with a Google search. Since the MySQLdb trunk doesn't work under Django, I guess we would have hacked MySQLdb or Django to get them to work together. We would have run leak-free, but would be unknowingly executing the giant database query.

The last mystery: why didn't the problem appear on our staging server? Because it was running with a much smaller database than our production servers, so the "all public stories" query wasn't a big deal. We learned a lesson there: sometimes subtle difference can make all the difference. We need to keep the staging server's database as current as we can to make sure it's replicating the production environment as much as possible. It's impossible to make them identical (for example, the staging server doesn't get traffic from search bots), but at times like this, it's important to understand what all the differences are, and minimize them where you can.


Yoav Shapira 1:46 PM on 27 Sep 2008

Thank you for taking the time to document in detail what you ran into, how you debugged it, and what you found. Such detailed accounts are priceless.

Jeethu Rao 2:39 PM on 27 Sep 2008

I'd been bitten by the QuerySet pickling feature about a week ago. My workaround was to cache the QuerySet's query attribute (qs.query) and do something like the following to restore the queryset from the cached query.

if cached_query is not None :
    qs = model.objects.all()
    qs.query = cached_query
    return qs

Andrew 2:42 PM on 27 Sep 2008

Great post! I highly recommend the with statement for ensuring destructors are called as soon as possible.

Richard Moore 4:16 PM on 27 Sep 2008

I've found that a good way to build test servers is to restore the backup of the production system on to a new machine. This not only ensures you have a data set that matches your live systems, but also lets you test that your backup procedures work correctly.

Michael Abshoff 5:02 PM on 27 Sep 2008

Very nice article indeed. I have done extensive debugging of Python applications under valgrind but some times that does not help to isolate the problem, even with custom valgrind suppression files. Guppy looks very interesting and I was not aware of its existence, so that you for mentioning it.

Alan Fairless 5:05 PM on 27 Sep 2008

What a chase. Thanks for posting. Indeed, guppy (and also winpdb) has become one of my favorite tools lately.

FYI, awhile ago we were trying to solve some memory related segfaults of our desktop SpiderOak backup client, and created some integration code so valgrind can show which Python lines are leaking memory (similar to how gdb does.) It's available here: https://spideroak.com/blog/200805150001

Al 8:41 PM on 27 Sep 2008

I've been bitten by a 'not big enough database' in a staging environment before as well - very frustrating but a lesson well learned. We now make sure our staging environment is not more than a week behind our production servers, just in case something strange has happened!

Ned Batchelder 8:10 AM on 28 Sep 2008

The Django team has updated the docs to include the QuerySet caching issue: http://code.djangoproject.com/changeset/9090

Ned Batchelder 8:12 AM on 28 Sep 2008

@Alan: thanks for the pointer to valgrind integration. I'll definitely be keeping that on hand for the next "impossible" debugging situation.

@Andrew: the with statement is a likely tool for us to use in solving the deallocation problem with PDFlib. We've got it solved now with finally clauses, but we're contemplating cleaning it up with with's.

Carl Meyer 11:47 AM on 28 Sep 2008

Thanks for this very illuminating writeup!

One thing I find curious is why you were bothering to memoize the results of getRecentStories in the first place? Given the old behavior of QuerySets (caching doesn't execute the query), and the fact that getRecentStories doesn't seem to include any expensive operations (just the code time to build up the query parameters), that memoization can't have been addressing any kind of real bottleneck, unless I'm missing something.

Ned Batchelder 11:53 AM on 28 Sep 2008

@Carl, that's a very astute question. The fact is, the actual code was a little more complicated that I showed here: it sometimes returned the QuerySet, but sometimes returned the list of results, depending on how it was called. That was another code smell that should have pointed us toward splitting it into the two functions it is now.

wolf550e 1:53 PM on 29 Sep 2008

Your DBA could and should have seen a massive new query returning a lot of data when you changed app code version and told you about it. Seeing the SQL and comparing it with similar queries from the previous version of the app, you would have noticed that the query without the LIMIT clause is not supposed to be executed. This would have helped you find the problem.

A very interesting read, but you should consider logging your db queries to find perf. regressions.

Ned Batchelder 3:24 PM on 29 Sep 2008

@wolf550e: your point is a good one, we should be monitoring the db more closely to see egregious changes like this. One small point, though: we don't have a DBA: it's us.

zgoda 2:45 PM on 23 Oct 2008

The leak is still present in MySQLdb. I'm waiting eagerly for fix...

frank 3:08 AM on 17 Dec 2008

a good post;
maybe others having the same problem will get the answer;

but to my puzzled, Python's GC don't free large-object(>256k),so when the python program allocate large-object memory from system..it coudln't be stopped until it reach the limit of the OS.

Add a comment:

Ignore this:
not displayed and no spam.
Leave this empty:
not searched.
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.