Saturday 14 March 2015 — This is more than ten years old. Be careful.

One of the things that is very useful about Python is its extreme introspectability and malleability. Taken too far, it can make your code an unmaintainable mess, but it can be very handy when trying to debug large and complex projects.

Open edX is one such project. Its main repository has about 200,000 lines of Python spread across 1500 files. The test suite has 8000 tests.

I noticed that running the test suite left a number of temporary directories behind in /tmp. They all had names like tmp_dwqP1Y, made by the tempfile module in the standard library. Our tests have many calls to mkdtemp, which requires the caller to delete the directory when done. Clearly, some of these cleanups were not happening.

To find the misbehaved code, I could grep through the code for calls to mkdtemp, and then reason through which of those calls eventually deleted the file, and which did not. That sounded tedious, so instead I took the fun route: an aggressive monkeypatch to find the litterbugs for me.

My first thought was to monkeypatch mkdtemp itself. But most uses of the function in our code look like this:

from tempfile import mkdtemp
...
d = mkdtemp()

Because the function was imported directly, if my monkeypatching code ran after this import, the call wouldn’t be patched. (BTW, this is one more small reason to prefer importing modules, and using module.function in the code.)

Looking at the implementation of mkdtemp, it makes use of a helper function in the tempfile module, _get_candidate_names. This helper is a generator that produces those typical random tempfile names. If I monkeypatched that internal function, then all callers would use my code regardless of how they had imported the public function. Monkeypatching the internal helper had the extra advantage that using any of the public functions in tempfile would call that helper, and get my changes.

To find the problem code, I would put information about the caller into the name of the temporary file. Then each temp file left behind would be a pointer of sorts to the code that created it. So I wrote my own _get_candidate_names like this:

import inspect
import os.path
import tempfile

real_get_candidate_names = tempfile._get_candidate_names

def get_candidate_names_hacked():
    stack = "-".join(
        "{}{}".format(
            os.path.basename(t[1]).replace(".py", ""),
            t[2],
        )
        for t in inspect.stack()[4:1:-1]
    )
    for name in real_get_candidate_names():
        yield "_" + stack + "_" + name

tempfile._get_candidate_names = get_candidate_names_hacked

This code uses inspect.stack to get the call stack. We slice it oddly, to get the closest three calling frames in the right order. Then we extract the filenames from the frames, strip off the “.py”, and concatenate them together along with the line number. This gives us a string that indicates the caller.

The real _get_candidate_names function is used to get a generator of good random names, and we add our stack inspection onto the name, and yield it.

Then we can monkeypatch our function into tempfile. Now as long as this module gets imported before any temporary files are created, the files will have names like this:

tmp_case53-case78-test_import_export289_DVPmzy/
tmp_test_video36-test_video143-tempfile455_2upTdS.srt

The first shows that the file was created in test_import_export.py at line 289, called from case.py line 78, from case.py line 53. The second shows that test_video.py has a few functions calling eventually into tempfile.py.

I would be very reluctant to monkeypatch private functions inside other modules for production code. But as a quick debugging trick, it works great.

Comments

EricH 9:01 PM on 14 Mar 2015

Why did this make me think of "Hackers share the surgeon's secret pleasure in poking about in gross innards" (http://www.paulgraham.com/popular.html) :-)

Nice work, detective.

suresh 4:51 PM on 15 Mar 2015

Check out marmoset patching in python

nde 8:14 PM on 15 Mar 2015

waw - impressive!!

Christopher Allen-Poole 11:11 AM on 16 Mar 2015

Brilliant. Well done.

Marius Gedminas 6:22 AM on 17 Mar 2015

I once created a mkdtemp() wrapper that created a file in each new directory and with the output of traceback.print_stack(), so I could hunt down unit tests that failed to clean them up. Then I had to search'n'replace mkdtemp cals in the entire testsuite to use it.

The idea of monkey-patching an internal function is genius.

andrew 11:29 AM on 17 Mar 2015

Did it work to find the offending calls?

Ned Batchelder 2:27 PM on 17 Mar 2015

@Andrew: I guess I forgot to mention, yes, it worked! :)

Graham Dumpleton 4:52 AM on 18 Mar 2015

You may find my related post about monkey patching interesting. I cited your post. :-)

http://blog.dscpl.com.au/2015/03/ordering-issues-when-monkey-patching-in.html

satoru 7:49 AM on 21 Mar 2015

Brilliant.

BTW, when I need to write file system related tests, I often use pyfakefs so that I don't need to care about file setting up and tearing down myself.

masklinn 12:42 PM on 22 Mar 2015

@Ned couldn't you have patched/replaced tempfile.mkdtemp() in a custom site.py rather than hunted down a private function to patch?

Ned Batchelder 3:23 PM on 22 Mar 2015

@masklinn: this is something I hadn't considered. Modifying site.py seems dangerous (it's a large file already), but it will try to import sitecustomize and usercustomize, which would be good places to try this kind of "before starting" customization. Thanks for the idea!

Sef Kloninger 3:46 PM on 25 Mar 2015

Nice hack and thanks for taking the time to share with everyone Ned.

Adam Chainz 8:53 PM on 26 Dec 2015

@Ned you can modify the code inside a function using Patchy: https://github.com/adamchainz/patchy . It doesn't care how you've imported the function - the function object is unchanged, it's the underlying code object (func.co_code) that gets edited, allowing you to not worry about module.function imports in the codebase.

Finding temp file creators

Comments

Add a comment: