Finding temp file creators

Saturday 14 March 2015This is close to ten years old. Be careful.

One of the things that is very useful about Python is its extreme introspectability and malleability. Taken too far, it can make your code an unmaintainable mess, but it can be very handy when trying to debug large and complex projects.

Open edX is one such project. Its main repository has about 200,000 lines of Python spread across 1500 files. The test suite has 8000 tests.

I noticed that running the test suite left a number of temporary directories behind in /tmp. They all had names like tmp_dwqP1Y, made by the tempfile module in the standard library. Our tests have many calls to mkdtemp, which requires the caller to delete the directory when done. Clearly, some of these cleanups were not happening.

To find the misbehaved code, I could grep through the code for calls to mkdtemp, and then reason through which of those calls eventually deleted the file, and which did not. That sounded tedious, so instead I took the fun route: an aggressive monkeypatch to find the litterbugs for me.

My first thought was to monkeypatch mkdtemp itself. But most uses of the function in our code look like this:

from tempfile import mkdtemp
...
d = mkdtemp()

Because the function was imported directly, if my monkeypatching code ran after this import, the call wouldn’t be patched. (BTW, this is one more small reason to prefer importing modules, and using module.function in the code.)

Looking at the implementation of mkdtemp, it makes use of a helper function in the tempfile module, _get_candidate_names. This helper is a generator that produces those typical random tempfile names. If I monkeypatched that internal function, then all callers would use my code regardless of how they had imported the public function. Monkeypatching the internal helper had the extra advantage that using any of the public functions in tempfile would call that helper, and get my changes.

To find the problem code, I would put information about the caller into the name of the temporary file. Then each temp file left behind would be a pointer of sorts to the code that created it. So I wrote my own _get_candidate_names like this:

import inspect
import os.path
import tempfile

real_get_candidate_names = tempfile._get_candidate_names

def get_candidate_names_hacked():
    stack = "-".join(
        "{}{}".format(
            os.path.basename(t[1]).replace(".py", ""),
            t[2],
        )
        for t in inspect.stack()[4:1:-1]
    )
    for name in real_get_candidate_names():
        yield "_" + stack + "_" + name

tempfile._get_candidate_names = get_candidate_names_hacked

This code uses inspect.stack to get the call stack. We slice it oddly, to get the closest three calling frames in the right order. Then we extract the filenames from the frames, strip off the “.py”, and concatenate them together along with the line number. This gives us a string that indicates the caller.

The real _get_candidate_names function is used to get a generator of good random names, and we add our stack inspection onto the name, and yield it.

Then we can monkeypatch our function into tempfile. Now as long as this module gets imported before any temporary files are created, the files will have names like this:

tmp_case53-case78-test_import_export289_DVPmzy/
tmp_test_video36-test_video143-tempfile455_2upTdS.srt

The first shows that the file was created in test_import_export.py at line 289, called from case.py line 78, from case.py line 53. The second shows that test_video.py has a few functions calling eventually into tempfile.py.

I would be very reluctant to monkeypatch private functions inside other modules for production code. But as a quick debugging trick, it works great.

Comments

[gravatar]
Why did this make me think of "Hackers share the surgeon's secret pleasure in poking about in gross innards" (http://www.paulgraham.com/popular.html) :-)

Nice work, detective.
[gravatar]
Check out marmoset patching in python
[gravatar]
waw - impressive!!
[gravatar]
I once created a mkdtemp() wrapper that created a file in each new directory and with the output of traceback.print_stack(), so I could hunt down unit tests that failed to clean them up. Then I had to search'n'replace mkdtemp cals in the entire testsuite to use it.

The idea of monkey-patching an internal function is genius.
[gravatar]
Did it work to find the offending calls?
[gravatar]
@Andrew: I guess I forgot to mention, yes, it worked! :)
[gravatar]
You may find my related post about monkey patching interesting. I cited your post. :-)

http://blog.dscpl.com.au/2015/03/ordering-issues-when-monkey-patching-in.html
[gravatar]
Brilliant.

BTW, when I need to write file system related tests, I often use pyfakefs so that I don't need to care about file setting up and tearing down myself.
[gravatar]
@Ned couldn't you have patched/replaced tempfile.mkdtemp() in a custom site.py rather than hunted down a private function to patch?
[gravatar]
@masklinn: this is something I hadn't considered. Modifying site.py seems dangerous (it's a large file already), but it will try to import sitecustomize and usercustomize, which would be good places to try this kind of "before starting" customization. Thanks for the idea!
[gravatar]
Nice hack and thanks for taking the time to share with everyone Ned.
[gravatar]
@Ned you can modify the code inside a function using Patchy: https://github.com/adamchainz/patchy . It doesn't care how you've imported the function - the function object is unchanged, it's the underlying code object (func.co_code) that gets edited, allowing you to not worry about module.function imports in the codebase.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.