Saturday 14 March 2015 — This is nearly ten years old. Be careful.
One of the things that is very useful about Python is its extreme introspectability and malleability. Taken too far, it can make your code an unmaintainable mess, but it can be very handy when trying to debug large and complex projects.
Open edX is one such project. Its main repository has about 200,000 lines of Python spread across 1500 files. The test suite has 8000 tests.
I noticed that running the test suite left a number of temporary directories behind in /tmp. They all had names like tmp_dwqP1Y, made by the tempfile module in the standard library. Our tests have many calls to mkdtemp, which requires the caller to delete the directory when done. Clearly, some of these cleanups were not happening.
To find the misbehaved code, I could grep through the code for calls to mkdtemp, and then reason through which of those calls eventually deleted the file, and which did not. That sounded tedious, so instead I took the fun route: an aggressive monkeypatch to find the litterbugs for me.
My first thought was to monkeypatch mkdtemp itself. But most uses of the function in our code look like this:
from tempfile import mkdtemp
...
d = mkdtemp()
Because the function was imported directly, if my monkeypatching code ran after this import, the call wouldn’t be patched. (BTW, this is one more small reason to prefer importing modules, and using module.function in the code.)
Looking at the implementation of mkdtemp, it makes use of a helper function in the tempfile module, _get_candidate_names. This helper is a generator that produces those typical random tempfile names. If I monkeypatched that internal function, then all callers would use my code regardless of how they had imported the public function. Monkeypatching the internal helper had the extra advantage that using any of the public functions in tempfile would call that helper, and get my changes.
To find the problem code, I would put information about the caller into the name of the temporary file. Then each temp file left behind would be a pointer of sorts to the code that created it. So I wrote my own _get_candidate_names like this:
import inspect
import os.path
import tempfile
real_get_candidate_names = tempfile._get_candidate_names
def get_candidate_names_hacked():
stack = "-".join(
"{}{}".format(
os.path.basename(t[1]).replace(".py", ""),
t[2],
)
for t in inspect.stack()[4:1:-1]
)
for name in real_get_candidate_names():
yield "_" + stack + "_" + name
tempfile._get_candidate_names = get_candidate_names_hacked
This code uses inspect.stack to get the call stack. We slice it oddly, to get the closest three calling frames in the right order. Then we extract the filenames from the frames, strip off the “.py”, and concatenate them together along with the line number. This gives us a string that indicates the caller.
The real _get_candidate_names function is used to get a generator of good random names, and we add our stack inspection onto the name, and yield it.
Then we can monkeypatch our function into tempfile. Now as long as this module gets imported before any temporary files are created, the files will have names like this:
tmp_case53-case78-test_import_export289_DVPmzy/
tmp_test_video36-test_video143-tempfile455_2upTdS.srt
The first shows that the file was created in test_import_export.py at line 289, called from case.py line 78, from case.py line 53. The second shows that test_video.py has a few functions calling eventually into tempfile.py.
I would be very reluctant to monkeypatch private functions inside other modules for production code. But as a quick debugging trick, it works great.
Comments
Nice work, detective.
The idea of monkey-patching an internal function is genius.
http://blog.dscpl.com.au/2015/03/ordering-issues-when-monkey-patching-in.html
BTW, when I need to write file system related tests, I often use pyfakefs so that I don't need to care about file setting up and tearing down myself.
Add a comment: