Isolated @memoize

Saturday 16 January 2016

When calling functions that are expensive, and expected to return the same results for the same input, lots of people like using an @memoize decorator. It uses a cache to quickly return the same results if they have been produced before. Here's a simplified one, adapted from a collection of @memoize implementations:

def memoize(func):
    cache = {}

    def memoizer(*args, **kwargs):
        key = str(args) + str(kwargs)
        if key not in cache:
            cache[key] = func(*args, **kwargs)
        return cache[key]

    return memoizer

@memoize
def expensive_fn(a, b):
    return a + b        # Not actually expensive!

This is great, and does what we want: repeated calls to expensive_fn with the same arguments will use the cached values instead of actually invoking the function.

But there's a potential problem: the cache dictionary is a global. Don't be fooled by the fact that it isn't literally a global: it doesn't use the global keyword, and it isn't a module-level variable. But it is global in the sense that there is only one cache dictionary for expensive_fn for the entire process.

Globals can interfere with disciplined testing. One ideal of automated tests in a suite is that each test be isolated from all the others. What happens in test1 shouldn't affect test99. But here, if test1 and test99 both call expensive_fn with arguments (1, 2), then test1 will run the function, but test99 will get the cached value. Worse, if I run the complete suite, test99 gets a cached value, but if I run test99 alone, it runs the function.

This might not be a problem, if expensive_fn is truly a pure function with no side effects. But sometimes that's not the case.

I inherited a project that used @memoize to retrieve some fixed data from a web site. @memoize is great here because it means each resource will be fetched only once, no matter how the program uses them. The test suite used Betamax to fake the network access.

Betamax is great: it automatically monitors network access, and stores a "cassette" for each test case, which is a JSON record of what was requested and returned. The next time the tests are run, the cassette is used, and the network access is faked.

The problem is that test1's cassette will have the network request for the memoized resource, and test99's cassette will not, because it never requested the resource, because @memoize made the request unnecessary. Now if I run test99 by itself, it has no way to get the resource, and the test fails. Test1 and test99 weren't properly isolated, because they shared the global cache of memoized values.

My solution was to use an @memoize that I could clear between tests. Instead of writing my own, I used the lru_cache decorator from functools (or from the functools32 if you are still using Python 2.7). It offers a .cache_clear function that can be used to clear all the values from the hidden global cache. It's on each decorated function, so we have to keep a list of them:

import functools

# A list of all the memoized functions, so that
# `clear_memoized_values` can clear them all.
_memoized_functions = []

def memoize(func):
    """Cache the value returned by a function call."""
    func = functools.lru_cache()(func)
    _memoized_functions.append(func)
    return func

def clear_memoized_values():
    """Clear all the values saved by @memoize, to ensure isolated tests."""
    for func in _memoized_functions:
        func.cache_clear()

Now an automatic fixture (for py.test) or a setUp function, can clear the cache before each test:

# For py.test:

@pytest.fixture(autouse=True)
def reset_all_memoized_functions():
    """Clears the values cached by @memoize before each test."""
    clear_memoized_values()

# For unittest:

class MyTestCaseBase(unittest.TestCase):
    def setUp(self):
        super().setUp()
        clear_memoized_values()

In truth, it might be better to distinguish between the various reasons for using @memoize. A pure function might be fine to cache between tests, who cares when the value is computed? But other uses clearly should be isolated. @memoize isn't magic, you have to think about what it is doing for you, and when you want to have more control.

Comments

[gravatar]
Connelly Barnes 4:34 PM on 16 Jan 2016

Reminds me of this post of mine from many years ago:

http://code.activestate.com/recipes/440678-memoization-with-cache-cleared-on-return-of-last-f/

I believe I was trying to solve a sequence of dynamic programming problems but wanted the cache to clear out in between each problem.

[gravatar]
Elliot Cameron 10:03 PM on 16 Jan 2016

Another approach is to introduce memoization at some point lower in the stack (instead of globally) and dependency inject it into your code. You could also put it on an object so that it lasts as long as the object is around. These are usually better when you're caching an impure operation instead of a pure one.

It's worth noting that, as written, this code is also not thread safe.

[gravatar]
Chris 10:47 AM on 18 Jan 2016

Another approach I've used is to require the user to explicitly control the cache's lifetime with a with statement, for instance in their main code block. You can't then forget to clear something in your tests, silently breaking isolation, because the code just won't work outside of an explicit scope.

Add a comment:

name
email
Ignore this:
not displayed and no spam.
Leave this empty:
www
not searched.
 
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.