Thursday 14 February 2013 — This is almost 12 years old. Be careful.
At edX, we have Python behind the scenes in courses to initialize the state of problems presented to students. Often, these problems are randomized so that different students will see different details in quantitative problems, but each student’s random seed is saved so that the student will see the same problem if they revisit the page.
The seed is used to seed the random module before executing any chunk of course Python, so that you can simply use the random module and know that you’ll get an appropriate value.
Today I found code like this in a course:
import q
random.seed(the_seed)
# ... generate the problem ...
My task was to refactor how information flowed around, and the_seed wasn’t going to be available, so I asked why the code was like this. It seemed odd, because the random module had just been seeded before this code was invoked, so why had the author bothered to re-seed the module with the same seed?
The answer was that it was a mysterious bug from months ago where the first time the code was run, it would produce a different result than any other time, and the re-seeding solved it. The q import seemed to be messing with the random seed, but only the first time.
The “only first time” clue pointed to it being code that is run on import. Remember, Python modules are just a series of statements, and when you import a module, it really executes all the statements. There’s no “import mode” that just collects function definitions. If you write a statement with a side effect at the top level of a module, that side effect will happen when you import the module.
But statements in module are only executed the first time the module is imported in a process. Subsequent imports simply produce another reference to the existing module object. Everything pointed to a statement running during import which stomped on the random module.
The q module imported a number of other modules, including numpy and sympy. But why would importing a module re-seed the random module?
A little experimenting showed that sympy was at fault here:
Python 2.7.3 (default, Aug 1 2012, 05:16:07)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> # Some baseline data
>>> import random
>>> random.seed(17)
>>> random.random()
0.5219839097124932
>>> random.random()
0.8066907771186791
>>> random.random()
0.9604947743238768
>>> random.random()
0.2896253777644655
>>>
>>> # Re-seed, and import sympy
>>> random.seed(17)
>>> import sympy
>>> random.random()
0.8066907771186791
>>>
Looking at the values, after importing sympy, we’ve skipped ahead one number in our random sequence. So sympy isn’t re-seeding the generator, it’s consuming a random number.
To find out where, we resorted to a monkey-patching trick: Replace random.random with a booby-trap:
Python 2.7.3 (default, Aug 1 2012, 05:16:07)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> random.random = lambda: 1/0
>>> import sympy
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/venv/lib/python2.7/site-packages/sympy/__init__.py", line 20, in <module>
from sympy.core import *
File "/venv/lib/python2.7/site-packages/sympy/core/__init__.py", line 8, in <module>
from expr import Expr, AtomicExpr
File "/venv/lib/python2.7/site-packages/sympy/core/expr.py", line 2020, in <module>
from mul import Mul
File "/venv/lib/python2.7/site-packages/sympy/core/mul.py", line 1197, in <module>
from numbers import Rational, igcd
File "/venv/lib/python2.7/site-packages/sympy/core/numbers.py", line 1993, in <module>
from function import FunctionClass
File "/venv/lib/python2.7/site-packages/sympy/core/function.py", line 43, in <module>
from sympy.utilities import default_sort_key
File "/venv/lib/python2.7/site-packages/sympy/utilities/__init__.py", line 13, in <module>
from runtests import test, doctest
File "/venv/lib/python2.7/site-packages/sympy/utilities/runtests.py", line 472, in <module>
class SymPyTests(object):
File "/venv/lib/python2.7/site-packages/sympy/utilities/runtests.py", line 475, in SymPyTests
seed=random.random()):
File "<stdin>", line 1, in <lambda>
ZeroDivisionError: integer division or modulo by zero
OK, not sure why it’s importing its tests when I try to use the package, but looking at the code, here’s the culprit:
class SymPyTests(object):
def __init__(self, ..., seed=random.random()):
#...
self._seed = seed
Here we can see the problem. Remember that function arguments are computed once, when the function is defined. Since this function is defined when the module is imported, random.random() will be called during import, consuming one of our random numbers.
Better would be to define it like this:
class SymPyTests(object):
def __init__(self, ..., seed=None):
#...
self._seed = seed
if self._seed is None:
self._seed = random.random()
I’m not quite sure which behavior the author wanted, one seed for all the instances, or one seed per instance. I know I don’t want importing this module to change my random number sequence.
Amusingly enough, the behavior of the initializer is irrelevant, it’s only called in one place, and never defaults the seed argument:
def test(*paths, **kwargs):
...
seed = kwargs.get("seed", None)
if seed is None:
seed = random.randrange(100000000)
t = SymPyTests(..., seed)
The best solution for our code would be to not rely on the module-level random number sequence, and instead use our own Random object. Come to think of it, that’s what sympy should do too.
BTW, looking at why sympy is importing test infrastructure when I import it, there’s this in sympy/utilities/__init__.py:
"""Some utilities that may help.
"""
from iterables import (flatten, group, take, subsets,
variations, numbered_symbols, cartes, capture, dict_merge,
postorder_traversal, preorder_traversal, interactive_traversal,
prefixes, postfixes, sift, topological_sort)
from lambdify import lambdify
from source import source
from decorator import threaded, xthreaded
from runtests import test, doctest
from cythonutils import cythonized
from timeutils import timed
from misc import default_sort_key
This makes using utilities very convenient, since it contains everything at the top level. But the downside is it means you must always take everything. There is no way to import only part of utilities. Even if you use “from utilities.lambdify import lambdify,” Python will execute the utilities/__init__.py file, importing everything.
Lessons:
- Modules really execute when imported,
- Function defaults are computed once when the function is defined,
- Modifying global state like the random module’s implicit sequence is bad,
- Importing stuff into __init__.py makes your code monolithic and harder to use as you want, and
- Monkey-patching can be a great way to debug problems.
Comments
I've made the change to SymPy (development version).
https://github.com/sympy/sympy/pull/1785
Instead, you should use random.Random, which seeds without global state. We should also change SymPy to use this as well.
If you have any suggestions on how to improve SymPy's import structure, we'd love to hear it. We still want `from sympy import *` to import everything useful, because using SymPy interactively is very common. Will changing everything to use __all__ make things better?
• Don't use string seeds, or the sequence will not be predictable (Python 3.2 uses a different hashing method, while the hash in Python < 3.2 returns different results on 32-bit and 64-bit machines).
• Don't use randrange(), randint(), choice(), or shuffle(), because the algorithms differ in different Python versions. Only random.random() is guaranteed to return the same numbers for the same seed.
Aside: I took a long time to verify what I wrote above. When I finally hit the Preview button, I got the edit box back with an error message above it saying "You took a long time entering this post. Please preview it and submit it again." Hey, I was trying to preview!)
@Marius: thanks for the heads up about repeatability, it makes total sense, though I'd never considered it before. (Sorry about the preview snafu.. spam prevention...)
https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/__init__.py
btw, it's just about syntax cosmetics but what do you think about writing:
self._seed = random.random() if seed is None else seed
or
self._seed = seed or random.random()
Add a comment: