Sunday 4 November 2012 — This is 12 years old. Be careful.
About a month ago, I found a bad-behavior bug in the tokenize standard library module, and with help from Aron Griffis, submitted a patch to fix it. Yesterday was a Python bug day, and Ezio Melotti merged my change, so I have officially contributed to CPython!
The bug in tokenize was an obscure case: if the code ends with a line that starts with non-space, then ends with many spaces, and no newline, then the tokenizer gets into an N² run-time behavior, where N is the number of spaces. The problem is that each space is tokenized as an error token (because it precedes no good token), so N tokens are produced, but each token takes linear time for the regex to see that there’s no good token following it, leading to N² behavior.
I discovered this working on code that grades student submissions at edX. For some reason there was a submission ending with 40,000 spaces and no newline, and it was taking 20 minutes to tokenize!
Simple demonstration:
import tokenize
import time
from cStringIO import StringIO
def time_to_tokenize_trailing(spaces):
source = StringIO("@" + " "*spaces)
start = time.time()
list(tokenize.generate_tokens(source.readline))
end = time.time()
return end - start
for spaces in xrange(1000, 15000+1, 1000):
print "%5d: %.2fs" % (spaces, time_to_tokenize_trailing(spaces))
Ouch:
1000: 0.71s
2000: 2.83s
3000: 6.47s
4000: 11.52s
5000: 17.68s
6000: 26.16s
7000: 35.35s
8000: 46.65s
9000: 58.35s
10000: 72.80s
11000: 89.53s
12000: 107.27s
13000: 126.44s
14000: 147.60s
15000: 166.81s
If you are running a server that tokenizes untrusted Python code, you might want to throw an .rstrip() into it to prevent this case...
Comments
Congrats on the first patch acceptance. I am running the test code now and I'm getting timing on the same order that you did:
1000: 0.66s
2000: 2.59s
3000: 5.86s
4000: 10.39s
5000: 16.44s
6000: 23.83s
7000: 32.91s
8000: 42.12s
9000: 52.77s
10000: 65.42s
11000: 80.21s
12000: 98.65s
13000: 114.85s
14000: 134.36s
15000: 156.47s
~Jack
Add a comment: