Wednesday 19 November 2008 — This is exactly 16 years old. Be careful.
At work we’ve been using the well-regarded feedparser module to parse RSS feeds, and it works great for the most part, but we’d occasionally get a stuck server process. The CPU would spike to 100%, and wouldn’t make any progress.
We discovered a particular feed would cause a particular regular expression in the code to spin endlessly. The regex was intended to determine if a style attribute is valid CSS:
if not re.match("^(\s*[-\w]+\s*:\s*[^:;]*(;|$))*$", style):
return ''
Breaking this out into verbose regex syntax shows how it matches valid CSS:
"""(?x) # use verbose regex syntax
^(
# A single CSS clause is:
\s* # leading whitespace
[-\w]+ # a dash-word, the property name
\s*:\s* # space, colon, space
[^:;]* # anything but :;, the value
(;|$) # ends with a semi or the end of the string
)* # Valid CSS is any number of clauses
$
"""
And here’s the snippet discovered in the feed that spun us hard (with whitespace added for readability):
<var style="COLOR: #fffafe; coming: ; basket: ; philologist: ; gradually: ;
encyclic: ; whitechapel: ; left: ; albino: ; lamelliform: ; foment: ;
adjuvant: ; Room: ; Milk: ; buynow: ; wheelwork: ; unseal: ; reasons: ;
socalled: ; dazed: ; Brain: ; Kaleidoscope: ; hardheaded: ; asthenic: ;
preferred: ; Barbecue: ; Comet: ; Nail: ; lubberly: ; School: ;
Mist: ; undercurrent: ; intwine: ; isotonic: ; Chief: ; miscellaneous: ;
Book: ; Shoes: ; Chocolates: ; deuced: ; you: ; Man: ; federalize: ;
Rainbow: ; Satellite: ; Printer: ; amicus: ; tautophony: ; taking: ;
regrater: ; waggon: ; prescient: ; God: ; prosing: ; Bank: ; hariolation: ;
patriarchs: ; Pyramid: ; Data Base: ; PaintBrush: ; ingenu: ; Rope: ;
parenchyma: ; price: ; Alphabet: ; Circle: ; seeks: ; frankhearted: ;
vituperate: ; dysmeromorph: ; Shop: ; firm: ; imperforation: ; lane: ;
Gemstone: ; slatternly: ; Fire: ; impudence: ; Carrot: ; Fan: ;
inoccupation: ; uncover: ; Liquid: ; drawee: ; Pocket: ;barbacan: ;
fornicatress: ; chimes: ; Crystal: ;innovation: ; years: ; untiring: ;
Freeway: ;desertful: ; unreined: ; Compass: ; Hose: ;prelusive: ;
impenetrability: ; Fruit: ; direct: ; "></var>
(yes, it’s garbage, and yes, spam sucks.)
It’s hard to see the problem here, but this is not valid CSS because they used “Data Base” as a property name about half-way through and spaces aren’t allowed in property names.
The CPU spins because when the regex encounters the failure to match “Data Base”, it backtracks to reconsider previous matches in the hopes that it can still make the regex work. In fact, it isn’t in an infinite loop, just a very very very long one. Eventually this regex will finish and decide that the string doesn’t match.
But we don’t need it to backtrack: going back to re-match previous CSS clauses isn’t going to help.
Some regex libraries offer solutions to this problem. Possessive quantifiers let you use *+ to mean, match as many as possible, and once matched, don’t try matching fewer during backtracking. They’re called possessive because once the operator claims part of the string, it won’t give it back for other operators to match later.
But Python doesn’t offer possessive quantifiers (yet yet). So we have to choose a different technique than trying to match the whole string in one large regex. In this case, since we don’t need the match data, we’re just checking that the whole string matches, so we can use re.sub to remove matching clauses and then check that there’s nothing left over:
if re.sub("\s*[-\w]+\s*:\s*[^:;]*;?\s*", '', style):
return ''
Because re.sub grabs matches, performs the replacement, and moves on, there’s no needless backtracking to throw a wrench in the works. Now our crazy CSS spam is speedily dispatched as invalid.
As an interesting side effect, if the string is not empty, what remains is the invalid part of the string.
Comments
There are now nice parsec-like libraries in many languages. And they give you a more readable grammar, without comments.
^( \s* D (;|$) )*$
(with D a "declaration", a name and value pair), in order to match any of
- nothing
- D (;)?
- D ; D (;)?
- D ; D ; D (;)?
- D ; D ; D ; D (;)?
- etc.
Let's leave out the first case (empty string, which you can easily test for independently), and you match it with the following generic form:
^( D ; )* D (;)? $
With the semicolon between declaration *required*, this likely won't be so pathetic.
In Perl (I don't do Python), doing the pattern matching my way, on this data, took 4.6 milliseconds.
^ D ( ; D )* ;? $
From memory FAs can perform worse for some average-case inputs (smaller constants in the O(n^2)?)
@Bart, I appreciate the refactoring, but it doesn't seem to help Python. Even your latest form is pathological.
@Simon: thanks for the info on compile-time issues. That would have been my first guess.
1) it is a mess to guarantee leftmost/longest matching of subexpressions (either in the POSIX sense, or in the Perl sense); it is even worse if you add the greedy and stingy operators. DFAs are basically useful only if you want to match a regex against a string, but you don't care about what parts matched;
2) DFAs are naturally "stingy": they can tell you the shortest match, and you have to iterate in order to get longer matches. Unfortunately, in some cases (such as ^a.*b$) a well-optimized regex matcher can process greedy operators in sublinear time (only if the regex matches).
Besides, does Python have the (?> ...) syntax, so that (?> ...)* would be the same as (?: ...)*+ but without the possessive quantifier?
(True, a bit slower, but unnoticeable for most inputs. The readability is very noticeable, however)
Add a comment: