Pathological backtracking

Wednesday 19 November 2008

At work we've been using the well-regarded feedparser module to parse RSS feeds, and it works great for the most part, but we'd occasionally get a stuck server process. The CPU would spike to 100%, and wouldn't make any progress.

We discovered a particular feed would cause a particular regular expression in the code to spin endlessly. The regex was intended to determine if a style attribute is valid CSS:

if not re.match("^(\s*[-\w]+\s*:\s*[^:;]*(;|$))*$", style):
    return ''

Breaking this out into verbose regex syntax shows how it matches valid CSS:

"""(?x)             # use verbose regex syntax
    ^(
                    # A single CSS clause is:
    \s*             #   leading whitespace
    [-\w]+          #   a dash-word, the property name
    \s*:\s*         #   space, colon, space
    [^:;]*          #   anything but :;, the value
    (;|$)           #   ends with a semi or the end of the string
    
    )*              # Valid CSS is any number of clauses
    $
"""

And here's the snippet discovered in the feed that spun us hard (with whitespace added for readability):

<var style="COLOR: #fffafe; coming: ; basket: ; philologist: ; gradually: ;
encyclic: ; whitechapel: ; left: ; albino: ; lamelliform: ; foment: ;
adjuvant: ; Room:  ; Milk:  ; buynow: ; wheelwork: ; unseal: ; reasons: ;
socalled: ; dazed: ; Brain:  ; Kaleidoscope:  ; hardheaded: ; asthenic: ;
preferred: ;  Barbecue:  ; Comet:  ; Nail:  ; lubberly: ; School:  ;
Mist:  ; undercurrent: ; intwine: ; isotonic: ; Chief:  ; miscellaneous: ;
Book:  ; Shoes:  ; Chocolates:  ; deuced: ; you: ; Man:  ; federalize: ;
Rainbow:  ; Satellite:  ; Printer:  ; amicus: ; tautophony: ; taking: ;
regrater: ; waggon: ; prescient: ; God:  ; prosing: ; Bank:  ; hariolation: ;
patriarchs: ; Pyramid:  ; Data Base:  ; PaintBrush:  ; ingenu: ; Rope:  ;
parenchyma: ; price: ; Alphabet:  ; Circle:  ; seeks: ; frankhearted: ;
vituperate: ; dysmeromorph: ; Shop:  ; firm: ;  imperforation: ; lane: ;
Gemstone:  ; slatternly: ; Fire:  ; impudence: ; Carrot:  ; Fan:  ;
inoccupation: ; uncover: ; Liquid:  ; drawee: ; Pocket:  ;barbacan: ;
fornicatress: ; chimes: ; Crystal:  ;innovation: ; years: ; untiring: ;
Freeway:  ;desertful: ; unreined: ; Compass:  ; Hose:  ;prelusive: ;
impenetrability: ; Fruit:  ; direct: ; "></var>

(yes, it's garbage, and yes, spam sucks.)

It's hard to see the problem here, but this is not valid CSS because they used "Data Base" as a property name about half-way through and spaces aren't allowed in property names.

The CPU spins because when the regex encounters the failure to match "Data Base", it backtracks to reconsider previous matches in the hopes that it can still make the regex work. In fact, it isn't in an infinite loop, just a very very very long one. Eventually this regex will finish and decide that the string doesn't match.

But we don't need it to backtrack: going back to re-match previous CSS clauses isn't going to help.

Some regex libraries offer solutions to this problem. Possessive quantifiers let you use *+ to mean, match as many as possible, and once matched, don't try matching fewer during backtracking. They're called possessive because once the operator claims part of the string, it won't give it back for other operators to match later.

But Python doesn't offer possessive quantifiers (yet yet). So we have to choose a different technique than trying to match the whole string in one large regex. In this case, since we don't need the match data, we're just checking that the whole string matches, so we can use re.sub to remove matching clauses and then check that there's nothing left over:

if re.sub("\s*[-\w]+\s*:\s*[^:;]*;?\s*", '', style):
    return ''

Because re.sub grabs matches, performs the replacement, and moves on, there's no needless backtracking to throw a wrench in the works. Now our crazy CSS spam is speedily dispatched as invalid.

As an interesting side effect, if the string is not empty, what remains is the invalid part of the string.

Comments

[gravatar]
No one of consequence 11:41 AM on 19 Nov 2008

Or you could find a posix compatible regular expression library for python that uses a finite automaton (neither an NFA or DFA do backtracking) and then all matching will take time which is always proportional to the length of the input text. No more hidden exponential time traps.

[gravatar]
Chung-chieh Shan 12:13 PM on 19 Nov 2008

What you discovered is that regular expressions in Python are broken.

[gravatar]
Ned Batchelder 5:59 PM on 19 Nov 2008

It's interesting to read that not only are there alternative algorithms for implementing regexes, but that they perform so differently. Something tells me there's a reason Python and Perl both use the algorithm they do. Does anyone know the reason? Lots of effort goes into the regex libraries, it can't be ignorance that leaves us with exponential backtracking.

[gravatar]
nraynaud 6:29 PM on 19 Nov 2008

maybe you should stop using regex to parse grammars that are clearly not meant to be parsed with it ?

There are now nice parsec-like libraries in many languages. And they give you a more readable grammar, without comments.

[gravatar]
well done 7:34 PM on 19 Nov 2008

This is a great example of the inappropriate use of regular expressions.

[gravatar]
Bart 8:08 PM on 19 Nov 2008

You could rethink your problem before thinking of trying to find an alternative regexp engine. What you're trying to match looks like this:

^( \s* D (;|$) )*$

(with D a "declaration", a name and value pair), in order to match any of

- nothing
- D (;)?
- D ; D (;)?
- D ; D ; D (;)?
- D ; D ; D ; D (;)?
- etc.

Let's leave out the first case (empty string, which you can easily test for independently), and you match it with the following generic form:

^( D ; )* D (;)? $

With the semicolon between declaration *required*, this likely won't be so pathetic.

In Perl (I don't do Python), doing the pattern matching my way, on this data, took 4.6 milliseconds.

[gravatar]
Bart 8:14 PM on 19 Nov 2008

update: swapping the single and the repeated match in the regexp still doubles the speed (it clocks at 2.1 milliseconds!):

^ D ( ; D )* ;? $

[gravatar]
Simon Buchan 2:38 AM on 20 Nov 2008

@Ned: I would assume the reason Python and Perl don't use FAs is because there is a non-negligable compile cost, and I believe neither languages' default implementation precompiles. Perhaps Psycho or the fancy Perl 6 compiler I've heard such glowing praise of precompile?

From memory FAs can perform worse for some average-case inputs (smaller constants in the O(n^2)?)

[gravatar]
Ned Batchelder 6:10 AM on 20 Nov 2008

Thanks for all the suggestions and info. Those of you suggesting I not use a regex, remember, this was a fix to a 3rd-party library, so I wanted to make it work and move on, not bolt in a whole new technology.

@Bart, I appreciate the refactoring, but it doesn't seem to help Python. Even your latest form is pathological.

@Simon: thanks for the info on compile-time issues. That would have been my first guess.

[gravatar]
Paolo Bonzini 9:08 AM on 20 Nov 2008

No, the reasons why DFAs are not used are that:

1) it is a mess to guarantee leftmost/longest matching of subexpressions (either in the POSIX sense, or in the Perl sense); it is even worse if you add the greedy and stingy operators. DFAs are basically useful only if you want to match a regex against a string, but you don't care about what parts matched;

2) DFAs are naturally "stingy": they can tell you the shortest match, and you have to iterate in order to get longer matches. Unfortunately, in some cases (such as ^a.*b$) a well-optimized regex matcher can process greedy operators in sublinear time (only if the regex matches).

Besides, does Python have the (?> ...) syntax, so that (?> ...)* would be the same as (?: ...)*+ but without the possessive quantifier?

[gravatar]
Chung-chieh Shan 9:32 AM on 20 Nov 2008

The two points brought up by Paolo Bonzini are addressed in the article I linked to. Effort is not inconsistent with ignorance.

[gravatar]
Erez 9:39 AM on 22 Nov 2008

Matching with regexps/grammars is cute, but sometimes over-complicated. Just split on semicolons and match each entry by itself.
(True, a bit slower, but unnoticeable for most inputs. The readability is very noticeable, however)

[gravatar]
Dominic Cronin 8:54 AM on 26 Nov 2008

Jeffrey Friedl's book on the subject (http://regex.info/) has a full discussion of all those things.

Add a comment:

name
email
Ignore this:
not displayed and no spam.
Leave this empty:
www
not searched.
 
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.