Regular expression diversity

Friday 10 August 2007

I'm amazed at how many regular expression libraries there are, and at how each invents new syntax for some new feature. The Oniguruma library, for example, describes character class operators:

^...    negative class (lowest precedence operator)
x-y     range from x to y
[...]   set (character class in character class)
..&&..  intersection (low precedence at the next of ^)
  ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]

as well as greedy, reluctant, and possessive qualifiers? Yikes.

tagged: » 2 reactions


Ben Finney 9:47 AM on 10 Aug 2007

> ^... negative class (lowest precedence operator)
> x-y range from x to y

These two are straight from POSIX (i.e. standard, lowest-common-denominator) regular expressions.

> [...] set (character class in character class)
> ..&&.. intersection (low precedence at the next of ^)

These just seem confusing. Who the heck needs such convoluted character classes? If it's that complex, spell out the list of characters in the class.

> ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]

It's quite revealing that the best example they can give of "class-in-a-class" and "intersection" is far more confusing to read than the resulting class. If someone chose to write '[a-w&&[^c-g]z]' where they could write '[abh-w]', I'd keep them away from any programs I care about.

Ned Batchelder 3:28 PM on 10 Aug 2007

I couldn't agree more about the example of the character class set operators! It took me a while to figure out what was happening to the z in that example...

Add a comment:

Ignore this:
not displayed and no spam.
Leave this empty:
not searched.
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.