Regular expression diversity

Friday 10 August 2007This is almost 16 years old. Be careful.

I’m amazed at how many regular expression libraries there are, and at how each invents new syntax for some new feature. The Oniguruma library, for example, describes character class operators:

^...    negative class (lowest precedence operator)
x-y     range from x to y
[...]   set (character class in character class)
..&&..  intersection (low precedence at the next of ^)

  ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]

as well as greedy, reluctant, and possessive qualifiers? Yikes.


> ^... negative class (lowest precedence operator)
> x-y range from x to y

These two are straight from POSIX (i.e. standard, lowest-common-denominator) regular expressions.

> [...] set (character class in character class)
> ..&&.. intersection (low precedence at the next of ^)

These just seem confusing. Who the heck needs such convoluted character classes? If it's that complex, spell out the list of characters in the class.

> ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]

It's quite revealing that the best example they can give of "class-in-a-class" and "intersection" is far more confusing to read than the resulting class. If someone chose to write '[a-w&&[^c-g]z]' where they could write '[abh-w]', I'd keep them away from any programs I care about.
I couldn't agree more about the example of the character class set operators! It took me a while to figure out what was happening to the z in that example...

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.