A tour of some real code showing little-used power features of the Python regular expression module.

Python regexes have a number of features that bring new power to text manipulation. I’m not talking about fancy matching features like negative look-behinds, but ways you can construct and use regexes. As a demonstration, I’ll show you some real code from a real project.

Coverage.py will expand environment variables in values read from its configuration files. It does this with a function called substitute_variables:

def substitute_variables(
    text: str,
    variables: dict[str, str],
) -> str:
    """
    Substitute ``${VAR}`` variables in `text`.

    Variables in the text can take a number of
    shell-inspired forms::

        $VAR
        ${VAR}
        ${VAR?}         strict: an error if no VAR.
        ${VAR-miss}     defaulted: "miss" if no VAR.
        $$              just a dollar sign.

    `variables` is a dictionary of variable values.

    Returns the resulting text with values substituted.

    """

Call it with a string and a dictionary, and it makes the substitutions:

>>> substitute_variables(
...     text="Look: $FOO ${BAR-default} $$",
...     variables={'FOO': 'Xyzzy'},
... )

'Look: Xyzzy default $'

We use a regex to pick apart the text:

dollar_pattern = r"""(?x)   # Verbose regex syntax
    \$                      # A dollar sign,
    (?:                     # then
        (?P<dollar> \$ ) |      # a dollar sign, or
        (?P<word1> \w+ ) |      # a plain word, or
        \{                      # a {-wrapped
            (?P<word2> \w+ )        # word,
            (?:                         # either
                (?P<strict> \? ) |      # strict or
                -(?P<defval> [^}]* )    # defaulted
            )?                      # maybe
        }
    )
    """

This isn’t a super-fancy regex: it doesn’t use advanced pattern matching. But there are some useful regex features at work here:

The (?x) flag at the beginning turns on “verbose” regex syntax. In this mode, all white space is ignored so the regex can be multi-line and we can indent to help see the structure, and comments are allowed at the ends of lines.
Named groups like (?P<word1> … ) are used to capture parts of the text that we can retrieve later by name.
There are also two groups used to get the precedence of operators right, but we don’t want to capture those values separately, so I use the non-capturing group syntax for them: (?: … ). In this code, we only ever access groups by name, so I could have left them as regular capturing groups, but I think it’s clearer to indicate up-front that we won’t be using them.

The verbose syntax in particular makes it easier to understand the regex. Compare to what it would look like in one line:

r"\$(?:(?P<dollar>\$)|(?P<word1>\w+)|\{(?P<word2>\w+)(?:(?P<strict>\?)|-(?P<defval>[^}]*))?})"

Once we have the regex, we can use re.sub() to replace the variables with their values:

re.sub(dollar_pattern, dollar_replace, text)

But we’re going to use another power feature of Python regexes: dollar_replace here isn’t a string, it’s a function! Each fragment the regex matches will be passed as a match object to our dollar_replace function. It returns a string which re.sub() uses as the replacement in the text:

def dollar_replace(match: re.Match[str]) -> str:
    """Called for each $replacement."""
    # Get the one group that matched.
    groups = match.group('dollar', 'word1', 'word2')
    word = next(g for g in groups if g)

    if word == "$":
        return "$"
    elif word in variables:
        return variables[word]
    elif match["strict"]:
        msg = f"Variable {word} is undefined: {text!r}"
        raise NameError(msg)
    else:
        return match["defval"]

First we use match.group(). Called with a number of names, it returns a tuple of what those named groups matched. They could be the matched text, or None if the group didn’t match anything.

The way our regex is written only one of those three groups will match, so the tuple will have one string and two None’s. To get the matched string, we use next() to find it. If the built-in any() returned the first true thing it found this code could be simpler, but it doesn’t so we have to do it this way.

Now we can check the value to decide on the replacement:

If the match was a dollar sign, we return a dollar sign.
If the word is one of our defined variables, we return the value of the variable.
Since the word isn’t a defined variable, we check if the “strict” marker was found, and if so, raise an exception.
Otherwise we return the default value provided.

The final piece of the implementation is to use re.sub() and return the result:

return re.sub(dollar_pattern, dollar_replace, text)

Regexes are often criticized for being too opaque and esoteric. But done right, they can be very powerful and don’t have to be a burden. What we’ve done here is used simple pattern matching paired with useful API features to compactly write a useful transformation.

BTW, if you are interested, the real code is in coverage.py.

Comments

bjh 8:08 AM on 24 Apr 2025

The next(...if g) as “coalesce” is nifty, though in another situation if there isn’t guaranteed to be a true value in the iterable, it could raise StopIteration; adding a default can make it a little more robust, though not as pretty:

first = next((x for x in itr if x), None)

And, yeah, it’s too bad that any() doesn’t work this way.

Marcos Dione 5:29 AM on 25 Apr 2025

My approach for complex regexps is to split them into chunks and test them. first separately, and then integrated. I wrote about that here:

https://www.grulic.org.ar/~mdione/glob/posts/is-dinant-dead-or-a-tip-for-writing-regular-expressions/

Regexps are (very dense) code, so you should test them!

Rodrigo Girão Serrão 2:00 PM on 9 May 2025

You may or may not like it, but you can use any with assignment expressions to achieve what you’re looking for:

any((word := g) for g in match.groups(...))

If match.group(...) is not empty, word will either be the first truthy group or the very last group.

Before I wrote down the code I was hoping it would actually look good in this instance, but it doesn’t… But I’ll leave this here, for reference.

Regex affordances

Comments

Add a comment: