Python regexes have a number of features that bring new power to text
manipulation. I’m not talking about fancy matching features like negative
look-behinds, but ways you can construct and use regexes. As a demonstration,
I’ll show you some real code from a real project.
Coverage.py will expand environment variables in values read from its
configuration files. It does this with a function called
substitute_variables
:
def substitute_variables(
text: str,
variables: dict[str, str],
) -> str:
"""
Substitute ``${VAR}`` variables in `text`.
Variables in the text can take a number of
shell-inspired forms::
$VAR
${VAR}
${VAR?} strict: an error if no VAR.
${VAR-miss} defaulted: "miss" if no VAR.
$$ just a dollar sign.
`variables` is a dictionary of variable values.
Returns the resulting text with values substituted.
"""
Call it with a string and a dictionary, and it makes the substitutions:
>>> substitute_variables(
... text="Look: $FOO ${BAR-default} $$",
... variables={'FOO': 'Xyzzy'},
... )
'Look: Xyzzy default $'
We use a regex to pick apart the text:
dollar_pattern = r"""(?x) # Verbose regex syntax
\$ # A dollar sign,
(?: # then
(?P<dollar> \$ ) | # a dollar sign, or
(?P<word1> \w+ ) | # a plain word, or
\{ # a {-wrapped
(?P<word2> \w+ ) # word,
(?: # either
(?P<strict> \? ) | # strict or
-(?P<defval> [^}]* ) # defaulted
)? # maybe
}
)
"""
This isn’t a super-fancy regex: it doesn’t use advanced pattern matching.
But there are some useful regex features at work here:
- The
(?x)
flag at the beginning turns on “verbose” regex syntax. In
this mode, all white space is ignored so the regex can be multi-line and we can
indent to help see the structure, and comments are allowed at the ends of
lines. - Named groups like
(?P<word1> … )
are used to capture parts of
the text that we can retrieve later by name. - There are also two groups used to get the precedence of operators right, but
we don’t want to capture those values separately, so I use the non-capturing
group syntax for them:
(?: … )
. In this code, we only ever access groups
by name, so I could have left them as regular capturing groups, but I think it’s
clearer to indicate up-front that we won’t be using them.
The verbose syntax in particular makes it easier to understand the regex.
Compare to what it would look like in one line:
r"\$(?:(?P<dollar>\$)|(?P<word1>\w+)|\{(?P<word2>\w+)(?:(?P<strict>\?)|-(?P<defval>[^}]*))?})"
Once we have the regex, we can use re.sub()
to replace the variables
with their values:
re.sub(dollar_pattern, dollar_replace, text)
But we’re going to use another power feature of Python regexes:
dollar_replace
here isn’t a string, it’s a function! Each fragment the
regex matches will be passed as a match object to our dollar_replace
function. It returns a string which re.sub() uses as the replacement in the
text:
def dollar_replace(match: re.Match[str]) -> str:
"""Called for each $replacement."""
# Get the one group that matched.
groups = match.group('dollar', 'word1', 'word2')
word = next(g for g in groups if g)
if word == "$":
return "$"
elif word in variables:
return variables[word]
elif match["strict"]:
msg = f"Variable {word} is undefined: {text!r}"
raise NameError(msg)
else:
return match["defval"]
First we use match.group()
. Called with a number of names, it returns
a tuple of what those named groups matched. They could be the matched text, or
None if the group didn’t match anything.
The way our regex is written only one of those three groups will match, so
the tuple will have one string and two None’s. To get the matched string, we
use next()
to find it. If the built-in any()
returned the first
true thing it found this code could be simpler, but it doesn’t so we have to do
it this way.
Now we can check the value to decide on the replacement:
- If the match was a dollar sign, we return a dollar sign.
- If the word is one of our defined variables, we return the value of the
variable.
- Since the word isn’t a defined variable, we check if the “strict” marker was
found, and if so, raise an exception.
- Otherwise we return the default value provided.
The final piece of the implementation is to use re.sub()
and return
the result:
return re.sub(dollar_pattern, dollar_replace, text)
Regexes are often criticized for being too opaque and esoteric. But done
right, they can be very powerful and don’t have to be a burden. What we’ve done
here is used simple pattern matching paired with useful API features to
compactly write a useful transformation.
BTW, if you are interested, the real code is in
coverage.py.
A parenting story from almost 30 years ago.
My wife told me about something her dad did when she was young: in the car,
knowing they were approaching an exit on the highway, he’d say to himself, but
loud enough for his daughters in the back to hear, “If only I could find exit
10...” The girls would look out the window and soon spot the very sign he
needed! “There it is Dad, we found it!” I liked it, it was clever and
sweet.
When my son Max was six or so, we were headed into
Boston to visit the big FAO Schwarz toy store that used to be on Boylston St.
They had a large bronze statue of a teddy bear on the corner in front of the
store. It must have been 10 or 12 feet tall. I wanted to try my father-in-law’s
technique with it.
Max had always been observant, competent and confident. The kind of kid who
could quickly tell you if a piece was missing from a Lego kit. I figured he’d
be the perfect target for this.
We got off the T (the subway if you aren’t from Boston), and had to walk a
bit. When we were a half block from the store, I could clearly see the bear
up ahead. I said, “If only I could find that bear statue...”
Max responded, “Oh Dad, I knew you didn’t know where it was!”
• • •
The store closed in 2004 and the bear was removed. I thought it was gone for
good. But on a walk a few weeks ago, I happened upon it outside the Tufts
Children’s Hospital.
Now I definitely know where it is:
Well, Anthropic and I were not a good fit, though
as predicted it was an experience. I’ve
started a new job on the Python language team at Netflix. It feels like a much
better match in a number of ways.
When sorting strings, you’d often like the order to make sense to a person.
That means numbers need to be treated numerically even if they are in a larger
string.
For example, sorting Python versions with the default sort() would give
you:
Python 3.10
Python 3.11
Python 3.9
when you want it to be:
Python 3.9
Python 3.10
Python 3.11
I wrote about this long ago (Human sorting), but have
continued to tweak the code and needed to add it to a
project recently. Here’s the latest:
import re
def human_key(s: str) -> tuple[list[str | int], str]:
"""Turn a string into a sortable value that works how humans expect.
"z23A" -> (["z", 23, "a"], "z23A")
The original string is appended as a last value to ensure the
key is unique enough so that "x1y" and "x001y" can be distinguished.
"""
def try_int(s: str) -> str | int:
"""If `s` is a number, return an int, else `s` unchanged."""
try:
return int(s)
except ValueError:
return s
return ([try_int(c) for c in re.split(r"(\d+)", s.casefold())], s)
def human_sort(strings: list[str]) -> None:
"""Sort a list of strings how humans expect."""
strings.sort(key=human_key)
The central idea here is to turn a string like "Python 3.9"
into the
key ["Python ", 3, ".", 9]
so that numeric components will be sorted by
their numeric value. The re.split() function gives us interleaved words and
numbers, and try_int() turns the numbers into actual numbers, giving us sortable
key lists.
There are two improvements from the original:
- The sort is made case-insensitive by using casefold() to lower-case the
string.
- The key returned is now a two-element tuple: the first element is the list
of intermixed strings and integers that gives us the ordering we want. The
second element is the original string unchanged to ensure that unique strings
will always result in distinct keys. Without it,
"x1y"
and
"x001Y"
would both produce the same key. This solves a
problem that actually happened when sorting the items of
a dictionary.
# Without the tuple: different strings, same key!!
human_key("x1y") -> ["x", 1, "y"]
human_key("x001Y") -> ["x", 1, "y"]
# With the tuple: different strings, different keys.
human_key("x1y") -> (["x", 1, "y"], "x1y")
human_key("x001Y") -> (["x", 1, "y"], "x001Y")
If you are interested, there are many different ways to split the string into
the word/number mix. The comments on the old post
have many alternatives, and there are certainly more.
This still makes some assumptions about what is wanted, and doesn’t cover all
possible options (floats? negative/positive? full file paths?). For those, you
probably want the full-featured natsort (natural sort)
package.
AI is everywhere these days, and everyone has opinions and thoughts. These
are some of mine.
Full disclosure: for a time I worked for Anthropic, the makers of
Claude.ai. I no longer do, and nothing in
this post (or elsewhere on this site) is their opinion or is proprietary to
them.
How to use AI
My advice about using AI is simple: use AI as an assistant, not an expert,
and use it judiciously. Some people will object, “but AI can be wrong!” Yes,
and so can the internet in general, but no one now recommends avoiding online
resources because they can be wrong. They recommend taking it all with a grain
of salt and being careful. That’s what you should do with AI help as well.
We are all learning how to use AI well. Prompt engineering is a new
discipline. It surprises me that large language models (LLMs) give better
answers if you include phrases like “think step-by-step” or “check your answer
before you reply” in your prompt, but they do improve the result. LLMs are not
search engines, but like search engines, you have to approach them as unique
tools that will do better if you know how to ask the right questions.
If you approach AI thinking that it will hallucinate and be wrong, and then
discard it as soon as it does, you are falling victim to
confirmation bias. Yes, AI will be wrong sometimes. That
doesn’t mean it is useless. It means you have to use it carefully.
I’ve used AI to help me write code when I didn’t know how to get started
because it needed more research than I could afford at the moment. The AI
didn’t produce finished code, but it got me going in the right direction, and
iterating with it got me to working code.
One thing it seemed to do well was to write more tests given a few examples
to start from. Your workflow probably has steps where AI can help you. It’s
not a magic bullet, it’s a tool that you have to learn how to use.
The future of coding
In beginner-coding spaces like Python Discord,
anxious learners ask if there is any point in learning to code, since won’t AI
take all the jobs soon anyway?
Simon Willison seems to be our best guide to the
head-spinning pace of AI development these days (if you can keep up with the
head-spinning pace of his blog!) I like what he said
recently about how AI will affect new programmers:
There has never been a better time to learn to code — the
learning curve is being shaved down by these new LLM-based tools, and the
amount of value people with programming literacy can produce is going up by
an order of magnitude.
People who know both coding and LLMs will be a whole lot more attractive
to hire to build software than people who just know LLMs for many years to
come.
Simon has also emphasized in his writing what I have found: AI lets me write
code that I wouldn’t have undertaken without its help. It doesn’t produce the
finished code, but it’s a helpful pair-programming assistant.
Can LLMs think?
Another objection I see often: “but LLMs can’t think, they just predict the
next word!” I’m not sure we have a consensus understanding of what “think” means
in this context. Airplanes don’t fly in the same way that birds do.
Automobiles don’t run in the same way that horses do. The important thing is
that they accomplish many of the same tasks.
OK, so AI doesn’t think the same way that people do. I’m fine with that.
What’s important to me is that it can do some work for me, work that could also
be done by people thinking. Cars (“horseless carriages”) do work that used to
be done by horses running. No one now complains that cars work differently than
horses.
If “just predict the next word” is an accurate description of what LLMs are
doing, it’s a demonstration of how surprisingly powerful predicting the next
word can be.
Harms
I am concerned about the harms that AI can cause. Some people and
organizations are focused on Asimov-style harms (will society collapse, will
millions die?) and I am glad they are. But I’m more concerned with
Dickens-style harms: people losing jobs not because AI can do their work, but
because people in charge will think AI can do other people’s work. Harms due to
people misunderstanding what AI does and doesn’t do well and misusing it.
I don’t see easy solutions to these problems. To go back to the car analogy:
we’ve been a car society for about 120 years. For most of that time we’ve been
leaning more and more towards cars. We are still trying to find the right
balance, the right way to reduce the harm they cause while keeping the benefits
they give us.
AI will be similar. The technology is not going to go away. We will not turn
our back on it and put it back into the bottle. We’ll continue to work on
improving how it works and how we work with it. There will be good and bad.
The balance will depend on how well we collectively use it and educate each
other, and how well we pay attention to what is happening.
Future
The pro-AI hype in the industry now is at a fever pitch, it’s completely
overblown. But the anti-AI crowd also seems to be railing against it without a
clear understanding of the current capabilities or the useful approaches.
I’m going to be using AI more, and learning where it works well and where it
doesn’t.
After nearly two years, I think this is finally ready: coverage.py can use
sys.monitoring to more efficiently measure branch
coverage.
I would love for people to try it, but it’s a little involved at the
moment:
Once you have both of those things, set the environment variable
COVERAGE_CORE=sysmon
and run coverage as you usually do. If all goes
well, it should be faster. Please let me know!
Feedback is welcome in GitHub issues or in the
#coverage-py channel in the Python Discord server.
This has been a long journey, starting when I first
commented on PEP 669 that underpins this work. Mark Shannon and I have had
many back and forths about the behavior of sys.monitoring, finally landing on
something that would work for us both.
For the curious: traditionally coverage.py relied on
sys.settrace. Python calls my recording function for
every line of Python executed. It’s simple and effective, but inefficient.
After I’ve been told a line was executed once, I don’t need to be told again,
but settrace keeps calling my function. The new
sys.monitoring that arrived in Python 3.12 lets me
disable an event once it’s fired, so after the first ping there’s no overhead to
running that same code multiple times.
It took a while to iron out the event behavior that lets us measure branches
as well as lines, but Python 3.14.0 after alpha 5 has it, so we’re finally able
to announce coverage.py support for people to try out.
Older: