PyCon summer camp

Thursday 15 May 2025

I’m headed to PyCon today, and I’m reminded about how it feels like summer camp, in mostly good ways, but also in a tricky way.

You take some time off from your “real” life, you go somewhere else, you hang out with old friends and meet some new friends. You do different things than in your real life, some are playful, some take real work. These are all good ways it’s like summer camp.

Here’s the tricky thing to watch out for: like summer camp, you can make connections to people or projects that are intense and feel like they could last forever. You make friends at summer camp, or even have semi-romantic crushes on people. You promise to stay in touch, you think it’s the “real thing.” When you get home, you write an email or two, maybe a phone call, but it fades away. The excitement of the summer is overtaken by your autumnal real life again.

PyCon can be the same way, either with people or projects. Not a romance, but the exciting feeling that you want to keep doing the project you started at PyCon, or be a member of some community you hung out with for those days. You want to keep talking about that exciting thing with that person. These are great feelings, but it’s easy to emotionally over-commit to those efforts and then have it fade away once PyCon is over.

How do you know what projects are just crushes, and which are permanent relationships? Maybe it doesn’t matter, and we should just get excited about things.

I know I started at least one effort last year that I thought would be done in a few months, but has since stalled. Now I am headed back to PyCon. Will I become attached to yet more things this time? Is that bad? Should I temper my enthusiasm, or is it fine to light a few fires and accept that some will peter out?

Filtering GitHub actions by changed files

Sunday 4 May 2025

Coverage.py has a large test suite that runs in many environments, which can take a while. But some changes don’t require running the test suite at all. I’ve changed the actions to detect when they need to run based on what files have changed, but there were some twists and turns along the way.

The dorny/paths-filter action can check which files have changed for pull requests or branches. I added it to my tests action like this:

jobs:

  changed:
    name: "Check what files changed"
    outputs:
      python: ${{ steps.filter.outputs.python }}
    steps:
      - name: "Check out the repo"
        uses: actions/checkout

      - name: "Examine changed files"
        uses: dorny/paths-filter
        id: filter
        with:
          filters: |
            python:
              - "**.py"

  tests:
    # Don't run tests if the branch name includes "-notests".
    # Only run tests if Python files changed.
    needs: changed
    if: ${{ !contains(github.ref, '-notests') && needs.changed.outputs.python == 'true' }}

The “changed” jobs checks what files have changed, then the “tests” job examines its output to decide whether to run at all.

It’s a little awkward having an output for the “changed” job as an intermediary, but this did what I wanted: if any .py file changed, run the tests, otherwise don’t run them. I left in an old condition: if the branch name includes “-notests”, then don’t run the tests.

This worked, but I realized I needed to run the tests on other conditions also. What if no Python file changed, but the GitHub action file itself had changed? So I added that as a condition. The if-expression was getting long, so I made it a multi-line string:

jobs:

  changed:
    name: "Check what files changed"
    outputs:
      python: ${{ steps.filter.outputs.python }}
      workflow: ${{ steps.filter.outputs.workflow }}
    steps:
      - name: "Check out the repo"
        uses: actions/checkout

      - name: "Examine changed files"
        uses: dorny/paths-filter
        id: filter
        with:
          filters: |
            python:
              - "**.py"
            workflow:
              - ".github/workflows/testsuite.yml"

  tests:
    # Don't run tests if the branch name includes "-notests".
    # Only run tests if Python files or this workflow changed.
    needs: changed
    if: |
      ${{
        !contains(github.ref, '-notests')
        && (
          needs.changed.outputs.python == 'true'
          || needs.changed.outputs.workflow == 'true'
        )
      }}

This seemed to work, but it has a bug that I will get to in a bit.

Thinking about it more, I realized there are other files that could affect the test results: requirements files, test output files, and the tox.ini. Rather than add them as three more conditions, I combined them all into one:

jobs:

  changed:
    name: "Check what files changed"
    outputs:
      run_tests: ${{ steps.filter.outputs.run_tests }}
    steps:
      - name: "Check out the repo"
        uses: actions/checkout

      - name: "Examine changed files"
        uses: dorny/paths-filter
        id: filter
        with:
          filters: |
            run_tests:
              - "**.py"
              - ".github/workflows/testsuite.yml"
              - "tox.ini"
              - "requirements/*.pip"
              - "tests/gold/**"

  tests:
    # Don't run tests if the branch name includes "-notests".
    # Only run tests if files that affect tests have changed.
    needs: changed
    if: |
      ${{
        needs.changed.outputs.run_tests == 'true'
        && !contains(github.ref, '-notests')
      }}

BTW: these commits also update the quality checks workflow which has other kinds of mix-and-match conditions to deal with that you might be interested in.

All seemed good! Then I made a commit that only changed my Makefile, and the tests ran! Why!? The Makefile isn’t one of the checked files. The paths-filter action helpfully includes debug output that showed that only the Makefile was considered changed, and that the “run_test” output was false.

I took a guess that GitHub actions don’t like expressions with newlines in them. Using the trusty YAML multi-line string cheat sheet, I tried changing from the literal block style (with a pipe) to the folded style (with a greater-than):

if: >
  ${{
    needs.changed.outputs.run_tests == 'true'
    && !contains(github.ref, '-notests')
  }}

The literal form includes all newlines, the folded style turns newlines into spaces. To check that I had it right, I tried parsing the YAML files: to my surprise, both forms included all the newlines, there was no difference at all. It turns out that YAML “helpfully” notices changes in indentation, and includes newlines for indented lines. My expression is nicely indented, so it has newlines no matter what syntax I use.

The GitHub actions docs don’t mention it, but it seems that newlines do break expression evaluation. Sigh. My expressions are not as long now as they had gotten during this exploration, so I changed them all back to one line, and now it all works as I wanted.

There are some other things I’d like to tweak: when the tests are skipped, the final status is “success”, but I’m wondering if there’s a way to make it “skipped”. I’m also torn about whether every change to master should run all the workflows or if they should also filter based on the changed files. Currently they are filtered.

Continuous integration and GitHub workflows are great, but they always seem to involve this kind of fiddling in environments that are difficult to debug. Maybe I’ve saved you some grief.

Regex affordances

Friday 18 April 2025

Python regexes have a number of features that bring new power to text manipulation. I’m not talking about fancy matching features like negative look-behinds, but ways you can construct and use regexes. As a demonstration, I’ll show you some real code from a real project.

Coverage.py will expand environment variables in values read from its configuration files. It does this with a function called substitute_variables:

def substitute_variables(
    text: str,
    variables: dict[str, str],
) -> str:
    """
    Substitute ``${VAR}`` variables in `text`.

    Variables in the text can take a number of
    shell-inspired forms::

        $VAR
        ${VAR}
        ${VAR?}         strict: an error if no VAR.
        ${VAR-miss}     defaulted: "miss" if no VAR.
        $$              just a dollar sign.

    `variables` is a dictionary of variable values.

    Returns the resulting text with values substituted.

    """

Call it with a string and a dictionary, and it makes the substitutions:

>>> substitute_variables(
...     text="Look: $FOO ${BAR-default} $$",
...     variables={'FOO': 'Xyzzy'},
... )

'Look: Xyzzy default $'

We use a regex to pick apart the text:

dollar_pattern = r"""(?x)   # Verbose regex syntax
    \$                      # A dollar sign,
    (?:                     # then
        (?P<dollar> \$ ) |      # a dollar sign, or
        (?P<word1> \w+ ) |      # a plain word, or
        \{                      # a {-wrapped
            (?P<word2> \w+ )        # word,
            (?:                         # either
                (?P<strict> \? ) |      # strict or
                -(?P<defval> [^}]* )    # defaulted
            )?                      # maybe
        }
    )
    """

This isn’t a super-fancy regex: it doesn’t use advanced pattern matching. But there are some useful regex features at work here:

  • The (?x) flag at the beginning turns on “verbose” regex syntax. In this mode, all white space is ignored so the regex can be multi-line and we can indent to help see the structure, and comments are allowed at the ends of lines.
  • Named groups like (?P<word1> … ) are used to capture parts of the text that we can retrieve later by name.
  • There are also two groups used to get the precedence of operators right, but we don’t want to capture those values separately, so I use the non-capturing group syntax for them: (?: … ). In this code, we only ever access groups by name, so I could have left them as regular capturing groups, but I think it’s clearer to indicate up-front that we won’t be using them.

The verbose syntax in particular makes it easier to understand the regex. Compare to what it would look like in one line:

r"\$(?:(?P<dollar>\$)|(?P<word1>\w+)|\{(?P<word2>\w+)(?:(?P<strict>\?)|-(?P<defval>[^}]*))?})"

Once we have the regex, we can use re.sub() to replace the variables with their values:

re.sub(dollar_pattern, dollar_replace, text)

But we’re going to use another power feature of Python regexes: dollar_replace here isn’t a string, it’s a function! Each fragment the regex matches will be passed as a match object to our dollar_replace function. It returns a string which re.sub() uses as the replacement in the text:

def dollar_replace(match: re.Match[str]) -> str:
    """Called for each $replacement."""
    # Get the one group that matched.
    groups = match.group('dollar', 'word1', 'word2')
    word = next(g for g in groups if g)

    if word == "$":
        return "$"
    elif word in variables:
        return variables[word]
    elif match["strict"]:
        msg = f"Variable {word} is undefined: {text!r}"
        raise NameError(msg)
    else:
        return match["defval"]

First we use match.group(). Called with a number of names, it returns a tuple of what those named groups matched. They could be the matched text, or None if the group didn’t match anything.

The way our regex is written only one of those three groups will match, so the tuple will have one string and two None’s. To get the matched string, we use next() to find it. If the built-in any() returned the first true thing it found this code could be simpler, but it doesn’t so we have to do it this way.

Now we can check the value to decide on the replacement:

  • If the match was a dollar sign, we return a dollar sign.
  • If the word is one of our defined variables, we return the value of the variable.
  • Since the word isn’t a defined variable, we check if the “strict” marker was found, and if so, raise an exception.
  • Otherwise we return the default value provided.

The final piece of the implementation is to use re.sub() and return the result:

return re.sub(dollar_pattern, dollar_replace, text)

Regexes are often criticized for being too opaque and esoteric. But done right, they can be very powerful and don’t have to be a burden. What we’ve done here is used simple pattern matching paired with useful API features to compactly write a useful transformation.

BTW, if you are interested, the real code is in coverage.py.

Find the bear

Sunday 6 April 2025

A parenting story from almost 30 years ago.

My wife told me about something her dad did when she was young: in the car, knowing they were approaching an exit on the highway, he’d say to himself, but loud enough for his daughters in the back to hear, “If only I could find exit 10...” The girls would look out the window and soon spot the very sign he needed! “There it is Dad, we found it!” I liked it, it was clever and sweet.

When my son Max was six or so, we were headed into Boston to visit the big FAO Schwarz toy store that used to be on Boylston St. They had a large bronze statue of a teddy bear on the corner in front of the store. It must have been 10 or 12 feet tall. I wanted to try my father-in-law’s technique with it.

Max had always been observant, competent and confident. The kind of kid who could quickly tell you if a piece was missing from a Lego kit. I figured he’d be the perfect target for this.

We got off the T (the subway if you aren’t from Boston), and had to walk a bit. When we were a half block from the store, I could clearly see the bear up ahead. I said, “If only I could find that bear statue...”

Max responded, “Oh Dad, I knew you didn’t know where it was!”

•    •    •

The store closed in 2004 and the bear was removed. I thought it was gone for good. But on a walk a few weeks ago, I happened upon it outside the Tufts Children’s Hospital.

Now I definitely know where it is:

The bronze bear statue

Nedflix

Thursday 3 April 2025

Well, Anthropic and I were not a good fit, though as predicted it was an experience. I’ve started a new job on the Python language team at Netflix. It feels like a much better match in a number of ways.

Human sorting improved

Saturday 29 March 2025

When sorting strings, you’d often like the order to make sense to a person. That means numbers need to be treated numerically even if they are in a larger string.

For example, sorting Python versions with the default sort() would give you:

Python 3.10
Python 3.11
Python 3.9

when you want it to be:

Python 3.9
Python 3.10
Python 3.11

I wrote about this long ago (Human sorting), but have continued to tweak the code and needed to add it to a project recently. Here’s the latest:

import re

def human_key(s: str) -> tuple[list[str | int], str]:
    """Turn a string into a sortable value that works how humans expect.

    "z23A" -> (["z", 23, "a"], "z23A")

    The original string is appended as a last value to ensure the
    key is unique enough so that "x1y" and "x001y" can be distinguished.

    """
    def try_int(s: str) -> str | int:
        """If `s` is a number, return an int, else `s` unchanged."""
        try:
            return int(s)
        except ValueError:
            return s

    return ([try_int(c) for c in re.split(r"(\d+)", s.casefold())], s)

def human_sort(strings: list[str]) -> None:
    """Sort a list of strings how humans expect."""
    strings.sort(key=human_key)

The central idea here is to turn a string like "Python 3.9" into the key ["Python ", 3, ".", 9] so that numeric components will be sorted by their numeric value. The re.split() function gives us interleaved words and numbers, and try_int() turns the numbers into actual numbers, giving us sortable key lists.

There are two improvements from the original:

  • The sort is made case-insensitive by using casefold() to lower-case the string.
  • The key returned is now a two-element tuple: the first element is the list of intermixed strings and integers that gives us the ordering we want. The second element is the original string unchanged to ensure that unique strings will always result in distinct keys. Without it, "x1y" and "x001Y" would both produce the same key. This solves a problem that actually happened when sorting the items of a dictionary.
    # Without the tuple: different strings, same key!!
    human_key("x1y") -> ["x", 1, "y"]
    human_key("x001Y") -> ["x", 1, "y"]

    # With the tuple: different strings, different keys.
    human_key("x1y") -> (["x", 1, "y"], "x1y")
    human_key("x001Y") -> (["x", 1, "y"], "x001Y")

If you are interested, there are many different ways to split the string into the word/number mix. The comments on the old post have many alternatives, and there are certainly more.

This still makes some assumptions about what is wanted, and doesn’t cover all possible options (floats? negative/positive? full file paths?). For those, you probably want the full-featured natsort (natural sort) package.

Older:

Dec 1:

Dinner