Python nugget: renumbering ids

Sunday 5 January 2014This is almost 11 years old. Be careful.

While writing a test suite, I wrote a helper function to renumber the ids in SVG figures. It had enough interesting bits that I’ll share it here, and maybe I’ll get suggestions on better ways to do it.

BTW: The test is for an SVG-drawing library that I hacked together to replace the graphviz diagrams I made for Facts and myths about Python names and values. Graphviz is frustrating if you know what you want things to look like, so when I turned that article into a presentation, I re-did the diagrams in SVG. Now I want to make that library more formal, so I need tests!

The first tests are just the figures from the presentation, packaged as unit tests. But the figures have ids in them which are auto-assigned, and if the tests run in a different order than the original figures, the ids will be different. So I wrote a helper function that finds the ids and renumbers them, to canonicalize the SVG.

I chose to use regexes, since formally parsing the SVG to find the ids would involve not just XML parsing but CSS parsing, and this domain is specialized enough and tightly-controlled enough that I’m condfident that a regex will do a good job.

First the code, then we’ll go over it line by line:

1def renumber_svg_ids(svg):
2    """Renumber the ids in `svg`.
3
4    Ids are either "id='id10'" or "#id10".  Same ids get
5    the same renumbered id, to keep the meaning the same.
6
7    Return the same svg string, but with new ids.
8
9    """
10    id_map = {}
11    new_ids = ("newid{}".format(i) for i in itertools.count())
12
13    def new_repl(match, new_id_fmt):
14        r"""re.sub function for renumbering.
15
16        The `match` object has an id in \1.  Re-number it with a new id,
17        then return `new_id_fmt` with the new id in place of "{}".
18
19        """
20        found_id = match.group(1)
21        if found_id not in id_map:
22            id_map[found_id] = next(new_ids)
23        return new_id_fmt.format(id_map[found_id])
24
25    # Replace ids that look like: id="id123"
26    svg = re.sub(
27        r"""\bid=['"](id\d+)['"]""",
28        functools.partial(new_repl, new_id_fmt="id='{}'"),
29        svg
30    )
31    # Replace ids that look like: #id123
32    svg = re.sub(
33        r"""#(id\d+)\b""",
34        functools.partial(new_repl, new_id_fmt="#{}"),
35        svg
36    )
37
38    return svg

At heart, this function is conceptually simple: take a string, and return the string with the ids replaced. Since the same id can appear multiple times in the string, we need to be careful to replace the same id with the same replacement. To keep track of what was replaced with what, at line 10, id_map is a dictionary mapping old ids to new ids.

At line 11, we have a generator expression that will make new ids for us. itertools.count is an infinite sequence of integers; we format those into the form “newid123”, and the generator expression gives us an infinite stream of those ids.

The heart of renumber_svg_ids is a function for use with re.sub. The simple and common way to use re.sub is to give it a regex pattern and a string replacement. But instead of a string replacement, you can use a function. Every match is passed to the function as a match object, and the string returned by the function is used as the replacement.

Our function new_repl on line 13 takes a match object and a format string for the replacement. Line 20 gets the actual id out of the match object: match.group(1) returns the string matched by the first parenthesized group in the regex pattern, so found_id will be something like “id123”.

On line 21, if the id isn’t in our map, then we haven’t seen this id yet, so we make a new id by pulling the next value from our generator expression. Generators are usually consumed in a for-loop of some kind, but you can use the next() builtin to just grab the next value from one.

Finally we use the new_id_fmt format string, giving it the new id, and return the result on line 23.

It’s unusual to see nested functions in Python, but they work fine. One issue is variable scope: notice that we use id_map inside the new_repl function, but id_map is defined in the outer function. This works so long as we don’t reassign the id_map name. That won’t work right in Python 2, you’d need the Python 3 nonlocal keyword. Luckily we don’t need to reassign the name, we just use methods on it. Those methods modify the value, but that’s still not an assignment to the name, so we are OK.

Now that we have our re.sub replacement function, we’re going to use it twice, once to replace “id=’id123’” instances, and once for “#id123”. You may have noticed an odd thing about our new_repl function: it takes two arguments. But the function re.sub will call only takes one: the match object. We need two arguments so that the replacement format could be different for the two times we’re going to use it.

To turn our two-argument function into two different one-argument functions, we use functools.partial. You give it a function, and some arguments, and it returns a new function that will call your function with those arguments pre-supplied. In our case, line 28 uses functools.partial to make a new function that is our new_repl with the given string as new_id_fmt. The result is a function of only the one remaining argument, the match object, which is just what re.sub wants.

Lines 26 and 32 are our two calls to re.sub, they each make replacements in the svg string, and the final result is returned at the end.

A few minor things to note: on line 14, the docstring for new_repl is a raw string, because I have a backslash in it that I want to remain literal, although the “\1” is an obscure way to refer to the first group, and in any case the docstring of an inner function is unlikely to ever appear anywhere else, so who’s reading it? On line 27, I used a triple-quoted string even for a single-line string, because it let me avoid escaping the two kinds of quotes I have in the regex.

Of course, there’s still room for new ways to do things. Line 21, the check if the value is already in the dictionary, raises an eyebrow: Python has better ways to do that sort of thing. The defaultdict class can automatically create values for missing keys.

So we can re-write the top of our function like this (with docstrings removed for brevity):

def renumber_svg_ids(svg):
    new_ids = ("newid{}".format(i) for i in itertools.count())
    id_map = collections.defaultdict(lambda: next(new_ids))

    def new_repl(match, new_id_fmt):
        found_id = match.group(1)
        return new_id_fmt.format(id_map[found_id])

The new_ids generator is exactly the same. But now we use it in a defaultdict. When a key is missing, defaultdict will invoke the lambda function, which will use next() to get the next id. Now the body of new_repl has no conditional in it at all, it simply looks up the found id in the map. If it’s not there already, the defaultdict will make a new one, and if it is there, it will simply return the saved value. For bonus points, you could replace our new lambda function with another call to functools.partial.

In the back of my mind, I’m wondering if there isn’t a better way to accomplish this entirely. Maybe find all the ids in one pass, and then replace them all in another?

Comments

[gravatar]
I don't think splitting the match and replace phases will offer much benefit in terms of having the code state what it means.

You could get rid of the functools.partial invocations in line 28 and 34 if you change the patterns to add captures for the prefix and postfix, so instead of r"""\bid=['"](id\d+)['"]""" you'd use r"""\b(id=['"])(id\d+)(['"])""" and instead of r"""#(id\d+)\b""" you'd use r"""(#)(id\d+)()\b""" (note the empty capture).

You could then loop on the regexp patterns to avoid duplicating the re.sub call in your code. The new_repl function then returns match.group(1) + id_map[match.group(2)] + match.group(3).

This also has the benefit of leaving the original quote characters intact. Also note that generators have their own next method, which you could use directly instead of the next()-builtin, thus your id_map becomes: id_map = collections.defaultdict(new_ids.next)

No functools.partial or lambda necessary. The code, sans docstrings:
def renumber_svg_ids(svg):
    new_ids = ("newid{}".format(i) for i in itertools.count())
    id_map = collections.defaultdict(new_ids.next)

    def new_repl(match):
        return match.group(1) + id_map[match.group(2)] + match.group(3)

    for pattern in (r"""\b(id=['"])(id\d+)(['"])""", r"""(#)(id\d+)()\b"""):
        svg = re.sub(pattern, new_repl, svg)

    return svg
As an aside, nested functions aren't that unusual - in fact they are preferred over lambda expressions (because their name, if chosen well, can make their meaning more obvious) and you will see quite a few of them in the standard libs. In comparison, functools usage is positively exotic.
[gravatar]
Since you're importing itertools, might as well use imap instead of a gen expr:

new_ids = itertools.imap("newid{}".format, itertools.count())

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.