« | » Main « | »

When I wrote Facts and myths about Python names and values, I included figures drawn with Graphviz. I made a light wrapper to help, but the figures were mostly just Graphviz. For example, this code:

g = CogGraphviz()
g.codelabel("nums = [1, 2, 3]")
g.dot("""
    nums [NAMEATTRS]
    list1 [shape=record, label="{<0>|<1>|<2>}"]
    nums -> list1:0
    subgraph nums {
        rankdir=LR;
        1 [INTATTRS]
        2 [INTATTRS]
        3 [INTATTRS]
        }
    list1:0 -> 1
    list1:1 -> 2
    list1:2 -> 3
    """)
g.img(alt="nums refers to a list, which refers to ints")

produced this figure:

A list of three numbers

The beauty of Graphviz is that you describe the topology of your graph, and it lays it out for you. The problem is, if you care about the layout, it is very difficult to control. Why are the 1, 2, 3 where they are? How do I move them? I couldn't figure it out. And I wanted to have new shapes, and control where the lines joined, etc.

So I wrote a library called Cupid to help me make the figures the way I wanted them. Cupid is an API that generates SVG figures. Half of it is generic "place this rectangle here" kinds of SVG operations, and half is very specific to the names-and-values diagrams I wanted to create.

Now to make that same figure, I use this code:

fig = start_figure(title="nums refers to a list, which refers to ints")
fig.top_code("nums = [1, 2, 3]")
n_nums = fig.auto_name("nums")
l_nums = fig.list(length=3, pos=fig.val_for_name(n_nums))
fig.reference(n_nums, l_nums[0])
y_ints = fig.y+10
for i in range(3):
    i_num = fig.int(top=(l_nums[1].cx+(i-1)*35, y_ints), text=str(i+1))
    fig.connect(
        l_nums[i].center, 90, i_num.north, 90, 
        start_marker=fig.DOT, class_="arrow"
    )
finish_figure(fig)

nums refers to a list, which refers to intsnums123nums = [1, 2, 3]

The new code is more complicated than the old code, but I can predict what it will do, and if I want it to do something new, I can extend Cupid.

Cupid isn't the handiest thing, but I can make it do what I want. This is my way with this site: I want it the way I want it, I enjoy writing the tools that let me make it that way, and I don't mind writing code to produce figures when other people would use a mouse.

I chatted this morning with Alex Gaynor about speeding up coverage.py on PyPy. He's already made some huge improvements, by changing PyPy so that settrace doesn't defeat the JIT.

We got to talking about speeding how coverage stores data during execution. What coverage stores depends on how you use it. For statement coverage, it needs to store a set of numbers for each file. The numbers are the line numbers that have been executed. Currently coverage stores these in a dictionary, using the numbers as keys, and None as a value.

If you use branch coverage, then coverage needs to store pairs of numbers, the line number jumped from and the line number jumped to, for each line transition. It currently stores these in a dictionary with a tuple of two numbers as keys, and None as a value. (Yes, it should really be an actual set.)

For statement coverage, a faster data structure would be a bit-vector: a large chunk of contiguous memory where each bit represents a number. To store a number, set the bit corresponding to the number. This would definitely be faster than the dictionary.

But for branch pairs, what options do we have? A general pair-of-integers data structure is probably too complex. But I had a hunch that our data had idiosyncracies we could take advantage of. In particular, for most of our pairs of numbers, the two numbers will be close together, since many branches are just a few statements forward in the code.

To test this hunch, I made a quick experiment. I reached for the nearest large project (Open edX), and changed its .coveragerc files to use the Python tracer. Then I hacked a counter into the Python tracer which would tally up, for each line transition, what was the delta between the source and the destination.

After running the tests, I had data like this:

-159: 6
   ...
  -6: 10439
  -5: 5548
  -4: 7856
  -3: 22892
  -2: 13976
  -1: 363659
   0: 141834
   1: 2160602
   2: 471227
   3: 474249
   4: 159185
   5: 67638
   6: 93216
   7: 54037
   ...
  21: 565
  22: 5197
  23: 1834
  24: 975
  25: 117251
  26: 5316
  27: 9907
  28: 3030
  29: 3397
  30: 1555
   ...
  37: 20413
  38: 40443
  39: 2732
  40: 23349
   ...
 348: 222
 349: 810
 350: 86
 351: 80478
 352: 9
   ...
 367: 2
 368: 318
 369: 179
 370: 14903
 371: 1530
 372: 713
 373: 14
   ...
1553: 15
1558: 81
1561: 1014
   ...
2396: 1
2412: 35
2431: 34
2512: 34
2555: 37
2575: 34
2624: 37
2656: 37
2678: 37

The data showed what I suspected: there's a huge bulge around zero, since most branches are to points nearby in the code. There are also spikes elsewhere, like the 80k branches that went 351 lines forward. And I don't understand why the seven largest jumps all occurred 34 or 37 times?

In any case, this pointed to a possible storage strategy. My first idea was a handful of bit-vectors, say 10. To store the pair (a, b), you use the difference b-a (which is the value our data above shows) to choose a particular bit-vector, and store a in that vector. If b-a is more than 10, then you fall back to a dictionary. Since most values of b-a are in that small range, we'll mostly use the bit-vector, and save time.

A variant of this idea is instead of using a bit-vector, use a vector of ints. To store (a, b), you set bit number b-a in the a'th int. If b-a is too large, then you fall back to a dictionary. To properly capture the biggest bulge around zero, you'd pick an offset like -2, and instead of b-a, you'd use b-a+offset as the bit index, so that a delta of -2 would be stored in bit 0.

A quick program tried out various values of the width of the int, and the offset from zero. The results:

Int size 8, offset 0: 68.05%
Int size 32, offset 0: 73.28%
Int size 64, offset 0: 77.73%
Int size 8, offset -1: 73.87%
Int size 32, offset -6: 80.74%
Int size 64, offset -7: 85.50%

This says that if we use a vector of bytes, with an offset of -1, then 73% of the data will use the vector, the rest would go to the dictionary. If we use 64-bit ints with an offset of -7, then 85% of them would be fast.

One factor not accounted for here: you'd also have to limit the length of the int vector, starting line numbers larger than the length of the vector would also have to go into the dictionary, but that should also be a small factor.

I'm not taking on any of this work now, but it's fascinating to work through the details. It might make an interesting pull request...

tagged: » react

Seth Robertson has written a great tutorial on how to fix mistakes in git: On undoing, fixing, or removing commits in git. I love it for two reasons.

First, it really does help you find your way through a thicket of options in the confusing world of git. I love git, but it is a power tool that has hurt many, and good instructions for the baffled are hard to come by.

More interestingly, Seth has come up with an innovative way to provide the instructions. When writing detailed technical help, things are never straightforward, literally. There are always points in the flow where you have to say, "Now, you may have done A or you may have done B, and what happens next depends on your choice." Rather than try to keep all that in a linear English flow as most of us do, Seth has provided an actual branching structure to the text.

When you try to keep it linear, you end up with something that reads like the instructions for an IRS tax form, because they have a similarly complex documentation challenge. By actually branching, Seth has made it possible to focus on the situation you are actually in.

And although the page looks as bare-bones as they come (it could have been designed in 1994), the choices you make are recorded and appear as breadcrumbs right where you are in the text.

I can see extending this technique. Imagine in a set of instructions where the user has to choose the name of a directory, they enter the name into a text box, and then the rest of the instructions have the actual name filled in so the commands they have to type don't need metasyntactic placeholders like <YOURDIRECTORY>.

Help for confused people should be more helpful, and keeping track of the twisty paths like this is a really nice way to do it. I'd love to see other examples. Explaining things is hard, and new ways to do it are fascinating.

If you really want to write high-quality polished code, you have to attend to many details. A small one that is often overlooked: comments should be complete sentences. (Throughout, "comment" includes docstrings and any other English you write about your code.) Start comments with a capital, have a subject and a verb, and end them with a period.

There are a few reasons for this. First, your program should be readable. Just as you choose variable names to have meaning so that people can read your code, you should write your comments so that people can read them. Capitals are a clear indication of the start of a sentence, and a period is a clear indication of the end:

# BAD:
# try to read the thing

# BETTER:
# Try to read the thing.

Is that first comment complete? Was there meant to be a qualifying clause? The thing that what? Sure, the second sentence doesn't explain the thing either, but at least we can see that the author didn't just wander away in the middle of a thought.

By writing complete sentences you are more likely to include some small helper that will get your meaning across, and people will be better able to grasp what your words mean. Isn't that why you wrote them?

The second reason to write complete sentences is to focus you on what you are writing. You think about each statement in your program, why aren't you thinking about the comments? Maybe if you capitalize and punctuate your text, you'll realize that a few grunted words aren't really what you want to put there:

# EVEN BETTER:
# Try to read the configuration file, but it's OK if
# it's missing because we don't require a config file.

Some would argue that this comment is too many words, that you shouldn't explain in that much detail, that the code should be more self-explanatory. That's great, now you're thinking about what the comment is doing, and making them better. I'm not suggesting writing longer comments just for bulk, if you can say it in fewer words, do, but make them good words in a complete sentence.

Paying more attention to the comments will help you write better code. I can't tell you how many times I've written what I thought was a perfecty good function or line of code, then gone to write the comment or docstring, and realized a better way to do it, or even just a better name. Explaining yourself to others is a really good way to understand what you are doing. Understanding what you are doing is a really good way to write good code.

Another reason to make your comments complete sentences: it makes it easier to extend them later:

# ORIGINALLY:
# try to read the thing

# THEN SOMEONE ADDS:
# try to read the thing. Don't worry if it
# doesn't exist

The person who adds the second sentence had to add the period to the first, and now it looks really strange that the second sentence is unterminated. If the first sentence had been a real sentence, the second sentence could be added naturally.

Many coders will look at this advice and complain that it is way too nit-picking, that punctuation in comments is irrelevant, that since it's natural language, it's readable as it is, we don't have to worry about trivialities like punctuation they are wrong text needs punctuation to be readable leaving it out just makes it hard to parse the sentences see what i did there?

One last reason for full sentences: the programming variant of the broken windows theory says that if you take care of small things, others are more likely to take care of the bigger things. Polished code is more likely to be maintained well and will set the tone for more polished code in the future.

And isn't that what we all want? Write complete sentences.

tagged: » 18 reactions

While writing a test suite, I wrote a helper function to renumber the ids in SVG figures. It had enough interesting bits that I'll share it here, and maybe I'll get suggestions on better ways to do it.

BTW: The test is for an SVG-drawing library that I hacked together to replace the graphviz diagrams I made for Facts and myths about Python names and values. Graphviz is frustrating if you know what you want things to look like, so when I turned that article into a presentation, I re-did the diagrams in SVG. Now I want to make that library more formal, so I need tests!

The first tests are just the figures from the presentation, packaged as unit tests. But the figures have ids in them which are auto-assigned, and if the tests run in a different order than the original figures, the ids will be different. So I wrote a helper function that finds the ids and renumbers them, to canonicalize the SVG.

I chose to use regexes, since formally parsing the SVG to find the ids would involve not just XML parsing but CSS parsing, and this domain is specialized enough and tightly-controlled enough that I'm condfident that a regex will do a good job.

First the code, then we'll go over it line by line:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def renumber_svg_ids(svg):
    """Renumber the ids in `svg`.

    Ids are either "id='id10'" or "#id10".  Same ids get
    the same renumbered id, to keep the meaning the same.

    Return the same svg string, but with new ids.

    """
    id_map = {}
    new_ids = ("newid{}".format(i) for i in itertools.count())

    def new_repl(match, new_id_fmt):
        r"""re.sub function for renumbering.

        The `match` object has an id in \1.  Re-number it with a new id,
        then return `new_id_fmt` with the new id in place of "{}".

        """
        found_id = match.group(1)
        if found_id not in id_map:
            id_map[found_id] = next(new_ids)
        return new_id_fmt.format(id_map[found_id])

    # Replace ids that look like: id="id123"
    svg = re.sub(
        r"""\bid=['"](id\d+)['"]""",
        functools.partial(new_repl, new_id_fmt="id='{}'"),
        svg
    )
    # Replace ids that look like: #id123
    svg = re.sub(
        r"""#(id\d+)\b""",
        functools.partial(new_repl, new_id_fmt="#{}"),
        svg
    )

    return svg

At heart, this function is conceptually simple: take a string, and return the string with the ids replaced. Since the same id can appear multiple times in the string, we need to be careful to replace the same id with the same replacement. To keep track of what was replaced with what, at line 10, id_map is a dictionary mapping old ids to new ids.

At line 11, we have a generator expression that will make new ids for us. itertools.count is an infinite sequence of integers; we format those into the form "newid123", and the generator expression gives us an infinite stream of those ids.

The heart of renumber_svg_ids is a function for use with re.sub. The simple and common way to use re.sub is to give it a regex pattern and a string replacement. But instead of a string replacement, you can use a function. Every match is passed to the function as a match object, and the string returned by the function is used as the replacement.

Our function new_repl on line 13 takes a match object and a format string for the replacement. Line 20 gets the actual id out of the match object: match.group(1) returns the string matched by the first parenthesized group in the regex pattern, so found_id will be something like "id123".

On line 21, if the id isn't in our map, then we haven't seen this id yet, so we make a new id by pulling the next value from our generator expression. Generators are usually consumed in a for-loop of some kind, but you can use the next() builtin to just grab the next value from one.

Finally we use the new_id_fmt format string, giving it the new id, and return the result on line 23.

It's unusual to see nested functions in Python, but they work fine. One issue is variable scope: notice that we use id_map inside the new_repl function, but id_map is defined in the outer function. This works so long as we don't reassign the id_map name. That won't work right in Python 2, you'd need the Python 3 nonlocal keyword. Luckily we don't need to reassign the name, we just use methods on it. Those methods modify the value, but that's still not an assignment to the name, so we are OK.

Now that we have our re.sub replacement function, we're going to use it twice, once to replace "id='id123'" instances, and once for "#id123". You may have noticed an odd thing about our new_repl function: it takes two arguments. But the function re.sub will call only takes one: the match object. We need two arguments so that the replacement format could be different for the two times we're going to use it.

To turn our two-argument function into two different one-argument functions, we use functools.partial. You give it a function, and some arguments, and it returns a new function that will call your function with those arguments pre-supplied. In our case, line 28 uses functools.partial to make a new function that is our new_repl with the given string as new_id_fmt. The result is a function of only the one remaining argument, the match object, which is just what re.sub wants.

Lines 26 and 32 are our two calls to re.sub, they each make replacements in the svg string, and the final result is returned at the end.

A few minor things to note: on line 14, the docstring for new_repl is a raw string, because I have a backslash in it that I want to remain literal, although the "\1" is an obscure way to refer to the first group, and in any case the docstring of an inner function is unlikely to ever appear anywhere else, so who's reading it? On line 27, I used a triple-quoted string even for a single-line string, because it let me avoid escaping the two kinds of quotes I have in the regex.

Of course, there's still room for new ways to do things. Line 21, the check if the value is already in the dictionary, raises an eyebrow: Python has better ways to do that sort of thing. The defaultdict class can automatically create values for missing keys.

So we can re-write the top of our function like this (with docstrings removed for brevity):

def renumber_svg_ids(svg):
    new_ids = ("newid{}".format(i) for i in itertools.count())
    id_map = collections.defaultdict(lambda: next(new_ids))

    def new_repl(match, new_id_fmt):
        found_id = match.group(1)
        return new_id_fmt.format(id_map[found_id])

The new_ids generator is exactly the same. But now we use it in a defaultdict. When a key is missing, defaultdict will invoke the lambda function, which will use next() to get the next id. Now the body of new_repl has no conditional in it at all, it simply looks up the found id in the map. If it's not there already, the defaultdict will make a new one, and if it is there, it will simply return the saved value. For bonus points, you could replace our new lambda function with another call to functools.partial.

In the back of my mind, I'm wondering if there isn't a better way to accomplish this entirely. Maybe find all the ids in one pass, and then replace them all in another?

tagged: » 2 reactions

New

Wednesday 1 January 2014

Welcome to 2014! Who knows what it will bring? In 2013 I joined edX full-time (including open-sourcing the whole thing) and had a little surgery. On the side, I was greatly drawn to Python community work, notably Boston Python. I didn't write as much here (or elsewhere!) as I thought I would, maybe that's OK, though that's definitely something I would like to improve on.

At home, I watched and helped as my family continued to develop their own paths in various ways: Nat in a group home; Max at NYU; Ben in high school and the arts; and Susan as a writer and advocate.

For the record, here's us:

Us

And here's us, more realistically:

Us

I couldn't have predicted 2013, so I won't try to predict 2014. As always, here's to living mindfully, and having it come out the way you want!

« | » Main « | »