Circle of Mad Libs

Monday 11 March 2019

Sometimes you find an unexpected real-world connection even in the geekiest of places. I (nedbat) was hanging out in the #python IRC channel on Freenode, and I recommended to someone that they write a Mad Libs game for a project.

Calvin Spealman (aka ironfroggy) chimed in:

[ironfroggy] didn't you write a madlibs python blog post like... forever ago?
    [nedbat] yes :)  14 years ago I think.
    [nedbat] my son was 13, and he just turned 27...
[ironfroggy] nedbat: fun fact: i read that when my wife was pregnant.
[ironfroggy] my son turns 13 in a few weeks.
[ironfroggy] we make games together now
    [nedbat] :) i like the symmetry

My post from 14 years ago is Programming madlibs, written based on a project I did with my then 13-year-old. To think that Calvin read it on the brink of becoming a father, and now has a son the same age that mine was then, is mind-bending.

It’s kind of like a circle of life or something, but I guess it’s just a circle of Mad Libs, which is still good.

Mutmut

Saturday 2 March 2019

Mutation testing is an old idea that I haven’t yet seen work out, but it’s fascinating. The idea is that your test suite should catch any bugs in your code, so what if we artificially insert bugs into the code, and see if the test suite catches them?

Mutation testers modify (mutate) your project code in small ways, then run your test suite. If the tests all pass, then that mutation is considered a problem: a bug that your tests didn’t catch. The theory is that a mutation will change the behavior of your program, so if your test suite is testing closely enough, some test should fail for each mutation. If a mutation doesn’t produce a test failure, then you need to add to your tests.

There are a few problems with this plan. The first is that it is time-consuming. Most people feel like it takes too long to run their entire test suite just once. Mutation testers run the whole suite once for each mutation, and there can be thousands of mutations.

But my larger concern is false positives: not all mutations are bugs, and if the mutation tester reports too many non-bugs as bugs, then its usefulness is diminished or even negated. I wanted to examine this idea more closely.

There are a few mutation testers out there for Python. I thought I would give them a try, starting with mutmut. [Mutmut’s author Anders Hovmöller helped by commenting on a draft of this post. I’ve included some of his commentary.]

I needed a test suite to use, so I created a slightly artificial project. The templite module in coverage.py is almost standalone, and is well-tested. And it’s small enough that its test suite runs in less than a second. I extracted templite, wrote some project scaffolding, and gave it its own repository.

Now I had a project that tested well:

$ coverage run -m pytest
============================= test session starts ==============================
platform darwin -- Python 3.7.1, pytest-4.3.0, py-1.8.0, pluggy-0.9.0
rootdir: /Users/ned/lab/templite, inifile:
collected 26 items

test_templite.py ..........................                              [100%]

========================== 26 passed in 0.09 seconds ===========================

$ coverage report -m
Name              Stmts   Miss Branch BrPart  Cover   Missing
-------------------------------------------------------------
src/templite.py     144      1     60      1    99%   137, 136->137

(The one line missing coverage is a conditional for Python 2 vs Python 3.)

Running mutmut was easy:

$ pip install mutmut
Collecting mutmut
...
Installing collected packages: mutmut
Successfully installed mutmut-1.3.1

$ mutmut run

- Mutation testing starting -

These are the steps:
1. A full test suite run will be made to make sure we
   can run the tests successfully and we know how long
   it takes (to detect infinite loops for example)
2. Mutants will be generated and checked

Mutants are written to the cache in the .mutmut-cache
directory. Print found mutants with `mutmut results`.

Legend for output:
🎉 Killed mutants. The goal is for everything to end up in this bucket.
⏰ Timeout. Test suite took 10 times as long as the baseline so were killed.
🤔 Suspicious. Tests took a long time, but not long enough to be fatal.
🙁 Survived. This means your tests needs to be expanded.

mutmut cache is out of date, clearing it...
1. Running tests without mutations
⠇ Running... Done

2. Checking mutants
⠧ 154/154  🎉 146  ⏰ 0  🤔 0  🙁 8

This ran 154 different mutations, which took about a minute for my half-second-ish test suite. 146 of them resulted in test suite failures, as they should. But 8 passed the test suite, so they have to be examined as potential test gaps.

One nice touch: if you interrupt mutmut, when you run it again, it picks up where it left off, which is great for a long-running process like this.

I’m not sure how mutmut decides where to find the code to mutate. In this case it found it implicitly. Other projects I tried, I had to add some configuration to setup.cfg, even though I thought the projects were laid out similarly.

[Anders says it looks for “src”, “lib”, or a directory with the same name as the current directory. My other project has a quirk: edx-lint/edx_lint has the code, so the punctuation difference threw it off.]

To look at the mutants, use the results command:

$ mutmut results
To apply a mutant on disk:
    mutmut apply <id>

To show a mutant:
    mutmut show <id>


Survived 🙁 (8)

---- src/templite.py (8) ----

10, 29, 37, 45, 46, 58, 108, 152

This gives me the ids of the mutants that survived, that is, the mutations that didn’t cause a failure in the test suite.

We can see the actual code mutation with the show command:

$ mutmut show 10
--- src/templite.py
+++ src/templite.py
@@ -48,7 +48,7 @@
         self.code.append(section)
         return section

-    INDENT_STEP = 4      # PEP8 says so!
+    INDENT_STEP = 5      # PEP8 says so!

     def indent(self):
         """Increase the current indent for following lines."""

The mutation is shown as a diff. The old line is prefixed with minus, and the new line with plus. Here the INDENT_STEP constant was changed from 4 to 5.

Right off the bat, we have a philosophical decision to make. A bit about how templite works: it converts template files into Python code. Rendering a template is done by executing the generated Python code. This INDENT_STEP constant is the indentation amount used in the generated code.

I have no tests that examine the generated code. That code is an implementation detail. The important thing is that the templates render properly, so that is what’s tested. When mutmut changed the indent level to 5, the generated code was different, but only in white space, so it ran the same, and still produced the right output.

Does this mutation point to a problem in the test suite? I don’t think I should test that the indentation level in the generated code is 4 spaces. Mutmut provides a way to mark the line to exempt it from mutation, but I’m not sure I want to start adding those pragmas. This is one of the things I wanted to understand: what kind of false positives would appear, and how would I deal with them?

Let’s see how the next mutant looks:

$ mutmut show 29
--- src/templite.py
+++ src/templite.py
@@ -134,7 +134,7 @@
         code.add_line("append_result = result.append")
         code.add_line("extend_result = result.extend")
         if sys.version_info.major == 2:
-            code.add_line("to_str = unicode")
+            code.add_line("XXto_str = unicodeXX")
         else:
             code.add_line("to_str = str")

The second mutant has found the one line of code that is not covered by the test suite, because it’s for Python 2, and we are only running under Python 3. Mutmut has a --use-coverage flag, which uses coverage data to skip mutations on lines that are not covered by the test suite. If I had used it to begin with, this mutant wouldn’t have appeared. Nice.

Next:

 mutmut show 37
--- src/templite.py
+++ src/templite.py
@@ -144,7 +144,7 @@
             """Force `buffered` to the code builder."""
             if len(buffered) == 1:
                 code.add_line("append_result(%s)" % buffered[0])
-            elif len(buffered) > 1:
+            elif len(buffered) >= 1:
                 code.add_line("extend_result([%s])" % ", ".join(buffered))
             del buffered[:]

This is a classic false positive. The condition has been changed from greater to greater-or-equal, but it doesn’t change the behavior of the code. This mutation is in an “elif” clause and the equal case was already handled by the previous if clause, so greater-or-equal is the same as greater.

On this point, Anders commented:

Mutmut here does point out that your code is overly complex. Just “elif buffered” can’t be mutated but has the same functionality. I’ve found this to be a weird little side effect to using mutation testing. If I follow this the code gets better and more “just so”. This specific case isn’t a super strong argument, but I’ve had many similar things that build on top of each other in small increments.

I can see Anders’ point here, though I’m not sure I want to change the code that way.

Mutant 45 gives us our first true success:

$ mutmut show 45
--- src/templite.py
+++ src/templite.py
@@ -153,7 +153,7 @@
         # Split the text to form a list of tokens.
         tokens = re.split(r"(?s)({{.*?}}|{%.*?%}|{#.*?#})", text)

-        squash = False
+        squash = True

         for token in tokens:
             if token.startswith('{'):

Templite can squash white space around tokens, and here we are changing the initial value of the “should I squash white space?” flag. How can it not cause a test failure? Because we never tested a template that started with white space! Adding this simple test kills the mutant:

self.try_render("  hello  ", {}, "  hello  ")

I thought that mutmut run again would clear the mutant from the results, but the only way I could find to clear it was to delete the mutmut cache and run all the mutations again. [Anders wrote an issue about this.]

Mutant 46 is another false positive:

$ mutmut show 46
--- src/templite.py
+++ src/templite.py
@@ -153,7 +153,7 @@
         # Split the text to form a list of tokens.
         tokens = re.split(r"(?s)({{.*?}}|{%.*?%}|{#.*?#})", text)

-        squash = False
+        squash = None

         for token in tokens:
             if token.startswith('{'):

Here squash is the same boolean flag we saw in mutant 45. I only ever check it with if squash:, so of course False and None produce the same results. Notice here if I wanted to prevent this mutant by adding a pragma to the line, I would also have prevented the first success we had. Adding that pragma would be counter-productive.

Next:

$ mutmut show 58
--- src/templite.py
+++ src/templite.py
@@ -160,7 +160,7 @@
                 start, end = 2, -2
                 squash = (token[-3] == '-')
                 if squash:
-                    end = -3
+                    end = -4

                 if token.startswith('{#'):
                     # Comment: ignore it and move on.

This is another useful result. Turns out in my tests, I always wrote space-squashing tags with a space, like {{a -}}. This mutated code adjusted the trimming of punctuation to account for the dash. Because I always had a space before the dash, the change to -4 went unnoticed. I killed this mutant by changing some tags in my tests to have no space: {{a-}}, and also added some with many spaces for good measure.

Mutant 108 sure looks like it’s real:

$ mutmut show 108
--- src/templite.py
+++ src/templite.py
@@ -211,7 +211,7 @@
             else:
                 # Literal content.  If it isn't empty, output it.
                 if squash:
-                    token = token.lstrip()
+                    token = None
                 if token:
                     buffered.append(repr(token))

Seems like we have no tests of non-white-space literal content after a squashing tag. Add that test, and that mutant is killed.

Our last mutant is another interesting case:

$ mutmut show 152
--- src/templite.py
+++ src/templite.py
@@ -283,7 +283,7 @@
                     value = value[dot]
                 except (TypeError, KeyError):
                     raise TempliteValueError(
-                        "Couldn't evaluate %r.%s" % (value, dot)
+                        "XXCouldn't evaluate %r.%sXX" % (value, dot)
                     )
             if callable(value):
                 value = value()

Here the error message has been mutated by adding chaff to the beginning and end. We do have a test for this error, including its message:

def test_exception_during_evaluation(self):
    msg = "Couldn't evaluate None.bar"
    with self.assertRaisesRegex(TempliteValueError, msg):
        self.try_render(
            "Hey {{foo.bar.baz}} there", {'foo': None}, "Hey ??? there"
        )

The test still passes because it’s finding the expected error message somewhere in the actual error message. If mutmut had added chaff in the middle of the string as well, it would have failed the test. Is this clever of mutmut? Hard to say!

When I change the test, the mutant is killed:

regex = "^Couldn't evaluate None.bar$"
with self.assertRaisesRegex(TempliteValueError, regex):

BTW, the first time I ran mutmut, it created another nonsensical mutant:

--- src/__init__.py
+++ src/__init__.py
@@ -1,2 +1,2 @@
-from .templite import *
+from .templite import /

This mutant survived because this file was never executed. That in itself was a useful clue to the fact that I had made a useless file. Delete the file, and the mutant is killed. [mutmut has changed so that it won’t create this mutation any more.]

So after all this, how did mutmut do? It gave me seven mutations, four of which resulted in improving the tests. That’s not a bad outcome. But I don’t know how I would use this regularly. I don’t have a good way to silence the three false positives, so if I run mutmut again in the future, I will have to consider them again.

As another data point about the cost of mutation testing, I tried mutmut on another project with a 10-second test suite. It took mutmut 43 minutes to run 513 mutants, of which 165 survived. I haven’t looked through them yet to see what they mean.

All in all, I am pleased with the results. As an occasional (but expensive) way to assess your test suite, mutmut works well.

A Boston story

Sunday 24 February 2019

The other day, I woke up to find it had snowed. Not a lot, but enough to have to shovel the driveway and the walk. Working at the end of the driveway, I saw a hired guy about my age clearing the sidewalk across the street. He had the stocky, almost chubby kind of build of someone who is strong because he works all day, a bull of a guy. He was wearing a bright orange knit cap.

I waved to him and called out, “I didn’t expect to have to shovel, I’m going to be late for work.”

He came over, and we started chatting. He had a classic thick Boston accent. Was it going to be warm enough later that it would just melt? Or would it just freeze again and cause a real mess? It was the kind of friendly bonding over a shared experience that snow shoveling can bring about, even between two people without much else in common.

I ended with, “Well, I choose to live in Boston, so I can’t complain,” to which he responded, smiling, “And if you did complain, who would listen!?” It was that kind of conversation.

I went back up to the top of my driveway. I was faced away from the street, clearing the car, when I heard something behind me. I turned to see the guy in his pickup truck, zooming toward me. The blade of the snowplow is about three feet from me. He stopped, lowered the blade, and backed down the driveway, scraping away the snow. He came up again, scraped again, and then pushed the resulting snow pile in the street out of the way.

I just stood aside, grinning, pleased at the work he had saved me. As he turned the truck to drive away down the street, I joked to him, “Thanks a lot! But next time give me some warning, you almost gave me a heart attack!”

To which he smiled and responded, “Ah, you wasn’t gonna have no haht attack,” and drove away to the next job.

Git tools: tig etc

Saturday 23 February 2019

Recently I’ve had a few chats with engineers where I mention a git helper thing, and they hadn’t heard of it. So maybe other people would like to hear about them too:

tig is a full-window terminal UI for git. It’s great for spelunking through a git repo, looking at branches, history, blames, and so on. For a quick overview of what it does, this old blog post from Atlassian describes it pretty well.

You should play with it to see all of its features. To be honest, I haven’t quite internalized how it displays new panes, but I still love it for its speedy compact presentation of just the information I want.

hub is a command-line tool for doing GitHub things that are not pure git, or for supplying helpful GitHub-centric defaults. For example, cloning a repo with “hub clone username/repo”, or opening a pull request for the current branch with “hub pr”. It can do a ton of stuff. If you use a lot of GitHub features, but like the command line, you should definitely give it a try.

A global .gitignore file is like the .gitignore file in your repos: it specifies files that should never be committed to git. But instead of being part of a specific repo, this one is global to all of your repos on your machine. This is great for IDE-specific files, or data files for your own quirky tools.

Do you have other good helpers to recommend?

Drawing Cairo SVG in a Jupyter notebook

Sunday 27 January 2019

Quick tip: if you want to draw figures using Cairo in a Jupyter notebook, here’s how to do it, at least this was how I did it:

from io import BytesIO

import cairo
import IPython.display

svgio = BytesIO()
with cairo.SVGSurface(svgio, 200, 200) as surface:
    # These lines are copied verbatim from the
    # pycairo page: https://pycairo.readthedocs.io
    context = cairo.Context(surface)
    x, y, x1, y1 = 0.1, 0.5, 0.4, 0.9
    x2, y2, x3, y3 = 0.6, 0.1, 0.9, 0.5
    context.scale(200, 200)
    context.set_line_width(0.04)
    context.move_to(x, y)
    context.curve_to(x1, y1, x2, y2, x3, y3)
    context.stroke()
    context.set_source_rgba(1, 0.2, 0.2, 0.6)
    context.set_line_width(0.02)
    context.move_to(x, y)
    context.line_to(x1, y1)
    context.move_to(x2, y2)
    context.line_to(x3, y3)
    context.stroke()
    # end of pycairo copy
IPython.display.SVG(data=svgio.getvalue())

Counting lines of code

Saturday 19 January 2019

I wrote an Open edX blog post about the need to move from Python 2 to Python 3. For emphasis, I wanted to say how much code there was. Open edX is a large project spread across a number of repos. Why spend 30 minutes writing a blog post when you can first spend two hours fiddling around with line-counting tools to get a vague factoid for the blog post?

The old standard tool for line-counting is cloc. It has way too many options, many of which don’t work quite the way I would have expected, but it gets the job done, with some bash support. My resulting monster is below.

It over-counts JavaScript code because there are lots of places that JavaScript gets checked into git that isn’t code we wrote. I don’t know what to do about that. Oh well.

BTW, on the subject of line counting: once, helping someone with a program, I saw they were using semicolons to end their Python statements. I said they didn’t need them, and they replied, “Yes I do, because my manager’s line-counting software requires them.” !!!

Be careful out there...

#!/bin/bash

# Count lines of code in a tree of git repos.
# Needs cloc (https://github.com/AlDanial/cloc)

REPORTDIR=/tmp/cloc-reports
mkdir -p $REPORTDIR
rm -rf $REPORTDIR/*

cat <<EOF > $REPORTDIR/exclude-files.txt
package-lock.json
EOF

cat <<EOF > $REPORTDIR/more-langs.txt
reStructured Text
    filter remove_matches xyzzy
    extension rst
    3rd_gen_scale 1.0
SVG Graphics
    filter remove_html_comments
    extension svg
    3rd_gen_scale 1.0
EOF

find . -name .git -type d -prune | while read d; do
    dd=$(dirname "$d")
    if [[ $dd == ./src/third-party/* ]]; then
        # Ignore repos in the "third-party" tree.
        continue;
    fi
    echo "==== $dd =============================================="
    cd $dd
    git remote -v

    REPORTHEAD=$REPORTDIR/${dd##*/}
    cloc \
        --report-file=$REPORTHEAD.txt \
        --read-lang-def=$REPORTDIR/more-langs.txt \
        --ignored=$REPORTHEAD.ignored \
        --vcs=git \
        --not-match-d='.*\.egg-info' \
        --exclude-dir=node_modules,vendor,locale \
        --exclude-ext=png,jpg,gif,ttf,eot,woff,mo,xcf \
        --exclude-list-file=$REPORTDIR/exclude-files.txt \
        .
    cd -
done

cloc \
    --sum-reports \
    --read-lang-def=$REPORTDIR/more-langs.txt \
    $REPORTDIR/*.txt

Older:

Jan 1:

Advice