« | » Main « | »

The scalability of programming languages

Saturday 24 October 2009

Tabblo is written on the Django framework, and therefore, in Python. Ever since we were acquired by Hewlett-Packard two and a half years ago, there's been a debate about whether we should start working in Java, a far more common implementation language within HP. These debates come and go, with varying degrees of seriousness.

The latest wave of "Java?" debating is upon us, and Mike Vanier's The Scalability of Programming Languages has been entered into evidence. I found it a very interesting read, especially about static vs. dynamic typing. At one point, Mike says,

What typically happens in large projects written in these languages is that extensive unit tests are written to catch type errors as well as logical errors ...

I think Mike meant this as a negative, but I don't see how it is. Extensive unit tests are a good thing, especially since they catch logical errors as well as type errors. The static type people either don't have such tests, in which case nothing is catching their logic errors, or they do have such tests, in which case they didn't need the static type checking in the first place.

Static type adherents claim that their type declarations give them both documentation of what's expected, and automatic checking of code. But it only gives you a small amount of either.

For example, a parameter to a function has to be a string, so you declare it as String, and the compiler can guarantee that it is a String. But that's just one small aspect of the rules about the parameter. Can it be NULL? Can it be empty? What's it supposed to represent? An IP address? Can it be a wild-carded IP address? Can it be a comma-separated list of such addresses?

The questions beyond "String" go on and on, and static type checking gives us help with none of them. There's the temptation to slice the universe ever more finely to get the type system to carry some of this information. So you'll end up with IpAddress types, and WildcardableIpAddress, and so on. Those are good things, since you will likely have methods on IP addresses that you want to perform, so building classes will help. But there are always distinctions between instances that can't be expressed in the type system. The only way to get at them is at run time. You can decide which run time you want to find them: in tests or in real use. Tests are the better answer.

The rest of the essay is interesting, especially Mike's postscripts where his changes of viewpoint are recorded. It's worth a read, if only for its exposition of the considerations that go into programming language design. He doesn't get caught up in shallow issues like syntax, but gets at the deeper factors in programming languages that affect the outcome of projects that use them.

Mozilla Raindrop

Friday 23 October 2009

Mozilla Labs has announced Mozilla Raindrop, "an exploration in messaging innovation", or a next-gen mail and messaging client.

I'm interested in this for a few reasons. First, it could someday be the way I read email, and goodness knows we could use better ways to deal with that mess. Second, it's built on Python and CouchDB.

On the down side, it's a rethinking of messaging, which is a quagmire where plenty of other enthusiastic innovators have lost their way before. Also, it's in the very early stages, and so has a long way to go before it's something people can use.

I'll be interested to follow its progress, both as a usable piece of software and as a showcase for favorite foundational technologies.

Physical type

Wednesday 21 October 2009

The opening titles for the Typophile 5 Film Festival are quite a work of art. It's a rumination on the five senses, expressed typographically, and it's very well done. It's great to see the type shapes used as shapes first, as text second. But the really surprising thing about it in this day and age is that it is all physical objects filmed, no CG effects were used.

There's an interview with the team, and behind-the-scenes stills here, as well as the opening titles from the previous four festivals.

Today's KenKen

Sunday 11 October 2009

The raging success of Sudoku has created demand for more abstract analytical puzzles, and KenKen seems to fit the bill nicely. Like Sudoku, completing the puzzle requires satisfying a number of overlapping constraints. For KenKen, the rows and columns must have each number only once, and the outlined areas (called cages) must total to the correct number using the specified operator. So a cage marked "10+" must sum to 10, while in one marked "2÷", the numbers must be a pair that divides to 2.

I find KenKen more interesting than Sudoku, because of the variety of logic that has to be applied to solve one. As an example, here is today's New York Times KenKen:

KenKen, as presented

I'll show here step-by-step how I solved it. First we fill in the forced numbers:

KenKen, step 1

At cage C4:D4, we need a sum of 5, which is either 1,4 or 2,3. Column 4 already has a 3, so it must be 1,4, though we don't know exactly where. Fill in the possibilities:

KenKen, step 2

Looking at row E, the 5 can't go in either of the 2÷ cages, so it has to go into the 6+, making it 1,5. Column 4 already has a 1, so we know exactly where they go:

KenKen, step 3

G5:G6 is either 1,3 or 2,6, but G4 is already 3, so it must be 2,6:

KenKen, step 4

Now, consider rows F and G together. The sum of all the cells in both rows must be 56 (2×(1+2+3+4+5+6+7)). Other than F4:F5 and G2:G3, we have a sum of 10+7+10+3+8, or 38. So F4:F5 + G2:G3 is 18. If one of them is 1,4, then the other must sum to 18-5 or 13, making it 5,8, which is impossible, so neither is 1,4. If one of them is 3,6, then the other is also 3,6 but G2:G3 can't be 3,6 since G4 is already 3. The last remaining possibility is that one of them is 2,5 and the other is 4,7. F4:F5 can't be 4,7 (since F3 is 7), so it is 2,5, and G2:G3 is 7,4. E4 determines where the 5 will go, and F3 determines where the 7 will go:

KenKen, step 5

In row A, we have a similar arrangement with two 3− cages. Because the entire row sums to 28, and the third cage is a 10+, the two 3− cages together sum to 18, just like we saw in rows F and G. Again 1,4 and 5,8 isn't a possibility, and because they're on the same row, we can't use 3,6 and 3,6. So these cages will also be filled with 2,5 and 4,7, though we don't know which is which.

That leaves 1,3,6 to fill the 10+ on row A. The 1 has to go in column 5 since columns 3 and 4 already have 1's. Column 4 has a 3, so A4 gets the 6, and A3 gets the 3:

KenKen, step 6

The last number in column 4 is the 7 for B4. The 56× cage needs a 7, which can't go in row B or column 5, so it goes in C6:

KenKen, step 7

Now comes some complicated logic. Look at rows B, C, and D. All togther, their cells have to sum to 84. We know the 56× cage has to be 7,2,4, so the only cages we don't know the sums of are 36×, 3−, and 2−. The others sum to 54.

There are three possibilities for the 36× cage: 3,4,3 or 6,2,3 or 6,1,6. Let's consider them in turn:

  • If 36× is 3,4,3, then the 3− and 2− cages have to sum to 20, which isn't possible since the sum of a 3− must be odd and the sum of a 2− must be even, which sum to odd.
  • If 36× is 6,2,3, then the 3− and 2− cages sum to 19. There are three ways to do that, none of which are allowed:
    • 7,4 and 5,3: won't work, because columns 2 and 3 already have 7's.
    • 6,3 and 6,4: that would put two 6's in row D.
    • 5,2 and 7,5: two 5's in row D.
  • So 36× must be 6,1,6. Let's look at the possibilities for the 3− and 2−:
    • 7,4 and 4,2: nope, two 4's in row D.
    • 6,3 and 5,3: nope, two 3's in row D.
    • 5,2 and 6,4: that's valid!
    • 4,1 and 7,5: nope, columns 5 and 6 both have 7's already.

So there's only one way to complete the 3− and 2− in row D, and we know the solution to the 36×. Also, the possibilites for the 2− cage force the positions of the 1 and 4 in column 4:

KenKen, step 8

Column 3 is almost completed. D3 is either 2 or 5, so B3 is either 5 or 2. If it's 2, then B5 must be 6, but column 5 will get a 6 from either D5 or G5, so B3 must be 5, and D2 and D3 are determined also:

KenKen, step 9

The 56× cage has to be 7,2,4 and we now have enough squares filled in to see where they go. These then fix the positions of two cages on rows D and G:

KenKen, step 10

Returning to the two 3− cages in row A: they are 2,5 and 4,7. With column 6 more filled in, we can see that A6:A7 has to be 5,2, and A1:A2 is 7,4:

KenKen, step 11

The two 2÷ cages in row E will be 2,4 and 3,6. Neither 2 or 4 can go in column 6, so those cages can be filled in, along with the last remaining numbers in columns 2 and 6:

KenKen, step 12

Now our rows have only two numbers each remaining. In each row we can consider the two missing numbers, and place them in the column that doesn't already have it:

KenKen, step 13

And the rest is easy:

KenKen, completed

DevDays Boston

Wednesday 7 October 2009

Today was the DevDays Boston one-day conference. I did an hour introducing Python. My slides are here: DevDays Whirlwind Python (sorry there's no text).

It was a really good crowd, lots of intelligent questions at the end, lots of good energy. The other talks were really interesting too, it was great to have diversity: iPhone, Javascript, Mono, and so on.

Joel of course was entertaining and announced a few new products from FogBugz which look interesting, including Mercurial hosting as a plugin, called Kiln (can't find a link).

Running the same code on Python 2.x and 3.x

Saturday 3 October 2009

Last weekend I released a version of Coverage.py for Python 3.x. Getting to that point took a while because 3.x was new to me, and, it seems, everyone is still figuring out how to support it.

I experimented with using 2to3 to create my 3.x code from my 2.x code base, and that worked really well, see Coverage.py on Python 3.x for some details. For a while, I developed like this, with 3.x code translated by 2to3 so that I could run the tests under Python 3.1. But then I had to figure out how to package it.

I didn't want to have to create a separate package in PyPI for the 3.x support. I tried for a while to make one source package with two distinct trees of code in it, but I never got setup.py to be comfortable with that. Setup.py is run during kitting, and building, and installation, and the logic to get it to pick the right tree at all times became twisted and confusing.

(As an aside, setuptools has forked to become Distribute, and they've just released their Python 3 support which includes being able to run 2to3 as part of build and install. That may have been a way to go, but I didn't know it at the time.)

Something, I forget what, made me consider having one source tree that ran on both Python 2 and Python 3. When I looked at the changes 2to3 was making, it seemed doable. I adapted my code to a 2-and-3 idiomatic style, and now the source runs on both.

Changes I had to make:

¶   I already had a file called backward.py that defined 2.5 stuff for 2.3, now I also used it to deal with import differences between 2 and 3. For example:

try:
    from cStringIO import StringIO
except ImportError:
    from io import StringIO

and then in another file:

from backward import StringIO

¶   exec changed from a statement to a function. Syntax changes like this are the hardest to deal with because code won't even compile if the syntax is wrong. For the exec issue, I used this (perhaps too) clever conditional code:

# Exec is a statement in Py2, a function in Py3

if sys.hexversion > 0x03000000:
    def exec_function(source, filename, global_map):
        """A wrapper around exec()."""
        exec(compile(source, filename, "exec"), global_map)
else:
    # OK, this is pretty gross.  In Py2, exec was a statement, but that will
    # be a syntax error if we try to put it in a Py3 file, even if it isn't
    # executed.  So hide it inside an evaluated string literal instead.
    eval(compile("""\
def exec_function(source, filename, global_map):
    exec compile(source, filename, "exec") in global_map
""",
    "<exec_function>", "exec"
    ))

¶   All print statements have to adopt an ambiguous print(s) syntax. The string to be printed has to be a single string, so some comma-separated lists turned into formatted strings.

¶   2to3 is obsessive about converting any d.keys() use into list(d.keys()), since keys returns a dictionary view object. If the dict isn't being modified, you can just loop over it without the list(), but in a few places, I really was returning a list, so I included the list() call.

¶   A few 2to3 changes are fine to run on both, so these:

d.has_key(k)
d.itervalues()
callable(o)
xrange(limit)

became:

k in d
d.values()
hasattr(o, '__call__')
range(limit)

¶   Exception handling has changed when you want to get a reference to the exception. This is one of those syntax differences, and it's structural, so a tricky function definition isn't going to bridge the gap.

Where Python 2 had this:

try:
    # .. blah blah ..
except SomeErrorClass, err:
    # use err

now Python 3 wants:

try:
    # .. blah blah ..
except SomeErrorClass as err:
    # use err

The only way to make both versions of Python happy is to use the more cumbersome:

try:
    # .. blah blah ..
except SomeErrorClass:
    _, err, _ = sys.exc_info()
    # use err

This is uglier, but there were only a few places I needed it, so it's not too bad.

¶   Simple imports are relative or absolute in Python 2, but only absolute in Python 3. The new relative import syntax in Python 3 won't compile in Python 2, so I can't use it. I was only using relative imports in my test modules, so I used this hack to make them work:

sys.path.insert(0, os.path.split(__file__)[0]) # Force relative import 
from myotherfile import MyClass

By explicitly adding the current directory to the path, Python 3's absolute-only importer would find the file alongside this one in the current directory.

¶   One area that still tangles me up is str/unicode and bytes/str. Python 3 is making a good change here, but it feels like we're still in transition. The docs aren't always clear on what will be returned, and trying to get the same code to do the right thing under both versions still seems to require experiments with decode and encode.

After making all of these changes, I had a single code base that ran on both Python versions, without too much strangeness. It's way better than having to maintain two packages at PyPI, or trying to trick setup.py into installing different code on different versions.

Others have written about the same challenge:

« | » Main « | »