« | » Main « | »

Pilkington, LBP, and Inferno cakes

Wednesday 13 April 2011

Ben's birthday was a few weeks ago, and we ended up with three different occasions for cakes to celebrate. For the day itself, we honored the animated Karl Pilkington:

Karl Pilkington, as cake

For those that haven't enjoyed The Ricky Gervais Show, it's an animated cartoon of Ricky's podcasts discussing anything and everything with his colleagues Steven Merchant and especially Karl Pilkington:

Karl Pilkington, as cartoon

For an extended family gathering, we made cupcakes based on Little Big Planet sackboys:

Sackboy cupcakes

Finally, for his delayed party, a monstrosity based on Dante's Inferno, a fascination for Ben:

Inferno cake

There are three levels here: the top is suicides turned into trees, the bottom is the icy level with a demon guarding the place, and in the middle is the gluttonous third circle, with tormented souls swimming in their own excretions (don't worry, just chocolate pudding and Tootsie rolls).

OK, so this is an unusual theme for a cheerful birthday party, but believe me, they loved this, and it's right up Ben's alley. Take a look at some of his art: Forgiveness Pt. 2

Cog in Matlab

Tuesday 12 April 2011

Cog, my templating and code generation tool, seems to be like the little engine that could. I wrote it years ago to bring a little Python power to a non-Python job. But then it was unexpectedly useful while preparing my slides for PyCon this year. I did a lightning talk explaining why (I start at about 8:00 minutes in).

One of the things I didn't expect when I released Cog was that people would take the concept and port it to other languages. There are implementations for PHP, Ruby, and Perl. And now, Doug Harriman has written another, so you can Cog in Matlab. I don't know anything about Matlab, I didn't realize this was even a sensible idea, but now it's real.

When I look at Cog now, I see things I'd like to change about it. Maybe there will be a more modern implementation some day. But it does its job well now. If you have text files that you want to do a little bit of processing on, look into Cog, people seem to like it.

A Javascript lexer in Python, and the saga behind it

Sunday 10 April 2011

In the last week I've written a new Javascript lexer, jslex. Why I did it is one of those open source adventures that starts innocently enough.

I'm working on a Django project for a client, and it needs to be localized into their language. Django has good support for localization, providing tools for extracting strings from Python, HTML, and Javascript files. But something wasn't right: the client reported that some of the strings were still in English. Usually this means that they made a small mistake during the translation process, and the English in the source doesn't match the English in the message file.

But when I looked, it turned out the English was completely missing from the message file. Check the source: yup, it's properly marked for translation. Then I remembered: parsing Javascript source files for messages is fragile. I'd encountered this before, and had simply fiddled with the Javascript source to make the problem go away. But this time, as one message was re-harvested, other messages would disappear. The problem seemed more severe than I had encountered in the past. I decided to learn more about why it was happening.

Like many open source projects, Django uses Gnu gettext to manage the message files, including using the xgettext tool to parse the source files to find strings to translate. But xgettext doesn't support parsing Javascript. Django has a strange accomodation to deal with this: it performs a simple transformation on the Javascript source, then tells xgettext that it's Perl.

I can only guess why Perl was chosen: because Javascript and Perl both have regex literals, which as we'll see, play a large part in this story. But Django's Javascript-to-Perl transformation is simplistic: it just converts all //-comments on their own line into #-comments. So this Javascript:

// My awesome Javascript
x = 1;  // Don't start x at zero.
gettext("Please translate me!");

gets transformed into this "Perl":

# My awesome Javascript
x = 1;  // Don't start x at zero.
gettext("Please translate me!");

I assume the reason //-comments that share a line with code are skipped is to avoid clobbering strings with // in them, though with multi-line strings, even that is not enough to protect them.

Of course, this transformation is insufficient to properly carry the strings into the "Perl" so that xgettext can find them. For example, in the above sample, the Javascript comment on line 2 is still executable Perl code after the transformation, and the apostrophe in the comment is considered the start of a string literal, so the gettext call is skipped as part of a multi-line string.

In fact, depending on the version of gettext, which determines how advanced its Perl parsing is, all sorts of innocuous Javascript constructs can throw off the parser:

gettext("Message on 1");
var x = y;
gettext("Message on 3");
gettext("Message on 4");
gettext("Message on 5");

Here messages 1 and 5 are found, and 3 and 4 are not. How come? Because Perl's y operator consumes two strings delimited by the next character, in this case a semicolon, so lines 3 and 4 are considered literals rather than code.

In truth, Django's accommodation for Javascript is an egregious hack. So I wanted to find a better solution. I figured that if I could properly lex Javascript, then I could manipulate the token stream to create something that could reliably be parsed by gettext.

The result is jslex, a pure-Python lexer for Javascript. Lexing Javascript turns out to be tricky due to our old friend the regex literal. When a slash character is found, it could mean one of four things: a division operator (either / or /=), a line comment (//), a multi-line comment (/*), or a regex literal. The two comment forms are simple to deal with, because a regex literal can't be empty, so // is always a comment, and a regex can't start with a star, so /* is always a comment.

But distinguishing between division and regexes is impossible to do at a purely lexical level, and can be quite subtle:

for (var x = a in foo && "</x>" || mot ? z:/x:3;x<5;y</g/i) {xyz(x++);}
for (var x = a in foo && "</x>" || mot ?  z/x:3;x<5;y</g/i) {xyz(x++);}

The first line has a regex of /x:3;x<5;y</g, the second has /g/i.

The ECMAScript standard says you need to parse the code, and if you're at a point where a regex literal would be a valid next token, then lex it as a regex, but if you're at a point where a division would be valid, that lex it as division.

I wasn't willing to write a full parser, but I've taken a similar approach to other light Javascript tools, and use the previous token to decide if the next token can be division or regex. It seems to work well.

The lexer is a general-purpose multi-state lexer built on regular expressions. The rules create a two-state lexer with a state for "division possible," and "regex possible." When I thought I had it working, I outsourced the QA to Stack Overflow, finally finding something to do with my too-many reputation points: pay a bounty to find Javascript it doesn't lex properly. Mind-twistingly, a respondent there found a useful test: a Javascript lexer written in Javascript, which when fed through my lexer, failed because my regex-matching regex couldn't properly lex his regex-matching regex!

To bridge Javascript code to xgettext, I chose to transform it into "C" instead of Perl. That means getting rid of the regex literals by turning them all into the C string "REGEX", and changing single-quoted strings into double-quoted strings.

The next phase is to determine whether this gets into Django or not. I've prepared it as a patch, but there was already some momentum to replace gettext with Babel, and it's looking like it might all have to wait for 1.4 in any case. As someone who's recently lost time to this bug, I would really rather get something into 1.3.1, so we'll see where that ends up.

In any case, if you have need for lexing Javascript in Python, use jslex, it works.

« | » Main « | »