Friday 31 December 2010 — This is almost 15 years old. Be careful.

Internationalizing an application consists of two broad areas of work: marking all the human text for translation, and then localizing (translating) them all into whatever languages you want to support. The first phase is tricky because after you’ve marked a string for translation, it still looks the same, because there isn’t yet a translation for it. So you start with an English application, and then do a bunch of work to find and mark all the strings, and what you end up with looks and behaves exactly the same, if you’ve done it right.

This makes it difficult to know that you’ve marked all the strings. The end result is precisely as if you had done nothing at all. If you miss a string, you won’t find out until you get back a translation, and try it out. And then you’re looking at your application in a foreign language, which if you are like me, means you don’t understand what you’re reading.

To solve these problems on a recent project, I wrote this little script. Before describing it, let’s review the mechanics of localizing an application, in this case, a Django application:

Edit all your source to mark strings for translation, with trans tags and gettext() function calls.
Run the makemessages script to extract all the marked strings into .po files, one for each language.
Have someone edit each .po file, entering translations for every string.
Compile the edited .po files into binary .mo files that will be used during execution.
Set the language for the app, and run it to see your awesome translated application!

I can do all of these steps myself except number 3. Step 3 is also the time-consuming one that likely will be done far away from you, and so on: it’s the difficult part. This poxx.py replaces step 3. It munges a .po file, creating synthetic “translations” for your strings. With it, you can see your application in a pseudo-translated state that lets you know that you’ve properly marked strings for translation, and shows you where you haven’t yet marked them. I use poxx.py to create a translation for the language “xx” (hence the name poxx.py), then set my application to use language “xx”.

What poxx.py does is create a “translation” by swapping the case of all the vowels. So where your English site shows “Please log in to comment,” your poxx’ed site will show “PlEAsE lOg In tO cOmmEnt.” You can still read the text, but the translated and the un-translated stand out from one another, all without need for an actual speaker of another language.

Most of the complexity in poxx.py arises from the fact that the text in a .po file is not all human readable: HTML tags and data replacement tokens should be left alone. So it uses a simple HTML parser to find the pieces that will be displayed, and only munges them.

It works great for me, I hope you find it useful too. You’ll need polib as a prerequisite.

#!/usr/bin/env python
"""Munge a .po file so we English-bound can see what strings aren't marked 
for translation yet.

Run this with a .po file as an argument.  It will set the translated strings 
to be the same as the English, but with vowels in the wrong case:

    ./poxx.py locale/xx/LC_MESSAGES/django.po    

Then set LANGUAGE_CODE='xx' in settings.py, and you'll see wacky case for
translated strings, and normal case for strings that still need translating.

This code is in the public domain.

"""

import re, sys
import polib    # from http://bitbucket.org/izi/polib
import HTMLParser

class HtmlAwareMessageMunger(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.s = ""

    def result(self):
        return self.s

    def xform(self, s):
        return re.sub("[aeiouAEIOU]", self.munge_vowel, s)

    def munge_vowel(self, v):
        v = v.group(0)
        if v.isupper():
            return v.lower()
        else:
            return v.upper()

    def handle_starttag(self, tag, attrs, closed=False):
        self.s += "<" + tag
        for name, val in attrs:
            self.s += " "
            self.s += name
            self.s += '="'
            if name in ['alt', 'title']:
                self.s += self.xform(val)
            else:
                self.s += val
            self.s += '"'
        if closed:
            self.s += " /"
        self.s += ">"

    def handle_startendtag(self, tag, attrs):
        self.handle_starttag(tag, attrs, closed=True)

    def handle_endtag(self, tag):
        self.s += "</" + tag + ">"

    def handle_data(self, data):
        # We don't want to munge placeholders, so split on them, keeping them
        # in the list, then xform every other token.
        toks = re.split(r"(%\(\w+\)s)", data)
        for i, tok in enumerate(toks):
            if i % 2:
                self.s += tok
            else:
                self.s += self.xform(tok)

    def handle_charref(self, name):
        self.s += "&#" + name + ";"

    def handle_entityref(self, name):
        self.s += "&" + name + ";"

def munge_one_file(fname):
    po = polib.pofile(fname)
    count = 0
    for entry in po:
        hamm = HtmlAwareMessageMunger()
        hamm.feed(entry.msgid)
        entry.msgstr = hamm.result()
        if 'fuzzy' in entry.flags:
            entry.flags.remove('fuzzy') # clear the fuzzy flag
        count += 1
    print "Munged %d messages in %s" % (count, fname)
    po.save()

if __name__ == "__main__":
    for fname in sys.argv[1:]:
        munge_one_file(fname)

Comments

Bradley Grainger 11:42 AM on 31 Dec 2010

I've done something similar in the past, but translated all input into "Fāķė Ĕńĝĺĩŝħ Ťėxŧ" (using characters from Latin-1 Supplement and Latin Extended-A). It's still fairly readable, but it has the side-effect of testing that Unicode is being handled correctly, and that combining marks above and below aren't being clipped.

Richard Schwartz 11:48 AM on 31 Dec 2010

Years ago, at Wang Labs, our suggestion to developers was to create a "pig-latin" translation. English is more compact than most other languages, so the extra letters added to each word by the pig latin suffixes give you an idea of whether your UI will still look reasonable in translation.

Ned Batchelder 11:49 AM on 31 Dec 2010

@Bradley: very nice. I pondered a few other xform ideas, and this is better than all of mine!

Seung Soo,Ha 6:12 PM on 31 Dec 2010

Wow, this is really nice!!

Some copyright and license information would be helpful too.

Ned Batchelder 9:59 AM on 1 Jan 2011

Kent Johnson 7:16 PM on 1 Jan 2011

IANAL but I think copyright and public domain are mutually exclusive. Perhaps you mean "copyright 2010 by Ned Batchelder, may be used for any purpose without restriction." Or better, "This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/"

Ned Batchelder 8:46 AM on 2 Jan 2011

@Kent, although YANAL, you make a good point. The code is in the public domain.

Michael Chermside 6:54 PM on 2 Jan 2011

Although I am not a lawyer and this is not legal advice, I HAVE spent a good deal of time reading up on copyright law. There is a "flaw" in the US copyright law such that it is extremely difficult to put something in the public domain even if you want to. (That is actually one reason why things like the creative commons licenses are useful.) Basically (and I may have the details of this wrong), for good reasons, the law says you can't give up your copyright without a contract, and for good reasons it says a contract isn't valid unless both parties receive something of value. The combination of these two means your contract to give up ownership of the code may not be valid.

Of course, all this is only meaningful in the imaginary world of legal arguments, not in the real world. Clearly you are trying to say anyone can use it, and I would personally feel confident using it without fear that you would sue me later. But in terms of best practices, it's legally better to use a license than say the words "public domain".

Maybe someday Congress will fix the flaws in our legal system. And I have no idea how any of this applies in any other jurisdiction.

F Wolff 4:53 AM on 3 Jan 2011

There is a tool doing exactly this with support for several translation formats, including Gettext PO files. It also has quite a few more possibilities. It does the extended Latin version that Bradley mentions, does one for boundaries checking, one for testing right-to-left functionality, and can even imitate the Swedish Chef :-)

Have a look at podebug:
http://translate.sourceforge.net/wiki/toolkit/podebug

Ned Batchelder 6:19 AM on 3 Jan 2011

@F.Wolff: very nice! The podebug tool indeed does exactly this, but times 20! Thanks for the pointer.

Christian Wyglendowski 9:31 PM on 4 Jan 2011

Like Richard, I used pig-latin for testing translation functionality in an application I used to work on. It worked quite well and was probably the only translation I was qualified to write.

Graham Fawcett 9:18 PM on 28 Mar 2011

Nice! One little problem with the script is that it doesn't poxxify the "msgid_plural" forms, if any, in your PO file. This seems to work, though it could be shortened:

    for entry in po:
        if entry.msgid_plural:
            hamm = HtmlAwareMessageMunger()
            hamm.feed(entry.msgid)
            entry.msgstr_plural['0'] = hamm.result()
            hamm = HtmlAwareMessageMunger()
            hamm.feed(entry.msgid_plural)
            entry.msgstr_plural['1'] = hamm.result()
        else:
            hamm = HtmlAwareMessageMunger()
            hamm.feed(entry.msgid)
            entry.msgstr = hamm.result()

Faked translations: poxx.py

Comments

Add a comment: