Localization is a bitch

Sunday 6 June 2010

The latest product release at work seems to mostly have been about localization: the site is available in 26 languages. Django provides good tools to manage the translations, but there’s still an awful lot of detail to attend to.

  • All the different country and language codes are easy to get wrong. Did you know that Slovenia’s country code is SI, but Slovenian’s language code is sl?
  • Language codes get more complicated when dealing with Chinese, which comes in both traditional and simplified forms. Do we use zh-tw, or zh_TW, or zh-hant for traditional?
  • Every visible string has to be marked with a {% trans %} tag, and you have to do it in a way that the translators can make use of. Sentences can’t be broken into fragments, if you have data to plug in to a sentence, the slot has to be part of a single string to translate, with code to plug in the value:
    {% trans "There are" %} {{n_things}} {% trans "things here." %} {# BAD #}
    {% trans "There are <span>10</span> things here." %}  {# GOOD, needs code #}

On top of these sorts of technical issues, there are huge coordination problems to be solved. You have to add an extra step to your march toward shipping, which is waiting for the translations to come in.

And typically, there are last-minute copy tweaks that “have to” go live, and there’s no time left to get them localized. We ended up sprinkling conditionals into the site, so that English-speaking users would see updated copy, and others would see the older, translated copy:

{% ifequal LANGUAGE_CODE "en" %}
    {# We think this is a better way to put it.. #}
    {% trans "The product is awesome!" %}
{% else %}
    {% trans "The product may be satisfactory." %}
{% endifequal %}

The plan was that at the next translation pass, all the new text would be translated, and we could remove all these conditionals (about a dozen).

Of course, once we had those tweaks in place, it turns out there were ultra-high-priority text changes that had to happen as soon as possible, and that had to be translated into all 26 languages in some expensive expedited way. Once we made those emergency changes, and ran “messages make” to pull out all the strings, we of course also pulled in all the low-priority conditionalized tweak strings. We didn’t want to clog up the double-secret-probabtion translation pass with anything extra, so we hacked some more.

We added a simple no-op template tag called english_trans:

def english_trans(s):
    return s

and then could change our English-only tweak to:

{% ifequal LANGUAGE_CODE "en" %}
    {# We think this is a better way to put it.. #}
    {% english_trans "The product is awesome!" %}
{% else %}
    {% trans "The product may be satisfactory." %}
{% endifequal %}

Now when we extract the strings from the source, only the emergency changes need to be translated, because the string extractor doesn’t recognize english_trans as an indicator of a translatable string. When we next do a full translation pass, we can remove the English customization and change english_trans back to trans, and hopefully be out of the craziness for a while...

And we haven’t even attempted the right-to-left languages...


{% trans "There are 10 things here." %}
is the wrong way to do it.
What you want is the blocktrans-tag that will give you placeholders:
I think it's fine that language codes don't match the code of the country where they were invented. After all, most (if not all) languages are spoken in two or more countries, and in some situations there may be more native speakers outside of the origin country.

I agree that locales should be more consistent. It's crazy that they can be written as language-country, language_country, language_COUNTRY, etc. However, zh-tw and zh-hant are not the same thing -- only the later means "traditional Chinese".

I also agree that it's nasty to let translators deal with code... But unfortunately there's no way around that :(

The other problem you're talking about is just one of the limitations in Django, though. I've done internalization and localization with other Python and PHP tools, and Django is by far the worst ever -- That's the reason why you've had to come up with those ugly workarounds.

Gettext supports something called "fuzzy translations", which Django doesn't support. In your example, "This product may be satisfactory" could have been translated into Castilian as "Este producto puede ser satisfactorio" and if that phrase is changed to "This product is awesome", Gettext might have been able to tell that it's a newer version of an existing message. If Gettext doesn't detect that automatically, you can always mark that as fuzzy by hand quite easily.

That way, "This product is awesome" would be translated to "Este producto puede ser satisfactorio" if there's no translation for the new message. Later, the Castilian translator will update the translation to "¡Este producto es buenísimo!" and mark it as not fuzzy.

You may want to try Babel and its Django plugin instead.
I very much want to see your "@register.simple_tag" class/function. I've written a bunch of factory classes (several of which can be found on my past PyCon talk slides), and none of which exceed 20 lines (including docstrings). But they all do slightly different stuff.
@Stefan, you are right, blocktrans with substitution is the right way to go. We do use that, I was going too fast when writing the post and didn't get the right server-side technique! Thanks for the improvement.
The troubles you had are why I wrote Dragoman. It is a really simple translation service and has been getting a good deal of usage where I work. Just FYI in case it helps.


Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.