I’m working on projects for Threepress, and they have a good, extensive test suite. I was surprised when a test failed on Ubuntu that had always passed on their Macs.
The test in question was trying to open a file by name, no big deal, right? Well, in this case, the filename had an accented character, so it was a big deal. Getting to the bottom of it, I learned some new things about Python and Unicode.
On the disk is a file named lé.txt. On the Mac, this file can be opened by name, on Ubuntu, it cannot. Looking into it, the filename we’re using, and the filename it has, are different:
>>> fname = u"l\u00e9.txt".encode('utf8')
On the Mac, that filename will open that file:
<open file 'lé.txt', mode 'r' at 0x1004250c0>
On Ubuntu, not so much:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'l\xc3\xa9.txt'
What’s with the two different strings that seem to both represent the same text? Wasn’t Unicode supposed to get us out of character set hell by having everyone agree on how to store text? Turns out it doesn’t make everything simple, there are still multiple ways to store one string.
In this case, the accented é is represented as two different UTF-8 strings: both as ‘\xc3\xa9’ and as ‘e\xcc\x81’. In pure Unicode terms, the first is a single code point, U+00E9, or LATIN SMALL LETTER E WITH ACUTE. The second is two code points: U+0065 (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT). Turns out Unicode has both a single combined code point for accented e, and also two code points that together can mean accented é.
This demonstrates a complicated Unicode concept known as equivalence and normalization. Unicode defines complex rules that make it so that our two strings are “equivalent”.
On the Mac, trying to open the file with either string works, on Ubuntu, you have to use the same form as is stored on disk. So to open the file reliably, we have to try a number of different Unicode normalization forms to be sure to open it.
Python provides the unicodedata.normalize function which can perform the normalizations for us:
>>> import unicodedata
>>> fname = u"l\u00e9.txt"
>>> unicodedata.normalize("NFD", fname)
Unfortunately, you can’t be sure in what normalization form a filename might be. The Mac likes to create them in decomposed form, but Ubuntu seems to prefer composed form. Seems like a fool-proof file opener would need to try the four different normalization forms (NFD, NFC, NFKD, NFKC) to be sure to open a file with non-ASCII characters in it, but that also seems like a huge pain. Is it really true I have to jump through those hoops to open these files?