Last week was PyCon 2012, I had a blast as always. I gave a talk entitled, Pragmatic Unicode, or, How Do I Stop the Pain?

I chose the topic because I thought it would appeal to many Python developers, and because I knew all about it. Turns out I didn't! But it was great learning more details as I went. And then I filled in a few more tidbits by chatting with Martin v. Löwis at PyCon.

Part of the fun of this talk was finding the Unicode characters to decorate it with, and then building the credits slide at the end on the plane. It's all built with Cog to avoid cut-and-paste nightmares. Look at the HTML source of the actual presentation if you're interested in the Cog twistiness.

Of course, Unicode is a much bigger topic than this, but 25 minutes is what it is. Enjoy, the video, slides, and full text are there.

tagged: , » 7 reactions


tolomea 11:29 PM on 15 Mar 2012

To make the unicode sandwich work right in python 2 wouldn't you need to convert all string literals to unicode?

Alex Garel 4:00 AM on 16 Mar 2012

Thanks for your slides, it makes me laugh a lot :-)

Ned Batchelder 6:57 AM on 16 Mar 2012

@tolomea, strictly speaking, yes, you would. But if you know your literals are actually ASCII, as most of them are, then you can leave them alone, knowing that they'll convert properly with the default encoding.

John 2:09 PM on 20 Mar 2012


That was a great talk. Fast-moving, well-organized, and informative. Thanks so much!

One thing I wish you'd have addressed though: in your examples of unicode strings vs. byte strings (and converting between them), you had u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24". There's an "\xf8" in there with the "\uNNNN" codepoints/characters.

It appears that "\xf8" == "\u00f8".

When the unicode string is utf-8 encoded, there results a lot of \xNN bytes --- but no \xf8.

Ned Batchelder 3:15 PM on 20 Mar 2012

@John: \xf8 is an escape that inserts a hex f8 value into the string. In a unicode string, it is the same as \u00f8. When that string is encoded as UTF-8, there is no \xf8 because \u00f8 in UTF-8 is two bytes: \xc3\xb8. The only code points that are one byte in UTF-8 are the ASCII values, which are all below \x80.

John 9:53 AM on 21 Mar 2012

Ned, thanks for the clarification. :)

Jarno Virtanen 8:16 AM on 18 Apr 2012

Great talk, thanks!

Add a comment:

Ignore this:
not displayed and no spam.
Leave this empty:
not searched.
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.