Pragmatic unicode

Thursday 15 March 2012 — This is over 13 years old. Be careful.

Last week was PyCon 2012, I had a blast as always. I gave a talk entitled, Pragmatic Unicode, or, How Do I Stop the Pain?

I chose the topic because I thought it would appeal to many Python developers, and because I knew all about it. Turns out I didn’t! But it was great learning more details as I went. And then I filled in a few more tidbits by chatting with Martin v. Löwis at PyCon.

Part of the fun of this talk was finding the Unicode characters to decorate it with, and then building the credits slide at the end on the plane. It’s all built with Cog to avoid cut-and-paste nightmares. Look at the HTML source of the actual presentation if you’re interested in the Cog twistiness.

Of course, Unicode is a much bigger topic than this, but 25 minutes is what it is. Enjoy, the video, slides, and full text are there.

Comments

tolomea 11:29 PM on 15 Mar 2012

To make the unicode sandwich work right in python 2 wouldn't you need to convert all string literals to unicode?

Alex Garel 4:00 AM on 16 Mar 2012

Thanks for your slides, it makes me laugh a lot :-)

Ned Batchelder 6:57 AM on 16 Mar 2012

@tolomea, strictly speaking, yes, you would. But if you know your literals are actually ASCII, as most of them are, then you can leave them alone, knowing that they'll convert properly with the default encoding.

John 2:09 PM on 20 Mar 2012

Ned,

That was a great talk. Fast-moving, well-organized, and informative. Thanks so much!

One thing I wish you'd have addressed though: in your examples of unicode strings vs. byte strings (and converting between them), you had u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24". There's an "\xf8" in there with the "\uNNNN" codepoints/characters.

It appears that "\xf8" == "\u00f8".

When the unicode string is utf-8 encoded, there results a lot of \xNN bytes --- but no \xf8.

Ned Batchelder 3:15 PM on 20 Mar 2012

@John: \xf8 is an escape that inserts a hex f8 value into the string. In a unicode string, it is the same as \u00f8. When that string is encoded as UTF-8, there is no \xf8 because \u00f8 in UTF-8 is two bytes: \xc3\xb8. The only code points that are one byte in UTF-8 are the ASCII values, which are all below \x80.

John 9:53 AM on 21 Mar 2012

Ned, thanks for the clarification. :)

Jarno Virtanen 8:16 AM on 18 Apr 2012

Great talk, thanks!

Todd S. 6:06 PM on 29 Aug 2014

Ned, re: \xf8

The way I understand it, this codepoint is decimal 248, and the reason Python is using \xf8 is because it can. In retrospect, would it not have been better for Python to have used \u00f8, for the sake of consistency? It is a Unicode string after all.

I know this is beyond your control, just asking for your opinion.

I am watching your talk on YouTube, and it is the best use of 36 minutes I can recall in a long while. Thanks.

PS: Related link:

https://docs.python.org/2/reference/lexical_analysis.html#string-literals

which shows all string escapes including those that are of the form \xhh .

Ned Batchelder 12:46 PM on 30 Aug 2014

@Todd S: there are four possible ways to display code points in a Unicode string: literally, as an \x escape, as a \u escape, or as a \U escape. For example:

>> u"A" == u"\x41" == u"\u0041" == u"\U00000041"
True

All four of those representations create the exact same string. We don't expect Python to show us capital A's as u"\u0041". So why expect it to show U+00F8 as u"\u00F8" when it can use u"\xf8" instead?

Python chooses to use the shortest representation. Don't think of \x as meaning "a byte", think of it as meaning, "an integer value that fits in a byte."

Todd S. 12:54 PM on 30 Aug 2014

Ned, you are right, in your talk, the "Hi " part of the string is shown literally.

> ... "an integer value that fits in a byte."

Yes, this helps. Thanks.

Pragmatic unicode

Comments

Add a comment: