Pragmatic unicode

Thursday 15 March 2012This is almost 13 years old. Be careful.

Last week was PyCon 2012, I had a blast as always. I gave a talk entitled, Pragmatic Unicode, or, How Do I Stop the Pain?

I chose the topic because I thought it would appeal to many Python developers, and because I knew all about it. Turns out I didn’t! But it was great learning more details as I went. And then I filled in a few more tidbits by chatting with Martin v. Löwis at PyCon.

Part of the fun of this talk was finding the Unicode characters to decorate it with, and then building the credits slide at the end on the plane. It’s all built with Cog to avoid cut-and-paste nightmares. Look at the HTML source of the actual presentation if you’re interested in the Cog twistiness.

Of course, Unicode is a much bigger topic than this, but 25 minutes is what it is. Enjoy, the video, slides, and full text are there.

Comments

[gravatar]
To make the unicode sandwich work right in python 2 wouldn't you need to convert all string literals to unicode?
[gravatar]
Thanks for your slides, it makes me laugh a lot :-)
[gravatar]
@tolomea, strictly speaking, yes, you would. But if you know your literals are actually ASCII, as most of them are, then you can leave them alone, knowing that they'll convert properly with the default encoding.
[gravatar]
Ned,

That was a great talk. Fast-moving, well-organized, and informative. Thanks so much!

One thing I wish you'd have addressed though: in your examples of unicode strings vs. byte strings (and converting between them), you had u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24". There's an "\xf8" in there with the "\uNNNN" codepoints/characters.

It appears that "\xf8" == "\u00f8".

When the unicode string is utf-8 encoded, there results a lot of \xNN bytes --- but no \xf8.
[gravatar]
@John: \xf8 is an escape that inserts a hex f8 value into the string. In a unicode string, it is the same as \u00f8. When that string is encoded as UTF-8, there is no \xf8 because \u00f8 in UTF-8 is two bytes: \xc3\xb8. The only code points that are one byte in UTF-8 are the ASCII values, which are all below \x80.
[gravatar]
Ned, thanks for the clarification. :)
[gravatar]
Ned, re: \xf8

The way I understand it, this codepoint is decimal 248, and the reason Python is using \xf8 is because it can. In retrospect, would it not have been better for Python to have used \u00f8, for the sake of consistency? It is a Unicode string after all.

I know this is beyond your control, just asking for your opinion.

I am watching your talk on YouTube, and it is the best use of 36 minutes I can recall in a long while. Thanks.

PS: Related link:

https://docs.python.org/2/reference/lexical_analysis.html#string-literals

which shows all string escapes including those that are of the form \xhh .
[gravatar]
@Todd S: there are four possible ways to display code points in a Unicode string: literally, as an \x escape, as a \u escape, or as a \U escape. For example:
>> u"A" == u"\x41" == u"\u0041" == u"\U00000041"
True
All four of those representations create the exact same string. We don't expect Python to show us capital A's as u"\u0041". So why expect it to show U+00F8 as u"\u00F8" when it can use u"\xf8" instead?

Python chooses to use the shortest representation. Don't think of \x as meaning "a byte", think of it as meaning, "an integer value that fits in a byte."
[gravatar]
Ned, you are right, in your talk, the "Hi " part of the string is shown literally.

> ... "an integer value that fits in a byte."

Yes, this helps. Thanks.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.