Thursday 15 March 2012 — This is almost 13 years old. Be careful.
Last week was PyCon 2012, I had a blast as always. I gave a talk entitled, Pragmatic Unicode, or, How Do I Stop the Pain?
I chose the topic because I thought it would appeal to many Python developers, and because I knew all about it. Turns out I didn’t! But it was great learning more details as I went. And then I filled in a few more tidbits by chatting with Martin v. Löwis at PyCon.
Part of the fun of this talk was finding the Unicode characters to decorate it with, and then building the credits slide at the end on the plane. It’s all built with Cog to avoid cut-and-paste nightmares. Look at the HTML source of the actual presentation if you’re interested in the Cog twistiness.
Of course, Unicode is a much bigger topic than this, but 25 minutes is what it is. Enjoy, the video, slides, and full text are there.
Comments
That was a great talk. Fast-moving, well-organized, and informative. Thanks so much!
One thing I wish you'd have addressed though: in your examples of unicode strings vs. byte strings (and converting between them), you had u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24". There's an "\xf8" in there with the "\uNNNN" codepoints/characters.
It appears that "\xf8" == "\u00f8".
When the unicode string is utf-8 encoded, there results a lot of \xNN bytes --- but no \xf8.
The way I understand it, this codepoint is decimal 248, and the reason Python is using \xf8 is because it can. In retrospect, would it not have been better for Python to have used \u00f8, for the sake of consistency? It is a Unicode string after all.
I know this is beyond your control, just asking for your opinion.
I am watching your talk on YouTube, and it is the best use of 36 minutes I can recall in a long while. Thanks.
PS: Related link:
https://docs.python.org/2/reference/lexical_analysis.html#string-literals
which shows all string escapes including those that are of the form \xhh .
Python chooses to use the shortest representation. Don't think of \x as meaning "a byte", think of it as meaning, "an integer value that fits in a byte."
> ... "an integer value that fits in a byte."
Yes, this helps. Thanks.
Add a comment: