In Pragmatic Unicode, or, How Do I Stop the Pain?, I said that you have to know the encoding of your bytes, so that you can properly decode them to Unicode. I also said the world is a really messy place, and that the declarations that should tell you the encoding are sometimes wrong.

It gets even worse than that: your bytes may have been incorrectly handled by an upstream component, so that it isn't a valid sequence of bytes at all. We've all seen web pages with A-hats on them:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Rob Speer deals with data like this at his day job at Luminoso, and decided to do something about it. His blog post Fixing common Unicode mistakes with Python — after they’ve been made explains his function fix_bad_unicode(text), which detects common mistakes and fixes them with a handful of real-world heuristics.

This is the kind of code I'm not sure I would have attempted, given the "impossibility" of the task. Bravo to Rob for taking it on.

tagged: , » 2 reactions

Comments

[gravatar]
Adam Parkin 10:35 AM on 23 Aug 2012

Interesting, but how applicable is his work to Python 3 given that all strings in Python 3 are Unicode already?

[gravatar]
Ned Batchelder 11:40 AM on 23 Aug 2012

@Adam, the point here is that he's dealing with data which is wrong by virtue of having been mishandled somewhere else. Python 3 doesn't fix that.

Add a comment:

name
email
Ignore this:
not displayed and no spam.
Leave this empty:
www
not searched.
 
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.