Fixing broken Unicode

Tuesday 21 August 2012This is 12 years old. Be careful.

In Pragmatic Unicode, or, How Do I Stop the Pain?, I said that you have to know the encoding of your bytes, so that you can properly decode them to Unicode. I also said the world is a really messy place, and that the declarations that should tell you the encoding are sometimes wrong.

It gets even worse than that: your bytes may have been incorrectly handled by an upstream component, so that it isn’t a valid sequence of bytes at all. We’ve all seen web pages with A-hats on them:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Rob Speer deals with data like this at his day job at Luminoso, and decided to do something about it. His blog post Fixing common Unicode mistakes with Python — after they’ve been made explains his function fix_bad_unicode(text), which detects common mistakes and fixes them with a handful of real-world heuristics.

This is the kind of code I’m not sure I would have attempted, given the “impossibility” of the task. Bravo to Rob for taking it on.

Comments

[gravatar]
Interesting, but how applicable is his work to Python 3 given that all strings in Python 3 are Unicode already?
[gravatar]
@Adam, the point here is that he's dealing with data which is wrong by virtue of having been mishandled somewhere else. Python 3 doesn't fix that.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.