Fixing broken Unicode

Tuesday 21 August 2012

In Pragmatic Unicode, or, How Do I Stop the Pain?, I said that you have to know the encoding of your bytes, so that you can properly decode them to Unicode. I also said the world is a really messy place, and that the declarations that should tell you the encoding are sometimes wrong.

It gets even worse than that: your bytes may have been incorrectly handled by an upstream component, so that it isn't a valid sequence of bytes at all. We've all seen web pages with A-hats on them:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Rob Speer deals with data like this at his day job at Luminoso, and decided to do something about it. His blog post Fixing common Unicode mistakes with Python — after they’ve been made explains his function fix_bad_unicode(text), which detects common mistakes and fixes them with a handful of real-world heuristics.

This is the kind of code I'm not sure I would have attempted, given the "impossibility" of the task. Bravo to Rob for taking it on.


Adam Parkin 10:35 AM on 23 Aug 2012

Interesting, but how applicable is his work to Python 3 given that all strings in Python 3 are Unicode already?

Ned Batchelder 11:40 AM on 23 Aug 2012

@Adam, the point here is that he's dealing with data which is wrong by virtue of having been mishandled somewhere else. Python 3 doesn't fix that.

Spark 10:52 AM on 26 Oct 2014


Add a comment:

Ignore this:
not displayed and no spam.
Leave this empty:
not searched.
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.