Fixing broken Unicode

Tuesday 21 August 2012 — This is 13 years old. Be careful.

In Pragmatic Unicode, or, How Do I Stop the Pain?, I said that you have to know the encoding of your bytes, so that you can properly decode them to Unicode. I also said the world is a really messy place, and that the declarations that should tell you the encoding are sometimes wrong.

It gets even worse than that: your bytes may have been incorrectly handled by an upstream component, so that it isn’t a valid sequence of bytes at all. We’ve all seen web pages with A-hats on them:

If numbers arenâ€™t beautiful, I donâ€™t know what is. â€“Paul ErdÅ‘s

Rob Speer deals with data like this at his day job at Luminoso, and decided to do something about it. His blog post Fixing common Unicode mistakes with Python â€” after they’ve been made explains his function fix_bad_unicode(text), which detects common mistakes and fixes them with a handful of real-world heuristics.

This is the kind of code I’m not sure I would have attempted, given the “impossibility” of the task. Bravo to Rob for taking it on.

Comments

Adam Parkin 10:35 AM on 23 Aug 2012

Interesting, but how applicable is his work to Python 3 given that all strings in Python 3 are Unicode already?

Ned Batchelder 11:40 AM on 23 Aug 2012

@Adam, the point here is that he's dealing with data which is wrong by virtue of having been mishandled somewhere else. Python 3 doesn't fix that.

Spark 10:52 AM on 26 Oct 2014

Nice

Fixing broken Unicode

Comments

Add a comment: