|Ned Batchelder : Blog | Code | Text | Site|
Fixing broken Unicode
» Home : Blog : August 2012
In Pragmatic Unicode, or, How Do I Stop the Pain?, I said that you have to know the encoding of your bytes, so that you can properly decode them to Unicode. I also said the world is a really messy place, and that the declarations that should tell you the encoding are sometimes wrong.
It gets even worse than that: your bytes may have been incorrectly handled by an upstream component, so that it isn't a valid sequence of bytes at all. We've all seen web pages with A-hats on them:
Rob Speer deals with data like this at his day job at Luminoso, and decided to do something about it. His blog post Fixing common Unicode mistakes with Python â€” after they’ve been made explains his function fix_bad_unicode(text), which detects common mistakes and fixes them with a handful of real-world heuristics.
This is the kind of code I'm not sure I would have attempted, given the "impossibility" of the task. Bravo to Rob for taking it on.