Tuesday 21 August 2012 — This is more than 12 years old. Be careful.
In Pragmatic Unicode, or, How Do I Stop the Pain?, I said that you have to know the encoding of your bytes, so that you can properly decode them to Unicode. I also said the world is a really messy place, and that the declarations that should tell you the encoding are sometimes wrong.
It gets even worse than that: your bytes may have been incorrectly handled by an upstream component, so that it isn’t a valid sequence of bytes at all. We’ve all seen web pages with A-hats on them:
If numbers aren’t beautiful, I don’t know what is. –Paul Erdős
Rob Speer deals with data like this at his day job at Luminoso, and decided to do something about it. His blog post Fixing common Unicode mistakes with Python — after they’ve been made explains his function fix_bad_unicode(text), which detects common mistakes and fixes them with a handful of real-world heuristics.
This is the kind of code I’m not sure I would have attempted, given the “impossibility” of the task. Bravo to Rob for taking it on.
Comments
Add a comment: