![]() | Ned Batchelder : Blog | Code | Text | Site Fixing broken Unicode » Home : Blog : August 2012 |
Fixing broken UnicodeTuesday 21 August 2012 In Pragmatic Unicode, or, How Do I Stop the Pain?, I said that you have to know the encoding of your bytes, so that you can properly decode them to Unicode. I also said the world is a really messy place, and that the declarations that should tell you the encoding are sometimes wrong. It gets even worse than that: your bytes may have been incorrectly handled by an upstream component, so that it isn't a valid sequence of bytes at all. We've all seen web pages with A-hats on them:
Rob Speer deals with data like this at his day job at Luminoso, and decided to do something about it. His blog post Fixing common Unicode mistakes with Python — after they’ve been made explains his function fix_bad_unicode(text), which detects common mistakes and fixes them with a handful of real-world heuristics. This is the kind of code I'm not sure I would have attempted, given the "impossibility" of the task. Bravo to Rob for taking it on.
tagged:
python,
characters» 2 reactions | |
Comments
Interesting, but how applicable is his work to Python 3 given that all strings in Python 3 are Unicode already?
@Adam, the point here is that he's dealing with data which is wrong by virtue of having been mishandled somewhere else. Python 3 doesn't fix that.
Add a comment: