|Ned Batchelder : Blog | Code | Text | Site|
Phishing fun with Unicode
» Home : Blog : April 2005
Damien got a phishing email that looked like total gibberish, which he described as the Worst Phisher Ever. Turns out it wasn't a moronic spammer, but someone clever enough to use obscure Unicode features to sneak past spam filters.
The scrambled text looked like this in his Firefox browser:
But he discovered that when viewed with IE, the text was perfectly readable:
Here is the actual text. Try viewing it in Firefox (scrambled) and IE (readable):
What's going on? Confusing matters even more, if you view source in Firefox, you see scrambled text, and in IE you see readable text. How can the same series of bytes look different in the source?
Reading the page directly with readurl -x, I saw this:
000fa0: 65 22 3e 0a 0a 3c 70 3e 20 20 20 20 44 65 e2 80 e">..<p> De..
Between the "De" and "ra" are bytes "e2 80 ae", and after the "ra" are bytes "e2 80 ac". This smells like UTF-8. An interactive Python prompt and the decode() function reveal the Unicode code points:
>>> for c in 'De\xE2\x80\xAEra\xE2\x80\xAC'.decode('utf-8'):
So Unicode U+202E and U+202C are behind the mischief. They are Right-To-Left Override and Pop Directional Formatting respectively. The control the rendering of bidirectional text. So what's going on here is the "D" and "e" are written left-to-right, as is usual for English, then the writing direction is switched to right-to-left, "r" and "a" are written, and the writing direction is restored to left-to-right. The result, in a renderer that properly handles these codes, is "Dear". The result in a renderer that ignores the Unicode characters it doesn't understand, is "Dera". Unicode Standard Annex #9: The Bidirectional Algorithm has all the details.
Some spammers are very clever.