Phishing fun with Unicode

Saturday 9 April 2005

Damien got a phishing email that looked like total gibberish, which he described as the Worst Phisher Ever. Turns out it wasn't a moronic spammer, but someone clever enough to use obscure Unicode features to sneak past spam filters.

The scrambled text looked like this in his Firefox browser:

Dera Bcralays Merebm,

Tsih eamil was setn by the Balcrays svreer to virefy yruo eamil adsserd. You mtsu comelpte tsih procses by cilcking on the lkni bwole and entegnir in the samll wiwodn yruo Balcrays Mbmeership nebmur, paedocss and mbaromele wrod.

But he discovered that when viewed with IE, the text was perfectly readable:

Dear Barclays Member,

This email was sent by the Barclays server to verify your email address. You must complete this process by clicking on the link below and entering in the small window your Barclays Membership number passcode and memorable word.

Here is the actual text. Try viewing it in Firefox (scrambled) and IE (readable):

De‮ra‬ B‮cra‬lays Me‮rebm‬,

T‮sih‬ e‮am‬il was se‮tn‬ by the Ba‮lcr‬ays s‮vre‬er to v‮ire‬fy y‮ruo‬ e‮am‬il ad‮sserd‬. You m‮tsu‬ com‮elp‬te t‮sih‬ proc‮se‬s by c‮il‬cking on the l‮kni‬ b‮wole‬ and ente‮gnir‬ in the s‮am‬ll wi‮wodn‬ y‮ruo‬ B‮alcra‬ys M‮bme‬ership n‮ebmu‬r, pa‮edocss‬ and m‮barome‬le w‮ro‬d.

What's going on? Confusing matters even more, if you view source in Firefox, you see scrambled text, and in IE you see readable text. How can the same series of bytes look different in the source?

Reading the page directly with readurl -x, I saw this:

000fa0: 65 22 3e 0a 0a 3c 70 3e  20 20 20 20 44 65 e2 80  e">..<p>    De..
000fb0: ae 72 61 e2 80 ac 20 42  e2 80 ae 63 72 61 e2 80  .ra... B...cra..
000fc0: ac 6c 61 79 73 20 4d 65  e2 80 ae 72 65 62 6d e2  .lays Me...rebm.
000fd0: 80 ac 2c 3c 62 72 3e 3c  62 72 3e 20 20 20 20 3c  ..,<br><br>    <

Between the "De" and "ra" are bytes "e2 80 ae", and after the "ra" are bytes "e2 80 ac". This smells like UTF-8. An interactive Python prompt and the decode() function reveal the Unicode code points:

>>> for c in 'De\xE2\x80\xAEra\xE2\x80\xAC'.decode('utf-8'):
...     print hex(ord(c))

So Unicode U+202E and U+202C are behind the mischief. They are Right-To-Left Override and Pop Directional Formatting respectively. The control the rendering of bidirectional text. So what's going on here is the "D" and "e" are written left-to-right, as is usual for English, then the writing direction is switched to right-to-left, "r" and "a" are written, and the writing direction is restored to left-to-right. The result, in a renderer that properly handles these codes, is "Dear". The result in a renderer that ignores the Unicode characters it doesn't understand, is "Dera". Unicode Standard Annex #9: The Bidirectional Algorithm has all the details.

Some spammers are very clever.


Darren 6:03 PM on 9 Apr 2005

That is clever.
I wonder if there's a legitimate use for this technique.

Richard H. Schwartz 6:33 PM on 9 Apr 2005

The legitimate use would presumably be to embed some Hebrew or Arabic text within a Roman (or Greek or Cyrilic) alphabetic text.


Ned Batchelder 6:51 PM on 9 Apr 2005

I think he means, is there a legitimate use for exploiting the difference in the rendering in different browsers. I was wondering that myself. I only affects text, though, and it has to be one string, displayed either forward or backward.

Mark Pursey 8:24 PM on 9 Apr 2005

A quick way to check if you are looking at text that has been created this way is to drag-select it slowly -- as the cursor moves along the line the end of the selection will flicker, as it too must alternate between LTR and RTL directionality.

Also worth checking whether this exploit could be used in URLs themselves... I don't think it's possible, but am still planning to raise my paranoia level up yet another notch and check source on everything.

andrew 9:17 PM on 9 Apr 2005

I got that email too. BANGO! I never even looked at it, but I was surprised that it got past gmail's spam filters, which are usually quite good.

Mark Pursey 9:22 PM on 9 Apr 2005

Just added a post on this and noted something interesting, which is probably a well-duh! for anyone familiar with mixing character directions.
Bracketing characters are reversed when text-direction is switched, so reversing a[b]c results in c[b]a, rather than c]b[a. Neat!

K. 8:25 AM on 10 Apr 2005

The idea of using strange unicode characters in text is very clever. A similar attack that would probably work in both IE and Firefox (I've not tested this) would be to insert zero width spaces (U+200B) after every letter of a word. If the spam filter doesn't understand it and does a straight string compare, it would not find any matches. However, since the zero width space is not rendered, the word would be rendered normally on the screen.

Don Lawrence 9:35 AM on 10 Apr 2005

I arrived at this post via Damien via Pete Lyons' respective blogs.

Just as an FYI, the text is scrambled by all browsers at my disposal in Mac OS X (10.3.8) - Firefox, Camino, Safari, Mozilla, Netscape, and Omni. Microsoft ceased production of IE for Mac a number of years ago when Apple began development of Safari.

Peatey 2:26 PM on 10 Apr 2005

Ned, I noticed a couple months ago that IE 'fixes' faulty html like missing closing bracket (>) that Firefox might not. The interesting thing was that IE's 'view source' would show the nonexistent bracket whereas Firefox's source view would not show it. I think IE's view source is messed up, probably shows the source AFTER 'correction,' which makes it a non-source, imho. I think the same thing is happening with Unicode here.

Ben Butler-Cole 8:19 AM on 12 Apr 2005

Clever, but not that clever. 0x202e and 0x202c are extremely unlikely to be present in ham emails, so any learning statistical filters out there are only going to be caught once.

jake 2:51 AM on 14 Apr 2005

So IE handles unicode better than Firefox? or worse? I'm not sure which is the _correct_ behavior (although it seems like you're saying it's IE's)

Ned Batchelder 6:54 AM on 14 Apr 2005

It seems like IE is doing the right thing, though I would think these behaviors really only need to be properly implemented for right-to-left languages. In some ways, I think it is an "error" to display ascii characters while in right-to-left mode.

Add a comment:

Ignore this:
not displayed and no spam.
Leave this empty:
not searched.
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.