Xss with utf-7

Saturday 21 April 2007This is almost 18 years old. Be careful.

I’ve been reading a lot about XSS issues these days. It’s fascinating to learn about the various vulnerabilities and exploits. Understanding them helps me make my own applications more bullet-proof, and forces a better understanding of the interaction between layers of the computing infrastructure.

A good example is the UTF-7 vulnerability. When I first read this one in the XSS Bestiary, I didn’t understand what it meant. How could character sets be a vulnerability? After reading a few explanations, I finally get it. Here’s the deal.

Broadly, XSS vulnerabilities exist when content from the user is displayed on your page as executable HTML. A malicious user can include script content, which then executes in the browser, with access to the cookies from your domain.

The most famous example of a UTF-7 vulnerability was in Google’s forbidden access page, where the URL requested was displayed in the browser. In this case, the URL itself was the content from the user.

Here’s a very simple PHP implementation of a similar page, for 404 errors:

<?php
echo "<h1>Oops</h1>";
echo "<p>Can't find " . $_SERVER['REQUEST_URI'];
echo ". Have you looked under the couch?</p>";
?>

Before we get into the UTF-7 issues, notice that this page could be vulnerable to a garden-variety XSS attack. If you visit a URL like

http://example.com/<script>alert('xss')</script>

the page will appear, and the script tags will be executed, leading to an alert box that says “xss”, the canonical benign example of script used to test or demonstrate vulnerabilities. In a real exploit, the script would steal cookies or post requests.

A side note: one of the complexities of analyzing these exploits is that there are so many moving parts. In fact, you can’t visit that URL directly, because the browser will replace < with %3C and so forth. And when I put that URL in a link and follow it, Apache gives me a Forbidden page before the PHP script is even invoked, I don’t know why.

Just to be safe, let’s fix the XSS hole in this page. The problem is that we use the URL verbatim on the HTML page, so what should be treated as plain text (the URL) is inserted into the page as if it were executable HTML. The solution is to escape the data so that it becomes static text in the HTML page. PHP’s htmlentities function does the escaping we need:

<?php
echo "<h1>Oops</h1>";
echo "<p>Can't find " . htmlentities($_SERVER['REQUEST_URI']);
echo ". Have you looked under the couch?</p>";
?>

Now, regardless of the other layers of the system that might protect us from nasty URLs, this script will escape the URL anyway, and it’s safe to display it on the page, right? Wrong. This is where UTF-7 comes into the picture.

UTF-7 is an encoding of Unicode that only uses 7-bit characters, for use in email transmission. In a nutshell, many characters are used as-is, but some characters require encoding, in the form of a plus-sign, a modified base-64 encoding of the character code, and a minus-sign. For example, our script snippet can be represented in UTF-7 as:

+ADw-script+AD4-alert(+ACc-xss+ACc-)+ADw-+AC8-script+AD4-

+ADw- is an open angle-bracket, +AD4- is a closing angle-bracket, and so on.

If we include that string as part of our URL, it will be passed through the htmlentities function unchanged, and will appear in the HTML page as-is. The resulting page doesn’t have any explicit declaration of its character set, so depending on your browser settings, the browser may try to auto-detect the character set, and seeing the distinctive UTF-7 byte sequences, will choose UTF-7. In UTF-7, the string is executable HTML, and the script will be executed.

XSS vulnerabilities often boil down to a piece of data interpreted in two different ways. The server believes it to be static, but the browser decides it is executable. In this case, the server thought the string was static because it interpreted the byte string as ISO-8859-1 characters (the default encoding for htmlentities). As 8859-1, these characters are static. The browser, though, decides that the bytes are UTF-7, where they are executable, leading to the vulnerability.

The solution in this case is simple: force the browser to interpret the bytes the same way the server did, by declaring the character set:

<?php
echo "<meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1'>";
echo "<h1>Oops</h1>";
echo "<p>Can't find " . htmlentities($_SERVER['REQUEST_URI']);
echo ". Have you looked under the couch?</p>";
?>

The Content-Type declaration keeps the browser from auto-detecting the character set. By keeping the two interpretations synched up, the discrepancy that caused the vulnerability is removed, and the hole is closed.

BTW: if you want to see a UTF-7 page in action, try this one I made. It doesn’t do anything malicious, simply pops up an alert:

+ADw-p+AD4-Welcome to UTF-7!+ADw-+AC8-p+AD4-
+ADw-script+AD4-alert(+ACc-utf-7!+ACc-)+ADw-+AC8-script+AD4-

Comments

[gravatar]
I found the meta-tag not helping a lot. Other things to look into are header('Content-Type: text/html; charset:utf-8'); from within PHP and the default and additional character sets in the apache configuration.

The meta-tag approach doesn't work at all when there's dynamic content before the browser sees the meta-tag and starts reinterpreting the page.
[gravatar]
Jan, you are right: headers are more reliable than meta tags. On the small sample page I was working with, the meta tag did a fine job preventing the auto-detection, but larger pages will need a more robust declaration.
[gravatar]
Your test page didn't pop up an alert for me in either Firefox or IE7. Seems that I've got the encoding auto-detect turned off in both. I don't recall doing that on purpose.
[gravatar]
The auto-detection seems somewhat mysterious. My IE7 shows the alert. My Firefox does not, but if I select View - Character Encoding - More Encodings - Unicode - UTF-7, then it shows the alert, and what's more, shows it every time I visit the page thereafter, until I restart the browser.
[gravatar]
Thanks for this clear explanation. I'd been wondering what this TAG issue was about but having not understood the initial comments I didn't get round to digging a bit deeper.

http://www.w3.org/2001/tag/issues.html#utf7Encoding-55
[gravatar]
The test page doesn't do anything on Safari 2.0.4 either. It also seems that Safari doesn't have an obvious way to switch to UTF-7 encoding.
[gravatar]
Ned, I recall IE doing input encoding heuristics regardless of what you tell him. The others don't to that afaik.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.