Accidental HTML entities in URLs

Friday 19 December 2008

Here’s yet another browser mystery which seems to affect Internet Explorer, though after the last one (about conditional definition of JavaScript functions) turned out to be maybe the correct behavior, I’m reluctant to lay the blame at IE’s feet.

We had a problem where certain links in IE were producing mysterious stack traces about not being able to convert characters somewhere deep in the HTTP handling code. Turns out we had a URL like this:

http://nedbatchelder.com?foo=1&copy_from=quux

and by the time it got to the servers, it looked like this:

/?foo=1©_from=quux

Here’s the odd thing: the URL wasn’t just an HTML attribute. If the HTML had looked like this:

<a href='http://nedbatchelder.com?foo=1&copy_from=quux'>go</a>

then I could understand the problem: HTML needs to have ampersands escaped. &copy; is an HTML entity for the copyright character, so the URL string really does have a copyright symbol: In this case, HTML doesn’t care that the trailing semicolon is omitted, and as an extra twist, the underscore doesn’t count as a word character in HTML, so &copy_from becomes ©_from.

To use this URL in HTML, you’d have to escape the ampersand to avoid the entity conversion, like this:

<a href='http://nedbatchelder.com?foo=1&amp;copy_from=quux'>go</a>

But in my case, the URL wasn’t in HTML, it was in a JavaScript string, like this:

document.location = "http://nedbatchelder.com?foo=1&copy_from=quux";

In this case, the ampersand ends up in the string as a plain ampersand. But when the time comes to set the document location, IE applies another entity replacement, while other browsers don’t. In IE 6, this code results in a visit to a URL with a copyright symbol in it.

If you’d like to try for yourself, here’s a small HTML fragment to try:

<html>
<head>
<script>
function try_1() {
    // A simple URL string with &copy in it:
    var url = 'http://nedbatchelder.com?foo=1&copy_from=2';
    alert(url);
    document.location = url;
}
function try_2() {
    // Try to avoid entity replacement by breaking up the string:
    var url = 'http://nedbatchelder.com?foo=1&' + 'copy_from=2';
    alert(url);
    document.location = url;
}
</script>
</head>
<body>
<p><a href='http://nedbatchelder.com?foo=1&bar=2'>bar</a></p>
<p><a href='http://nedbatchelder.com?foo=1&copy_from=2'>copy</a></p>
<p><a href='http://nedbatchelder.com?foo=1&amp;copy_from=2'>copy</a></p>
<p><a href='javascript:try_1()'>js try 1</a></p>
<p><a href='javascript:try_2()'>js try 2</a></p>
</body>
</html>

When you click either of the JavaScript links, an alert will appear to show the URL you’re about to visit, so you can see that the string really is “&copy_from=2”. In Firefox or Safari, you end up at the URL shown in the alert. In IE, you see “&copy_from=2” in the alert box, but the URL you visit is “©_from=2”.

One of the tricky aspects of any kind of programming, but web programming in particular is understand all the different phases of interpretation your code and data is subjected to. Where exactly is &copy; turned into ©? When the browsers don’t agree, and subtle manipulations are being applied where you don’t expect, the debugging gets that much trickier.

Once this problem was identified, it was easy to fix the problem: change the parameter name from &copy_from to &copyfrom, so that it doesn’t match an HTML entity. Unfortunately, this means we can never use a parameter name that conflicts with an HTML entity name, and there are 252 HTML entities, including ones with potentially useful names like &lang (〈) and &times (×).

I couldn’t find any discussion of this issue in a quick search, other than this blog post about HTML escaping.

»  21 reactions

Comments

[gravatar]
You're not enclosing the