Accidental HTML entities in URLs

Friday 19 December 2008

Here’s yet another browser mystery which seems to affect Internet Explorer, though after the last one (about conditional definition of JavaScript functions) turned out to be maybe the correct behavior, I’m reluctant to lay the blame at IE’s feet.

We had a problem where certain links in IE were producing mysterious stack traces about not being able to convert characters somewhere deep in the HTTP handling code. Turns out we had a URL like this:

and by the time it got to the servers, it looked like this:


Here’s the odd thing: the URL wasn’t just an HTML attribute. If the HTML had looked like this:

<a href=''>go</a>

then I could understand the problem: HTML needs to have ampersands escaped. &copy; is an HTML entity for the copyright character, so the URL string really does have a copyright symbol: In this case, HTML doesn’t care that the trailing semicolon is omitted, and as an extra twist, the underscore doesn’t count as a word character in HTML, so &copy_from becomes ©_from.

To use this URL in HTML, you’d have to escape the ampersand to avoid the entity conversion, like this:

<a href=';copy_from=quux'>go</a>

But in my case, the URL wasn’t in HTML, it was in a JavaScript string, like this:

document.location = "";

In this case, the ampersand ends up in the string as a plain ampersand. But when the time comes to set the document location, IE applies another entity replacement, while other browsers don’t. In IE 6, this code results in a visit to a URL with a copyright symbol in it.

If you’d like to try for yourself, here’s a small HTML fragment to try:

function try_1() {
    // A simple URL string with &copy in it:
    var url = '';
    document.location = url;
function try_2() {
    // Try to avoid entity replacement by breaking up the string:
    var url = '' + 'copy_from=2';
    document.location = url;
<p><a href=''>bar</a></p>
<p><a href=''>copy</a></p>
<p><a href=';copy_from=2'>copy</a></p>
<p><a href='javascript:try_1()'>js try 1</a></p>
<p><a href='javascript:try_2()'>js try 2</a></p>

When you click either of the JavaScript links, an alert will appear to show the URL you’re about to visit, so you can see that the string really is “&copy_from=2”. In Firefox or Safari, you end up at the URL shown in the alert. In IE, you see “&copy_from=2” in the alert box, but the URL you visit is “©_from=2”.

One of the tricky aspects of any kind of programming, but web programming in particular is understand all the different phases of interpretation your code and data is subjected to. Where exactly is &copy; turned into ©? When the browsers don’t agree, and subtle manipulations are being applied where you don’t expect, the debugging gets that much trickier.

Once this problem was identified, it was easy to fix the problem: change the parameter name from &copy_from to &copyfrom, so that it doesn’t match an HTML entity. Unfortunately, this means we can never use a parameter name that conflicts with an HTML entity name, and there are 252 HTML entities, including ones with potentially useful names like &lang (〈) and &times (×).

I couldn’t find any discussion of this issue in a quick search, other than this blog post about HTML escaping.

» 21 reactions



You're not enclosing the <script> tag in CDATA sections - so isn't IE's behavior sorta correct here? (I'm not sure you're right about not needing the semicolon, but I'll take it as given.) Firefox is trying to be nice to you.


Yep. The real solution is to just not inline JavaScript. Otherwise, you need to use a super-special incant to "protect" your JavaScript in all user-agents. It's quite a mess, really.


Wait, how would CDATA fix his try_2? It looks like he's pretty clearly pinned it on document.location magic.


CDATA isn't the answer, since this is an HTML document not an XHTML document. In HTML script elements are defined as CDATA. You don't need to mark it explicitly (which you do in XHTML because XML DTDs aren't as powerful as SGML DTDs and can't do implicit CDATA).

I'm pretty sure this is just another Internet Explorer bug.

(Coincidently, I wrote about the history of CDATA and script elements a couple of weeks ago)


@Cory, @Kelly, @Edward: This really isn't an issue of properly escaping data so that HTML can consume it. Did you try your proposed solutions? They don't help. As Braden points out, try_2 was crafted specifically to get away from HTML entity issues in the source file, and the alert() calls are there to show what data is in the url variable.

There's something buggy about IE's behavior here.


I like the phrase "accidental entities." Are we not all just accidental entities?

David Boudreau 11:26 PM on 19 Dec 2008

No Jim, as these particular accidental entities are easy to fix, as Ned described: "Once this problem was identified, it was easy to fix". Fixing us, on the other hand, is not so easy (nor is "identifying", for that matter).


Totally unrelated, but the company I work for decided we'd make our user names more user friendly, so we switch to all numeric. Yup. People pay us to log in as '8323423994.' You can't make that up.

Regardless, on some components, we use basic authentication over SSL. I found that if you attempt to embed authentication information in the URL, and the username is numeric, IE 6 interprets the '[0-9]+', followed by a colon as an IP address and the request bombs.


I believe that the W3C validator requires all &'s in URLs to be escaped into &s. So to have valid HTML, they all should be escaped that way. I used to make sure I was at valid even though it would take me forever, especially those pesky little &'s that it complained about, because I was always missing one of them. You might still get away with it in Javascript with the validator, but I bet technically it would still be invalid. Which is why it's not consistent among browsers.


@Bryan The validator just tells you when you deviate from the requirements of the DTD you claim to follow, and the HTML DTDs are far from requiring that all ampersands be represented as entities. I'm not going to list all the exceptions, but the big one is when element content is marked as being CDATA (as is the case for script elements).


@David I only claim that I'm an idiot! :) I thought that might be the case, and exposed my bare (bear?) hind end.

I just have this tendency to always escape them. On systems where they never show & except for using & always cause me to go meta and talk about using &amp; to show the entity, and &amp;amp; to show the source of the entity. Zzzzzzz.

(not enough sleep... sorry...)


Good languages (like Perl!) allows you to use either & or ; as parameter separators, such that your link becomes;copy_from=quux -> no more &amp; in the links, no more taking care not using "reserved" words, but happy programmer.


@Berserk: we can't quite credit the language, but the library. We could use custom parameter separators in any language, and there may be good reason for it, but there's also something to be said for sticking to common practice.


Using semicolons in query strings instead of ampersands might not be common, but it is blessed by the HTML spec.


@David: That is a nice piece of information to file away and actually use.

Let's see, go through file and replace every & with ;

Then again, maybe not.

(Feeling much more rested today!)

(Just still silly!)


This feels like another place for yet another adapter layer for an idealized web browser interface. I would expect a cool plug-in to have traction as a cost savings among interested parties, given the total time spent on browser quirks and the specialization that it has become. At the least, a cool "lint mets the cryptonomicon" software debugging tool.


Hey. Escaping the ampersand in &copy with an &amp; does fix the problem in IE, but breaks Firefox 3. (Firefox puts the literal "&amp;" into the query string). Again, this is using JavaScript to open a new browser with I don't know whether escaping the ampersand in the href of an A tag has the same effect.

Just to point out, this bug/feature has been around since IE 4 (according to a newsgroup post from 2001 I just found).



Another tricky case:

<a href=";copy_from=1" onclick="; return false;">Popup link</a>
(try it)

All good browsers pass the entity through javascript and open the correct page in the popup window.

Internet Explorer (tested with IE8/WinXP) seems to parse the entity twice during the javascript call, from &amp;copy_from=1 to &copy_from=1 to ©_from=1 !

The only solution, despite the use of correctly escaped entities, is to avoid parameter names which match HTML entities...


I still need to test this, but a workaround can be to create a FORM in the background and submit that form, instead of using window.location=....

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.