|Ned Batchelder : Blog | Code | Text | Site|
Accidental HTML entities in URLs
» Home : Blog : December 2008
We had a problem where certain links in IE were producing mysterious stack traces about not being able to convert characters somewhere deep in the HTTP handling code. Turns out we had a URL like this:
and by the time it got to the servers, it looked like this:
Here's the odd thing: the URL wasn't just an HTML attribute. If the HTML had looked like this:
then I could understand the problem: HTML needs to have ampersands escaped. © is an HTML entity for the copyright character, so the URL string really does have a copyright symbol: In this case, HTML doesn't care that the trailing semicolon is omitted, and as an extra twist, the underscore doesn't count as a word character in HTML, so ©_from becomes ©_from.
To use this URL in HTML, you'd have to escape the ampersand to avoid the entity conversion, like this:
document.location = "http://nedbatchelder.com?foo=1©_from=quux";
In this case, the ampersand ends up in the string as a plain ampersand. But when the time comes to set the document location, IE applies another entity replacement, while other browsers don't. In IE 6, this code results in a visit to a URL with a copyright symbol in it.
If you'd like to try for yourself, here's a small HTML fragment to try:
One of the tricky aspects of any kind of programming, but web programming in particular is understand all the different phases of interpretation your code and data is subjected to. Where exactly is © turned into ©? When the browsers don't agree, and subtle manipulations are being applied where you don't expect, the debugging gets that much trickier.
Once this problem was identified, it was easy to fix the problem: change the parameter name from ©_from to ©from, so that it doesn't match an HTML entity. Unfortunately, this means we can never use a parameter name that conflicts with an HTML entity name, and there are 252 HTML entities, including ones with potentially useful names like &lang (〈) and × (×).
I couldn't find any discussion of this issue in a quick search, other than this blog post about HTML escaping.
tagged: web» 19 reactions