Thursday 29 March 2007 — This is almost 18 years old. Be careful.
A brief detour through some docs led me to PHP’s htmlspecialchars function, where I noticed that double-quote becomes ", but apostrophe becomes '. Seemed odd, since we’re all so used to ' as the apostrophe entity. A comment on the docs claimed that there’s no such thing as ' in HTML. I was already three or four levels deep on the distraction stack, so I went and looked.
Sure enough, the HTML 4.0 spec defines 255 different character entities, and ' is not among them.
What does it mean? Nothing, really, since the browsers all understand the entity, but it demonstrates that sticking to a standard may be tougher than you think, since common practice so often exceeds what the standard guarantees.
Comments
I'm not sure if this is still the case, but for a long time, Internet Explorer didn't recognise ' as an entity in HTML, but Mozilla would. So if you used an XML escape function to escape your HTML, it would render fine in Firefox, but be full of ''s in IE.
The idea of Internet Explorer being inconvenient because it was following the standard always struck me as amusing.
One thing I didn't know until last week was that in SGML, hence HTML, the semicolon at the end of the entity is optional (unless it's needed for tokenization). But in IE (up to v6, at least) this doesn't work for entities added with HTML 4.0. E.g. (look at this in I.E.),
These are okay:
< é Φ
But this is not:
&Phi
http://www.w3.org/TR/xhtml1/#C_16
Add a comment: