Fossilized hack-arounds

Friday 7 May 2010

A few weeks ago, we had a baffling problem with our web application: some JSON responses were being gzipped incorrectly. I asked about it on Server Fault: Incorrect gzipping of http requests, can’t find who’s doing it.

The final resolution was that Akamai was gzipping the request, and adding a “Content-Encoding: gzip” header. But we’d already put in a “Content-Encoding: identity” header, the browser saw both, only attended to the first (identity), and couldn’t interpret the gzipped gibberish it saw in the content.

It turns out we aren’t supposed to use “Content-Encoding: identity” on responses, and removing that from our JSON code solved the problem.

But there was a mystery remaining: Akamai also adds an “X-N: S” header to the response. What the heck is that?

A friend has friends at Akamai, and sent them the question. Back came the answer:

A long time ago, when there was a browser called “Netscape”, :-) there was a bug that prevented embedded images from rendering if the HTTP headers were exactly some length. (If the terminating \r\n begins on character 256, 257, or 258.) So if the header size is in this range the Akamai server adds that header...

Wow, talk about bug workarounds encased in amber. That’s a really old bug and code is still trying to sidestep it. Looking on Google, it looks like other web intermediaries are also adding headers to fix it: Apache used to send X-Pad, and WebSTAR sent X-BrowserAlignment.

I doubt the affected browser is even out there in the wild any more, but Akamai is still adding this header to requests, plugging away a decade later. It’s astounding to think of the labyrinth of special checks and bug adaptations in software like this, the extra cycles expended in the name of obsolete components that are no longer even listening on the other end.

The problem of course is that once you’ve added code like this, how can you be sure it’s safe to remove? Who’s even checking over the code to consider that it might be safe? Accommodations like this get in the code and generally never come out, though Apache removed theirs.

One last micro-mystery: what do the N and S mean in “X-N: S”? I’m betting on “Netscape Sucks”!


At work we spend quite a bit of time going over old functionality and removing it. Usually for something like this we will create a Boolean global parameter and have it's default be True. That way when we do normal reviews of existing parameters we will be able to keep track of the workaround. After forking for a product release, the first thing we do is go over all the older file versions and parameters to see what we can now remove. Lots of going to research and asking 'do you still need this parameter? should we intern the X value? can we remove this feature? what EXACTLY are you still using it for?'. We are not popular during this time. We are in code freeze right now (only approved and reviewed bug fixes go in). Once QA gets past the next milestone next week the party begins and we start looking for crap to remove. Actually we already have an initial list, and some clients are decidedly not happy. Oh well.

The only thing more fun that completing a feature you have been working on for 3 years: being able to finally rip it out of the code base. No this is not a joke, it is actually quite cathartic.
It's this exact thing which makes me wonder about the wisdom of the choice our Internet forbears made adopting 'Be conservative in what you do; be liberal in what you accept from others' as the guiding principle behind our Internet protocols.

While neighbourly, I'd argue that it's bad engineering. A hard fail (Internet Protocol test suites anyone? :-) results in bugs getting fixed. If a system 'seems to work', they often don't. It's just human (and corporate) nature.

An example of the waste this sort of thing engenders is the time my company spends working around Jurassic bugs in the browsers that run our markup and code. We do web applications, and a substantial part of the budget on each project is spent on 'backwards compatibility'. Libraries and experience reduce the factor, but it's still time that would be better spent 'adding value'. I shudder how much time and energy is wasted globally...
@Doug, I'm not sure I understand. You describe asking your clients if they still need a feature, then you describe them being mad when you remove it. Why remove it if people are still using it?

@Leon, yes, we'd have a more efficient streamlined ecosystem if the web had adopted an XML-like strictness from the beginning, but would we have as large an ecosystem? What would the adoption of the internet have looked like if many pages displayed error messages instead of a partial rendering? I think the loose connection between browsers and servers has allowed growth that wouldn't have happened under a different strategy. This is especially true when you consider the extension of the protocols. Would we ever have started putting images in web pages if the first ones displayed error messages in the older browsers that didn't implement the img tag? Graceful (or not) fallback has allowed us to build out HTML and other capabilities.

Two reasons why they are mad, which boil down to 'I want to do research, not make any modifications for your unimportant code management reasons. Code management is your problem, not mine.' Yes that is a direct quote.

1. When we remove an old file format which they no longer need and is a problem for code support and or implementing new features, then the old files need to be upgraded. This is agreed upon by Research management, which are all also researchers. Anyone with an old experiment using an old version of the file needs to use the existing version of the research interface to upgrade their files before moving to the new release. We warn them with plenty of time, have deprecation warnings, and tell them to upgrade their files; but they never do. They get mad when that deprecation warning becomes an error and they now need to go to the previous release and run an upgrade. It is much worse if the upgrade requires making decisions; very rare but can happen. Their job is to do research and all these engineering requirement hoops piss them off to no end. Why not just support all the old formats going back in time? It would save them the effort of doing an upgrade or making a 'mrec.SaveUser()' function call; saves auto-upgrade.

2. Researchers love to copy the parameters that other peoples experiments use without bothering to determine if that parameter is even needed for their experiment. They do not bother looking at the deprecation warnings. So when everyone agrees to remove a feature based on a parameter and then we remove the parameter after it has been deprecated for 10 releases, they get an error saying that it was removed on XXX in release YYY and either a set of instructions on what to do, or a url to the release notes. But what the researcher sees is that their experiments are failing and it's our fault.

It is one thing to agree that the old way of doing the best pel calculation is wrong and that we need to switch to the new system based on a log value, not a score/prob value, but it is another for every person in a 600+ research organization to understand the long tail impacts that making a minor change can have. There is a cognitive dissidence between the research mind set which is cultivated way back as a freshman in college all the way through a doctorate, and the software 'engineering' mindset cultivated through years of code and release management.

It is a social engineering problem which I have seen at every large company which has a significant R:D ratio; that I have been involved with. Talking with the folks over at IBM, we are in pretty good shape overall.

Bonus: IT upgrades the version of something like SciPy which they have been asking for because some new feature or bug fix is holding them back and costing them time, and then it is our fault that they have to change their scripts if an API has changed in that package. It's the core speech recognition engine teams fault that IT did what they asked. I love that one. A few people do not understand that 'python' the language and tools is different from the 'python' engine interface, as it is all just python. This problem is very very rare and stems from a confusion of what things are really 'engine' features and which are 'python' features. This has been seen in the Django project as well where there will be issues with improper use of module X which is not part of Django, but because the persons only experience with python is 'Django' they do not realize the difference.

The biggest problem is really #2, as we have fairly good data package management now, and it is only researchers who do not follow the research guidelines whom run into problems there. The parameter problem is much bigger, because while we have parameter set files, it is very easy and convenient to just set a parameter in a script somewhere, and many scripts are not properly version controlled as they are 'just to try something out'. Even when in version control it is a dvcs with tons of branches, and you never know which are actually still active. The research mentality is to keep everything, as it might be useful at some point in the future.

Thanks for the chance to rant ;-) it was quite cathartic.
I should note that I am painting with a very very wide brush here when I say 'research'. We have a number of research teams all over the world, and these issues are just not a problem for the wide majority of them, and most do understand. 'We' the engine team (which are technically part of research) just only hear from the people whom are upset. I understand their frustration and sympathize to a point. Though I do take a bit of perverse glee in complaining to specific people about all the problems feature X, which has not been used in product for 2 years, is causing me for trying to implement Y.
@Leon, I agree regarding Postel's Law being a mistake. The problem with being liberal in what you accept is the implied promise to maintain forward compatibility with broken clients. Who's fault is it in the future when the server changes slightly (within the defined protocol) and the client-server connection is broken? It's hard to blame the client when the error wasn't flagged earlier.

However in the case at hand, I think competition and pragmatism are responsible rather than Postel's Law. Netscape's browser was released, people were using it, and to users and customers it would appear that the server was broken. If one vendor's server worked around the problem and others didn't, it would be a point in favor of the ones that did.

@Ned, defining a protocol strictly (and enforcing adherence) doesn't prevent extensibility, rather it increases the freedom of the designer to extend because the rules are well-understood.
A bit more trivia about the Akamai problem, for anybody that's curious. The problem only occurred a small percentage of the time, specifically when the server generated an HTTP/1.1 chunked response. Chunked responses lack an overall Content-Length header, instead each chunk has a length prepended and the end of the response is marked by a zero-length chunk. Akamai opportunistically gzips under the assumption, mistaken in this case, that the overall reply is large. Interesting to note that for Akamai, CPU is apparently cheaper than bandwidth.
Thanks for the information. But there still must be a meaning to X-N: S? I wonder what is it.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.