Crash-only software

Wednesday 1 September 2004

George Candea and Armando Fox have an interesting paper entitled Crash-Only Software. They noticed that it is faster to crash a system and reboot it than it is to shut it down cleanly and reboot it (because of the time saved not doing an orderly shutdown). And since robust systems have to be prepared to recover properly from crashes anyway, why not skip the orderly shutdown and just crash? I haven’t gotten my head around all of the implications, but their logic is both counter-intuitive and compelling at the same time.


Having written a handfull of "crash-proof" software, it's often assumed that post crash you're willing to sacrifice startup time to make sure everything is in order. The total reboot time (shutdown + startup) will be faster if you don't crash, since, ideally, the system doesn't have to piece together where it was because it made a note at shutdown. That's also why shutdowns can often take a while - the system has to note enoungh information so as to be able to startup quickly.
Working in VC6 with my company's monstrous solution and project, MSDEV, on shutdown would grind for about 10 minutes and then crash anyway. Presumably, it was writing some cruft back to one of its little files that litter the system. But since it just crashed anyway, there was no upside of just nuking it in taskmgr. Which, eventually, is what everybody did.

VC7.1 shuts down cleanly and (relatively) quickly.
Andrew: remember how we used the same approach with WebLogic? Why let it go through all of the trouble (and time) to grind through shutdown code when you could just kill it and it would restart just fine. There's not much worse than a program that takes longer to go away than it does to come up in the first place.

Btw, Andrew you need a geek rant blog ;-)
Omer, that’s the point. Software written that way seems to take longer to do a reboot cycle under normal circumstances and doesn't really have the benefit of having less code, you have to write disaster code that detects and recovers from the data being partially written anyway. If you design your system to be able restart quickly in crash scenario, then it’s by design better in the face of crashes, which can be a important in systems where high availability is important.

Also, if the software is always crashed, then each time you start up you’re going through your disaster code, therefore its gets a much more thorough testing. That disaster code is important, but if it gets run rarely, which is what is supposed to happen with most software, then the disaster code rarely gets tested in production scenarios. That’s a bad thing.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.