Reporting server reliability

Thursday 19 June 2008This is 15 years old. Be careful.

The next time someone inquires about how reliable your system is, say this:

We’re almost at five 9’s: we’re at five 8’s!

If you don’t know what I’m talking about, the Wikipedia article on uptime explains it pretty well: “Five nines” refers to a system being available 99.999% of the time, and is considered really good. “Five eights” would be 88.888% of the time, which would be horrible, and is in no way considered “almost” five nines.


If you really want to slam a system, say it has "nine fives" reliability. It's usually good for a laugh about three seconds later.
My server has five nines... 9.9999% reliability!
99.999% uptime means a system is only down for 27 seconds/month, or about the time it takes for a machine to boot up.

This is proportionally harder to achieve when you have a system with multiple dependencies. E.g. If you need to validate users against an authentication system (OpenId), fetch data from an external service (POP mail servers), and save data to a persistent data store (Amazon S3), you would need each of those systems to sustain closer to six nines for the system to boast 5-nines reliability.

Any thoughts on whether or not 5-nines is harder or easier to achieve in the era of the Web 2.0? The technology we rely on has matured certainly, but web services are definitely becoming more intertwined.
At my current company, the management that sets uptime goals simply has no understanding of how uptime is measured and what it means. One year, our goal was 99.9% uptime. The next year they thought it should be 99.99% -- after all, "it's only one more nine". This year I think they were willing to back down to 99.95%. The truth is that whether we reach these goals or not doesn't depend at ALL on how reliable our systems are, it depends entirely on which downtimes get defined as "counting" and which "don't count" -- making it an entirely political decision.
@Michael: I hear ya. "Uptime" can be a pretty meaningless number, especially when the pointy-heads starting bandying it about. You're much better off redirecting the discussion towards quantifiable impact to users. E.g. How much downtime ("User can't login") is acceptable per month? How many user-noticeable disconnects are tolerable? What is the maximum allowable latency?

I think these are much easier concepts for people to deal with. Moreover, as you can see by my last question, you quickly find yourself asking important questions that aren't necessarily directly impacted by downtime.
Yes, I often find that I keep two sets of goals: the official ones that I get evaluated on and the REAL goals. It is unfortunate, but probably an unavoidable part of corporate life.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.