Reporting server reliability

Thursday 19 June 2008

The next time someone inquires about how reliable your system is, say this:

We're almost at five 9's: we're at five 8's!

If you don't know what I'm talking about, the Wikipedia article on uptime explains it pretty well: "Five nines" refers to a system being available 99.999% of the time, and is considered really good. "Five eights" would be 88.888% of the time, which would be horrible, and is in no way considered "almost" five nines.


Jeff Darcy 6:15 AM on 19 Jun 2008

If you really want to slam a system, say it has "nine fives" reliability. It's usually good for a laugh about three seconds later.

Roberto 7:52 AM on 19 Jun 2008

My server has five nines... 9.9999% reliability!

Robert Kieffer 12:11 AM on 20 Jun 2008

99.999% uptime means a system is only down for 27 seconds/month, or about the time it takes for a machine to boot up.

This is proportionally harder to achieve when you have a system with multiple dependencies. E.g. If you need to validate users against an authentication system (OpenId), fetch data from an external service (POP mail servers), and save data to a persistent data store (Amazon S3), you would need each of those systems to sustain closer to six nines for the system to boast 5-nines reliability.

Any thoughts on whether or not 5-nines is harder or easier to achieve in the era of the Web 2.0? The technology we rely on has matured certainly, but web services are definitely becoming more intertwined.

Michael Chermside 7:41 AM on 20 Jun 2008

At my current company, the management that sets uptime goals simply has no understanding of how uptime is measured and what it means. One year, our goal was 99.9% uptime. The next year they thought it should be 99.99% -- after all, "it's only one more nine". This year I think they were willing to back down to 99.95%. The truth is that whether we reach these goals or not doesn't depend at ALL on how reliable our systems are, it depends entirely on which downtimes get defined as "counting" and which "don't count" -- making it an entirely political decision.

Robert Kieffer 7:53 AM on 20 Jun 2008

@Michael: I hear ya. "Uptime" can be a pretty meaningless number, especially when the pointy-heads starting bandying it about. You're much better off redirecting the discussion towards quantifiable impact to users. E.g. How much downtime ("User can't login") is acceptable per month? How many user-noticeable disconnects are tolerable? What is the maximum allowable latency?

I think these are much easier concepts for people to deal with. Moreover, as you can see by my last question, you quickly find yourself asking important questions that aren't necessarily directly impacted by downtime.

Michael Chermside 11:23 AM on 21 Jun 2008

Yes, I often find that I keep two sets of goals: the official ones that I get evaluated on and the REAL goals. It is unfortunate, but probably an unavoidable part of corporate life.

Add a comment:

Ignore this:
not displayed and no spam.
Leave this empty:
not searched.
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.