Monday, September 8, 2008

Followup on downtime computations

last week, I posted a quick blurb about calculating minutes of downtime per 9. In one of the comments, Jeff Hengesbach wondered about how other people calculate uptime.

The particular question was whether you, when computing your availability numbers, should include planned downtime in the final numbers. In other words, if your maintenance window is from 1am till 2am, do you count that hour-long window as downtime.

I can see both sides of the argument, but my general mind is that as long as you are up front about your maintenance windows, and the customers have no reasonable expectation of service during those published times, then it shouldn't count off your downtime.

However, sometimes service availability is of paramount importance. In these cases, you typically don't have any downtime, as normally recognized. In my case, I provide financial data to timezones spread all over the world, which means that there is literally no time that someone isn't trying to access services. I literally have no downtime from a customer's perspective.

To combat this, I use an array of methods to provide redundancy at every level. Though my budget doesn't allow me to use truly enterprise level solutions, I try to do the best with what I have and what I know. Every iteration of the infrastructure has improved upon the previous one. I'm currently using technologies like blade servers, a small SAN, and a cluster of linux machines, along with redundant switches and network access. In addition, we never have less than two physical sites. For me, any time a client can't access any part of the service counts off of our availability.

I am curious too, how everyone else does it. What do you count against your uptime. Do you even keep track of your uptime or availability? I've never had anyone ask me what mine is, and to be completely honest, I have never actually tracked it. It's always just been "down as little as possible", but I'm rapidly approaching the time where the Mean Time Between Failure (MTBF) of the equipment counts, and I may be asked what my uptime has been and is calculated to be. I don't want to have to make stuff up, so I'm appealing once again to everyone else.

How do you do it?