Recently on Twitter, there was a conversation about vendors and certifying the availability of their arrays; which vendors certify their arrays as five nines etc. I am going to argue that these figures lull people into a false sense of security as actually no-one knows what they mean!
If a vendors says that that their array is 99.999% available, what does that really mean to you? Probably not a lot in practical terms. Does it mean that individual components are 99.999% available? Or does it mean that the array itself in some shape or form is available?
If the array is still powered on and not in flames, is that available?
If 75% of disks are working, is the array still available?
If the array can service any I/O is that available?
What do vendor figures actually mean and do they matter to you? More importantly, do they matter to your customers? Your customer doesn't care whether the array is still working, all they care about is whether they have access to their data and their service is available. So ultimately, vendor availability figures are pretty much meaningless in the larger scheme of things.
So those vendors who read my blog, what do your availability figures actually mean?
agree on the 5 9s, it's a rather useless stat, but if they're less then 5 9s I question it.
& wouldn't availability = access to all data via at least 1 connection? theoretically 50% of the links could be down
Posted by: clarke thomas | January 12, 2010 at 01:31 PM
At IBM we measure what we call CIE's (well we do love our TLA's) Customer Impact Events - any CIE may have varying impacts on the customer, but generally its associated with - a time period that said customer was unable to access their data as a result of a fault with said bit of IBM hardware or software.
This gives you a time of non-availability, add up all of these events over a month, and work out the power on hours of all installed said bits of hardware or software over the month and this gives you the availability number.
IBM however does not generally publicly state any availability numbers, but can be discussed and backed up with data if requested by customer or potential customers.
Posted by: Barry Whyte | January 12, 2010 at 02:30 PM
Barry, so say I have a failure which takes out a RAID rank (love to hear Jonathan Ross say that); how would IBM know that I had lost access to data? Is it as granular as that?
Posted by: Martin G | January 12, 2010 at 03:21 PM
The DS8000 monitors the health of the RAID ranks themselves and will call home to IBM and report problems as they are found. We can also notify the customer in parallel with the call home process.
The health monitoring goes below the level of the physical bits (drives, device adapters, etc.) all the way into the logical components (RAID ranks, volumes, etc.) as well. We also make extensive use of integrity checking bits at the data storage level to make sure that the drives actually write what we ask them to.
All in all, we have a very granular problem tracking system in the DS8000 and consequently are extremely aware of problems that cause "customer impact". As Barry described, these events are calculated over the full field inventory of DS8000s and tracked very closely by IBM System Storage.
Posted by: K.T. Stevenson | January 12, 2010 at 08:00 PM
IMHO and with all due respect, nonsensical questions (such as a value for a theoretical storage array availability metric) will get nonsensical (and non verifiable!) answers. Especially in a customer/vendor conversation. Any "number of nines" claim reminds me of "The Emperor's new clothes" (http://en.wikipedia.org/wiki/The_Emperor's_New_Clothes). Everyone knows it's nonsense but plays the game nevertheless, because everyone else does.
On a more constructive note, I think it **does** make sense to measure/estimate practical, end-to-end, application defined availability metrics. "What is the probability I will not be able to call up an NMR scan from a PACS system?". "What is the probability the transaction will fail when I try book a seat online on Airline X's web site?".
To come up with reasonable estimates, you'll need serious functional insight into the different underlying components as well as historical data in a comparable setting. Merely multiplying theoretical availability numbers is a garbage in garbage out proposition.
Posted by: Paul Carpentier | January 14, 2010 at 11:44 AM
Paul, this is why I am asking the question because I am in complete agreement with you; they mean very little but yes, we still persist in asking for them. I am not sure why but it is a very easy metric to manipulate which is why many departments use them internally.
For example when I worked for a large retail bank who took over another retail bank; we got together to compare service availability metrics. At first glance, they looked very similar, until you looked in detail; one was total availability including planned outages, one was total availability excluding planned outages. The latter got very upset when their availability figures were significantly revised downwards.
Posted by: Martin G | January 14, 2010 at 01:15 PM