Today’s XaaS environments claim better service availability and data durability than many IT directors would be comfortable signing up for. These claims need to be understood, but for many of us, that understanding often only comes through the pain of loss. The very well known Amazon Web Services platform, which is, by just about any measure, a model of reliability, makes an interesting statement: “Amazon S3 is designed for 99.99% availability and 99.999999999% durability.” For this discussion, let’s set aside the words “designed for,” that indicate this is an objective not a promise.
It may help to have a look at the words being used here and make sure we ascribe the right meanings. To help along that path, have a look at opposites. It could be said that the opposite of durable is fragile, and the opposite of available is scarce. Nobody wants their data to be fragile and their service to be scarce. Even after trying to master the meaning of entropy in college thermodynamics classes, I think we all struggle with the idea that the natural state of the data we stored and the systems we’ve built is chaos. Where then, should we invest our energy in resisting the natural: protecting the fragile or stockpiling the scarce?
It would appear that durability is, in fact, the easier problem
In some ways we get clues from the numbers of 9s that Amazon is willing to put down on paper: 11x9s for durability or 4x9s for availability gives a notion of the relative difficulty of the challenges. Already 11 nines are not easy. Our math indicates that 3 copies of data, who never spend over 4 hours missing a copy, only bring a system to about 9 nines of durability, it takes 4 copies of data to reach 11 nines and word on the street is that Amazon uses even more, but all those disks aren’t free. Erasure coding schemes can save cost, but they too must be cared for. Conclusion: to get high durability, plan for sufficient redundancy and fix lost or corrupt data quickly, very quickly.
Repair time and durability
So what’s the challenge with availability?
To start with, in this case we can’t stockpile the scarce. No matter how long your system has been working it’s only 52.56 minutes away from not being 99.99% available.
What’s abundant and easily available is un-availability. Unavailability is like air that can easily be stored by compressing it in a tank, but to get rid of air by creating a vacuum is very hard; like zero downtime. The only thing we can act directly on is to make sure that we don’t waste downtime.
To further investigate availability, let’s work with Donald Rumsfeld’s famous statements: “…as we know, there are known knowns; there are things that we know that we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns, the ones we don’t know we don’t know.”
So first, what are the known knowns? We know that systems need maintenance, and we know when it’s going to happen. Unfortunately, in today’s worldwide service solutions, it’s only an occasional IT person who dares to claim that a maintenance window is not down time. If we are really measuring downtime we better include maintenance windows, which means past 3x9s there is no such thing as a maintenance window, ever. That’s simple enough: the known known.
So what are the known unknowns? We know that that equipment will break. Disk drives will fail, network interfaces will fail, motherboards, and literally every component of a system will fail given enough time. Unfortunately, we just don’t know when these nasty things will happen: known unknowns.
We also know that when things fail, it takes time to figure out what failed and how to fix it, and at 4x9s we’ve got 52 minutes to figure things out and get them fixed. At 5x9s there is only enough time to determine that the service is unavailable, no hope of repair.
So if you keep a system available with these known unknowns, none of the likely failures can require diagnosis or repair to stay online. The key to improving availability is to stockpile those precious 53 minutes and not waste them on known or unknown knowns.
So we reach a simple conclusion: in order to provide high availability we must save all the downtime for Rumsfeld’s “unknown unknowns.” We cannot use any of it for maintenance or diagnosis of known component failure scenarios.
Finally, if we take an interest in gracefully handling Rumsfeld’s “unknown unknowns” we must push on to geo-distributed systems. Indeed how do we go about keeping a system alive and prevent data loss when very unexpected things happen? Multiple sites can save us in the event of a power outage when two electricity grids and the diesel generators all fail, or a 100 year flood, or an earthquake, or some other lurking unknown like, heaven forbid, WMD. Geo-distributed systems are worthy of a longer discussion, but like Scality, start by managing the fragile and the scarce.