Changing the Way I Think About Storage
It’s not every day you get to ride the wave of a fundamental shift in technology. I’d like to talk a bit about how I am staying afloat.
My introduction to shared storage was with the original HP EVA series. They were pretty awesome – a single, user-friendly web interface could get you from power up to production in a couple of hours. LUNs (Logical Unit Numbers) so massive that they could host upwards of 20 virtual machines were automatically spread across hundreds of drives, utilizing the aggregate capacity and performance to deliver what would eventually be coined ‘private clouds’. Dedicated, segregated, and redundant Fibre Channel (FC) networks guaranteed efficient data delivery under the most extreme utilization.
Over the years, these private clouds have grown exponentially. 20 virtual machines can be spun up by anyone with a credit card and an extra 10 minutes or so on their lunch break. Unstructured data have been doing laps around the more traditional stuff for years, and the finish line has extended into the exabytes. I’ve made my rounds through the large enterprise storage vendors, working with some of the biggest, fastest, and coolest arrays out there. Hulking, sub-millisecond workhorses that can literally take a bullet and keep going will always hold a special place in my heart, but I’ve recently decided to take a leap forward and hop on the bandwagon of the massively scalable, software-defined variety.
I’m far from a guru, but I’d like to share with you some of the thought transition required when moving from one world into the next. I’m still learning every day, but here are some of the major mental hurdles I’ve had to cross:
RAID is a liability.
In the past, parity disks gave me a warm fuzzy feeling inside, and hot spares helped me sleep at night. RAID protection is a solid decision for a single storage system that sits in a single rack, inside a single datacenter, and serves data to a single application. But when things begin to scale, issues start to arise. It is standard practice these days to pool disk groups together for management efficiency. The loss of a single RAID group inside a modern FC array will cause an entire system outage – I’ve seen it happen. All it takes is 2 or 3 drives failing at the same time, and someone gets a phone call at 3AM to deal with the failure and try and resolve it as quickly as possible. To combat this, system administrators must protect their data in a primary storage system with RAID while maintaining fully replicated copies to a secondary on-premise system. If the data is critical, it is also replicated to another dedicated system off site. That’s 3 full copies of your data on systems that are already losing out on 25% of the capacity due to RAID-6.
Here’s what that looks like: 100TB of usable capacity on a shared array with R6 (6+2) disk groups will require about 133TB of raw disk space. This needs to be snapped locally to a standby system in the same datacenter and replicated off site for DR. That’s 400TB of raw storage to host a 100TB application. Also factor in the FC switches, management overhead of providing clustering and failover capability, and possibly even local backups to protect against corruption replicating across the wires. That’s a whole lot of system for a little bit of data.
New ways to protect data have been in the works for years, and the two that make the most sense in a distributed system are software-based erasure coding and full-copy replicas. They each have their strengths and weaknesses, which is why it is important for modern storage systems to support both; with automated policy assignment based on things like file size and access point. There is enough information here for a completely separate post, but just know that we can now have discussions around numbers like 28% overhead as opposed to the 300% overhead mentioned above, and at availability levels several orders of magnitude greater than traditional storage.
The world does not run on microseconds.
SAP HANA. Oracle. HPC. These are environments that thrive on IOPS (Input/Output Operations per Second) and very low latency. The systems that support them are very cool and very expensive. They continue to hold a critical place in the datacenter, but it is a corner that is shrinking away – dwarfed by the incredible scale of unstructured data created and consumed by users around the world.
When I sit down and get ready to stream the newest episode of Game of Thrones, I want it now. I don’t expect to watch a wheel spin for 10 minutes while the content buffers enough to play. However, I can tolerate a little leeway – let’s say 20ms or so for the backend to retrieve the first bits and another couple of seconds for the light to travel across the country into my living room. This is the type of data that is becoming more and more prevalent in the datacenter, and things like accessibility, manageability, and ease of scale take precedence over sub-millisecond performance requirements.
We need to start thinking about performance differently, and in terms of the ability of a storage system to provide content at or better than the ultimate end user experience expectations. “Objects per Second” (OPS) is a term you’ll start to hear more and more. Throughput will remain a key performance indicator and will be a multiplier of the OPS and the size of the data being delivered.
Purpose-Built is not the Holy Grail.
Application administrators and storage architects may not always see eye to eye, but they are all working towards a common goal: generating revenue and reducing risk. At times, they butt heads trying to get to the same destination: app folks want their application to run in the most efficient, available, and high performance environment as possible. They want to deploy more application resources when they want it, and they want it to just work. Storage people want to maintain their service-level agreements (SLAs) without building a new datacenter every time a new initiative hits the floor.
In recent years, application folks have bypassed the traditional storage purchasing processes by shopping for systems on their own. They have budget, and they have influence. Recently, companies have been cracking down on this “Shadow IT” methodology and the systems end up back in the datacenter, and guess who has to manage it? That’s right, the storage team.
Wouldn’t it be great if the storage team could provide the application people the exact workloads they need, accessible via the protocols they want, and in a timeframe that doesn’t push back anyone’s deadlines? One-off acquisitions of one-trick ponies may serve in a pinch, but they are detrimental to the long term goals of all parties involved. A general-purpose, multi-protocol solution built on industry-standard hardware to address the many needs of a given industry IS the Holy Grail, and it is ready today.
As the ratio of unstructured data to structured data increases, so will the ratio of storage systems built to handle the former versus those built to handle the latter. I don’t think dedicated storage arrays are going away. In fact, I think some really cool innovation is coming from the major storage vendors to create bigger and badder systems to handle some of the operations required to really glean value from all of this unstructured data. However, the world is changing and we need to continue thinking differently if we want to remain relevant.
Think peer-to-peer technology. Industry-standard components. Abstraction of traditional functions from the hardware to the software, creating an application-centric environment that scales with the times.
Think software-defined storage.