loader image

What is Erasure Coding?

Erasure coding is a data protection technique that divides data into fragments, creates redundancy through mathematical encoding, and enables recovery from multiple simultaneous failures without storing complete duplicate copies.

Data protection at petabyte scale creates profound economic challenges. Mirroring—storing complete copies of data on separate disks—provides protection but doubles storage costs. For large enterprises managing petabytes of backup, archival, and secondary storage, the cost of mirroring is often prohibitive. Erasure coding solves this problem by providing equivalent or superior protection while reducing storage overhead to 1.3x-1.5x rather than 2x or more. For infrastructure architects designing cost-efficient storage for thousands of employees in large organizations, erasure coding is not an advanced technique—it is a foundational architecture enabling petabyte-scale storage economics.

Why Erasure Coding Is Essential for Large-Scale Storage

Traditional RAID storage protects against failure by maintaining duplicate copies or parity information at the disk level. RAID 6 protects against two simultaneous disk failures by storing data and two parity blocks across multiple disks. This provides excellent protection but can only protect against failure of a small number of drives. When disk capacity reaches multi-terabytes and data sets span hundreds of drives, RAID’s limitations become apparent—recovering from a single failed drive requires reading multi-terabyte amounts of data from other drives, a process that takes hours and stresses the remaining storage during recovery.

Erasure coding extends data protection to arbitrary numbers of failures. Instead of storing complete copies or simple parity information, erasure coding divides data into k fragments, creates m parity fragments through mathematical computation, and enables recovery from any m fragment failures. Typical erasure coding schemes might use k=8 data fragments and m=4 parity fragments. This allows storage of data in 8 fragments, with capacity to recover from any 4 simultaneous failures. Storage overhead is 12/8 = 1.5x instead of the 2x overhead of mirroring.

Cost efficiency improves dramatically at scale. In backup and archival applications where data protection is essential but rapid recovery from individual failures is less critical, erasure coding enables organizations to manage petabytes of protected data on thousands of drives while maintaining manageable storage costs. The same organization might be unable to afford complete mirroring at petabyte scale but find erasure coding economically optimal.

How Erasure Coding Operates

Erasure coding uses mathematical techniques to create redundancy without duplication. The simplest erasure code, XOR (exclusive OR), was actually invented decades before modern erasure coding became practical. XOR parity divides data into blocks, computes XOR of all blocks, and stores the XOR result. If any single block fails, it can be recomputed from all other blocks and the parity. This is the foundation of RAID 5.

Modern erasure codes like Reed-Solomon compute parity chunks using polynomial mathematics. From any k of the k+m total chunks, original data is recoverable. If k=8 and m=4, from any 8 of 12 chunks, data is recoverable—4 chunks can be lost without data loss.

Implementing erasure coding at scale requires careful engineering. Encoding is computationally expensive. Storage systems use specialized hardware accelerators and parallel processing. Decoding is intensive, making failure recovery time critical.

Key Considerations for Erasure Coding Deployment

Recovery time and network impact during failures is critical. Erasure-coded recovery must read from k surviving fragments. Recovery traffic can saturate networks and stress storage. Organizations must design networks around recovery requirements.

Minimum rebuild time (also called rebuild overhead) impacts storage economics. Storage clusters must maintain sufficient capacity to support reconstruction of failed fragments. If a storage cluster operates near capacity, failure recovery becomes impossible—there is no spare capacity to write reconstructed data. This requires designing clusters with substantial headroom above active data, reducing the effective storage capacity available for customer data. Erasure coding schemes must balance protection strength against required capacity headroom.

Performance implications differ between erasure coding schemes. XOR-based coding (like RAID 6) is computationally simple and fast. Reed-Solomon is more flexible but more computationally expensive. Degraded performance during reconstruction—when one or more data fragments are missing and reads must reconstruct missing data—is an important consideration. Some applications cannot tolerate the latency impact of reconstructing data on every read, making minimum degraded-mode read time a critical metric.

Latency sensitivity affects erasure coding applicability. Real-time systems requiring consistent sub-millisecond latency cannot tolerate the variable latency of erasure-coded storage. Storage for transactional databases might require mirroring or other simpler protection. Erasure coding is most applicable to batch-oriented workloads and latency-insensitive applications where throughput is more important than individual request latency.

Erasure Coding in Backup and Archive Storage

Backup storage systems increasingly use erasure coding to reduce costs while maintaining data protection. Backup data doesn’t require the low latency of production storage, making degraded-mode performance acceptable. Organizations can protect weeks or months of backup data using erasure coding instead of mirroring, cutting backup storage costs substantially.

Archive storage for compliance retention uses erasure coding extensively. Data retained for seven or ten years rarely requires rapid recovery, making the latency impact of erasure coding acceptable. The cost efficiency of erasure coding enables organizations to store decades of archived data economically. Archive storage systems built on erasure coding can maintain petabytes of protected archival data on modest budgets.

 

Further Reading