Human DNA’s areal density
2TB. 2.000 Gigabytes. 3 or 4 times the average amount of storage available on a laptop today. Now, wrap your head around the fact that that is roughly the size of our genome once we decode it. Talk about Storage Areal Density. A single cell is approximately 1.55 × 10-5 square inches, which makes the areal density of a cell 13 PB per sq.in. Now that’s effective. Nowadays, we’re up to 900 GB per sq.in in the macro world 1. We’ve got some things to learn from Evolution, it seems.
It is of course not that simple. But the important thing is that the way we decode and store someone’s genome takes 2 TB. That’s a full disk nowadays. How can we efficiently store this in a sustainable way, knowing that it’s not a archival use case? That genome will be processed, picked apart many times and even more data will come from it.
BGI in China creates and stores 2,000 human genomes per day 2 ?!! That’s 4 PB a day. It sounds like this is going at a pace far outstripping Moore’s law.
This deluge of genomics data is a real problem as David Haussler, director of the center for biomolecular science and engineering at the University of California, Santa Cruz, said: “Data handling is now the bottleneck. It costs more to analyze a genome than to sequence a genome.” I’d say that the cost mostly comes from storage.
Let’s make a quick calculation:
Taking into account the overhead for security and data protection (ie RAID that is ca. 1.3), the DR capabilities overhead (ca. 1.5) and the fact that you can’t fill 100% of the disks to keep some performance, and 4 usable PB quickly becomes 12 PB of raw storage. 12PB a day.
At such a rate of data creation, operations would spend all their time racking new machines and disks. And even then, I don’t think that even Amazon or Google can rack 4PB (usable!) a day.
Compression and deduplication are key
DNA is a sequence of nucleotide bases that, once broken down, can be sort of easily compressed, and on top of it, a lot of sequences are recurring, making it a very good deduplication target.
There are some studies on how to apply the very well known LZW compression algorithm, used in bzip2, to compress DNA sequences with extremely high ratios. One of the newest proposed tools for genome sequencing and resequencing data compression, called GReEn, boasts some compression ratios as high as 1:150 (saving 99.4% of space – 3 GB becomes 17M) 3.
With these ratios, 4PB becomes 25TB, which becomes way more reasonable to handle, but is still a significant growth.
So what are the options for Genomics Storage? Right now, it’s very likely that SANs and NAS are being used, but that has got to be unsustainable in the long run, if not already. Maybe tapes with LTFS, but as stated above, this is not an archival use case. Genome analysis needs to resequence and compare data from different genomes all the time and multiple times. So tape is not the solution either. It needs to be online or very near line.
Some companies, like DNANexus – backed by Google – are using the Cloud and in this case Amazon AWS S3 service to store the DNA sequences. And I believe that that is the only answer to this kind of growth: Object Storage.
All the necessary components for a sustainable storage system for DNA sequencing are inherent to most Object Stores: Elasticity, to be able to add TBs of capacity daily, cost-effectiveness by using commodity servers and increasing operational efficiency, and a flat namespace resulting in no real limits in terms of growth. There are other characteristics like metadata that could probably apply very well to the use case, but I’m not an expert enough to know about that.
Whether it is using the public cloud (like DNANexus) or deploying their own private object store, the latter solution being the most cost-efficient, such a technology will drive the cost down of DNA sequencing and handling and pave a shorter way to accessible personal genome sequencing and massive advances in healthcare and curing diseases.