It’s Raining CAATTs – Drudge of DNA Sequencing Data Poses Huge Data Analysis and Management Challenges
As part of ZE’s continuing commitment to stay abreast of new Information System trends, Yi-Jeng recently attended Bio-IT World Asia, held June 6-8, 2012 in Singapore. The conference brought together IT professionals, bioinformaticians, pharmaceutical and biomedical scientists from all over the world to discuss the increasing intertwined field of IT and Life Sciences.
First, here’s some background in world of genomics and bioinformatics. Bioinformatics, simply put, is the application of information technology in the field of biology and medicine. Modern bioinformatics really took off after 1990, when the Human Genome Project started. 13 years and $3 billion dollars later, the first ‘rough’ iteration of a human genome was completed in 2002 with great fanfare, 2 years ahead of schedule.
Today (a mere decade later), the formerly herculean task of sequencing the 3 billion A,T,G,C bases that make up a human genome would take a modern “Next-Gen” machine a single week and around $3000 plus change. Large sequencing centers, such as China’s BGI, is further reducing costs with the relentless application of economies of scale. Experts all predict companies will break the $1000 per genome barrier in the near future, and bringing truly personalized medicine (and all the risks of 1997 movie GATTACA) within the grasp of average consumers.
Bioinformatics Storage Requirements Quickly Outpacing Moore’s Law
Even as sequencing technology races forward, spurred both by technological leaps and increasing economies of scale, the analytical capacity to store and make sense of this new ‘drudge’ of data has not kept up. We at ZE, have noticed the recurring cross-industry trend of data explosion in many markets, especially Power (electricity).
Currently, the cost of sequencing a base (a basic unit of DNA information) is dropping much faster than the cost for storing that data. In essence, DNA sequencing improvements is outstripping Moore’s Law. Back in 2010, Nature Methods journal published an article (Figure 2) which foresaw this exact predicament. Raw data for a complete human genome generates around 300-700 GB of data. Analyzing this data often generate a ten-fold increase in analytical data and metadata. Following that logic, one complete human genome plus analysis may generate up to 7 terabytes [TB] (7×10^12 bytes) of information. Indeed, some scientists present at the conference mulled that re-sequencing samples from the freezer may soon be preferable to the expensive alternative of digitally storing that sequence.
Recently, a large consortium of academic institutions initiated the 1000 Genomes project (http://www.1000genomes.org/about), which will deepen our understanding of human genetic diversity. As a consequence, the computational and storage requirements of 1000 genomes are suddenly at the forefront of everyone’s mind. Using our prior ballpark, that much genetic information may require as much as 7 petabytes [PB] (7×10^15 bytes) of storage space; a no small feat for smaller labs.
The bioinformatics industry, much like other modern information-based industries, have been undergoing rapid cycles of technological advances and growing economies of scale. This positive feed-back loop means some key industrial processes such as DNA sequencing is dropping in price per unit faster than other key processes, such as IT hardware advancement and data management and compression. ZE currently provides enterprise level data acquisition and management solutions for many international companies in the energy and financial sector (see our latest clients here). We will of course, be looking for opportunities in the bioinformatics sector as it further develops.
We’ve described current data challenges facing the bioinformatics sector. Next week, we’ll take a look at consensus solutions proposed by Bio-IT World Asia attendees. Are some of these challenges familiar to your industry? We’re like to hear from you!