Feb 15, 2011

Archiving NGS data

Anyone who has worked with NextGen sequence data quickly gains an appreciation for the difficulties associated with long term data storage. The current 'state of the art,' at least for Illumina machines, involves saving some fairly raw data files such as fastq text to the NCBI Short Read Archive (SRA).


Our GAIIx is producing about 30 million reads per lane, which gives files of 8-10 GB (72 cycles) per lane in either qseq (completely unfiltered) or fastq (quality scored) format. If we max out two runs per week, that is about 140 GB of raw sequence data per Illumina machine per week.

There has been some recent discussion about the possibility of phasing out the SRA at NCBI.
[see this post which claims to be a memo from NCBI director David Lipman: "The Sequence Read Archive (SRA) will also be phased out over the next 12 months."]
If cost cutting is truly necessary for our national biomedical research infrastructure, I can see why the raw SRA data might be growing at an awkwardly rapid rate and have less value than the higly used databases of GenBank non-redundant nucleotide, GEO, etc.

I think that it is interesting to turn this discussion around and ask why are we archiving all of this raw sequence data? The trivial argument is that: Journals require open access to raw data as a condition of publication." But that argument ignores the more interesting question: What is the 'raw data' for a sequencing project? No one is loading Illumina (or SOLID or 454) image data into public archives. The impracticality of saving multiple terabytes of image data for each run made that approach moot a couple of years ago. We are saving raw qseq or fastq files right now because our methods for basecalling and SNP calling (and indel/translocation/copy number calling) are imprecise. I have seen data analysts go back into primary sequence reads for a single sample and find a SNP that was not called because a few reads had below threshold quality scores.

If we consider the actual "useful" data content of a NGS run on a single sample, the landscape looks quite different. ChIP-seq is our most common NGS application. The useful data from a ChIP-seq run is actually just a set of genome positions where read starts are mapped. At most, this is 20-30 million positions. In actuality, 30% of reads are not mapped, and another 10-50% are duplicates (multiple reads that map to the exact same position), so the final data set might be compressed to about 10 million genomic loci with a read count at each spot. After sorting and indexing, this information could be efficiently stored in a very compact file.

RNA sequencing is becoming increasingly popular. Our clients are typically not interested in the sequence data itself, only in gene expression counts - essentially the same data as produced by a microarray. However, there are some cool new applications that look at alternative splicing, so we may have to keep the actual sequence reads on hand for a while longer.

Human (and mouse) SNP/indel/cnv detection is another popular NGS application. We are only really interested in the variants. However, SNP calling software requires both numbers of reads with reference vs. variant bases and quality scores for each basecall. Some software also uses context dependent quality metrics, such as distance from other SNPs, distance from indels, etc. Given the highly diverse collection of existing SNP detection software, and the likelihood of new software development, it seems impossible to compress this class of data to a set of variant calls and discard the raw reads. This is very unfortunate, since typical variant detection projects use anything from 20x to 50x coverage of the genome. So we are storing 150 GB of raw sequence data in order to track a few million bytes worth of actual variation in the genome of each research sample.

Other applications, such as de novo genome sequencing of new organisms, or metagenomic sequencing of environmental or medical samples will not be easily compressed. Fortunately, these data are currently archived in places other than the SRA.