The decreased cost of genome sequencing technologies has made genome sequencing a viable tool for clinical and populations genomics applications. The efficiency of genome sequencing has been further improved through large projects like the Human Genome Project, which have assembled reference genomes for medically/agriculturally important organisms. These reference quality assemblies have enabled the creation of genome resequencing pipelines, where the genome of a single sample is computed by computing the difference between a given sample and the reference genome for the organism. While sequencing cost has decreased by more than 10,000× since the Human Genome Project concluded in 2003, resequencing pipelines have struggled to keep pace with the growing volume of genomic data. These tools suffer from limited parallelism because they were not designed to use parallel or distributed computing techniques, and are limited by asymptotically inefficient algorithms. In this thesis, we introduce two tools, ADAM and avocado. ADAM provides an efficient framework for performing distributed genomic analyses, and avocado implements efficient local reassembly to discover genomic variants. ADAM presents high level APIs that allow for genomic analyses to be parallelized across more than 1,000 processors. Using these APIs, we are able to achieve linear speedups when parallelizing several common analysis stages.




Download Full History