Parallel Algorithms for De Novo Long Read Genome Assembly via Sparse Linear Algebra

Guidi, Giulia

PDF

Description

Significant advances in genome sequencing over the past decade have produced a flood of genomic data that pose enormous computational challenges and require new bioinformatics approaches. As the cost of sequencing has decreased and genomics has become an increasingly important tool for health and the environment, genomic data has grown exponentially, often requiring parallel computing on high-performance computing (HPC) systems. However, genomic applications are often characterized by irregular and unstructured computation and data layout, making them a troublesome target for distributed memory parallelism.

In this dissertation, we show that it is possible to productively write highly parallel code for irregular genomic computation using the appropriate abstraction. Genomic algorithms are often based on graph analysis and processing. For individual graph algorithms, it has been previously shown that graphs can be viewed as sparse matrices and the computations become a series of matrix operations. Here, we take this idea to a new level by demonstrating its applicability and challenges for a data- and computationally-intensive end-to-end application in genomics: de novo long-read genome assembly, in which an unknown genome is reconstructed from short, redundant, and erroneous DNA sequences. Our main contribution is the design and development of a set of scalable distributed and parallel algorithms for de novo long-read genome assembly that can run on hundreds of nodes of an HPC system, reducing the runtime for mammalian genomes from days on a single processor to less than 20 minutes on a supercomputer. Our algorithms are presented as the Extreme-Scale Long-Read Berkeley Assembler (ELBA) pipeline, which encompasses the major phases of the overlap-layout-consensus paradigm that is most popular for long-read sequencing data. In ELBA, we view assembly through the lens of sparse linear algebra, where the core data structure is a sparse matrix. This dissertation paves the way for a highly productive paradigm for writing massively parallel codes for irregular and unstructured real-world computation.

ELBA is built for HPC systems with high-speed network and batch scheduling. However, we recognize that not every research community has access to government or institutional supercomputing facilities that have the necessary scale (e.g., hundreds of nodes) and hardware characteristics (e.g., a low-latency network) to realize the full potential of massively parallel algorithms such as those we have developed in this work. Thus, we believe that a long-term goal of HPC research is to democratize large-scale computing for science, not only through highly productive programming but also through widely accessible large-scale resources and systems. As a first step in demonstrating the applicability of the ideas presented in this dissertation to a cloud computing environment, we perform a benchmarking exercise to compare HPC and cloud systems. Our study shows that today's cloud systems can compete with traditional HPC systems, at least at moderate scales, due to significant advances in networking technologies.

Details

Title

Parallel Algorithms for De Novo Long Read Genome Assembly via Sparse Linear Algebra

Creator

Guidi, Giulia, Author

Published

EECS Department, University of California at Berkeley, Berkeley, California, 08/11/22

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

EECS-2022-196

Type

Text

Format

technical reports

Extent

125 p

Language

eng

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket