Data-efficient De novo Genome Assembly Algorithm : Theory and Practice

Lam, Ka Kit

PDF

Description

We study data-efficient and also practical de-novo genome assembly algorithm. Due to the advancement in high-throughput sequencing, the cost of read data has gone down rapidly in recent years. Thus, there is a growing need to do genome assembly in a cost effective manner on a large scale. However, the state-of-the-art assemblers are not designed to be data-efficient. This leads to wastage of read data, which translates to a significant monetary loss for large scale sequencing projects. Moreover, as technology advances, the nature of the read data also evolves. This makes the old assemblers insufficient for discovering all the information behind the data. Therefore, there is a need to re-invent genome assemblers to suit the growing demand. In this dissertation, we address the data efficiency aspect of genome assembly as follows. First, we show that even when there is noise in the reads, one can successfully reconstruct the genomes with information requirements close to the noiseless fundamental limit. We develop a new assembly algorithm, X-phased Multibridging, is designed based on a probabilistic model of the genome. We show, through analysis that it performs well on the model, and through simulation that it performs well on real genomes. Second, we introduce FinisherSC, a repeat-aware and scalable tool for upgrading de-novo haploid genome assembly using long and noisy reads. Experiments with real data suggest that FinisherSC can provide longer and higher quality contigs than existing tools while maintaining high concordance. Thus, FinisherSC achieves higher data efficiency than state-of-the-art haploid genome assembly pipelines. Third, we study whether post-processing metagenomic assemblies with the original input long reads can result in quality improvement. Previous approaches have focused on pre-processing reads and optimizing assemblers. We introduce BIGMAC, which takes an alternative perspective to focus on the post-processing step. Using both the assembled contigs and original long reads as input, BIGMAC first breaks the contigs at potentially mis-assembled locations and subsequently scaffolds contigs. Our experiments on metagenomes assembled from long reads show that BIGMAC can improve assembly quality by reducing the number of mis-assemblies while maintaining/increasing N50 and N75. Thus, BIGMAC achieves higher data efficiency than state-of-the-art metagenomic assembly pipelines. Moreover, we also discuss some theoretical work regarding genome assembly in terms of data, time and space efficiency; and a promising post-processing tool POSTME in improving metagenomic assembly using both short and long reads.

Details

Title

Data-efficient De novo Genome Assembly Algorithm : Theory and Practice

Creator

Lam, Ka Kit, Author

Published

2016-05-09

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

EECS-2016-43

Type

Text

Format

technical reports

Extent

155 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket