Algorithms for Next-Generation High-Throughput Sequencing Technologies

Kao, Wei-Chun; EECS Department, University of California

PDF

Description

Recent advances of DNA sequencing technologies are allowing researchers to generate immense amounts of data in a fast and cost effective fashion, enabling genome-wide association study and population genetic research which could not be done a decade ago. There are quite numerous computational challenges arising from this advancement, however. Examples include efficient algorithms for processing raw data generated by sequencing instruments, algorithms for detecting and correcting sequencing errors, algorithms for detecting genome variations from sequence data, just to name a few. Because different sequencing technologies can have drastically different characteristics, these algorithms often need to be adapted in order to produce most accurate results.

In this thesis, I will address a few of the aforementioned problems. First, I will describe two model-based basecalling algorithms for the Illumina sequencing platforms: BayesCall and naiveBayesCall. The novelty of BayesCall algorithm is that it is fully unsupervised, requiring no training data with known labels, and therefore it is applicable to data without a reference sequence. It also dramatically improves sequencing accuracies. Built upon BayesCall algorithm, naiveBayesCall dramatically improves computational efficiency by approximating the original model without sacrificing accuracy. We will also show that improved basecall can have positive effects on the downstream sequence analysis, such as the detection of single nucleotide polymorphism and the assembly of novel genomes.

In the third chapter, an algorithm, called ECHO, for correcting short-read sequencing errors will be described. The correction algorithm efficiently computes all overlaps between sequencing reads and corrects errors by using statistical models. Since it does not rely on reference genomes, ECHO can also be applied to de novo sequencing. Most other error correction algorithms require users to specify key parameters, but the optimal values for these parameters are unknown to users and can be hard to specify. Without key parameters being optimized, the effectiveness of error correction algorithm could sometimes be dramatically reduced. Based on statistical models, ECHO optimizes these parameters accordingly. We will show that ECHO can significantly reduce sequence error rates and also facilitate downstream sequence analysis. It is also demonstrated that ECHO can be extended to detect heterozygousity from sequencing data.

These algorithms are developed in hopes to make downstream analysis of sequence data easier and ultimately facilitate genome researches.

Details

Title

Algorithms for Next-Generation High-Throughput Sequencing Technologies

Creator

Kao, Wei-Chun, Author
EECS Department, University of California, Publisher

Published

2011-09-02

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

EECS-2011-99

Type

Text

Format

technical reports

Extent

108 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket