Scalable Machine Learning Algorithms for Biological Sequence Data

Chan, Jeffrey

PDF

Description

Recent advances in sequencing and synthesis technologies have sparked extraordinary growth in large-scale biological experimentation and data collection. This explosive growth necessitates the development of scalable yet accurate methods to investigate increasingly complex biological questions. Machine learning has become a vital tool for addressing the needs of computational biology blending complex statistical models with efficient computation to uncover the underpinnings of biology.

In this dissertation, I develop three novel machine learning algorithms tailored towards biological sequence data to aid in answering such biological questions. The first method is a general-purpose statistical framework for inference of population genetic parameters. Previous methods focused on developing model approximation methods for a restricted class of models or reducing datasets to a set of hand-crafted summary statistics and comparing them against simulated data. Our framework uses a exchangeable neural network which respects the permutation-invariant symmetries of the data to learn the mapping from simulated datasets to the population genetic parameters of interest.

The second method extends the ideas from the first method to a more challenging setting where segmentation of the genotypes is necessary to determine tracts of archaic admixture. In this setting, the data are permutation-equivariant requiring a neural network architecture that results in accurate segmentation of archaic admixture tracts.

Finally, the third method focuses on the problem of search in protein engineering to discover high fitness protein sequences of interest. Standard bandit optimization methods often focus on experimental feedback that is purely sequential. In protein engineering, advances in high-throughput synthesis and experimentation can often lead to large batches of size as large as 10^5 where the size of the batch can often be much larger than the number of rounds of experimentation. We propose a family of parallel contextual linear bandit algorithms and analyze their regret bounds.

Details

Title

Scalable Machine Learning Algorithms for Biological Sequence Data

Creator

Chan, Jeffrey, Author

Published

2021-05-14

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Type

Text

Format

technical reports

Extent

102 p

Language

eng

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket