Description
In this dissertation, I develop three novel machine learning algorithms tailored towards biological sequence data to aid in answering such biological questions. The first method is a general-purpose statistical framework for inference of population genetic parameters. Previous methods focused on developing model approximation methods for a restricted class of models or reducing datasets to a set of hand-crafted summary statistics and comparing them against simulated data. Our framework uses a exchangeable neural network which respects the permutation-invariant symmetries of the data to learn the mapping from simulated datasets to the population genetic parameters of interest.
The second method extends the ideas from the first method to a more challenging setting where segmentation of the genotypes is necessary to determine tracts of archaic admixture. In this setting, the data are permutation-equivariant requiring a neural network architecture that results in accurate segmentation of archaic admixture tracts.
Finally, the third method focuses on the problem of search in protein engineering to discover high fitness protein sequences of interest. Standard bandit optimization methods often focus on experimental feedback that is purely sequential. In protein engineering, advances in high-throughput synthesis and experimentation can often lead to large batches of size as large as 10^5 where the size of the batch can often be much larger than the number of rounds of experimentation. We propose a family of parallel contextual linear bandit algorithms and analyze their regret bounds.