The rise of data center computing and Internet-connected devices has led to an unparalleled explosion in the volumes of data collected across a multitude of industries and academic disciplines. This data serves as fuel for statistical machine learning techniques that in turn enable some of today's most advanced applications including those powered by image classification, speech recognition, and natural language understanding, which we broadly term machine learning applications.
Unfortunately, until recently the tools and techniques used to leverage recent advances in machine learning at the scales demanded by modern datasets, and thus develop these applications, have been available only to experts in fields such as distributed computing, statistics, and optimization.
I describe my efforts to render these tools accessible to a broader audience of application developers, and further demonstrate that by taking a holistic approach and capturing end-to-end high level specifications of machine learning applications the systems I present here can make novel, high impact optimizations to decrease resource consumption while simultaneously increasing throughput. These improvements are designed to decrease machine learning application development time, increase quality, and increase machine learning application developer productivity. I demonstrate the viability of these optimizations via experiments on a number of real-world applications in domains such as collaborative filtering, computer vision, and natural language processing.
Many of the ideas presented in this thesis have already had practical impact as embodied in the open source software packages KeystoneML and Apache Spark MLlib.
Title
End-to-End Large Scale Machine Learning with KeystoneML
Published
2016-12-15
Full Collection Name
Electrical Engineering & Computer Sciences Technical Reports
Other Identifiers
EECS-2016-200
Type
Text
Extent
143 p
Archive
The Engineering Library
Usage Statement
Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).