Towards Automatic Machine Learning Pipeline Design

Milutinovic, Mitar

PDF

Description

The rapid increase in the amount of data collected is quickly shifting the bottleneck of making informed decisions from a lack of data to a lack of data scientists to help analyze the collected data. Moreover, the publishing rate of new potential solutions and approaches for data analysis has surpassed what a human data scientist can follow. At the same time, we observe that many tasks a data scientist performs during analysis could be automated. Automatic machine learning (AutoML) research and solutions attempt to automate portions or even the entire data analysis process.

We address two challenges in AutoML research: first, how to represent ML programs suitably for metalearning; and second, how to improve evaluations of AutoML systems to be able to compare approaches, not just predictions.

To this end, we have designed and implemented a framework for ML programs which provides all the components needed to describe ML programs in a standard way. The framework is extensible and framework’s components are decoupled from each other, e.g., the framework can be used to describe ML programs which use neural networks. We provide reference tooling for execution of programs described in the framework. We have also designed and implemented a service, a metalearning database, that stores information about executed ML programs generated by different AutoML systems.

We evaluate our framework by measuring the computational overhead of using the framework as compared to executing ML programs which directly call underlying libraries. We observe that the framework’s ML program execution time is an order of magnitude slower and its memory usage is twice that of ML programs which do not use this framework.

We demonstrate our framework’s ability to evaluate AutoML systems by comparing 10 different AutoML systems that use our framework. The results show that the framework can be used both to describe a diverse set of ML programs and to determine unambiguously which AutoML system produced the best ML programs. In many cases, the produced ML programs outperformed ML programs made by human experts.

Details

Title

Towards Automatic Machine Learning Pipeline Design

Creator

Milutinovic, Mitar, Author

Published

2019-08-16

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

EECS-2019-123

Type

Text

Format

technical reports

Extent

105 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket