Traditional resource management techniques that rely on simple heuristics often fail to achieve predictable performance in contemporary complex systems that span physical servers, virtual servers, private and/or public clouds. My research aims to bring the benefits of data-driven models to resource management of such complex systems. In my dissertation, I argue that the advancements in machine learning can be leveraged to manage and optimize today’s systems by deriving actionable insights from the performance and utilization data these systems generate. To realize this vision of model-based resource management, we need to deal with the key challenges data-driven models raise: uncertainty in predictions, cost of training, generalizability from benchmark datasets to real-world systems datasets, and interpretability of the models.
In this dissertation, to demonstrate how to handle these challenges, we chose two main problem domains: (I) Scheduling in parallel data intensive computational frameworks for improved tail latencies, and (II) Performance-aware resource allocation in the public cloud environments for meeting user-specified performance and cost goals.
We begin by presenting Wrangler, a system that predicts when stragglers (slow-running tasks) are going to occur based on cluster resource utilization counters and makes scheduling decisions to avoid such situations. Wrangler introduces a notion of a confidence measure with these predictions to overcome modeling uncertainty. We then describe our Multi-Task Learning formulations that share information between the various models, allowing us to significantly reduce the cost of training. To capture the challenges of resource allocation in the public cloud environments, we present key observations from our empirical analysis based on performance profiles of workloads executing across different public cloud environments. Finally, we describe PARIS, a Performance-Aware Resource Inference System, that we built to enable cloud users to select the best VM (virtual machine) for their applications in the public cloud environments so as to satisfy any performance and cost constraints.
Title
Machine Learning for Automatic Resource Management in the Datacenter and the Cloud
Published
2018-08-10
Full Collection Name
Electrical Engineering & Computer Sciences Technical Reports
Other Identifiers
EECS-2018-110
Type
Text
Extent
125 p
Archive
The Engineering Library
Usage Statement
Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).