Description
The advent of algorithms capable of leveraging vast quantities of data and computational resources has led to the proliferation of systems and tools aimed to facilitate the development and usage of these algorithms. Hardware trends, including the end of Moore's Law and the maturation of cloud computing, have placed a premium on the development of scalable algorithms designed for parallel architectures. The combination of these factors has made distributed computing an integral part of machine learning in practice.
This thesis examines the design of systems and algorithms to support machine learning in the distributed setting. The distributed computing landscape today consists of many domain-specific tools. We argue that these tools underestimate the generality of many modern machine learning applications and hence struggle to support them. We examine the requirements of a system capable of supporting modern machine learning workloads and present a general purpose distributed system architecture for doing so. In addition, we examine several examples of specific distributed learning algorithms. We explore the theoretical properties of these algorithms and see how they can leverage such a system.