Recent years have seen countless advances in the fields of both machine learning and high performance computing. Although computing power has steadily increased and become more available, many widely-used machine learning techniques fail to take full advantage of the parallelism available from large-scale computing clusters. Exploring techniques to scale machine learning algorithms on distributed and high performance systems can potentially help us reduce training time and increase the accessibility of machine learning research. To this end, this thesis investigates methods for scaling up deep learning on distributed systems using a variety of optimization techniques, ranging from clusters of Intel Xeon Phi processors to Tensor Processing Unit (TPU) pods. Training machine learning models and fully optimizing compute on such distributed systems requires us to overcome multiple challenges at both the algorithmic and the systems level. This thesis evaluates and presents scaling methods for distributed systems which can be used to address such challenges, and more broadly, to bridge the gap between high performance computing and machine learning.




Download Full History