Description
The past ten years have seen tremendous growth in the volume of data in Deep Learning (DL) applications. As a result, the long training time of Deep Neural Networks (DNNs) has become a bottleneck for Machine Learning (ML) developers and researchers. For example, it takes 29 hours to finish 90-epoch ImageNet/ResNet-50 training on eight P100 GPUs. It takes 81 hours to finish BERT pre-training on 16 v3 TPU chips. This thesis is focused on fast and accurate ML training. Although production teams want to fully utilize supercomputers to speed up the training process, the traditional optimizers fail to scale to thousands of processors. In this thesis, we design a series of fundamental optimization algorithms to extract more parallelism for DL systems. Our algorithms are powering state-of-the-art distributed systems at Google, Intel, Tencent, NVIDIA, and so on. The focus of this thesis is bridging the gap between High Performance Computing (HPC) and ML.
There was a huge gap between HPC and ML in 2017. On the one hand, we had powerful supercomputers that could execute 2x10^17 floating point operations per second. On the other hand, we could not even make full use of 1% of this computational power to train a state-of-the-art machine learning model. The reason is that supercomputers need an extremely high parallelism to reach their peak performance. However, the high parallelism led to a bad convergence for ML optimizers. To solve this problem, my co-authors and I proposed the LARS optimizer, LAMB optimizer, and CA-SVM framework. These new methods enable ML training to scale to thousands of processors without losing accuracy. In the past three years, we observed that the training time of ResNet-50 dropped from 29 hours to 67.1 seconds. In fact, all the state-of-the-art ImageNet training speed records were made possible by LARS since December of 2017. LARS became an industry metric in MLPerf v0.6. Moreover, our approach is faster than existing solvers even without supercomputers. If we fix the training budget (e.g. 1 hour on 1 GPU), our optimizer can achieve a higher accuracy than state-of-the-art baselines.