Description
This project studies the impact of overhead-communication in large-scale machine learning models on throughput and resource utilization. Recent machine learning scale-out frameworks such as ZionEX and ZeRO-Infinity, have shown the importance of increasing interconnect bandwidth on computation efficiency; in this project, we will measure the impact of overhead communication and its interconnect bandwidth on the GShard Mixture of-Experts architecture. We measured and analyzed the performance of the training model using the Google Cloud Platform and version-3 TPUs, and its profiling tool, Tensorboard. The results showed that the communication portion of the training process increases as we increase the model size and scale-out the model. As a result, given the trend of increasing model size in machine learning to improve accuracy, it is important to scale interconnect bandwidth with respect to model size to maintain computation efficiency.