Go to main content

PDF

Description

This project studies the impact of overhead-communication in large-scale machine learning models on throughput and resource utilization. Recent machine learning scale-out frameworks such as ZionEX and ZeRO-Infinity, have shown the importance of increasing interconnect bandwidth on computation efficiency; in this project, we will measure the impact of overhead communication and its interconnect bandwidth on the GShard Mixture of-Experts architecture. We measured and analyzed the performance of the training model using the Google Cloud Platform and version-3 TPUs, and its profiling tool, Tensorboard. The results showed that the communication portion of the training process increases as we increase the model size and scale-out the model. As a result, given the trend of increasing model size in machine learning to improve accuracy, it is important to scale interconnect bandwidth with respect to model size to maintain computation efficiency.

Details

Files

Statistics

from
to
Export
Download Full History
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS