PDF

Description

Stochastic gradient descent is a widely used method to find locally-optimal models in machine learning and data mining. However, it is naturally a sequential algorithm, and parallelization involves severe compromises because the cost of synchronizing across a cluster is much larger than the time required to compute an optimal-sized gradient step. Here we explore butterfly mixing, where gradient steps are interleaved with the k stages of a butterfly network on 2^k nodes. Udp based butterfly mix steps should be extremely fast and failure-tolerant, and convergence is almost as fast as a full mix (AllReduce) on every step.

Details

Files

Statistics

from
to
Export
Download Full History