Go to main content

PDF

Description

Research in AI is dominated by model experimentation, and training a single model can be extremely expensive. However, there is no efficient way of recovering information if something goes wrong or the model behaves differently from expected. Flor is a system that provides a record-replay approach to ML training, allowing developers the flexibility of retrieving the data they want after execution. It takes intermittent checkpoints during model training to help speed up and parallelize replay. However, achieving a record-replay system requires expensive checkpoints to be serialized during program execution, causing high overheads and making this solution less palatable. Flor is able to achieve a low overhead, fast materialization process by using its multiprocessing materializer, a solution that takes advantage of parallelism to offload the burden of serialization to other processes. The multiprocessing materializer uses forking to both spawn processes and provide one-way interprocess communication (IPC), allowing Flor to quickly share and serialize expensive checkpoints. We also show that our multiprocessing materializer performs better than other popular methods of multiprocessing and IPC. Notably, this method achieves checkpointing at only an additional 1.74% runtime cost.

Details

Files

Statistics

from
to
Export
Download Full History
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS