ML model development has become increasingly challenging due to the rapid increase in the scale and complexity of models and model workloads, making standard logging and debugging practices insufficient. Due to the impracticality of exhaustive logging during model training as well as the infeasibility of a complete re-execution of model training for debugging purposes, we turn to hindsight logging. Hindsight logging is an efficient post-hoc logging method that periodically collects checkpoints during model training and performs parallelized checkpoint resume for context recovery during debugging. With hindsight logging, model developers can defer logging overhead and postpone considering which execution data they may need for analysis until model training completes

However, model developers and machine learning researchers and practitioners are faced with a continually evolving landscape of datasets, processing pipelines, analysis frameworks, and models to explore and investigate. In this paper, we introduce FlorET (Fast Low-Overhead Recovery Extending Time), a system for empowering model developers to perform handsfree analysis and debugging over multiple model training sessions. The key features that facilitate this advancement consist of (i) automatic version control, (ii) robust log statement propagation, and (iii) parallelized inter-version replay. We evaluate the efficacy of log statement propagation and code alignment techniques for temporal hindsight logging and then conduct system overhead analysis.




Download Full History