Description
We present ADDA, a replay-debugging system for datacenter applications. We observe that these applications often consist of a separate "control plane" and "data plane," and that the applications' initial inputs are typically persisted in append-only storage for reasons unrelated to debugging. Building upon these observations, ADDA leverages the control / data plane separation to make recording of debug-critical data scalable even in large clusters, it deterministically re-synthesizes intermediate data based on the (already available) initial inputs, and performs reduced-scale replay, i.e., recreates failed executions on just a subset of the original cluster.
We show that ADDA scales well and deterministically replays real-world failures in Hypertable and Memcached. We also argue that ADDA,s techniques generalize to a broader set of datacenter applications.