Debugging data-intensive distributed applications running in a datacenter ("datacenter applications") is complex and time-consuming. Developers wish they had a way to deterministically replay failed executions with little human effort, but unfortunately no such tool exists today. We see two challenges in replay-based debugging: First, the clusters used to run datacenter applications consist of many nodes, so the nondeterminism resulting from multithreaded execution on a single node is compounded by the size of the cluster. Second, datacenter applications produce terabytes of intermediate data shipped from one node to the next-the total data volume, itself proportional to cluster size, makes full input recording for potential subsequent replay infeasible.

We present ADDA, a replay-debugging system for datacenter applications. We observe that these applications often consist of a separate "control plane" and "data plane," and that the applications' initial inputs are typically persisted in append-only storage for reasons unrelated to debugging. Building upon these observations, ADDA leverages the control / data plane separation to make recording of debug-critical data scalable even in large clusters, it deterministically re-synthesizes intermediate data based on the (already available) initial inputs, and performs reduced-scale replay, i.e., recreates failed executions on just a subset of the original cluster.

We show that ADDA scales well and deterministically replays real-world failures in Hypertable and Memcached. We also argue that ADDA,s techniques generalize to a broader set of datacenter applications.




Download Full History