Debugging is hard, but debugging production datacenter applications such as Cassandra, Hadoop, and Hypertable is downright daunting. The key obstacle is non-deterministic failures --- hard-to-reproduce program misbehaviors that are immune to traditional cyclicdebugging techniques. Datacenter applications are rife with such failures because they operate in highly nondeterministic environments: a typical setup employs thousands of nodes, spread across multiple datacenters, to process terabytes of data per day. In these environments, existing methods for debugging non-deterministic failures are of limited use. They either incur excessive production overheads or don't scale to multi-node, terabyte-scale processing.

To help remedy the situation, we have built a new replay debugging tool. Our tool, called DCR, enables the reproduction and debugging of non-deterministic failures in production datacenter runs. The key observation behind DCR is that debugging does not always require a precise replica of the original datacenter run. Instead, it often suffices to produce some run that exhibits the original behaviors of the control-plane --- the most errorprone component of datacenter applications. DCR leverages this observation to relax the determinism guarantees offered by the system, and consequently, to address all key requirements of production datacenter applications: lightweight recording of long-running programs, causally consistent replay of large scale systems, and out of the box operation on real-world applications.




Download Full History