We are faced today with distributed systems of unprecedented scale, built from hundreds of thousands of heterogeneous computers. These range from large datacenter installations to swarms of mobile devices embedded in and interacting with the physical world. They are pushing the limits of what problems we can solve, be it organizing all the world's information, making sense of the genome, or the world's climate. As our dependency on such systems increases, so does the importance of their availability, reliability, and efficiency, as well as the costs and impacts of failures. The problem, however, is that our ability to build and program these systems is progressing faster than our ability to understand how they work, and, especially, how they fail.

In this dissertation we argue that distributed systems should have traceability, or the ability to follow the execution of defined tasks or activities across the different components involved, as a first-class concept, creating the basis upon which to build tools to gain visibility into their execution and understand their behavior, performance, and failures.

To demonstrate that this is feasible and useful, we designed and implemented two instrumentation frameworks targeted at two widely different points in the distributed systems space. The first framework, X-Trace, tracks the execution and records the causal relationship among arbitrary programmer-defined events in large-scale, loosely-coupled distributed systems. X-Trace is general, lightweight, and is designed to span different layers, components, machines, and administrative domains. We instrumented several protocols and applications with X-Trace, including two wide-area deployed systems, the Coral CDN and the OASIS Anycast service. We used the instrumentation to find and solve several performance and correctness bugs. The second framework, Quanto, applies to wireless embedded sensor networks, and tracks execution of programmer-specified activities to understand energy and resource consumption. Quanto provides network-wide visibility of resource consumption for related events in extremely resource-constrained devices. We show how we instrumented TinyOS with Quanto and can use it to do a complete map of where and why energy is spent by a node and across the network.




Download Full History