We are faced today with distributed systems of unprecedented scale, built from hundreds of thousands of heterogeneous computers. These range from large datacenter installations to swarms of mobile devices embedded in and interacting with the physical world. They are pushing the limits of what problems we can solve, be it organizing all the world's information, making sense of the genome, or the world's climate. As our dependency on such systems increases, so does the importance of their availability, reliability, and efficiency, as well as the costs and impacts of failures. The problem, however, is that our ability to build and program these systems is progressing faster than our ability to understand how they work, and, especially, how they fail.
In this dissertation we argue that distributed systems should have traceability, or the ability to follow the execution of defined tasks or activities across the different components involved, as a first-class concept, creating the basis upon which to build tools to gain visibility into their execution and understand their behavior, performance, and failures.
To demonstrate that this is feasible and useful, we designed and implemented two instrumentation frameworks targeted at two widely different points in the distributed systems space. The first framework, X-Trace, tracks the execution and records the causal relationship among arbitrary programmer-defined events in large-scale, loosely-coupled distributed systems. X-Trace is general, lightweight, and is designed to span different layers, components, machines, and administrative domains. We instrumented several protocols and applications with X-Trace, including two wide-area deployed systems, the Coral CDN and the OASIS Anycast service. We used the instrumentation to find and solve several performance and correctness bugs. The second framework, Quanto, applies to wireless embedded sensor networks, and tracks execution of programmer-specified activities to understand energy and resource consumption. Quanto provides network-wide visibility of resource consumption for related events in extremely resource-constrained devices. We show how we instrumented TinyOS with Quanto and can use it to do a complete map of where and why energy is spent by a node and across the network.
Title
Improving Visibility of Distributed Systems through Execution Tracing
Published
2008-12-18
Full Collection Name
Electrical Engineering & Computer Sciences Technical Reports
Other Identifiers
EECS-2008-167
Type
Text
Extent
243 p
Archive
The Engineering Library
Usage Statement
Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).