In this dissertation we argue that distributed systems should have traceability, or the ability to follow the execution of defined tasks or activities across the different components involved, as a first-class concept, creating the basis upon which to build tools to gain visibility into their execution and understand their behavior, performance, and failures.
To demonstrate that this is feasible and useful, we designed and implemented two instrumentation frameworks targeted at two widely different points in the distributed systems space. The first framework, X-Trace, tracks the execution and records the causal relationship among arbitrary programmer-defined events in large-scale, loosely-coupled distributed systems. X-Trace is general, lightweight, and is designed to span different layers, components, machines, and administrative domains. We instrumented several protocols and applications with X-Trace, including two wide-area deployed systems, the Coral CDN and the OASIS Anycast service. We used the instrumentation to find and solve several performance and correctness bugs. The second framework, Quanto, applies to wireless embedded sensor networks, and tracks execution of programmer-specified activities to understand energy and resource consumption. Quanto provides network-wide visibility of resource consumption for related events in extremely resource-constrained devices. We show how we instrumented TinyOS with Quanto and can use it to do a complete map of where and why energy is spent by a node and across the network.