Description
To improve distributed system visibility, we have developed an integrated tracing framework called X-Trace. A user or operator invokes X-Trace when initiating an application task (e.g., a web request), by inserting X-Trace metadata with a task identifier in the resulting request. This metadata is then propagated down to lower layers through protocol interfaces (which may need to be modified to carry X-Trace metadata), and also along all recursive requests that result from the original task (by modified software stacks). The X-Trace infrastructure makes use of this metadata to build a task graph, which represents a trace of the execution of the distributed application. Using these recovered task graphs, we have been able to identify correctness and performance bugs in a wide variety of distributed applications, from web and overlay applications, to the Hadoop Map/Reduce system. In this work, we present the design and implementation of X-Trace, show its application to the 802.1X network authentication protocol, and then present an API and software tool for manipulating large-scale network traces in a scalable manner.