Researchers in networks and computer systems have developed exciting new distributed applications in recent years; however, adoption of real-world prototypes has been slow. The development of stable, usable services has been hindered by the tremendous effort required to debug distributed applications that are deployed across the Internet. We believe that more powerful debugging tools are needed to address this problem. This dissertation presents the progress we have made on this front, in the form of two new tools, Liblog and Friday.

The first, Liblog, is a replay debugging library for libc- and POSIX-based distributed applications. It logs the execution of deployed application processes and replays them deterministically, faithfully reproducing race conditions and non-deterministic failures, enabling careful offline analysis.

To our knowledge, Liblog is the first replay tool to address the requirements of large distributed systems: lightweight support for long-running programs, consistent replay of arbitrary subsets of application nodes, and operation in a mixed environment of logging and non-logging processes. In addition, it runs on generic Linux/x86 computers without special hardware or kernel patches and supports unmodified application executables.

The second tool, Friday, combines the deterministic replay provided by Liblog with the power of symbolic, low-level debugging and a simple language for expressing higher-level distributed conditions and actions. Friday allows the programmer to understand the collective state and dynamics of a distributed collection of coordinated application components, as part of the debugging process.

This dissertation presents the design of Liblog and Friday, an evaluation of the performance overhead that they impose at runtime, and a set of case studies that illustrate the new functionality enabled for real distributed applications.





Download Full History