Description
The first, Liblog, is a replay debugging library for libc- and POSIX-based distributed applications. It logs the execution of deployed application processes and replays them deterministically, faithfully reproducing race conditions and non-deterministic failures, enabling careful offline analysis.
To our knowledge, Liblog is the first replay tool to address the requirements of large distributed systems: lightweight support for long-running programs, consistent replay of arbitrary subsets of application nodes, and operation in a mixed environment of logging and non-logging processes. In addition, it runs on generic Linux/x86 computers without special hardware or kernel patches and supports unmodified application executables.
The second tool, Friday, combines the deterministic replay provided by Liblog with the power of symbolic, low-level debugging and a simple language for expressing higher-level distributed conditions and actions. Friday allows the programmer to understand the collective state and dynamics of a distributed collection of coordinated application components, as part of the debugging process.
This dissertation presents the design of Liblog and Friday, an evaluation of the performance overhead that they impose at runtime, and a set of case studies that illustrate the new functionality enabled for real distributed applications.