This thesis presents fast crash recovery: a simple, efficient, and inexpensive method for increasing availability in distributed systems. In fast crash recovery we assume that critical resources will fail, and we do not attempt to mask the failures with redundant hardware or software. Instead, we design the system to recover so quickly that there is little downtime. This approach is intended for environments that can tolerate occasional failures and cannot afford the cost and overhead of redundant resources.

In particular, I focus on fast recovery of distributed state. An example of distributed state is the file caching information maintained by servers in most modern file systems. This information describes the state of file caches on client workstations. After a crash, a server must recover this information in order to guarantee the consistency of the caches. Unfortunately, distributed state recovery can be slow and complex. The techniques I have developed reduce state recovery from several minutes to under six seconds for a Sprite file server with 40 clients.

This thesis evaluates three distributed state recovery techniques based on their speed, complexity, and performance overhead. In client-driven recovery clients send their state information to the server after a crash. The server uses this information to regenerate its copy of the distributed state. Server-driven recovery is a modification of client-driven recovery that is faster and eliminates cache inconsistencies that can arise during client-driven recovery. The fastest technique is transparent recovery, so-called because client workstations do not communicate with the server during recovery. Instead, the server stores its distributed state in a protected area of its own main memory called the recovery box. The interface to the recovery box helps detect and prevent corruption of this state information.

To achieve fast overall recovery times, we must also recover other parts of the system quickly. For example, we can eliminate a lengthy file system consistency check by using a log-structured file system that recovers in seconds. By combining the improvements described in this thesis, a Sprite file server can reboot in under 30 seconds. This is two orders of magnitude faster than most modern file systems recover.

In addition to evaluating distributed state recovery techniques, this thesis presents some overall guidelines for designing distributed systems that will recover quickly from crashes.




Download Full History