This dissertation presents a new software architecture for fault-tolerant distributed programs. This new architecture allows replication to be added transparently and flexibly to existing programs. Tuning the availability of a replicated program becomes a programming-in-the-large problem that a programmer need address only after the individual modules have been written and verified.

The increasing reliance that people place on computer systems makes it essential that those systems remain available. The low cost of computer hardware and the high cost of computer software make replicated distributed programs an attractive solution to the problem of providing fault-tolerant operation.

A troupe is a set of replicas of a module, executing on machines that have independent failure modes. Troupes are the building blocks of replicated distributed programs and the key to achieving high availability. Individual members of a troupe do not communicate among themselves, and are unaware of one another's existence; this property is what distinguishes troupes from other software architectures for fault tolerance.

Replicated procedure call is introduced to handle the many-to-many pattern of communication between troupes. Replicated procedure call is an elegant and powerful way of expressing many distributed algorithms. The semantics of replicated procedure call can be summarized as exactly-once execution at all replicas.

An implementation of troupes and replicated procedure call is described. Experiments were conducted to measure the performance of this implementation; an analysis of the results of these experiments is presented.

The problem of concurrency control for troupes is examined, and algorithms for replicated atomic transactions are presented as a solution. Binding and reconfiguration mechanisms for replicated distributed programs are described, and the problem of when to replace failed troupe members is analyzed.

Several issues relating to programming languages and environments for reliable distributed applications are discussed. Integration of the replication mechanisms into current programming languages is accomplished by means of stub compilers. Four stub compilers are examined, and some lessons learned form them are presented. A language for specifying troupe configurations is described, and the design of a configuration manager, a programming-in-the-large tool for configuring replicated distributed programs, is presented.




Download Full History