Description
A number of methods have already been developed to perform fault recovery in distributed systems: recovery lines, recoverable transactions, and shadow processes. In order to effect time-bounded recovery, each of these methods requires interaction with the user application. This interaction may sometimes fit naturally into the application program. However, in many instances, the lack of transparency of the recovery system may significantly restrict the application programmer's style. Also, existing programs need to be rewritten to make use of these methods.
Making recovery transparent to the program being recovered is, in the most general case, a difficult and, perhaps, unsolvable problem. However, by considering only message-based systems, the problem can be greatly simplified. Message-based systems, especially those connected by low cost broadcast media, represent the most common type of distributed system. We have developed a new communications model for such systems called published communications. In this model, a passive recorder reliably stores all messages broadcast onto the network. Coupled with the idea of deterministic programs, published communications allows the transparent recovery of processes in a distributed system.
In order to evaluate the consistency of the model with message-based systems, an initial implementation has been added to an existing message-based system, DEMOS/MP to the model. However, it was not necessary to change any programs already running on the system.
The performance of published communications was determined both by evaluating a queuing model of the system under different loads and by measuring the DEMOS/MP implementations. The simulation shows that recorder, constructed from current technology, can support a system of up to 115 users. The measurements show that the steady state costs of publishing messages is low.