Publishing is a model and mechanism for crash recovery in a distributed computing environment. Published communication works for systems connected via a broadcast medium by recording messages transmitted over the network. The recovery mechanism can be completely transparent to the failed process and all processes interacting with it. Although published communication is intended for a broadcast network such as a bus, a ring, or an Ethernet, it can be used in other environments.

A recorder reliably stores all messages that are transmitted, as well as checkpoint and recovery information. When it detects a failure, the recorder may restart affected processes from checkpoints. The recorder may restart affected processes from checkpoints. The recorder subsequently resends to each process all messages which were sent to it since the time its checkpoint was taken, while ignoring duplicate messages sent by it.

Message-based systems without shared memory can use published communications to recover groups of processes. Simulations show that at least 5 multi-user minicomputers can be supported on a standard Ethernet using a single recorder. The prototype version implemented in DEMOS/MP demonstrates that an error recovery can be transparent to user processes and can be centralized in the network.




Download Full History