In this report, we extend the concept of system-wide undo from self-contained services to collections of distributed, interacting services, thereby providing an undo-based recovery mechanism to the operators and administrators of distributed services. The extended undo mechanism is targeted at human operator error and other state-affecting problems like software bugs, misconfigurations, and external attack, and provides retroactive repair of past problems. We achieve the distributed extension by appealing to the concept of spheres of undo: spheres of undo surround each component of a distributed service and provide a structuring mechanism that helps identify when undo of one component can affect others. We propose two approaches for composing interacting spheres of undo: one assumes coordination and cascades undo-based recovery from the first undone sphere to all others affected; the other approach assumes independence and handles interacting spheres by compensating for previous communications that become invalid following an invocation of undo. We present criteria that can be used to decide which approach is most appropriate, and give an example using them to choose the appropriate undo approach for a distributed e-shopping service. Finally, we describe initial thoughts on how the compositions might be implemented and propose algorithms for interacting-sphere undo.





Download Full History