By the same authors

Checkpointing and Error Recovery in Distributed Systems

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Author(s)

Department/unit(s)

Publication details

Title of host publicationProc. 2nd Int. Conf. on Distributed Computing Systems
DatePublished - 1981
Pages271-282
Number of pages12
Original languageEnglish

Abstract

This paper discusses some of the problems of producing fault tolerant distributed computer systems, in particular those of software error recovery. It shows how checkpoints may be used in error recovery, it defines the information that checkpoints must contain, and discusses alternate strategies for checkpointing. It describes models of error recovery and extends an existing recovery protocol to cater for certain types of checkpoint inconsistencies. The paper defines protocols for systematically generating checkpoints so that they can be used by the recovery protocols. It also defines a protocol for discarding checkpoints when they are no longer 'of use', which prevents the set of checkpoints growing indefinitely. The paper concludes by considering some of the problems of implementing the protocols.

Discover related content

Find related publications, people, projects, datasets and more using interactive charts.

View graph of relations