Computer Science 441

Principles of Distributed Systems

Gregory M. Kapfhammer


flickr photo shared by Patrick Brosset under a Creative Commons ( BY-NC ) license

Color Scheme

Key Concept

Corresponding Diagram

In-Class Discussion

In-Class Activity

Details in the Textbook

Fault tolerance in distributed systems

Dependable Systems

Availability: ready to be use immediately

Reliability: run continuously without failure

Safety: nothing catastrophic happens on failure

Maintainability: system can be easily repaired on failure

How are these terms related to each other?

Fault Tolerance

Fail: system cannot meet its promises

Error: incorrect system state that may lead to failure

Fault: the cause of an error

PIE Model: how a fault manifests itself in a failure

Fault tolerance: provide services in the presence of faults

How can systems provide fault tolerance?

Examples of fault tolerant behavior?

Fault Classification

Transient: occur once and then disappear

Intermittent: occurs, vanishes, and the reappears

Permanent: continues to exist until replacement

Which kinds of faults most frequently occur in software?

Failure Classification

Crash failure

Omission failure

Timing failure

Response failure

Arbitrary failure

See Figure 8-1 for more details!

Strategies for providing fault tolerance?

Fault masking by redundancy

Triple modular redundancy

Process resilience

Failure detection by pinging

Reliable unicasting

Reliable multicasting

Transaction processing

Recovery-oriented computing

Checkpointing

Backward recovery

Forward recovery

What are the trade-offs?