Fiveable

๐ŸฅธAdvanced Computer Architecture Unit 13 Review

QR code for Advanced Computer Architecture practice questions

13.4 Redundancy and Fault-Tolerant Architectures

๐ŸฅธAdvanced Computer Architecture
Unit 13 Review

13.4 Redundancy and Fault-Tolerant Architectures

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐ŸฅธAdvanced Computer Architecture
Unit & Topic Study Guides

Redundancy and fault-tolerant architectures are crucial for keeping computer systems running smoothly when things go wrong. They use extra hardware, information, time, and software to catch and fix errors before they cause big problems.

These techniques are like backup plans for computers. They help systems detect issues, isolate them, and recover quickly. By using smart design principles, engineers can make systems that keep working even when parts fail, ensuring reliability in critical applications.

Redundancy Types in Fault Tolerance

Hardware Redundancy Techniques

  • Hardware redundancy involves replicating critical components or subsystems to ensure continued operation in case of failures
  • Common hardware redundancy techniques include dual modular redundancy (DMR), triple modular redundancy (TMR), and N-modular redundancy (NMR)
    • DMR uses two identical components and compares their outputs to detect faults
    • TMR employs three identical components and uses majority voting to determine the correct output
    • NMR extends the concept to N components, providing higher levels of fault tolerance

Information and Time Redundancy Techniques

  • Information redundancy adds extra bits or data to the original information to detect and correct errors
    • Examples include parity bits, error-correcting codes (ECC), and cyclic redundancy checks (CRC)
    • Parity bits detect single-bit errors by adding an extra bit to ensure an even or odd number of 1s
    • ECC can detect and correct multiple-bit errors by adding redundant bits based on mathematical algorithms
  • Time redundancy repeats computations or operations multiple times to detect and mitigate transient faults
    • Techniques include re-execution, checkpointing, and rollback recovery
    • Re-execution repeats the computation and compares the results to detect transient faults
    • Checkpointing periodically saves the system state to enable recovery from faults
    • Rollback recovery restores the system to a previous checkpoint when a fault is detected

Software Redundancy Approaches

  • Software redundancy employs multiple instances of software components or diverse implementations to detect and recover from software faults
  • Approaches include N-version programming, recovery blocks, and self-checking software
    • N-version programming uses independently developed software versions and compares their outputs
    • Recovery blocks execute alternate software versions when an acceptance test fails
    • Self-checking software incorporates error detection and recovery mechanisms within the software itself
  • Software redundancy techniques aim to mitigate the impact of software bugs, design flaws, and other software-related faults

Fault-Tolerant Architecture Principles

Fault Detection and Isolation Mechanisms

  • Fault-tolerant architectures aim to maintain system functionality and prevent failures in the presence of faults
  • Fault detection mechanisms identify the occurrence of faults in the system
    • Techniques include error detection codes, watchdog timers, and built-in self-tests (BIST)
    • Error detection codes (e.g., parity, ECC) detect data corruption during storage or transmission
    • Watchdog timers monitor the system's behavior and trigger an alarm if expected actions do not occur within a specified time
  • Fault isolation techniques prevent the propagation of faults to other parts of the system
    • Approaches include circuit-level isolation, module-level isolation, and system-level partitioning
    • Circuit-level isolation uses physical barriers or electrical isolation to contain faults within a specific circuit
    • Module-level isolation employs well-defined interfaces and error containment boundaries to prevent fault propagation between modules

Fault Recovery and Masking Techniques

  • Fault recovery mechanisms restore the system to a correct state after a fault occurs
    • Techniques include checkpointing, rollback recovery, and forward error correction
    • Checkpointing periodically saves the system state to enable recovery from faults
    • Rollback recovery restores the system to a previous checkpoint when a fault is detected
    • Forward error correction uses redundant information to correct errors without requiring retransmission or rollback
  • Fault masking techniques hide the effects of faults from the system's outputs, ensuring uninterrupted operation
    • Examples include majority voting, redundant data storage, and error-correcting memory
    • Majority voting compares the outputs of redundant components and selects the majority result
    • Redundant data storage maintains multiple copies of data to ensure availability and integrity
    • Error-correcting memory automatically corrects bit errors in memory using ECC techniques

Redundancy Effectiveness for Reliability

Reliability Metrics and Evaluation Tools

  • Reliability metrics, such as mean time between failures (MTBF), mean time to repair (MTTR), and availability, are used to assess the effectiveness of redundancy techniques in improving system reliability
    • MTBF represents the average time between failures in a system
    • MTTR indicates the average time required to repair a failed component or system
    • Availability is the proportion of time a system is operational and available for use
  • Reliability block diagrams (RBDs) and Markov models are analytical tools used to evaluate the reliability of fault-tolerant systems with different redundancy configurations
    • RBDs represent the system as a series of blocks, each representing a component or subsystem, and analyze the overall system reliability based on the reliability of individual blocks
    • Markov models use state transitions to represent the system's behavior and calculate reliability metrics based on the probabilities of moving between states

Factors Affecting Redundancy Effectiveness

  • The effectiveness of hardware redundancy techniques depends on factors such as the level of redundancy (e.g., DMR, TMR), the reliability of individual components, and the voting or comparison mechanisms employed
    • Higher levels of redundancy (e.g., TMR vs. DMR) provide better fault tolerance but increase cost and complexity
    • The reliability of individual components directly impacts the overall system reliability
    • Voting or comparison mechanisms must be reliable and correctly identify and handle faults
  • Information redundancy techniques' effectiveness is determined by the error detection and correction capabilities of the chosen codes (e.g., Hamming codes, Reed-Solomon codes) and the overhead introduced by the additional bits
    • More powerful error-correcting codes can handle a greater number of errors but may introduce more overhead
    • The trade-off between error correction capability and overhead must be considered based on the system's requirements
  • Time redundancy techniques' effectiveness depends on the number of repetitions, the detection and recovery mechanisms employed, and the trade-off between fault coverage and performance overhead
    • More repetitions increase fault coverage but may impact system performance
    • Detection and recovery mechanisms must be reliable and efficiently handle faults
  • Software redundancy techniques' effectiveness is influenced by the diversity of implementations, the error detection and recovery mechanisms, and the coordination among software versions
    • Greater diversity among software versions reduces the likelihood of common mode failures
    • Robust error detection and recovery mechanisms are essential for effective software redundancy
    • Coordination mechanisms must ensure consistent and correct behavior across software versions

Designing Fault-Tolerant Systems

Identifying Critical Components and Selecting Redundancy Techniques

  • Identifying the critical components and subsystems that require fault tolerance based on the system's reliability requirements and failure modes and effects analysis (FMEA)
    • FMEA systematically analyzes potential failure modes, their effects, and their criticality to prioritize fault tolerance efforts
    • Reliability requirements, such as target MTBF or availability, guide the selection of critical components for redundancy
  • Selecting appropriate hardware redundancy techniques (e.g., DMR, TMR) for critical components, considering factors such as reliability, cost, and power consumption
    • The chosen redundancy technique should provide the required level of fault tolerance while balancing cost and power constraints
    • Reliability analysis and trade-off studies help determine the most suitable redundancy technique for each critical component
  • Incorporating information redundancy techniques (e.g., ECC, CRC) for data storage, transmission, and processing to detect and correct errors
    • ECC is commonly used in memory systems to protect against bit errors
    • CRC is often employed in data transmission to detect and sometimes correct errors in the received data
  • Applying time redundancy techniques (e.g., re-execution, checkpointing) for critical computations or operations to detect and recover from transient faults
    • Re-execution can be used for critical computations where the results can be quickly verified
    • Checkpointing is useful for long-running or complex operations to minimize the amount of lost work in case of a fault

Implementing Fault Detection, Recovery, and Masking Mechanisms

  • Employing software redundancy techniques (e.g., N-version programming, recovery blocks) for critical software components to improve fault tolerance
    • N-version programming is suitable for software components with well-defined inputs and outputs
    • Recovery blocks are useful for software components with clear acceptance criteria for the results
  • Designing fault detection and isolation mechanisms (e.g., watchdog timers, BIST) to identify and contain faults within specific components or subsystems
    • Watchdog timers can detect software or hardware faults that cause the system to become unresponsive
    • BIST mechanisms enable self-testing of components to identify faults during system startup or periodic checks
  • Implementing fault recovery mechanisms (e.g., checkpointing, rollback recovery) to restore the system to a correct state after a fault occurs
    • Checkpointing saves the system state at regular intervals to enable recovery from faults
    • Rollback recovery uses the saved checkpoints to restore the system to a known good state
  • Incorporating fault masking techniques (e.g., majority voting, error-correcting memory) to maintain uninterrupted system operation in the presence of faults
    • Majority voting can be used in systems with redundant components to determine the correct output
    • Error-correcting memory automatically corrects bit errors, preventing them from affecting the system's operation