NASA Office of Logic Design

A scientific study of the problems of digital engineering for space flight systems,
with a view to their practical solution.

Error Detection, Correction, and Fault Tolerance

This is a new page and I'll be adding to it from around klabs.org.
Help is appreciated and feel free to send me links, both internal and external,
of things to add to this page. Thanks! -- rk

Common Mode Failures in Safety- and Mission-Critical Digital Electronics

Presented at the 2006 PMA 201 Fuze IPT System Safety Working Group

March 7-9, 2006
Oxnard, California
Abstract
A review of different classes of common mode failures in safety- and mission-critical digital electronics will be presented, discussed, and analyzed. Classes of these failures to be discussed range from IC design failures, manufacturing faults, testing errors, application design errors (internal and external to the device), handling errors, computer aided engineering (CAE) software faults, documentation errors, and environmental factors. The talk will use a series of case studies of actual faults to demonstrate each class of failure.

How Much Redundancy? - Some Cost Considerations, Including Examples for Spacecraft Systems

NASA Technical Memorandum 103197
Ronald C. Suich¹ and and Richard L. Patterson²
¹California State University
²Lewis Research Center
AIChE Summer National Meeting Session on Space Power Systems Technology San Diego, California, August 19-22, 1990
Abstract
   How much redundancy should be built into a subsystem such as a space power subsystem? How does a reliability or design engineer choose between a power subsystem with .990 reliability and a more costly subsystem with .995 reliability?. How does the engineer designing a power subsystem for a satellite decide between one power subsystem and a more reliable but heavier power subsystem?
   High reliability is not necessarily an end in itself. High reliability may be desirable in order to reduce the statistically expected loss due to a subsystem failure. However, this may not be the wisest use of funds since the expected loss due to subsystem failure is not the only cost involved. The subsystem itself may be very costly. We cannot consider either the cost of the subsystem or the expected loss due to subsystem failure separately. We therefore minimize the total of the two costs, i.e.,
the total of the cost of the subsystem plus the expected loss due to subsystem failure.
   We consider a specific type of redundant system, called a k-out-of-n: G subsystem. Such a subsystem has n modules, of which k are required to be good for the subsystem to be good. We discuss five models which can be applied in the design of a power subsystem to select the unique redundancy method which will minimize the total of the cost of the power subsystem plus the expected
loss due to the power subsystem failure.

Survivable Algorithms and Redundancy Management in NASA's Distributed Computing Systems

NASA Grant NAG9-426 for the period of May 1, 1990- April 30, 1992

Dr. Miroslaw Malek
The University of Texas at Austin
Introduction (excerpt)
The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. In this report, we outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design.

Redundancy Management for Efficient Fault Recovery in NASA's Distributed Computing System

NASA GRANT NAG9-351
February 15, 1991
Miroslaw Malek, Mihir Pandya, and Kitty Yau
The University of Texas at Austin
Introduction (excerpt)
The proliferation of increasingly powerful and complex multiprocessor systems has made fault-tolerant design a necessity. Optimizing fault tolerance in multiprocessor systems is a very difficult task because it involves multi-dimensional tradeoffs. The system architecture, the computation structure, the implementation technolo{gy, the frequency, duration and location of faults, and many other factors ahhave certain impact on the effectiveness of a particular faulttolerant approach. In our research, we have attempted to look at different areas of fault tolerance and have tried to integrate them under one umbrella. A comprehensive approach to fault tolerance is perhaps the only solution that may succeed in the difficult task of redundancy management. Such an approach covers design for fault tolerance and testing. A comprehensive approach requires a proper perspective, especially in distributed systems. A four layered view of fault tolerance in multiprocessor system, as shown in Figure 1.1, may prove to be very useful.

Development of Techniques for Improving the Reliability of Digital Systems Through Logical Redundancy - Phase II

Jack Goldberg, Robert C. Minnick, and William H. Kautz
May 1962
Abstract
This report summarizes the second year of study of techniques for improving digital system reliability through application of logical redundancy. Three major topics are discussed: organization of a computer for efficient use of redundancy techniques, use of redundancy to mask faults in computer memories, and logical redundancy techniques for sequential networks. Each of these three topics is treated in a self-contained part of the report.

Redundancy Techniques to Improve the Reliability of Two Level and Three Level Logic Circuits

Kasivisvanathan Vairavan
The University of Madras, India
Preface
The object of this paper is to investigate into redundancy techniques to improve the reliability of logic circuits and to present some new concepts.

Development of Techniques for Improving the Reliability of Digital Systems Through Logical Redundancy - Phase III

Jack Goldberg, J.A. Baer, and Robert C. Minnick
August 1963
Abstract (excerpt)
This is the final report on the third year of a study of techniques for improving the reliability of digital systems through the application of redundancy. The objectives of this year's research were (1) continuation of the previous year's study of the design of reliable memory read access switches and extension to over-all memory system considerations, (2) to study means for correcting faults in the amplifiers (known as transfer circuits) of the Inhibit-Core computer scheme by means of logical redundancy, and (3) to develop a breadboard version of a reliable Inhibit-Core

EDAC and Dynamic Faults

Conclusion
A combinational EDAC circuit can provide error detection and correction capabilities against static errors. Proper analysis must be conducted for dynamic errors such as signals that oscillate or have non-logic levels.

Application of Redundancy in the Saturn V Guidance and Control System

F.B. Moore and J.B. White
NASA Marshall Space Flight Center
AIAA Paper No. 67-553
AIAA Guidance, Control, and Flight Dynamics Conference
Huntsville, Alabama
August 14-16, 1967
Abstract
The Saturn launch vehicle's guidance and control system is so complex that the reliability of a simplex system is not adequate to fulfill mission requirements. Thus, to achieve the desired reliability, redundancy encompassing a wide range of types and levels was employed. At one extreme, the lowest level, basic components (resistors, capacitors, relays, etc.) arc employed in series, parallel, or quadruplex arrangements to insure continued system operation in the presence of possible failure conditions. At the other extreme, the highest level, complete subsystem duplication is provided so that a backup subsystem can be employed in case the primary system malfunctions. In between these two extremes, many other redundancy schemes and techniques are employed at various levels. Basic redundancy concepts are covered to gain insight into the advantages obtained with various techniques. Points and methods of application of these techniques are included. The theoretical gain in reliability resulting from redundancy is assessed and compared to a simplex system. Problems and limitations encountered in the practical application of redundancy are discussed as well as techniques verifying proper operation of the redundant channels. As background for the redundancy application discussion, a basic description of the guidance and control system is included.

Home - NASA Office of Logic Design
Last Revised: February 03, 2010
Web Grunt: Richard Katz

Common Mode Failures in Safety- and Mission-Critical Digital Electronics Presented at the 2006 PMA 201 Fuze IPT System Safety Working Group March 7-9, 2006 Oxnard, California	Abstract A review of different classes of common mode failures in safety- and mission-critical digital electronics will be presented, discussed, and analyzed. Classes of these failures to be discussed range from IC design failures, manufacturing faults, testing errors, application design errors (internal and external to the device), handling errors, computer aided engineering (CAE) software faults, documentation errors, and environmental factors. The talk will use a series of case studies of actual faults to demonstrate each class of failure.
How Much Redundancy? - Some Cost Considerations, Including Examples for Spacecraft Systems NASA Technical Memorandum 103197 Ronald C. Suich¹ and and Richard L. Patterson² ¹California State University ²Lewis Research Center AIChE Summer National Meeting Session on Space Power Systems Technology San Diego, California, August 19-22, 1990	Abstract How much redundancy should be built into a subsystem such as a space power subsystem? How does a reliability or design engineer choose between a power subsystem with .990 reliability and a more costly subsystem with .995 reliability?. How does the engineer designing a power subsystem for a satellite decide between one power subsystem and a more reliable but heavier power subsystem? High reliability is not necessarily an end in itself. High reliability may be desirable in order to reduce the statistically expected loss due to a subsystem failure. However, this may not be the wisest use of funds since the expected loss due to subsystem failure is not the only cost involved. The subsystem itself may be very costly. We cannot consider either the cost of the subsystem or the expected loss due to subsystem failure separately. We therefore minimize the total of the two costs, i.e., the total of the cost of the subsystem plus the expected loss due to subsystem failure. We consider a specific type of redundant system, called a k-out-of-n: G subsystem. Such a subsystem has n modules, of which k are required to be good for the subsystem to be good. We discuss five models which can be applied in the design of a power subsystem to select the unique redundancy method which will minimize the total of the cost of the power subsystem plus the expected loss due to the power subsystem failure.
Survivable Algorithms and Redundancy Management in NASA's Distributed Computing Systems NASA Grant NAG9-426 for the period of May 1, 1990- April 30, 1992 Dr. Miroslaw Malek The University of Texas at Austin	Introduction (excerpt) The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. In this report, we outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design.
Redundancy Management for Efficient Fault Recovery in NASA's Distributed Computing System NASA GRANT NAG9-351 February 15, 1991 Miroslaw Malek, Mihir Pandya, and Kitty Yau The University of Texas at Austin	Introduction (excerpt) The proliferation of increasingly powerful and complex multiprocessor systems has made fault-tolerant design a necessity. Optimizing fault tolerance in multiprocessor systems is a very difficult task because it involves multi-dimensional tradeoffs. The system architecture, the computation structure, the implementation technolo{gy, the frequency, duration and location of faults, and many other factors ahhave certain impact on the effectiveness of a particular faulttolerant approach. In our research, we have attempted to look at different areas of fault tolerance and have tried to integrate them under one umbrella. A comprehensive approach to fault tolerance is perhaps the only solution that may succeed in the difficult task of redundancy management. Such an approach covers design for fault tolerance and testing. A comprehensive approach requires a proper perspective, especially in distributed systems. A four layered view of fault tolerance in multiprocessor system, as shown in Figure 1.1, may prove to be very useful.
Development of Techniques for Improving the Reliability of Digital Systems Through Logical Redundancy - Phase II Jack Goldberg, Robert C. Minnick, and William H. Kautz May 1962	Abstract This report summarizes the second year of study of techniques for improving digital system reliability through application of logical redundancy. Three major topics are discussed: organization of a computer for efficient use of redundancy techniques, use of redundancy to mask faults in computer memories, and logical redundancy techniques for sequential networks. Each of these three topics is treated in a self-contained part of the report.
Redundancy Techniques to Improve the Reliability of Two Level and Three Level Logic Circuits Kasivisvanathan Vairavan The University of Madras, India	Preface The object of this paper is to investigate into redundancy techniques to improve the reliability of logic circuits and to present some new concepts.
Development of Techniques for Improving the Reliability of Digital Systems Through Logical Redundancy - Phase III Jack Goldberg, J.A. Baer, and Robert C. Minnick August 1963	Abstract (excerpt) This is the final report on the third year of a study of techniques for improving the reliability of digital systems through the application of redundancy. The objectives of this year's research were (1) continuation of the previous year's study of the design of reliable memory read access switches and extension to over-all memory system considerations, (2) to study means for correcting faults in the amplifiers (known as transfer circuits) of the Inhibit-Core computer scheme by means of logical redundancy, and (3) to develop a breadboard version of a reliable Inhibit-Core
EDAC and Dynamic Faults	Conclusion A combinational EDAC circuit can provide error detection and correction capabilities against static errors. Proper analysis must be conducted for dynamic errors such as signals that oscillate or have non-logic levels.
Application of Redundancy in the Saturn V Guidance and Control System F.B. Moore and J.B. White NASA Marshall Space Flight Center AIAA Paper No. 67-553 AIAA Guidance, Control, and Flight Dynamics Conference Huntsville, Alabama August 14-16, 1967	Abstract The Saturn launch vehicle's guidance and control system is so complex that the reliability of a simplex system is not adequate to fulfill mission requirements. Thus, to achieve the desired reliability, redundancy encompassing a wide range of types and levels was employed. At one extreme, the lowest level, basic components (resistors, capacitors, relays, etc.) arc employed in series, parallel, or quadruplex arrangements to insure continued system operation in the presence of possible failure conditions. At the other extreme, the highest level, complete subsystem duplication is provided so that a backup subsystem can be employed in case the primary system malfunctions. In between these two extremes, many other redundancy schemes and techniques are employed at various levels. Basic redundancy concepts are covered to gain insight into the advantages obtained with various techniques. Points and methods of application of these techniques are included. The theoretical gain in reliability resulting from redundancy is assessed and compared to a simplex system. Problems and limitations encountered in the practical application of redundancy are discussed as well as techniques verifying proper operation of the redundant channels. As background for the redundancy application discussion, a basic description of the guidance and control system is included.

NASA Office of Logic Design

A scientific study of the problems of digital engineering for space flight systems, with a view to their practical solution.

Error Detection, Correction, and Fault Tolerance

A scientific study of the problems of digital engineering for space flight systems,
with a view to their practical solution.