NASA Office of Logic Design

NASA Office of Logic Design

A scientific study of the problems of digital engineering for space flight systems,
with a view to their practical solution.

4.5 Reliability and Fault Tolerance

A suggested technique for defining reliability requirements is to provide a list of functions to be performed by the computer during each mission phase and a corresponding listing of the probability of successful performance required. The reliability specifications should be reasonable and appropriate for the mission. For example, the hardware and software should be highly reliable for guidance and navigation, whereas lower reliability specifications may be acceptable for experiments. If possible, the extent of coverage or fault tolerance should be specified. Caution should be used in comparing the reliability of different systems unless their reliabilities have been predicted for the same assumed conditions and those conditions are carefully described. An unqualified specification of MTBF should be considered to be inadequate.

To establish a meaningful measure of reliability, the following items should be considered: 1) the required operational time before failure; 2) the repair time, if repairs are possible; 3) the mission phases; 4) the probability of operating during a critical time period; and 5) environmental factors. For multiple channel organizations, it should be noted that reliability is generally based on independent failure assumptions and lack of interference between channels. For multiphase missions, a state-phase calculation may be used to determine how many computers or how many redundant modules should be operable at each critical phase. A thorough reliability analysis of all anticipated failure modes should be conducted. During initial testing and checkout, all failures should be attributed to these specific modes, or additional modes should be added to provide a complete understanding of all experienced failures. The failure effects and modes analysis should take into account the topology of the devices and similar hardware-related mechanisms. The reliability of the design is finally verified by diagnostic and functional tests of the system under expected environmental conditions.

To achieve reliability in production of the computer, it is recommended that closely controlled manufacturing and assembly processes, together with quality workmanship and practices, should be required in all facets of the hardware development. Components should be selected for inherent reliability; they should use proven methods for their interconnection and packaging; and they should be subjected to rigid quality inspection and electrical checks. Highly recommended practices are: parts standardization, component derating with respect to both voltage levels and speed, and 100% screening of all parts and assemblies. Circuits should exhibit tolerance to transients. Adequate margins should be placed on the environmental and electrical characteristics, so that chances of failure at later stages of assembly are noticeably reduced. Extensive research is recommended to establish: the types of semiconductors to be used; the techniques for preparing and processing wafers; mask characteristics; failure probabilities for specific IC, MSI or LSI technologies; and the reliabilities of these devices in the space environment. Where more current data are unavailable, the use of MIL Handbook 217A or RADC Handbook, vol. 2, is recommended for failure-rate data.

The use of redundancy to enhance reliability should be considered, if necessary, to meet the mission specifications. The particular form of redundancy to be used will be a function of the time allowed for repair and may depend on the hardware technology with which the computer is to be mechanized. Both the method of redundancy and its level of application should be chosen carefully. The increased power supply and cooling requirements, and the effects of additional complexity on system checkout and personnel training, etc., should also be evaluated when various schemes are considered. Power switching should be considered for removing failed units from the system and for incorporating spares into it. The software should provide for recovery from specified malfunctions and for subsequent restart. If neither degraded performance nor computer failure can be tolerated, a means of repair should be provided. Inflight maintenance, i.e., manual replacement of failed modules, should be considered for future manned missions.

Self-check programs, error detecting codes, duplication and comparison, and/or voting among multiple units should be used in redundant systems to detect errors and/or malfunctions. It may be necessary to specify the possible configurations which the computer may assume in case of inflight failure of external or peripheral equipment, as well as internal subsystems. The additional reliability gained by such restructuring of the system should be studied both analytically and by means of simulation, and the software requirements for providing these alternatives should be considered early in the design. The response time for reconfiguring and restarting should be evaluated since even a temporary outage of the computer could be unacceptable during critical mission phases. Complicated redundant systems should be extensively tested to verify that the reconfiguration or fault-masking technique operates correctly in all cases and does not introduce errors. It is a good practice to use techniques whereby faults can be introduced artificially to insure that the system masks or reconfigures correctly. Combinations of these artificially introduced faults should be tried.

Home - NASA Office of Logic Design
Last Revised: February 03, 2010
Digital Engineering Institute
Web Grunt: Richard Katz