NASA Office of Logic Design

NASA Office of Logic Design

A scientific study of the problems of digital engineering for space flight systems,
with a view to their practical solution.


2.5 Reliability and Fault Tolerance

The Reliability of a system is defined as the probability of correct operation for a specified period of time under given conditions. Fault tolerance is a more recent and a more general concept of the system's ability to continue to operate without errors after a fault has occurred. Until recently, reliability has been used almost exclusively in describing the operational lifetime of computers, as well as other devices. Conventional techniques which have been used for achieving ultra-high reliability in spaceborne digital computer systems include:

In the past, the measure of a computer's reliability has generally been expressed as a mean time between failures (MTBF), which is the reciprocal of the estimated failure rate of the computer. The reliability analysis typically begins with a study of the components and fabrication processes from which the computer is to be built, including tests to determine the failure rate of the individual components and examinations of failed and unfailed components to determine the causes of failures. From these data and estimates of the number of parts to be used in the design, the failure rate for the total assembly is based upon certain mathematical assumptions (often unstated) of independence of failures, etc. At the present state of the art in construction of equipment, representative MTBFs for a medium-sized digital computer constructed with various component reliabilities would be as tabulated below.

Reliability Level of Parts and Practices

Representative MTBF (hr)

Commercial

Military

High Reliability

500

2,000

10,000

The mean time to failure of the Apollo guidance computers in the field has been about 13,000 hr, and the failure rate of the integrated circuit gates has been less than 0.001% per 1,000 hr. This excellent performance is attributed to the fact that only one gate type was used, and that an unrelenting effort was made in screening, failure analysis, and the implementation of production, test, and flight-processing specifications.

2.5.1 Redundancy
2.5.2 Restarting


Home - NASA Office of Logic Design
Last Revised: February 03, 2010
Digital Engineering Institute
Web Grunt: Richard Katz
NACA Seal