NASA Office of Logic Design

NASA Office of Logic Design

A scientific study of the problems of digital engineering for space flight systems,
with a view to their practical solution.

2.5.1 Redundancy

Beyond these conventional reliability requirements, spaceborne computers have been designed to continue accurate operation, even after a transient or permanent failure, through the use of redundant elements. In this concept, two or more elements, each capable of performing the necessary function, are carried in the vehicle. In the event of a failure in one element, the others are able to carry on. The penalties in volume, weight, and power prevented extensive use of redundancy in early space projects, but the advent of integrated circuitry has made redundancy techniques practical. Two categories of redundancy have been utilized (ref. 71): 1) dynamic redundancy, in which the occurrence of a fault is detected, and the fault and/or its effect is subsequently corrected; and 2) static or masking redundancy, in which the effect of a faulty element (component, circuit or system) is immediately masked by permanently connected and continually operating replicas of that element.

The early studies of redundancy techniques concentrated on static redundancy at the component level. Thus, gates or flip-flops were triplicated and majority logic was used, diodes were quadded as on the OAO (ref. 21), etc. As the reliability of individual components improved, it was found that the probability of failure rates for voters vs gates or flip-flops was such that it would be preferable to introduce redundancy of large modules, such as whole memories or arithmetic elements, thus reducing the complexity of the voting or checking circuitry relative to that which was being checked. At the present time, the reliability of equipment built with high reliability integrated circuit components is a function of the number of interconnections (welded joints, soldered joints, and especially connector pins), as well as the number of active and passive elements. Thus, there is a tendency to keep the number of interconnections to a minimum by applying redundancy at a very high level.

The simplest redundancy technique is to use two identical systems with a provision for switching from one to the other when a failure is discovered. This dynamic redundancy approach is used in the Apollo telescope mount (ATM) pointing computer now in development. In the ATM, two identical computers, an active and a spare, are connected to the workshop computer interface unit (WCIU), which is an executive device that monitors the active computer and switches operation to the backup in case of a malfunction (ref. 72). Alternately, two different systems can be used, with the backup system having less capability than the primary system but enough to allow the mission to be aborted safely. This is the case with the Apollo lunar module's abort guidance system in which the AEA computer provides a backup for the primary Apollo C&N system (ref. 16).

Dynamic redundancy and error-detection logic are also used to improve the reliability of the Saturn LVDC memory section (ref. 25). In this self-correcting duplex scheme, shown in figure 6, when both memories are operating properly, each memory is controlled by an independent buffer register and both are simultaneously read and updated. Initially, only one buffer register output is used, but both buffer register outputs are simultaneously parity-checked. When a parity error is detected in the memory being used, operation immediately transfers to the other memory. Both memories are then regenerated by the buffer register of the "good" memory, thus correcting transient errors. After the parity-checking and error-detection circuits have verified that the erroneous memory has been corrected, each memory is again controlled by its own buffer register. Operation is not transferred to the previously erroneous memory until the "good" memory develops its first error. Consequently, instantaneous switching from one memory output to another permits uninterrupted computer operation unless simultaneous failures at the same storage location in both memories cause complete system failure.

Some recently developed computers are designed with one or more spares of each subsystem - processors, memories, I/O units, etc. The spares are held in an unpowered state and are used to replace operating units when a permanent fault is discovered Examples are the Jet Propulsion Laboratory's self-test-and-repair (STAR) breadboard computer (refs. 73 to 75) and the onboard processor for OAO-B (refs. 22 and 23).

A popular static redundancy scheme is triple modular redundancy (TMR) with majority voting. This approach is used to improve the reliability of the logic section in the Saturn 5 LVDC. Figure 7 shows part of the LVDC logic divided into subsections called modules (M). Three identical copies of each module are used, with each channel receiving the same inputs at the same time. The outputs of the three channels are transmitted to majority voting circuits (V), which check the inputs for agreement. If one input differs, it is disregarded; so a single component failure will not cause a system malfunction (refs. 25, 76, 77). The LVDC is divided into seven modules with an average of ten voted outputs. Another circuit, called a disagreement detector (DD), monitors system performance by signaling the ground equipment whenever voter inputs are not identical. This is particularly important during prelaunch checkout to insure that a failure is not present and masked by the voters. If a masked failure did exist, the overall reliability of the system would be less than that of a nonredundant (simplex) configuration since a failure in either of the remaining good channels would cause the voter to give a faulty indication. The LVDC disagreement detector outputs are OR'd together so that malfunctions can be isolated to one, two, or three replaceable subassemblies.

figure_7.jpg (94773 bytes)

Figure 7. Triple modular redundancy in Saturn 5-LVDC logic design.

TMR is also used to achieve the very high reliability required for the common section of the ATM WCIU, mentioned previously. However, instead of hardware disagreement detectors, the WCIU has a software capability for independently checking redundant channels. Over a short period, TMR can provide a substantial increase in reliability. For example, the TMR logic section in the LVDC is approximately twenty times more reliable for a 250-hr mission than an equivalent simplex system, with only about 3-1/2 times more components (ref. 25). However, for longer duration missions, TMR with majority voting is less reliable than simplex, as shown for a single module in figure 8 (ref. 78).

figure_8.jpg (70674 bytes)

Figure 8. Comparison of reliability techniques as
functions of normalized mission time (ref. 78).

A substantial reliability increase over simplex operation can be achieved by reorganizing the TMR system to switch out one of the two remaining good modules, as well as the failed one, in the event of a malfunction (ref. 78). As a result, one or more modules may operate simplex while the remainder operate TMR (the "TMR/Simplex" curve in fig. 8). A somewhat greater improvement, shown by the "Switchable Spare" curve in figure 8, can be obtained by enabling the system to switch to the remaining good unit in the event of a second failure in a module. An extension of this concept, discussed in reference 79, consists of an N-tuply modular redundant (NMR) module with an associated bank of spare units. When one of the N active units fails, a spare unit replaces it and restores the NMR module to the all-perfect state. These approaches use a combination of static and dynamic redundancy to provide uninterrupted, long-duration reliability.

The ability to tolerate multiple failures may be implied by a long-duration high-reliability requirement, or directly stated by a fail-safe or fail-operational criterion. The latter case cannot strictly be called a reliability requirement, since no probabilities of failure are involved. Instead, this approach to fault tolerance requires that no single failure shall jeopardize the mission (fail-safe), or shall degrade system performance (fail-operational). These criteria are often stacked together, as in the fail-operational/fail-operational/fail-safe requirement for the proposed Space Station-Space Base computer system, which specifies normal operation after any two failures and safe operation after any three.

The use of redundancy is not always the answer to reliability improvement problems, however. Redundancy will improve the probability of success only if the simplex probability is high. The cost of a redundant system in terms of weight, volume, and power consumption can be quite severe, and a careful analysis might show that a more reliable system can be obtained by allocating the same resources to a simplex computer. Moreover, the checkout of a simplex system is often easier since there is no need to check each redundant element for failures that may be masked by some redundancy techniques.

Home - NASA Office of Logic Design
Last Revised: February 03, 2010
Digital Engineering Institute
Web Grunt: Richard Katz