Melanie Berg
NASA Office of Logic DesignDecember 12, 2004
NASA Goddard Space Flight Center
Bldg. 11, AETD Conference Room
10 am to 12 noonAbstract
As Integrated Circuit (IC) geometries become smaller and core voltages scale down, the probability of incurring system faults increases significantly. Errors occur when charged particles penetrate a memory cell and cross a junction creating an aberrant charge that changes the state of the bit. Based on the speed of the specified FPGA or ASIC circuit, and the geometry of the employed technology, transistor level “Hardened by Design” techniques may not sufficiently meet design requirements. In order to rival IC advancements, FPGA and ASIC designers will have to carefully consider architectures that contain some degree of gate-level fault tolerance (mitigation). Fault tolerance is defined as masking or recovering from erroneous conditions in a system once they have been detected.
Due to the radiation effects in space, the Aerospace industry has always had to design with SEU (Single Event Upset) mitigation. As far as gate-level DFF protection is concerned, Triple Mode Redundant (TMR – voting) logic is the most commonly used scheme to combat SEUs. However, TMR can be very area extensive and - in a turbulent environment – may not fully erase the probability of upsets. As a solution, many error-coding techniques have been proposed as a compliment (or replacement) to TMR, however due to their complexity, they are rarely implemented.
Interestingly, the theory of Fault Tolerance (or Mitigation) is very extensive. However, very seldom is it emphasized that errors are not only random but also asynchronous to circuitry. Unfortunately, the theory does not cover how to actually implement FPGA or ASIC designs that can correctly detect asynchronous errors without worsening the fault (turning a SEU into a SEFI -single-event functional interrupt). The problem arises when errors occur near clock edges and due to the difference in routing delays (and perhaps glitchy mitigation circuitry) the detection logic may be seen by some DFFs but not by others. Or the worse case scenario being that the asynchronous error-event can set off a chain of metastability. Such a scenario will have a very low probability of occurring in a slow and/or simple architecture such as a shift register. However, as the clock frequency increases and the DFF fan-out significantly increases as in counters or complicated Finite State Machines (FSMs), then the probability of faulty transitions increases drastically.
This seminar will address Single Event Upsets (SEUs) within edge-triggered D-Flip-Flops (DFFs) and assumes that the upsets are soft (correctable by the following clock edge). It will also be shown that if the designer does not take into account the asynchronous nature of the SEU, a probability of incurring a SEFI increases. Additionally, an approach to fault tolerant state machine design starting from architectural development through synthesis will be proposed. Examples of coding schemes that include additional logic for error detection and (in some cases) correction such as One-Hot, Sequential, and Hamming will be examined. Due to the fact that users have run into roadblocks with synthesis tools “optimizing” away necessary logic for error handling, special attention will be given to these tools concerning the necessary techniques involved in producing the correct realization of functionality.
Home - NASA Office of Logic Design
Last Revised:
February 03, 2010
Digital Engineering Institute
Web Grunt:
Richard Katz
