Please e-mail comments and suggestions for these guidelines and criteria.
A. Definitions While normally associated with computers, many of the concepts in this section also apply to the “configuration memory” of FPGAs.
The contents of the memory are physically fixed by the structure of the memory element.
Examples: core rope memories (wire wound through or around a core), fusible link PROMs, and antifuse-based PROMs.
The contents of the memory are non-volatile, like the fixed memories, but the contents can be changed. In many cases this involves an erase operation and then a write.
Examples: core, plated wire, electrically erasable programmable read only memories (EEPROM), erasable read only memories (EPROM), ferroelectric memories, and flash. The “ROM” in EPROM and EEPROM is a poor part of the name as it implies permanence, which is incorrect. Devices such as EEPROM may need “refreshing” over long missions as many are rated with a 10 year storage lifetime, giving them volatile characteristics.
The contents of the memory are volatile; they do not retain contents either after the cycling of power or during “brown out” conditions. This class is subdivided into two subclasses, static, which will retain state indefinitely and dynamic, where the memory must be read and subsequently refreshed.
Examples include SRAM, DRAM, and SDRAM.
B. Protection During Power-Up/Down Transitions: This has been noted as a common problem for erasable non-volatile memories. The analysis and test must carefully examine all of the signals for proper and safe operation during power-up, power-down, and brown out transients. Note that the real power supply and its bounded characteristics must be used, not laboratory supplies which most likely will have substantially different characteristics. Some devices have a reset pin to help protect against inadvertent writes. The design, analysis, and test/evaluation of this circuit under all conditions is critical for maintaining the integrity of the non-volatile memories contents. Consider circuit operation if the power is shut down during a write cycle, either planned or unexpected and the design should ensure the proper completion of write cycles to ensure that the contents of the non-volatile memory is protected. The write cycle often includes not only the time for the bus operation to complete, but for the time for writing internal to the part, which can take on order of 10 ms. Another related consideration is the unexpected application of a system reset signal. Shutdown states should be entered to help ensure that write cycles are fully completed and properly shut down, with the critical signals safed.
C. Analysis of Damage During Write Cycles: The technology of the non-volatile memory must be carefully considered if the memory is to be written in flight. Some of these devices, such as EEPROMs, use high voltage to write the cell. If struck by a heavy ion with high voltage applied, the result can be a hard fault. Thus, writing should be done with caution and the technology used for storage chosen wisely.
D. Cycle Count (i.e., # of write cycles for EEPROM, all cycles for FRAM, etc.): Many non-volatile erasable memories have limited number of cycles. There is no hard and fast rule with the numbers of cycles ranging from 104 to 105 or higher. Each device must be treated on a case-by-case basis with system lifetime and radiation factored in. There are some subtle specifications that will be noted here, as examples. The popular 128k x 8 Hitachi die, for example, has a lifetime write specification limit of 103 cycles in byte mode with 104 cycles in page mode. The write mechanism for this device utilizes an 8-byte subpage as the smallest unit that can be written. Hence, writing the same memory space one byte at a time is more stressful than page writes since entire subpages must first be fetched and then re-written. Another subtlety is the operation of some FERAM (ferroelectric RAMs). In these devices, read cycles operate in destructive readout mode (DRO) and an internal write cycle is executed after every read. Hence, the number of read cycles must also be managed in addition to write cycles, since each read access generates a subsequent write cycle.
E. Transients and Noise: It is critical that the signals interfacing with non-volatile memories be clean and system noise kept to a minimum and always meet all specifications. In this case, signals includes not only logic signals but power and ground connections; robust bypassing should be used. Noise glitches on EEPROMs, for example, can cause false write cycles to be generated, resulting in advertent altering of the device's contents. Illegal timing to a non-volatile memory, even with the write signal not asserted, can result in the corruption of the memories contents.
F. Reliability, Refreshing, and Reloading:
The required reliability of the non-volatile, erasable memory device is highly dependent on its application.
If the device operates as part of a large memory array, then some bit failures and even page failures can be tolerated either by error correction techniques or by error detection and mapping the failed segment out of service.
Applications such as boot ROM for a central processing unit or memory contents for an FPGA, require perfect system performance. For single bit failures a Hamming code may suffice, although that may be awkward for serial PROMs. Note that some failure modes of non-volatile memory devices may result in a bit oscillating or not providing a valid logic level; in this case, an EDAC device may or may not correct the single bit error, depending on the logic design of the EDAC device being used and whether or not it is static hazard free. In any event, the devices employed, combined with the architecture of the particular system, must ensure that there are no lockup states from any credible failures. Credible failures include any single bit error and an inadvertent corrupt of a non-permanent memory's contents.
Other forms of redundancy may be required such as TMR with switchable spares. Some options include the ability to switch in alternate devices, the use of permanent memory such as PROM, or the use of storage buffers to replace erasable non-volatile memory functions, using operational overhead to manage the risk. For example, if a configuration memory device for an FPGA fails, a storage buffer and CPU may configure the FPGA using a different loading mode, assuming that, of course, the FPGA isn't needed to run the computer. In general, for critical applications, permanent memories such as PROM are to be used to ensure that the spacecraft or other system can not be permanently lost. This can take the form of boot and safe-hold code for a processor or a basic operating configuration for an FPGA.
Another consideration is the guaranteed storage time of the device vs. mission length. There is no hard and fast rule and each device must be analyzed on a case by case basis. Ten years is a frequent specification for the retention of memory contents. However, system lifetimes of several decades is not uncommon. Refreshing can be risky and the usefulness of it should be verified with the manufacturer's assistance, to ensure a guarantee of storage integrity, particularly in the radiation environment. Obviously, when the device is refreshed, it may be susceptible to damage in the space environment by heavy ions, as noted above. Other errors can occur, damaging the contents, such as a computer crash, brown out, or the unexpected removal of power due to a bus fault or a spacecraft entering a safe mode. Also, each write cycles takes away from the operational lifetime of the component.
G. Recommendations and Tips
- Many designers use a simple RC timing circuit for the generation of a POR or “Power On Reset” signal. Looking closely at the acronym, is has the word “on” in it and the “O” does not stand for “Off.” Use of such a circuit will often protect memories for power up but assertion of the protection circuit will lag either during a brown out or when power is removed.
- POR circuits are often best generated in the power supply module.
- Ensure that critical memory controls behave properly during power transient conditions. They are often incorrectly implemented by an FPGA that is not guaranteed to be under control during the power-on, power-off, and periods when power is disrupted. FPGA and configuration memory device internal power-on reset circuits may be active along with initialization sequences, charge pumps have to supply sufficient charge and voltage to turn on high-voltage isolation FETs, etc.
- Erasable memory device protection is an analog function and digital components must be used with extreme care. Along with timing, many memory devices require non-standard voltage levels and currents for protection.
- Consider the likelihood of a software fault is 100%.
- Device Protection: Many erasable devices implement “software write protection” to prevent against inadvertent writes to the memory. JEDEC has published a standard on this type of protection. Do not keep the “keys” to unlock the memory on-board unless absolutely necessary.
- Subsystem Protection: System level write protection limits should be implemented in hardware, to protect against software faults. Some systems implement this in software which is risky; see bullet #5 above. Use external hardware discrete command as an additional barrier to prevent inadvertent writes.
- Analyze and test devices for lockup states. These can occur in many memory types from illegal loads into command registers, poor signal integrity, poor power quality, or an SEU. Some device lockup states require power cycling to clear. Lockup states in memory devices are often not considered either in memory controller designs (soft repairs) or system designs (power cycle required for clearing of faults).
- Critical switching between memory images for booting implemented as a software function can not be guaranteed to function under all credible faults resulting in system lockup. Use a discrete hardware signal to implement recovery from faults to prevent system lockups.
- Consider the likelihood of an EEPROM or flash device fault to be 100%. There are enough failures in the industry to justify such an approach.
- Boot and Safe-Hold Code: High-reliability, radiation-hardened, fixed memories should normally be employed for boot and safe-hold functions. For applications such as instruments, DMA functions, properly implemented, can load memories with boot code. In this case, the instrument should be safed by hardware logic. DMA functions should not require any operational software. A hardware discrete command to clamp a processor into reset is also recommended.
- “Refreshing” of critical code, such as boot code, that is stored in erasable memory should not be done to mitigate faulty devices. Instead, use reliable fixed memory technology.
- Verify Margins of All Protection Signals: DC voltage margin; AC voltage margins (e.g., cross talk); Timing (protection signals for power up, power down, and during glitches). The power down rate of voltage buses is often ignored or idealized.
- Third party device packaging houses: Verify that they fully understand the technology and the original manufacturer’s test procedures and screening criteria. Compare failure rates of third party houses with those reported by the original die manufacturer. Ensure that proper and complete testing for space missions is performed.
- Multiple copies of the same code in the same technology is risky, if the fundamental technology is not reliable. With the current rash of industry failures of EEPROM, for example, multiple copies of the same device type, even with hardware selection, is a form of Russian Roulette. Storing redundant copies of code in separate blocks of one device can be subject to common mode failures.
- Treating bit, block, and device failures in software can be done in many instances, such as recorders. For critical boot code, as an example, treating failures as a software maintenance issue that must be done before a reset, should not be a function relegated to software. That would be a form of “foam logic.”
"Summary of Recent EEPROM Failures," OLD News #12, July 3, 2003.
"Maxwell EEPROM Bit and Page Failure Investigation Report," Y. Chen, June 3, 2003. e-mail for access
"EEPROM Bit and Page Failure Investigation," Yuan Chen, Rich Kemski, Duc Nguyen, Frank Stott, Ken Erickson, Leif Scheick, Richard Bennett, and Tien Nguyen, 2003 MAPLD International Conference, Washington, D.C., September 9-11, 2003.
Reliability Report: HN58C1001 Series CMOS 1M EEPROM
EEPROM Evaluation and Reliability Analysis, Aerospace Report No. TOR-2000(3000)-01
June 28, 2000. e-mail for access.
"Usage of EEPROM in Digital Designs," Saab Ericsson Space, D-G-NOT-00385-SE, 2004
"Design of Memory Systems for Spaceborne Computers," 2004 MAPLD International Conference, Washington D.C., September 8-10, 2004.
"An Application Engineer's View," 2004 MAPLD International Conference, Washington D.C., September 8-10, 2004.
"Observations in Characterizing a Commercial MNOS EEPROM for Space," 2004 MAPLD International Conference, Washington D.C., September 8-10, 2004.
"Maintaining Data Integrity in EEPROM’s," 2004 MAPLD International Conference, Washington D.C., September 8-10, 2004.
TOP LEVEL: "Design Guidelines and Criteria for Space Flight Digital Electronics"
NASA Office of Logic Design
Last Revised: February 03, 2010
Digital Engineering Institute
Web Grunt: Richard Katz