NASA Office of Logic Design

NASA Office of Logic Design

A scientific study of the problems of digital engineering for space flight systems,
with a view to their practical solution.


OLD News #14

Testing and Application of Modern Microelectronic Devices:

Do's, Don'ts, and Failures

Update: March 26, 2004: Added references 18, NASA Advisory.

Update: February 27, 2004: Added references 16 and 17.

Update: November 24, 2003: A NASA Advisory has been written based on OLD News #14 and is currently in the release cycle.

Date: November 19, 2003

This is the fourteenth in a series of OLD News articles.

Introduction

Modern microelectronics for space are rapidly progressing to ever smaller feature sizes and operating voltages.  Feature sizes are 0.25 µm and below with operating voltages of 2.5V and below.  These small, low voltage transistors give the designer high performance and lower power; they also present engineers with challenges for the safe handling and operation of the devices. OLD News #11 discussed ESD and interface components.  This note will focus on the challenges presented by modern FPGAs, a discussion of some recent failures both on boards and in upscreening tests, and "does and dont's" for the testing and application of these components.

 

Discussion

The reliability of modern VLSI components, based on published reliability data, exceeds the reliability of smaller and far simpler SSI, MSI, and LSI parts that were considered "hi-rel" a generation ago. A fundamental question arises: What testing, if any, should the user perform following receipt and programming? Will such testing improve system reliability by "screening out" defective devices that would escape board-level testing or will it decrease system reliability by overstressing the devices through the extra handling and testing in various test fixtures? Even though some improvement can theoretically be attained through additional tests and screens, it must be balanced against the potentially substantial decrease in reliability introduced through mishandling and faulty testing techniques and methods.

Field Reliability: In general, FPGAs in the field have been highly reliable in commercial, military, and aerospace applications and specifically vendors of these devices used for space applications have shipped millions of parts providing a good quantity of real-world data.  Recently, however, an aerospace contractor has reported clusters of unexplained failures which can not be ignored.  Another contractor reported a cluster of failures during "upscreening" tests.  While the manufacturer's analysis concluded that these devices failed as a result of electrical overstress, the contractors dispute these claims and, regrettably, do not permit the necessary access to the the test environments, data, analysis, or reports, thus excluding the possibility of an independent review.

The Reliability Objectives: In designing a testing and screening program, the driving issue is the reliability goal for a program in general and the FPGAs in particular.  What increase in reliability of a device as a result of the proposed additional testing is needed and expected?  Will the proposed testing and screening regime demonstrate the increased level of reliability and is the regime well designed?  Surprisingly, advocates of extensive third-party testing do not have quantitative or analytical answers for these questions.  The upscreening is "justified" by a "that's what we have always done" or a "we want to make it better" argument and is not based on a need for improvement or a defensible engineering analysis.

The Vendor Position: Two vendors of aerospace FPGAs, Actel and Xilinx, both state unequivocally that upscreening or any other extra testing, such as that performed on a VLSI tester or burn-in chamber, performed by a user or a third-party voids their warranty.  They base this position on the complexity of the devices, the required knowledge of the device's internal circuits and architecture, the sensitivity of the devices to inappropriate testing and/or screening, their experience gained over the years in developing test programs and methods, and analysis of third-party testing.  Based on our examination of third-party facilities, techniques, and procedures, as documented below, this policy is well-justified.

Examination of Test Facilities, Procedures, and Techniques: Over the past several years, test methods, equipment, and procedures have been examined at various third-party test facilities.  Not a single facility was able to test the part to flight standards; that is, had the test equipment been subject to a normal design review, it would have been rejected as unsafe for the device under test.  The personnel designing the test, writing the test procedures, and running the test facilities were not able to answer fundamental questions with regard to either the devices they were testing or the application of the electrical test equipment they were using, thus representing an unknown level of risk to the safety of the flight devices.

Examples of issues found during examination of test facilities, procedures, and techniques include:

 

Conclusion and Recommendations

Several clusters of failures of FPGAs at aerospace contractors have recently been reported.   While the manufacturer's analysis concluded that the failures were a result of electrical overstress, the contractors dispute this but will not permit a complete independent review of the data, analysis, test environment, and reports.  The limited data available shows significant problems including an unsafe environment as well as basic design and test errors that potentially compromise the integrity of flight hardware.  Modern devices should be handled carefully in a controlled environment and ESD rules should be followed religiously.  Additionally, shorting plugs on power lines and installing capacitors prior to microcircuit installation can help eliminate problems.

All failures, from all phases of test, should be reported and diagnosed, with the NASA Office of Logic Design providing a resource for this purpose.  This will permit full data sets and trends to be properly analyzed.

Flight electronics designs go through a series of reviews and qualification testing; hardware that is used to test modern flight microelectronics must meet the same design standards.  Any handling or testing of flight microelectronics must be justified, performed to flight standards, and meet all manufacturers' specifications and well-established good engineering practices.  If these conditions have not been met and the testing and/or screening have not been proven safe, then the devices should not be considered acceptable for flight.  As has been seen in the examination of a number of test facilities, these standards are often being violated and thus presenting a "clear and present danger" to the integrity of the flight hardware.


References

  1. "Post Programming Burn In (PPBI) for RT54SX-S AND A54SX-A Actel FPGAs," Minal Sawant, Dan Elftmann, John McCollum, Werner van den Abeelen, Solomon Wolday and Jonathan Alexander, 2002 MAPLD International Conference, September 9-11, 2002, Laurel, MD.
  2. "OLD News #11: Interface Components and ESD," May 28, 2003.
  3. "How Burn-in can Reduce Quality and Reliability,"XL J. Jordan, M. Pecht, and J. Fink The International Journal of Microcircuits and Electronic Packaging, Vol. 20, No. 1, pp, 36-40, First Quarter, 1997.
  4. "A Physics-of-Failure Approach to IC Burn-In,"XL M. Pecht and P. Lall, Proceedings 1992 Joint ASME/JSME Conference on Electronic Packaging: Advances in Electronic Packaging April 9-12, 1992; also 21st Joint Hybrid Microelectronics Symposium, ISHM, Cherry Hill, NJ, May 27-28, 1992.
  5. "Reliability Report," Actel Corporation, Q2 CY2003 August 11, 2003.
  6. "Xilinx Reliability Report," January, 2002.
  7. Influence of Temperature on Microelectronics and System Reliability, Chapter 6, "A Physics-of-Failure Approach to IC Burn-In," Pradeep Lall, Michael G. Pecht, Edward B. Hakim
  8. "Actel Corporation COTS and Up-Screening Policy," Dr. Esmat Hamdy Senior Vice President, Technology and Operations, Actel Corporation.
  9. Xilinx Upscreening Policy, Joseph J. Fabula Director, Quality Assurance.
  10. "Summary of October 8, 2003 Meeting on Actel FPGA Failures," R. Katz, M. Fraeman, and J. Boldt to S. Scott (NASA) and E. Hoffman (JHU/APL).
  11. "Reliability," from Advanced Design: Designing for Reliability, presented at the 2001 MAPLD International Conference, Laurel, MD, September 10, 2001.
  12. "Conducting Filament of the Programmed Metal Electrode Amorphous Silicon Antifuse," R. Wong and K. Gordon, International Electron Devices Meeting, December 1993
  13. "On-State Reliability of Amorphous Silicon Antifuses," Zhang, G. King, Y. Elfoukhy, S. Hamdy, E. Jing, T. Yu, P. Hu, C., Electron Devices Meeting, 1995. Washington, DC pp: 551-554.
  14. "Characterization and Modeling of a Highly Reliable Metal-to-Metal Antifuse for High-Performance and High-Density Field Programmable Gate Arrays,"
  15. "Time Dependent Reliability of the Programmed Metal Electrode Antifuse," R. Wong, K. Gordon, and A. Chan, International Reliability and Physics Symposium, April 1996
  16. "The First Summary Report on the Independent Review of SX-S FPGA Reliability on NASA Space Flight Missions," February 11, 2004.
  17. "Brief Notes on Recent FPGA Failures," January 2004.
  18. "NASA Advisory: Actel RTSX-S and SX-A Programmed Antifuses" March 26, 2004.

Notes and Additional Recommendations

Actel Reliability Summary:

Xilinx Reliability Summary (FITs): [Xilinx 2002]

As flight electronics designers we should remember to: [Design Guidelines and Criteria]

heater_spike.gif (19912 bytes) Spikes on VCCA and VCCI at burn-in when the temperature chamber's heater turns on.
eos_example.jpg (508070 bytes) Damage to input pin due to electrical overstress (EOS) during post-programming burn-in at a third-party test house.
temp_dependence.jpg (17973 bytes) The accelerated disturb failure distribution for 25 ºC and 250 ºC at the same peak ac stress current density.  Peak ac current density was 64 MA/cm2.   This is from [Wong 96] and they note a lack of ambient temperature dependence.   Similarly, from [Shih 97] "The experimental time-to-fail data are insensitive to the ambient temperature in high read currents.  This is because the change in ambient temperature is insignificant relative to the peak temperature at high read currents."

In my new OLD (Office of Logic Design) position, I am now making some of my informal e-mail lists semi-formal. These mailings will have pointers to technical tips that can [hopefully] proactively prevent errors from getting into flight designs or make things go faster and smoother. I have included an array of people from a number of organizations; different NASA Centers, ESA, etc., as you all may distribute to people in your own organizations and other colleagues. Please let me know if you are on this list in error or if someone should be added to it. This list is targeted towards those that either will design or review space flight digital electronics. Feel free to suggest topics for discussion and research or to contribute news items.  [Note for this web-based release: to become a recipient on this mailing list, please send e-mail to: richard.b.katz@nasa.gov.]

All application notes are uploaded onto my www site. New additions are noted on the what's new page. I will give these mailings from time to time; too much and they will be filtered and ignored - too little and not enough information flows. So I'll try and hit a good balance.

whats_new.htm

Best regards,

-- rk


Home - NASA Office of Logic Design
Last Revised: February 03, 2010
Digital Engineering Institute
Web Grunt: Richard Katz
NACA Seal