Public Lessons Learned Entry: 2041
Adapted from: http://www.nasa.gov/offices/oce/llis/imported_content/lesson_2041.html
A few months into its mission, MRO began experiencing unexpected side swaps to the redundant flight computer that placed the spacecraft into safe mode. The problem was traced to subtle inconsistencies between the MRO design implementation of an ASIC device and a known limitation of that device. Users of the RAD750 spaceflight computer should assure that the "PPCI Erratum 24" ASIC defect cannot cause excessive accumulation of uncorrectable SDRAM memory errors, and that the system architecture has robust error recovery capabilities.
Description of Driving Event:
Mars Reconnaissance Orbiter (MRO) was launched in August 2005 with a mission to study the Martian climate, identify water-related landforms and aqueous deposits, characterize potential landing sites for Mars landers, and provide UHF relay for science data produced by these future missions. The MRO spacecraft is furnished with two redundant onboard computers (i.e., two Command & Data Handling Subsystems, or C&DHs), referred to as Side A and Side B, that share continuously updated state and sensor data. One computer remains active, while the second serves as a "cold backup" that can boot in tens of seconds.
In March 2007, 4 months after beginning the science phase of its mission, telemetry alerted the operations team at the NASA/Caltech Jet Propulsion Laboratory to two successive timeouts of the spacecraft's heartbeat watchdog timer (Reference (1)). The first timeout prompted onboard fault protection (FP) software to order a warm reset of Side A. The second timeout triggered an autonomous switch or "side swap" to the Side B computer. After the booting of Side B, FP autonomously configured the vehicle into safe mode. This prompted an intensive investigation that failed to determine the root cause and rule out a permanent failure of Side A.
Eleven months later, MRO performed another unrequested warm reset followed by an unrequested side swap-- this time back to Side A of the C&DH (Reference (2)). Since Side A was now functioning properly, it was clear to JPL investigators that the fault on Side A which caused the first swap was cleared by the power cycling of Side A, allowing them to rule out a permanent hardware failure. This prompted JPL to re-open the investigation. In the course of this, they revisited information on a defect ("PPCI Erratum 24") in the Power Peripheral Component Interconnect (PPCI) bridge Application-Specific Integrated Circuit (ASIC) in the RAD750 Spaceflight Computer (SFC) that was first reported in 2006 by the RAD750 vendor (Reference (2)). Under very specific conditions, this ASIC defect can cause the memory controller (Figure 1) to halt operations, resulting nominally in a warm reset of the computer that clears the condition.
Figure 1. Block diagram of the RAD750 SFC with the memory controller highlighted
This reported defect had not raised much JPL concern in 2006 because of the event's rarity and the belief that it would result merely in a warm reboot of the computer. However, the MRO project did not fully understand the low level details of RAD750 operation and its interaction with the MRO system design configuration. Specifically, ...
The remainder of this paragraph describes the failure mechanism experienced by the MRO project specific to its design implementation of the RAD750 Spaceflight Computer. The text has been redacted for International Traffic in Arms Regulations (ITAR) compliance. "U.S. Persons" may obtain a copy of the complete lesson learned by contacting the JPL Office of the Chief Engineer (David Oberhettinger at firstname.lastname@example.org).
Unintended C&DH side swaps and spacecraft placement into safe mode may interrupt telemetry downlink and, under some circumstances, threaten the mission. In October 2008, the MRO project implemented a vendor-recommended workaround, involving commanding a change to a parameter and a setting within the PPCI bridge ASIC, that should prevent further MRO side swap incidents.
- "MRO Side Swap to Side B," JPL Incident Surprise Anomaly (ISA) No. Z90507, March 14, 2007.
- PPCI Bridge ASIC Master Errata List, BAE Document # A13917 Revision (-) Version 1.3, Errata List for PPCI ASIC P/N 244A907 (Bridge Chip).
- "Excess Latency in SPS Safe Mode Predicts Delivery," JPL Incident Surprise Anomaly (ISA) No. Z90508, March 14, 2007.
- "Final Report on the Mars Reconnaissance Orbiter C&DH Side Swap #1 and #2 Anomalies," JPL Document No. D-37650 (MRO Report No. MRO-36-747), October 16, 2008.
The "Erratum 24" defect in the PPCI bridge ASIC represents a subtle failure mechanism for spacecraft employing a RAD750 SFC architecture that can be overcome by an operational workaround, but is best prevented through flight system design measures.
The full text of these recommendations have also been redacted for ITAR compliance. "U.S. Persons" may obtain a copy of the complete lesson learned by contacting the JPL Office of the Chief Engineer (David Oberhettinger at email@example.com). For all missions employing a RAD750 SFC architecture:
- The U.S. version of this recommendation calls for analyzing the proposed C&DH design to assure that it is not vulnerable to the Erratum 24 defect. (The Erratum 24 defect is a known issue, but the subtleties may not be apparent from the vendor-published data.)
- The U.S. version of this recommendation refers to the need for robust error checking.
- The U.S. version of this recommendation suggests robust design measures for data salvage.
- The U.S. version of this recommendation advocates a "clear-everything" capability for power on resets (PORs).
Evidence of Recurrence Control Effectiveness:
JPL has referenced this lesson learned as additional rationale and guidance supporting Paragraph 9.4.2 ("Flight System Flight Operations Design: Prime/Redundant Hardware Usage - Swapping to Redundant Hardware") in the JPL standard "Design, Verification/Validation and Operations Principles for Flight Systems (Design Principles)," JPL Document D-17868, Rev. 3, December 11, 2006.
- Lesson Number: 2041
- Lesson Date: 2008-12-16
- Submitting Organization: JPL
- Submitted by: David Oberhettinger
- Project: Mars Reconnaissance Orbiter
- Approval Date: 2009-05-06
Last Revised: February 03, 2010
http://twitter.com/klabsorg -- Web Grunt: Richard Katz