MER Spirit Flash Memory Anomaly (2004)
Abstract:
Shortly after the commencement of science activities on Mars, an MER rover lost the ability to execute any task that requested memory from the flight computer. The cause was incorrect configuration parameters in two operating system software modules that control the storage of files in system memory and flash memory. Seven recommendations cover enforcing design guidelines for COTS software, verifying assumptions about software behavior, maintaining a list of lower priority action items, testing flight software internal functions, creating a comprehensive suite of tests and automated analysis tools, providing downlinked data on system resources, and avoiding the problematic file system and complex directory structure.
Description of Driving Event:
Shortly after the commencement of science activities on Mars, the “Spirit” rover lost the ability to execute any task that requested memory from the flight computer. The rover operated in a degraded mode until 15 days later, when normal operations were restored and science activities resumed.
The root cause of the failure was traced to incorrect configuration parameters in two operating system software modules that control the storage of files in system memory (heap) and flash memory. A parameter in the dosFsLib module permitted the unlimited consumption of system memory as the flash memory space was exhausted. A parameter in the memPartLib module was incorrectly set to suspend the execution of any task employing memory when no additional memory was available. Task suspension forces a reset of the flight computer, and it is never supposed to occur.
The initial reset event was triggered by the creation of a large number of files associated with MER instrument calibration that overburdened flash memory, and then system memory. The reset did not clear flash memory because flash memory is non-volatile by design. Although the reset did delete the files in system memory, the total size of the file system structure is determined not by the number of current files but rather by the maximum number of files that has ever existed. Since neither memory was cleared by the initial reset, a cycle of repetitive computer resets and flight software re-initializations ensued.
The effects of overburdened flash and system memory were not recognized nor tested during system level ground testing.
Mission Operations recovered the mission by manually reallocating system memory, deleting unnecessary directories and files, and commanding the rover to create a new file system. Because revision of flight software was considered too risky, operational changes were implemented for both MER vehicles to improve oversight of rover file management.
Lesson(s) Learned:
A severely compressed flight software development schedule may prevent the achievement of a full understanding of software functions. During the MER software development process there was a continuous reprioritization of activities and focus. One impact of this dynamic process was that only the highest priority flight software issues and problems could be addressed, and memory management problems were viewed as a low risk.
Recommendation(s):
- Enforce the project-specific design guidelines for COTS software, as well as for NASA-developed software. Assure that the flight software development team reviews the basic logic and functions of commercial off-the-shelf (COTS) software, with briefings and participation by the vendor.
- Verify assumptions regarding the expected behavior of software modules. Do not use a module without detailed peer review, and assure that all design and test issues are addressed.
- Where the software development schedule forestalls completion of lower priority action items, maintain a list of incomplete items that require resolution before final configuration of the flight software.
- Place high priority on completing tests to verify the execution of flight software internal functions.
- Early in the software development process, create a comprehensive suite of tests and automated analysis tools. Ensure that reporting flight computer related resource usage is included.
- Ensure that the flight software downlinks data on system resources (such as the free system memory) so that the actual and expected behavior of the system can be compared.
- For future missions, implement a more robust version of the dosFsLib module, and/or use a different type of file system and a less complex directory structure.
Documents Related to Lesson and References
- JPL Incident Surprise Anomaly Report (ISA) No. Z83174, January 29, 2004.
- Glenn Reeves, Tracy Neilson & Todd Litwin, “Mars Exploration Rover Spirit Vehicle Anomaly Report,” Jet Propulsion Laboratory Document No. D-22919, May 12, 2004.
- Mars Exploration Rover Project Library, Collections 13788 and 13664.
- "Mars Exploration Rovers and the Spirit SOL-18 Anomaly: NASA IV&V Involvement," Ken Costello, NASA Independent Verification and Validation (IV&V) Facility, 2004 MAPLD International Conference, September 8-10, 2004, Washington, D.Co.
- "Design, Verification/Validation and Operations Principles for Flight Systems," Rev. 2, Jet Propulsion Laboratory Document No. D-17868, Section 4.11: Flight Software System Design, March 3, 2003.
Lesson Info
- Lesson Number: 1483, 23-aug-2004
- Submitted by: Mark Boyles/ David Oberhettinger, JPL
- Source: NASA Public Lessons Learned System (PLLS) Database: http://llis.nasa.gov/llis/cgi-plls/plls_lesson?num=1483&kw=1483
Home - NASA Office of Logic Design
Last Revised:
January 19, 2009
Digital Engineering Institute
Web Grunt:
Richard Katz
