Garman: "STS-1 Failure to Sync"

"STS-1 Failure to Sync"

John R. Garman
Technical Director
NASA Services Lockheed Martin Information Technology
Formerly NASA JSC and MSC

Abstract

The first orbital flight of a reusable spacecraft, NASA’s Shuttle spacecraft, was scheduled launch on April 10, 1981. Instead, due to a “software glitch” in the onboard computers, it was delayed two days and did not launch until April. And in fact, it wasn’t until after that launch that the “Bug Heard ‘Round the World” was fully understood.

The technical basis of the problem is technical, convoluted, and very deep into timing and the various architectural approaches toward handling time, processes, and scheduling within embedded flight systems. But in the end, it was “change” that kept the shuttle orbiter on the ground that day. A seemingly innocuous change made in what appeared to be a totally unrelated part of the software system about a year earlier introduced a “timing window” within which powering up the primary avionics system could result in some critical cyclic processing within that system to be phased one cycle later than expected. This in turn caused the backup flight software to see a “failure” in the primary, and cause it to “stop” processing that critical data (to prevent it from becoming polluted with possibly “bad data”) while awaiting switchover. Thus, properly awaiting the crew to select it over the primary due to a detected critical failure, while sitting on the pad, the backup refused to fully “sync up” with the primary, and the countdown had to be terminated.

In other words, some 30-hours before launch, when the first of four primary computers was powered up, a 1 in 67 probability window introduced 12-months earlier was “hit”, which prevented the backup computer, powered up some 20-minutes prior to launch, to properly initialize. Without knowing any of this at the time, it was a software engineering nightmare of the first order. A number of us were “locked into a conference room” (to keep people out - not so much to keep us in) near mission control and we “talked through” the symptoms and architecture of the two systems until, very slowly, a picture of what had happened emerged.
The problem was otherwise benign; it was cleared by simply powering up again (albeit two days later). Following the flight and for years afterward, this “bug” became a premier example of what can go wrong in system complexity “creep” and software “change” in general – within NASA and out into computer science college curriculum.
Reference

The "Bug" Heard 'Round the World, Jack Garman NASA, Johnson Space Center ACM Software Engineering Notes October, 1981, pp. 3-10.

2006 MAPLD International Conference - Session G
"Digital Engineering and Computer Design: A Retrospective and Lessons Learned for Today's Engineers"

2006 MAPLD International Conference Home Page