“Checkpointing State Recovery for On-Board Computers”

Shazia Maqbool1, Chris Jackson2 and Craig Underwood1
1 Surrey Space Centre, University of Surrey
2 Surrey Satellite Technology Ltd.

Abstract

On-board computers (OBCs) are prone to single event functional interrupts (SEFIs). A SEFI can cause the processor to lockup, to have continuous exceptions, to execute standby mode, or to go into some unknown, unrecognizable state. In such situations, resetting or power cycling of the OBC is likely to be required for a recovery. The OBC program memory usually consists of volatile random-access memory (RAM). Thus after each reset or power cycling of the OBC unit, the software tasks need to be uploaded to the OBC from the on-board non-volatile storage or from ground. Traditionally for space missions, such a situation has resulted into loss of any computation done prior to the fault event. State recovery gives an application or system the ability to save its state, and tolerate failures by enabling a failed process to recover to an earlier safe state.

The computer state can be recovered using roll-back error recovery through checkpointing or roll-forward error recovery through redundant hardware. The checkpoiniting state recovery uses an on-board non-volatile storage device to save recovery information periodically during failure free execution. Upon failure, a failed process uses the saved information to restart the computation from an intermediate state, thereby reducing the amount of lost computation. The recovery information includes at a minimum the states of the participating tasks, called checkpoints. Upon a failure, checkpointing-based rollback recovery restores the system state to the most recent consistent set of checkpoints, i.e the recovery line.

This paper will explore the surrey satellite technology ltd. (SSTL)’s OBC software architecture for implementation of the checkpointing state recovery. This software architecture is based on a client-server model. Based on their role, the software tasks are divided into non-checkpointing server tasks, checkpointing server tasks, independent tasks, and cooperating tasks. A state recovery (SR) task is proposed to support the checkpointing process in the OBC.

 

2006 MAPLD International Conference Home Page