Scott: "Layered Fault Management for HW/SW Co-design"

"Layered Fault Management for HW/SW Co-design"

Jason Scott, Sandeep Neema, Dolores Black, and Ted Bapty
Vanderbilt University

Abstract

Field Programmable Gate-Arrays (FPGAs) containing embedded processor cores within the FPGA fabric (both hard & soft processor cores) are quickly becoming a mainstream architecture for high-performance instrumentation, control, and signal processing systems. These devices offer significant advantages in terms of flexibility, cost, and processing density. Unfortunately, the ever-increasing density of these devices also makes them more susceptible to failure due to external radiation effects in the form of SEU, SEL, and MBU failure modes. The integration of processor cores in the FPGA fabric means fault tolerant/fault mitigation facilities must exist in the Middleware/OS layer also. Obviously, these types of failures must be addressed for use in space-based systems, but also must be considered for FPGA usage in avionics and land-based systems in areas of elevated radiation exposure. Each of these applications many have different reliability requirements and levels of radiation exposure/likelihood of failure. Overall system reliability is achieved by designing the system with some amount of fault tolerance and providing strategies for fault mitigation to deal with the faults that do occur. In this paper, we present an approach to system design where fault tolerant/fault mitigation strategies are considered as an integral part of the evaluation of system tradeoffs. System reliability becomes another metric in the evaluation of the “design space” that must be satisfied.

There are opportunities for providing fault tolerance and fault mitigation at each layer of the system architecture. Within each layer, fault management is composed of many tightly interlocked components. For example, middleware uses several facilities such as replication. Replication provides multiple copies and a voting or timeout failure sensing mechanism. This mechanism will closely link with self-stabilization and fault tolerant protocols. For instance, the replicated components will use fault-tolerant protocols to interact.

Vertical integration isolates fault tolerance mechanisms, providing a guaranteed level of reliability to higher levels. Likewise, it relies on a level of fault tolerance from lower levels. This layered approach is common in many fault tolerant infrastructures. This approach for system design will provide tool support for optimization of required levels of fault tolerance by balancing the load of resilience across the vertical layers. Each application will result in a unique cross slice of fault tolerance mechanisms and parameters, taking resources and requirements into account. The goal is to get the desired system reliability at the lowest cost, where cost may be a combination of metrics: monetary cost, physical size, weight, power, etc.

The space within which we must make these tradeoff decisions becomes complex. A tool is needed to automate this decision process to select an overall system design spanning both hardware and software that satisfies reliability requirements along with other key system requirements (size, weight, power, and cost).

2005 MAPLD International Conference Home Page