"An Evaluation of Software Fault Tolerance Techniques on a Tiled Architecture"

Karandeep Singh, Adnan Agbaria, Dong-In Kang, and Matthew French
USC Information Sciences Institute

Abstract

This work focuses on the application of Software Fault Tolerance techniques to the MIT Raw architecture, which is a single-chip parallel tiled architecture. Multi-core tiled computing architectures are emerging as the forerunners in high performance COTS processor systems. Besides superior performance and power efficiency, multi-core architectures provide inherent redundancy (because of multiple cores) that enable better fault tolerance and recovery. The goal of this research is to mitigate both SEU and TID faults, with no (or minimum) VLSI or architecture modification in the Raw design. We expect the fault tolerance algorithms to be implemented in the compiler, transparently to the user, and designed to fulfill run-time performance and throughput requirements of the system. The proposed techniques are also applicable to other tiled architectures (and parallel and distributed systems in general).

The Raw architecture consists of 16 tiles connected in a mesh using two types of high performance pipelined networks: static and dynamic. It has a peak throughput of 6.8 GFLOPS at 425 MHz. Each tile has two processors; one MIPS type compute processor with 32KB each of data and instruction cache, and one switch processor with 64 KB of instruction cache. Two of the interconnecting networks are static 2-D point-to-point mesh networks, which are optimized to route single word quantities of data (without any headers) and these routes are determined at compile time. There are two dynamically routed networks in Raw. The general dynamic network is used for data transfer among tiles, while the memory dynamic network is used to access off-chip memory.

The Raw architecture consists of 16 tiles connected in a mesh using two types of high performance pipelined networks: static and dynamic. It has a peak throughput of 6.8 GFLOPS at 425 MHz. Each tile has two processors; one MIPS type compute processor with 32KB each of data and instruction cache, and one switch processor with 64 KB of instruction cache. Two of the interconnecting networks are static 2-D point-to-point mesh networks, which are optimized to route single word quantities of data (without any headers) and these routes are determined at compile time. There are two dynamically routed networks in Raw. The general dynamic network is used for data transfer among tiles, while the memory dynamic network is used to access off-chip memory.

We define Reliability as the percentage of total area that is not susceptible to faults that lead to system reset. We use analytical methods to derive reliability numbers. We analyze the fault tolerance algorithms for three kernels (FIR, Corner Turn and Matrix Multiply) on Raw. We determine the reliability of the system and performance with and without SEUs and also compare their performance to software-based TMR for the system. Worst case reliability (mitigation of faults without system reset) is around 85% and ranges between 89% and 96% for specific applications with software optimizations. The techniques we propose perform better than the software TMR technique. Detailed application mappings, analysis and performance numbers are presented in the paper.

2006 MAPLD International Conference Home Page