"Petaflops YEars Effect: Towards Reliable Supercomputing"

N Venkateswaran and R. Prem Kumar
WAran Research FoundaTion (WARFT)

Abstract

The abstract submitted to MAPLD 06[1] discussed the need for power awareness as well as the reconfigurable characteristics (to match communication requirements of wide class of applications) in evolving supercomputers for onboard space applications. The fMRI brain imaging of the astronauts can be performed on board/on space station and the real time cerebral blood flow  and oxygenation process need to be modeled to investigate long years of stay in space stations. Further space borne high end simulations may need to be performed using thousands of tera bytes of  data collected. A major factor being the need for reduced cluster size without compromising on performance. Besides exploiting nanotechnology for fabricating miniaturized and high performance node for the cluster its   architecture must  include several types of functional units to cater large spread of  applications characteristics. Basically the architecture of the node must be heterogeneous as in Memory In Processor  SuperComputer On Chip MIP SCOC[2]. On board (Deep-Space Space-Stations) applications may demand petaflop years of computing power. At the cluster level it is crucial to achieve performance in exaflops and also maintain high performance/power ratio [3]. In the billion device node architecture the temperature will not be  uniform across the different functional units including the control units and long years of running will generate  Hot Spots leading to electromigration.. From reliability point of view, long run time of these applications causes detrimental effects to the nodes which we  call Petaops YEars (PYE) effects [3]. The major impact is the aging of the devices , interconnects and components present in the functional units. For this reason it becomes necessary to balance the temperature in the node and hence increase the node lifetime. This paper presents a simulation model that tracks the individual units and interconnects (associated during the execution of billions of instructions --Instruction Blocks) present in the node in the form of a spatio-temporal temperature profile library. The temperature is extrapolated from this profile library leading to dynamic temperature balanced mapping, thus overcoming the PYE effect.

References 

  1. [1]  N Venkateswaran, T P Ramnath Sai Sagar, G Shyamsundar, E Vinoth Krishnan and K Viswanath,”Fusion Cluster: Reconfigurable, High Performance and ScalableCluster Architecture” , abstract submitted to NASA MAPLD 2006, Washington DC.

  2. [2]N Venkateswaran, Arrvindh Shriraman, and S. Niranjan Kumar. “Memory in processor supercomputer on a chip processor design and execution semantics for massive single chip performance”, Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05)-Workshop 14-Volume 15, Pg 263.2, 2005

  3.  [3] Prem Kumar Ramesh- “Towards Million Node Fusion Cluster Mapping Schedule: Employing Population Theory and Temperature Aware Load Balancing” – Thesis submitted to WAran Research FoundaTion (WARFT), Aug 2005.

 

2006 MAPLD International Conference Home Page