"A Hybrid Approach for Accelerating a Sparse Matrix Jacobi Solver using an FPGA-Augmented Reconfigurable Computer"

Gerald R. Morris and Viktor K. Prasanna
University of Southern California

Abstract

Reconfigurable computers (RCs) combining general-purpose processors (GPPs) and field programmable gate arrays (FPGAs) are now available. These exciting architectures allow end-users to create reconfigurable processors targeting the computationally intensive parts of each problem. RCs have already been used to accelerate integer and fixed-point applications. However, the extensive parallelism and deeply pipelined floating-point cores needed to make MHz-scale FPGAs competitive with GHz-scale GPPs have made it difficult to accelerate certain kinds of floating-point kernels. Kernels with variable length nested loops, e.g, sparse matrix-vector multiply, have been problematic because of the loop-carried dependence associated with the pipelined floating-point units. While hardware description language (HDL)-based kernels have shown some success in addressing this problem, the use of a high-level language (HLL)-based approach to accelerate such applications has been rather elusive. If RCs are to be a part of mainstream embedded military and aerospace applications we must move away from HDL-based hardware design and toward HLL-based programming.

In this paper we describe the design and implementation of a sparse matrix Jacobi processor to solve systems of linear equations, Ax = b. Unlike previous research, this is not a simple FPGA-only kernel or paper design with estimated performance values. In addition to the design, this paper describes the implementation and performance of a parameterized, parallelized, deeply pipelined, dual-FPGA, IEEE-754 64-bit floating-point sparse matrix Jacobi iterative solver running on a contemporary FPGA-augmented RC. The FPGA-based elements are developed via a hybrid approach that uses an HLL-to-HDL compiler in conjunction with custom-built, VHDL-based, reusable °oating-point components. A novel partial-subtraction matrix allows us to implement fully-pipelined loops and achieve wall clock runtime speedups over a software-only implementation. The current design can handle assembled sparse matrices of order, n · 2; 048, having up to 1M non-zero entries. Actual runtime comparisons show that our FPGA-augmented Jacobi solver runs up to 2.2 times faster than the software-only version running on the same RC.  Estimates for a second-generation RC show that this same design will run over 6 times faster than software.

2006 MAPLD International Conference Home Page