"Efficient FPGA Implementations of Floating-Point Reduction Operations"

Bodnar: "Efficient FPGA Implementations of Floating-Point Reduction Operations"

Michael R. Bodnar¹, James P. Durbano², John R. Humphrey², Petersen F. Curt², and Dennis W. Prather¹

¹University of Delaware
²EM Photonics, Inc.

Abstract

Many scientific algorithms require floating-point reduction operations, or accumulations, including matrix-vector-multiply (MVM), vector dot-products, and the discrete cosine transform (DCT). These computational kernels find themselves in numerous applications, such as automatic target recognition, large-object scattering, antenna design, encryption, and long-range imaging. Many of these algorithms are being ported to custom hardware, such as field-programmable gate arrays (FPGAs), in order to achieve much greater computational power. Thus, it is clear that a high-performance, floating-point accumulation unit is necessary. However, this type of circuit is difficult to design in a hardware environment due to the deep pipelining of the floating-point arithmetic units, which is needed in order to attain high performance. A deep pipeline requires special handling in feedback circuits because of the long delay, which is further complicated by a continuous input data stream. To address this need, we have developed a high-performance accumulation circuit. In this paper, we present the key aspects of our accumulation architecture, which lie in the storing and scheduling of the intermediate results, which allows for near perfect utilization of the underlying floating-point core. Our design, which is a natural evolution of previous work, maintains buffers for partial result storage that utilize significantly less embedded memory resources than other designs, while maintaining fixed size and speed characteristics, regardless of input stream length. We have demonstrated the power of our accumulation architectures in a matrix-matrix-multiply application in a Virtex-II 8000 FPGA clocked at over 150 MHz. Coupled with efficient DRAM controllers, careful on-chip caching techniques, and fully pipelined arithmetic units, we were able to achieve a throughput of 27.8 GFLOPs.

2006 MAPLD International Conference Home Page