Thumbnail Image

Scalable, Memory-Intensive Scientific Computing on Field Programmable Gate Arrays

Cache-based, general purpose CPUs perform at a small fraction of their maximum floating point performance when executing memory-intensive simulations, such as those required for many scientific computing problems. This is due to the memory bottleneck that is encountered with large arrays that must be stored in dynamic RAM. A system of FPGAs, with a large enough memory bandwidth, and clocked at only hundreds of MHz can outperform a CPU clocked at GHz in terms of floating point performance. An FPGA core designed for a target performance that does not unnecessarily exceed the memory imposed bottleneck can then be distributed, along with multiple memory interfaces, into a scalable architecture that overcomes the bandwidth limitation of a single interface. Interconnected cores can work together to solve a scientific computing problem and exploit a bandwidth that is the sum of the bandwidth available from all of their connected memory interfaces. The implementation demonstrates this concept of scalability with two memory interfaces through the use of available FPGA prototyping platforms. Even though the FPGAs operate at 133 MHz, which is twenty one times slower than an AMD Phenom X4 processor operating at 2.8 GHz, the system of two FPGAs performs eight times slower than the processor for the example problem of SMVM in heat transfer. However, the system is demonstrated to be scalable with a run-time that decreases linearly with respect to the available memory bandwidth. The floating point performance of a single board implementation is 12 GFlops which doubles to 24 GFlops for a two board implementation, for a gather or scatter operation on matrices of varying sizes.
Research Projects
Organizational Units
Journal Issue
Publisher Version
Embedded videos