Description
We compare the performance of our most-optimized FFT algorithm on a simulated version of VIRAM to that of eleven high-end fixed- and floating-point Digital Signal Processors (DSPs) and DSP-like architectures, and find that VIRAM outperforms all of the fixed-point DSPs and all but two of the special-purpose floating-point FFT DSPs. On 1024-point FFTs, VIRAM achieves 1.3 GFLOP/s in floating-point mode, and 1.9 GOP/s in fixed-point mode.
Despite its high performance relative to the DSPs, however, we find that the VIRAM architecture is being underutilized by as much as two thirds while running the FFT algorithm. We thus embark on an architectural analysis to determine the underlying cause of this underutilization, and discover that it results from bottlenecks in VIRAM's memory functional units and memory access conflicts in VIRAM's memory system. For larger FFTs, the memory system impact becomes more severe, and we find that the number of memory banks and subbanks plays a crucial role in the scalability of our algorithm's performance to large FFT sizes.