Description
We implement and evaluate auto-tuners for two important kernels: Lattice Boltzmann Magnetohydrodynamics (LBMHD) and sparse matrix-vector multiplication (SpMV). They are representative of two of the computational motifs: structured grids and sparse linear algebra. To demonstrate the performance portability that our auto-tuners deliver, we selected an extremely wide range of architectures as an experimental test bed. These include conventional dual- and quad-core superscalar x86 processors both with and without integrated memory controllers. We also include the rather unconventional chip multithreaded (CMT) Sun Niagara2 (Victoria Falls) and the heterogeneous, local store-based IBM Cell Broadband Engine. In some experiments we sacrifice the performance portability of a common C representation, by creating ISA-specific auto-tuned versions of these kernels to gain architectural insight. To quantify our success, we created the Roofline model to perform a bound and bottleneck analysis for each kernel-architecture combination.
Despite the common wisdom that LBMHD and SpMV are memory bandwidth-bound, and thus nothing can be done to improve performance, we show that auto-tuning consistently delivers speedups in excess of 3x across all multicore computers except the memory-bound Intel Clovertown, where the benefit was as little as 1.5x. The Cell processor, with its explicitly managed memory hierarchy, showed far more dramatic speedups of between 20x and 130x. The auto-tuners includes both architecture-independent optimizations based solely on source code transformations and high-level kernel knowledge, as well as architecture-specific optimizations like the explicit use of single instruction, multiple data (SIMD) extensions or the use Cell's DMA-based memory operations. We observe that the these ISA-specific optimizations are becoming increasingly important as architectures evolve.