Auto-tuning Performance on Multicore Computers

Williams, Samuel Webb; EECS Department, University of California

PDF

Description

For the last decade, the exponential potential of Moore's Law has been squandered in the effort to increase single thread performance, which is now limited by the memory, instruction, and power walls. In response, the computing industry has boldly placed its hopes on the multicore gambit. That is, abandon instruction-level parallelism and frequency-scaling in favor of the exponential scaling of the number of compute cores per microprocessor. The massive thread-level parallelism results in tremendous potential performance, but demands efficient parallel programming --- a task existing software tools are ill-equipped for. We desire performance portability --- the ability to write a program once and not only have it deliver good performance on the development computer, but on all multicore computers today and tomorrow. This thesis accepts for fact that multicore is the basis for all future computers. Furthermore, we regiment our study by organizing it around the computational patterns and motifs as set forth in the Berkeley View. Although domain experts may be extremely knowledgeable on the mathematics and algorithms of their fields, they often lack the detailed computer architecture knowledge required to achieve high performance. Forthcoming heterogeneous architectures will exacerbate the problem for everyone. Thus, we extend the auto-tuning approach to program optimization and performance portability to the menagerie of multicore computers. In an automated fashion, an auto-tuner will explore the optimization space for a particular computational kernel of a motif on a particular computer. In doing so, it will determine the best combination of algorithm, implementation, and data structure for the combination of architecture and input data.

We implement and evaluate auto-tuners for two important kernels: Lattice Boltzmann Magnetohydrodynamics (LBMHD) and sparse matrix-vector multiplication (SpMV). They are representative of two of the computational motifs: structured grids and sparse linear algebra. To demonstrate the performance portability that our auto-tuners deliver, we selected an extremely wide range of architectures as an experimental test bed. These include conventional dual- and quad-core superscalar x86 processors both with and without integrated memory controllers. We also include the rather unconventional chip multithreaded (CMT) Sun Niagara2 (Victoria Falls) and the heterogeneous, local store-based IBM Cell Broadband Engine. In some experiments we sacrifice the performance portability of a common C representation, by creating ISA-specific auto-tuned versions of these kernels to gain architectural insight. To quantify our success, we created the Roofline model to perform a bound and bottleneck analysis for each kernel-architecture combination.

Despite the common wisdom that LBMHD and SpMV are memory bandwidth-bound, and thus nothing can be done to improve performance, we show that auto-tuning consistently delivers speedups in excess of 3x across all multicore computers except the memory-bound Intel Clovertown, where the benefit was as little as 1.5x. The Cell processor, with its explicitly managed memory hierarchy, showed far more dramatic speedups of between 20x and 130x. The auto-tuners includes both architecture-independent optimizations based solely on source code transformations and high-level kernel knowledge, as well as architecture-specific optimizations like the explicit use of single instruction, multiple data (SIMD) extensions or the use Cell's DMA-based memory operations. We observe that the these ISA-specific optimizations are becoming increasingly important as architectures evolve.