We consider the problem of building high-performance implementations of sparse matrix-vector multiply (SpM x V), or y = y + A * x, which is an important and ubiquitous computational kernel. Prior work indicates that cache blocking of SpM x V is extremely important for some matrix and machine combinations, with speedups as high as 3x. In this paper we present a new, more compact data structure for cache blocking for SpM x V and look at the general question of when and why performance improves. Cache blocking appears to be most effective when simultaneously 1) the vector x does not fit in cache 2) the vector y fits in cache 3) the non zeros are distributed throughout the matrix and 4) the non zero density is sufficiently high. In particular we find that cache blocking does not help with band matrices no matter how large x and y are since the matrix structure already lends itself to the optimal access pattern.

Prior work on performance modeling assumed that the matrices were small enough so that x and y fit in the cache. However when this is not the case, the optimal block sizes picked by these models may have poor performance motivating us to update these performance models. In contrast, the optimum block sizes predicted by the new performance models generally match the measured optimum block sizes and therefore the models can be used as a basis for a heuristic to pick the block size.

We conclude with architectural suggestions that would make processor and memory systems more amenable to SpM x V.




Download Full History