Description
This report presents uniprocessor automatic tuning techniques for the sparse matrix operation, y = A^T Ax, where A is a sparse matrix and x,y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrix-vector multiply (SpM x V). Extensive experiments, on a benchmark set of 44 matrices and 4 platforms, show that speedups of up to 4.2x are possible compared to a conventional implementation that computes t = Ax and y = A^T t as separate steps. In addition, we develop platform-specific upper-bounds on the performance of our implementations. We analyze how closely our implementations approach these bounds, and show when low-level tuning techniques (e.g., better instruction scheduling) are likely to yield a significant pay-off. Finally, we present a hybrid off-line/run-time heuristic which in practice automatically selects optimal (or near-optimal) values of the key tuning parameters, the register block sizes.
There are at least three implications of this work. First, sparse A^T Ax should be a basic primitive in sparse matrix libraries, based on its utility to applications and the potential pay-off from automatically tuning it. Second, our upper bound analysis shows that there is an opportunity to apply automatic low-level tuning methods, in the spirit of tuning systems such as ATLAS and PHiPAC for dense linear algebra, to further improve the performance of this kernel. Third, the success of our heuristic provides additional validation of the Sparsity tuning methodology.