Description
In this paper we present just part of this new formulation, namely computing a basis of the Krylov subspace spanned by [x,Ax, A^2x, ... A^kx]; other papers will present the rest of the KSM formulation. We present theory, performance models, and computational results: Our parallel performance models of 2D and 3D model problems predict speedups over a conventional algorithm of up to 15x on a model Petaflop machine, and up to 22x on a model of the Grid. Our sequential performance models of the same model problems predict speedups over a conventional algorithm of up to 10x on an out-of-core implementation, and up to 2.5x on Intel Clovertown, where we use our ideas to reduce off-chip latency and bandwidth to DRAM. Finally, we measured a speedup of over 3x on an actual out-of-core implementation.
We also consider the related kernel [Ax,MAx,AMAx,...,A(MA)^{k-1}x,(MA)^kx], which arises in preconditioned KSMs. Under certain mathematical conditions on A and the preconditioner M, we show how to avoid latency and bandwidth for this kernel as well.