In the first part we discuss optimizations of a sequential algorithm to exploit the memory hierarchies that exist in most RISC-based superscalar computers. We begin with the left-looking supernode-column algorithm by Eisenstat, Gilbert and Liu, which includes Eisenstat and Liu's symmetric structural reduction for fast symbolic factorization. Our key contribution is to develop both numeric and symbolic schemes to perform supernode-panel updates to achieve better data reuse in cache and floating-point registers. A further refinement, a two-dimensional matrix partitioning scheme, enhances performance for large matrices or machines with small caches. We conduct extensive performance evaluations on several recent superscalar architectures, such as the IBM RS/6000-590, MIPS R8000 and DEC Alpha 21164, and show that our new algorithm is much faster than its predecessors. The advantage is particularly evident for large problems. In addition, we develop a detailed model to systematically choose a set of blocking parameters in the algorithm.
The second part focuses on the design, implementation and performance analysis of a shared memory parallel algorithm based on our new serial algorithm. We parallelize the computation along the column dimension of the matrix, assigning one block of columns (a panel) to a processor. The parallel algorithm retains the serial algorithm's ability to reuse cached data. We develop a dynamic scheduling mechanism to schedule tasks onto available processors. One merit of this approach is the ability to balance work load automatically. The algorithm attempts to schedule independent tasks to different processors. When this is not possible in the later stage of factorization, a pipeline approach is used to coordinate dependent computations. We demonstrate that the new parallel algorithm is very efficient on shared memory machines with modest numbers of processors, such as the SGI Power Challenge, DEC AlphaServer 8400, and Cray C90/J90. We also develop performance models to study available concurrency and identify performance bottlenecks.