Communication-optimal Parallel and Sequential QR and LU Factorizations

Demmel, James; Hoemmen, Mark Frederick; Grigori, Laura; Langou, Julien; EECS Department, University of California

PDF

Description

We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. Our first algorithm, Tall Skinny QR (TSQR), factors m x n matrices in a one-dimensional (1-D) block cyclic row layout, and is optimized for m >> n. Our second algorithm, CAQR (Communication-Avoiding QR), factors general rectangular matrices distributed in a two-dimensional block cyclic layout. It invokes TSQR for each block column factorization.

The new algorithms are superior in both theory and practice. We have extended known lower bounds on communication for sequential and parallel matrix multiplication to provide latency lower bounds, and show these bounds apply to the LU and QR decompositions. We not only show that our QR algorithms attain these lower bounds (up to polylogarithmic factors), but that existing LAPACK and ScaLAPACK algorithms perform asymptotically more communication. We also point out recent LU algorithms in the literature that attain at least some of these lower bounds.

Both TSQR and CAQR have asymptotically lower latency cost in the parallel case, and asymptotically lower latency and bandwidth costs in the sequential case. In practice, we have implemented parallel TSQR on several machines, with speedups of up to 6.7x on 16 processors of a Pentium III cluster, and up to 4x on 32 processors of a BlueGene/L. We have also implemented sequential TSQR on a laptop for matrices that do not fit in DRAM, so that slow memory is disk. Our out-of-DRAM implementation was as little as 2x slower than the predicted runtime as though DRAM were infinite.

We have also modeled the performance of our parallel CAQR algorithm, yielding predicted speedups over ScaLAPACK's PDGEQRF of up to 9.7x on an IBM Power5, up to 22.9x on a model Petascale machine, and up to 5.3x on a model of the Grid.

Details

Title

Communication-optimal Parallel and Sequential QR and LU Factorizations

Creator

Demmel, James, Author
Hoemmen, Mark Frederick, Author
Grigori, Laura, Author
Langou, Julien, Author
EECS Department, University of California, Publisher

Published

2008-08-04

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

EECS-2008-89

Type

Text

Format

technical reports

Extent

136 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket