Communication-avoiding Parallel and Sequential QR Factorizations

Demmel, James; Hoemmen, Mark Frederick; Grigori, Laura; Langou, Julien; EECS Department, University of California

PDF

Description

We present parallel and sequential dense QR factorization algorithms that are optimized to avoid communication. Some of these are novel, and some extend earlier work. Communication includes both messages between processors (in the parallel case), and data movement between slow and fast memory (in either the sequential or parallel cases).

Our first algorithm, Tall Skinny QR (TSQR), factors mxn matrices in a one-dimensional (1-D) block cyclic row layout, storing the Q factor (if desired) implictly as a tree of blocks of Householder reflectors. TSQR is optimized for matrices with many more rows than columns (hence the name). In the parallel case, TSQR requires no more than the minimum number of messages Theta(log P) between P processors. In the sequential case, TSQR transfers 2mn + o(mn) words between slow and fast memory, which is the theoretical lower bound, and performs Theta(mn/W) block reads and writes (as a function of the fast memory size W), which is within a constant factor of the theoretical lower bound. In contrast, the conventional parallel algorithm as implemented in ScaLAPACK requires Theta(n log P) messages, a factor of n times more, and the analogous sequential algorithm transfers Theta(mn^2) words between slow and fast memory, also a factor of n times more. TSQR only uses orthogonal transforms, so it is just as stable as standard Householder QR. Both parallel and sequential performance results show that TSQR outperforms competing methods.

Our second algorithm, CAQR (Communication-Avoiding QR), factors general rectangular matrices distributed in a two-dimensional block cyclic layout. It invokes TSQR for each block column factorization, which both remove a latency bottleneck in ScaLAPACK's current parallel approach, and both bandwidth and latency bottlenecks in ScaLAPACK's out-of-core QR factorization. CAQR achieves modeled speedups of 2.1x on an IBM POWER5 cluster, 3.0x on a future petascale machine, and 3.8x on the Grid.

Details

Title

Communication-avoiding Parallel and Sequential QR Factorizations

Creator

Demmel, James, Author
Hoemmen, Mark Frederick, Author
Grigori, Laura, Author
Langou, Julien, Author
EECS Department, University of California, Publisher

Published

2008-05-29

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

EECS-2008-74

Type

Text

Format

technical reports

Extent

99 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket