Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Demmel, James; Gittens, Alex; Devarakonda, Aditya; Racah, Evan; Ringenburg, Michael; Gerhardt, Lisa; Kottaalam, Jey; Liu, Jialin; Maschhoff, Kristyn; Canon, Shane; Chhugani, Jatin; Sharma, Pramod; Yang, Jiyan; Harrell, Jim; Krishnamurthy, Venkat; Mahoney, Michael W.; Prabhat

PDF

Description

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.

Details

Title

Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Creator

Demmel, James, Author
Gittens, Alex, Author
Devarakonda, Aditya, Author
Racah, Evan, Author
Ringenburg, Michael, Author
Gerhardt, Lisa, Author
Kottaalam, Jey, Author
Liu, Jialin, Author
Maschhoff, Kristyn, Author
Canon, Shane, Author
Chhugani, Jatin, Author
Sharma, Pramod, Author
Yang, Jiyan, Author
Harrell, Jim, Author
Krishnamurthy, Venkat, Author
Mahoney, Michael W., Author
Prabhat, Author

Published

2016-08-23

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

EECS-2016-151

Type

Text

Format

technical reports

Extent

28 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket