Automatically Tuning Collective Communication for One-Sided Programming Models

Nishtala, Rajesh; EECS Department, University of California

PDF

Description

Technology trends suggest that future machines will rely on parallelism to meet increasing performance requirements. To aid in programmer productivity and application performance, many parallel programming models provide communication building blocks called collective communication. These operations, such as Broadcast, Scatter, Gather, and Reduce, abstract common global data movement patterns behind a simple library interface allowing the hardware and runtime system to optimize them for performance and scalability.

We consider the problem of optimizing collective communication in Partitioned Global Address Space (PGAS) languages. Rooted in traditional shared memory programming models, they deliver the benefits of sophisticated distributed data structures using language extensions and one-sided communication. One-sided communication allows one processor to directly read and write memory associated with another. Many popular PGAS language implementations share a common runtime system called GASNet for implementing such communication. To provide a highly scalable platform for our work, we present a new implementation of GASNet for the IBM BlueGene/P, allowing GASNet to scale to tens of thousands of processors.

We demonstrate that PGAS languages are highly scalable and that the one-sided communication within them is an efficient and convenient platform for collective communication. We show how to use one-sided communication to achieve 3x improvements in the latency and throughput of the collectives over standard message passing implementations. Using a 3D FFT as a representative communication bound benchmark, for example, we see a 17% increase in performance on 32,768 cores of the BlueGene/P and a 1.5x improvement on 1024 cores of the CrayXT4. We also show how the automatically tuned collectives can deliver more than an order of magnitude in performance over existing implementations on shared memory platforms.

There is no obvious best algorithm that serves all machines and usage patterns demonstrating the need for tuning and we thus build an automatic tuning system in GASNet that optimizes the collectives for a variety of large scale supercomputers and novel multicore architectures. To understand the large search space, we construct analytic performance models use them to minimize the overhead of autotuning. We demonstrate that autotuning is an effective approach to addressing performance optimizations on complex parallel systems.

Details

Title

Automatically Tuning Collective Communication for One-Sided Programming Models

Creator

Nishtala, Rajesh, Author
EECS Department, University of California, Publisher

Published

2009-12-15

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

EECS-2009-168

Type

Text

Format

technical reports

Extent

178 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket