We present compiler optimization techniques for explicitly parallel programs that communicate through a shared address space. The source programs are written in a single program multiple data (SPMD) style, and our machine target is a multiprocessor with physically distributed memory and hardware or software support for a single address space. The source language involves normal read and write operations on the address space, which correspond either to local memory operations or to communication over an interconnect network.

The remote operations result in high latencies, but much of the latency can be overlapped with local computation or initiation of further remote operations. Non-blocking memory operations allow this overlap to be expressed directly. However, overlap is difficult for programmers to do by hand; it can lead to subtle program errors, since the order in which operations complete is no longer obvious. Programmers writing explicitly parallel code expect reads and writes from a single thread to take effect in program order, a property called sequential consistency. The use of non-blocking memory operations might yield executions that violate sequential consistency.

We provide a new algorithm for static parallel program analysis to detect memory operations that can safely be made non-blocking. The analysis requires dependency information across and within threads, and builds on earlier work by Shasha and Snir. We improve their results by providing a more efficient algorithm for SPMD programs, and by improving the accuracy of the analysis through the use of synchronization information. Using the results of this analysis, we show how to optimize parallel programs by changing blocking operations into non-blocking ones, performing code motion to increase the time for communication overlap, and caching remote values to eliminate some read accesses entirely.

We show the potential payoff from each of our optimizations on real applications, using hand-transformed programs. The experiments are done on a CM-5 multiprocessor using the Split-C runtime system, which provides a software implementation of a global address space and both blocking and non-blocking memory operations.




Download Full History