Building portable parallel programs on distributed memory multiprocessors and workstation networks is a complex task that is greatly facilitated by powerful infrastructure. In this dissertation, we develop important components of that infrastructure, focusing on irregular applications such as unstructured mesh computations, search problems, and discrete event simulation. We use a library-based approach to building such applications. The library provides a uniform programming interface on multiple platforms and has highly tuned implementations developed by the library programmer. Therefore, applications built on the library can be portable both in functionality and in performance.

We describe the major components of our parallel data structure library called Multipol, including two of the more irregular data structures and one application. The two data structures are a task stealer for dynamic load balancing and an event graph for discrete event simulation. The application is a timing-level circuit simulator for combinational circuits. We analyze the workloads of several applications built by the Multipol group and quantitatively characterize their irregularities.

The Multipol library is built on a runtime layer consisting of threads as well as communication mechanisms. The thread layer supports a basic computational abstraction called fibers, which are code sequences that appear to execute atomically. The fiber abstraction enables a portable multithreading execution environment for latency hiding. The thread layer also allows the programmer to supply customized schedulers to enforce application-specific scheduling policies. The communication layer provides portable primitives for expressing irregular communication. It uses a technique called message aggregation to trade the excess parallelism in the application for better communication bandwidth.

We provide a new performance profiling toolkit called Mprof to help tune the performance of irregular parallel programs. Mprof identifies two major sources of performance inefficiency: overhead and insufficient parallelism. It uses statistical modeling to extract reusable cost models from benchmark executions. The cost models are combined with high-level statistics collected from an actual execution to provide low-overhead profiling information. Mprof also provides a performance interface for the library programmer to customize the profiling information and thereby preserve the library abstraction. Using information from Mprof, we optimize the performance of several irregular applications and demonstrate the performance portability of the Multipol library and runtime layer.




Download Full History