Description
We describe the major components of our parallel data structure library called Multipol, including two of the more irregular data structures and one application. The two data structures are a task stealer for dynamic load balancing and an event graph for discrete event simulation. The application is a timing-level circuit simulator for combinational circuits. We analyze the workloads of several applications built by the Multipol group and quantitatively characterize their irregularities.
The Multipol library is built on a runtime layer consisting of threads as well as communication mechanisms. The thread layer supports a basic computational abstraction called fibers, which are code sequences that appear to execute atomically. The fiber abstraction enables a portable multithreading execution environment for latency hiding. The thread layer also allows the programmer to supply customized schedulers to enforce application-specific scheduling policies. The communication layer provides portable primitives for expressing irregular communication. It uses a technique called message aggregation to trade the excess parallelism in the application for better communication bandwidth.
We provide a new performance profiling toolkit called Mprof to help tune the performance of irregular parallel programs. Mprof identifies two major sources of performance inefficiency: overhead and insufficient parallelism. It uses statistical modeling to extract reusable cost models from benchmark executions. The cost models are combined with high-level statistics collected from an actual execution to provide low-overhead profiling information. Mprof also provides a performance interface for the library programmer to customize the profiling information and thereby preserve the library abstraction. Using information from Mprof, we optimize the performance of several irregular applications and demonstrate the performance portability of the Multipol library and runtime layer.