This dissertation describes a methodology for compiling and executing irregular parallel programs. Such programs implement parallel operations whose size and work distribution depend on input data. Irregular operations pose a particularly difficult scheduling problem because the information necessary to execute these operations efficiently can not be known at the time the program is compiled. This dissertation describes a set of four run-time scheduling techniques that can execute many irregular parallel programs efficiently. A common thread among these techniques is that they gather information about the work distribution of a program during its execution and use this information to adjust the allocation of processing resources.

The most important contribution of this dissertation is its identification and exploitation of work distribution locality properties. Previous work on irregular parallel program scheduling unearthed the following dilemma: compilers can not predict work distribution accurately enough to schedule programs efficiently; however, runtime load balancing solutions, while more accurate, incur prohibitive overhead. This dissertation shows how to avoid this dilemma whenever irregular loops within parallel programs have work distribution locality, that is, when a loop retains a similar distribution of individual iteration execution times from one execution instance to the next. An execution instance is simply an execution of the entire loop, possibly in parallel.

Where this common case arises, we exploit it through work distribution caching: guessing the work distribution of a loop execution instance based on earlier measurements. We also exploit work distribution locality through deferred load balancing: reducing the communication overhead and thrashing potential of load balancing algorithms by applying them across multiple execution instances of a loop.

We evaluated these scheduling techniques using a set of application programs, including climate modeling, circuit simulation, and x-ray tomography, that contain irregular parallel operations. The results demonstrate that, for these applications, the techniques described in this dissertation achieve near-optimal efficiency on large numbers of processors. In addition, they perform significantly better, on these problems, than any previously proposed static or dynamic scheduling method.




Download Full History