The focus of this thesis is in between the theory and practice of scheduling: it includes modeling, performance analysis and practical algorithmics. We present a variety of new techniques for scheduling problems relevant to parallel scientific computing. The thesis progresses from new compile-time algorithms for message scheduling through new runtime algorithms for processor scheduling to a unified framework for allocating multiprocessor resources to competing jobs while optimizing both individual application performance and system throughput.
The compiler algorithm schedules network communication for parallel programs accesing distributed arrays. By analyzing and optimizing communication patterns globally, rather than at the single statement level, we often reduce communication costs by factors of two to three in an implementation based on IBM's High-Performance Fortran compiler.
The best parallelizing compilers at present support regular, static, array-based parallelism. But parallel programmers are out-growing this model. Many scientific and commercial applications have a two-level structure: the outer level is a potentially irregular and dynamic task graph, and the inner level comprises relatively regular parallelism within each task. We give new runtime algorithms for allocating processors to such tasks. The result can be a twofold increase in effective megaflops, as seen from an implementation based on ScaLAPACK, a library of scientific software for scalable parallel machines.
Compilers and runtime systems target single programs. Other system software must do resource scheduling across multiple programs. For example, a database scheduler or a multiprocessor batch queuing system must allocate many kinds of resources between multiple programs. Some resources like processors may be traded for time, others, like memory, may not. Also, the goal is not to finish a fixed set of programs as fast as possible but to minimize the average response time of the programs, perhaps weighted by a priority. We present new algorithms for such problems.
Most of the above results assume a central scheduler with global knowledge. When the setting is distributed, decentralized techniques are needed. We study how decentralization and consequent local knowledge by per-processor schedulers affects load balance in fine-grained task-parallel applications. In particular, we give new protocols for distributed load balancing and bounds on the trade-off between locality and load balance. The analysis has been supported by experience with implementing task queues in Multipol, a library for coding dynamic, irregular parallel applications.