Description
A trace-driven simulation study of dynamic load balancing in homogeneous distributed systems supporting broadcasting is presented. We use information about job CPU and I/O demands collected from a production system as input to a simulation model that includes a representative CPU scheduling policy and considers the message exchange and job transfer costs explicitly. Seven load balancing algorithms are simulated and their performances compared. We find that load balancing is capable of significantly reducing the mean and standard deviation of job response times, especially under heavy system load, and for jobs with high resource demands. The performances of all hosts, even those originally with light loads, are generally improved by load balancing. The reduction of the mean response time increases with the number of hosts, but levels off at around 30 hosts. Algorithms based on periodic or non-periodic load information exchange provide similar performance, and, among the periodic policies, the algorithms that use a distinguished agent to collect and distribute load information cut down the overhead and scale better. They are also the most appropriate algorithms for adaptive load balancing, which has the potential of offering near-optimal performance under a wide spectrum of system configurations and load conditions. System instability in the form of host overloading is possible when the load information is not up-to-date and the system is under heavy load; however, this undesirable phenomenon can be alleviated by simple measures. Load balancing is still very effective even when up to half of the eligible jobs have to be executed locally. The trace-driven simulation approach to the study of load balancing is found to be critical and effective, an d is recommended for use before implementation efforts.