Writing distributed programs is difficult for at least two reasons. The first reason is that distributed computing environments present new problems caused by asynchrony, independent time bases, and communications delays. The second reason is that there is a lack of tools available to help the programmer understand the program he/she has written. The tools we use for single machine environments do not easily generalize to a distributed environment. There has been only limited success with previous systems that have tried to help the programmer in developing, debugging, and measuring distributed programs.

To better understand distributed programs we have: specified a model for distributed computations, developed a measurement methodology from this model, constructed tools to implement the measurements, and developed data analysis techniques to obtain useful results from the measurements. The most important feature of the models, methodology, and tools is consistency between the programmer's view, the computation model, the measurement methodology, and the analysis.

This consistency has resulted in several benefits. The first is a simplicity of structure throughout the measurement and analysis tools. The second benefit is the ease of obtaining useful information about a program's behavior.

The model of distributed programs defines the two basic actions of a program to be computation and communication. Our research focuses on the communications performed by a program. The measurement model is based on the monitoring of communications between the parts of a program. Given our definition of a program, monitoring communications completely encapsulates the behavior of a computation. From the measurement model, we have constructed tools to measure distributed programs for two working operating systems, UNIX and DEMOS/MP. These measurement tools provide data on the interactions between the parts of a distributed program.

We have developed a number of analysis techniques to provide information from the data collected. We can report communications statistics on message counts, queue lengths, and message waiting times. We can perform more complex analyses, such as measuring the amount of parallelism in the execution of a distributed program. The analyses also include detecting paths of causality through the parts of a distributed program. The measurement tools and analyses can be structured so that results can be fed back into the operating system to help with scheduling decisions.





Download Full History