Our methodology is to use analytical models combined with empirical measurements of important kernels in order to choose between algorithms and data layout schemes. We can then implement efficient and re-usable data structures to hide the complexities of distributed memory programming. Detailed measurements of the application and the machines are then used to determine machine-specific optimizations. Using this technique we achieved good speedups on all three machines. This paper describes the results of each stage and discusses their implications for multiprocessor designers.