A cache coherence scheme is a critical part of a shared memory multiprocessor because it relieves the programmer of the burden of moving shared data among local and remote memories. In this dissertation several new techniques are described for implementing and evaluating the performance of cache-coherent multiprocessors. The first contribution of this work is a generalization of shared bus stack simulation techniques that supports directory-based cache coherence schemes. This is desirable because stack simulation Permits the evaluation of multiple cache sizes in a single simulation run. Results are presented for three benchmark programs, three directory methods, and multiple cache, block and multiprocessor sizes. The results quantify the tradeoffs between network traffic and miss ratio that are possible by varying the number of updates in a competitive directory scheme. These results extend previous studies of shared bus architectures by accounting for point-to-point network traffic and larger numbers of processors. A second contribution is an ap- proximation technique for analyzing interconnection networks as open, acyclic networks of finite queues. The technique combines the algorithms used in the Bell Laboratories Queueing Network Analyzer with a known algorithm for approximating the effect of finite buffers. The combined algorithm permits analysis of queueing networks with hundreds of queues and is applicable to a broad class of interconnection network, including hypercubes, meshes, tori and Delta networks. Using traffic estimates from cache simulations, this analysis technique is applied to a number of alternative networks. Good cache and network performance requires good synchronization support. The latter part of this dissertation describes several efficient implementations of fetchtop synchronization primitives. The implementations are suitable for hardware or software, and can be easily modified to support multiprefix operations.




Download Full History