The major limitation on the performance of shared memory multiprocessors running parallel programs is the memory traffic due to sharing, i.e., the coherence or consistency induced memory traffic. Much of this traffic occurs due to false sharing (when two or more processors use disjoint portions of the same cache block) and dead sharing (the transfer of unreferenced words in a block when the block moves between caches). Dead and false sharing can be minimized or eliminated by the use of a small block size, but at the cost of substantially increased miss ratios due to true sharing and regular cache misses. Because shared memory multiprocessors are likely to be used most often for the multiprogramming of single thread programs and not for parallel programs, optimizing solely for the latter case is a poor idea.
In this paper, we present a new cache protocol, Minerva, which allows the effective cache block size to vary dynamically. Minerva works using sector caches (also known as block/subblock caches). Cache consistency attributes (from the MESI set of states) are associated with each 4-byte word in the cache, and consistency is maintained on a word basis. Each block (sector) in each cache can itself have one of the attributes of: invalid, exclusive or shared. Each block also has a current subblock (subsector) size, of 2^k words and a confidence value for that size. The subblock size is reevaluated every time there is an external access (read or invalidate) to the block, and is changed when the confidence level reaches zero. When a fetch miss occurs within a block, a subblock equal to the current subblock size is fetched. Note that the fetch may involve a gather operation, with various words coming from different sources; some of the words may already be present.
Despite the apparent complexity of the protocol, it can be implemented with fairly simple combinational logic. The nature of Minerva makes it easy and convenient to also implement optimizations such as bus read-sharing, read-broadcast (snarfing), and write-validate. For non-parallel workloads, Minerva converges to the Illinois protocol on 64-byte blocks, so performance for conventional workloads is also high.
Depending on the assumed cache sizes, block sizes, and bus timings, we find that Minerva reduces execution times by 19-40%, averaged over 12 test parallel programs. Our evaluation considers the utility of various other optimizations, compares the use of Minerva with restructuring the code, and considers the extra state bits required.
Title
Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP Performance
Published
1999-12-10
Full Collection Name
Electrical Engineering & Computer Sciences Technical Reports
Other Identifiers
CSD-99-1087
Type
Text
Extent
27 p
Archive
The Engineering Library
Usage Statement
Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).