Averages can be deceiving, however, as there is sometimes a wide variance amongst the results. This phenomena is especially true for strided accesses, found in the vertical image access pattern, whose performance is highly dependent on the stride. Hardware-based data layout alternatives are examined for their effect on strided memory performance. An alternative layout modestly improves the mean performance of the vertical access pattern, but it increases the variance and decreases the performance of some particular cases. A simple address hashing scheme decreases the variance and increases the performance of some particular cases, but it decreases the mean performance of the vertical access pattern.
The bottlenecks to performance within the memory system are sometimes bank conflicts, sometimes sub-bank conflicts, and sometimes a mixture of the two. When sub-bank conflicts are a significant factor, the performance significantly increases if each bank within the DRAM is divided into sub-banks, and load bandwidth is higher than store bandwidth due to the additional sub-bank busy time for stores. Other factors limiting the performance of the VIRAM memory system include short vectors, insufficient issue bandwidth, and the effects of a simplified pipeline control. Loop unrolling is necessary for maximizing performance when there is insufficient issue bandwidth to keep one or both memory units busy, in the horizontal and blocked image access patterns. Data alignment is only significant on unit stride accesses when there is sufficient issue bandwidth to keep the vector memory unit(s) busy.
The memory system is a limiting factor in the ability of the vector unit to effective scale both the number of lanes and the number of address generators. Scaling improves as the number of sub-banks increases for cases in which sub-bank conflicts are a significant factor.
Even though there are limitations to scaling and all but the unit stride accesses of the horizontal image access pattern achieve less than the peak performance, the absolute performance of VIRAM-1 is impressive compared to conventional, cache-based machines. For comparison, the measured unit stride performance of a memory to memory copy on a PC running at twice the clock frequency of VIRAM-1 is only 304.0 MB/s, a small fraction of the sustainable unit stride bandwidth of VIRAM-1.