Description
Network-on-Chip provides an effective and scalable way to integrate hundreds of heterogeneous cores without forcing each to give up its own PVT-induced operating point for the chip-wide common worst case. As with asynchronous logic, a NoC of regular, redundant, many-CLK/Vdd cores can deliver the average rather than the worst case system performance with greater power efficiency and fault tolerance than its globally synchronous monolithic counterparts. This work shows that the Voltage-Frequency Island (VFI) architectures are also the key to tolerating and compensating for PVT variations.
The VFI advantages cannot be realized without run-time task-to-core mapping and adaptive network routing that optimally match application resource requirements with heterogeneous cores and communication fabric. These systematic techniques are more effective at mitigating a variety of faults and variations than layout and circuit DFM. Most importantly, the gains from these techniques can be translated into die yield improvements and smaller DFM guard-bands.
This work investigates core sparing and network routing. The developed models demonstrate that core sparing reduces the die cost asymptotically from O(A^3) to O(A^1/2), and it is more cost efficient than larger design guard-bands of layout and circuit redundancy. The analysis outcome favors a greater number of smaller unreliable cores as opposed to a fewer larger reliable cores given a fixed die area. This points to the limitations and ultimately the futility of DFM techniques in the future semiconductor process generations.
Adaptive network routing enables core sparing. More critically, it simultaneously combats the two sources of network load imbalance: on-die performance heterogeneity from PVT variations and application communication topology. With stochastic PVT variations, the developed Minimal Adaptive Total Congestion (MATC) router increases the expected network saturation bandwidth by 7-23% and reduces its variance by 2-10x as compared to the Dimension Order router. With systematic PVT variations, the improvements are 5-35%. These gains of the adaptive router can compensate for degradation due to performance variations and can thus be used to reduce design guard-bands.
By treating cores as units of fault and variation tolerance, these systematic techniques provide a simple and consistent way to deal with static and dynamic performance variations and faults. These techniques are more effective than isolated DFM solutions. Rather than fighting and minimizing the on-die parametric variations, our approach takes advantage of the platform heterogeneity and manages its net system performance impact.