Improving energy efficiency is critical to increasing computing capability, from mobile devices operating with limited battery capacity to servers operating under thermal constraints. The widely accepted solution to improving energy efficiency is dynamic voltage and frequency scaling (DVFS), where each block in a design operates at the minimum voltage required to meet performance constraints at a given time. However, variation-induced SRAM bitcell failures in caches at low voltage limit voltage scaling---and therefore energy-efficiency improvements---in advanced process nodes. Analyzing and modeling bitcell failures is necessary to develop resiliency techniques that prevent or tolerate SRAM failure to improve the minimum operating voltage of caches.

This work demonstrates a holistic approach that uses both circuit-level and architecture-level design techniques to improve low-voltage operation of SRAM. A simulation framework and experimental measurements from a 28nm testchip explore failure mechanisms of SRAM bitcells. The simulation framework is further utilized to evaluate the effectiveness of circuit-level SRAM assist techniques using dynamic failure metrics. New circuit-level techniques that use replica timing are developed to make SRAM macros more resilient to process variation. An architecture-level error model is developed to translate bitcell failure probability to yield, and to evaluate a variety of error-correcting code (ECC) and redundancy-based resiliency techniques. New resiliency schemes, named dynamic column redundancy (DCR) and bit bypass (BB), are proposed to tolerate a high bitcell failure rate with low overhead.

The methodology and proposed schemes were validated with five different 28nm chips. The RAVEN1 testchip measured in-situ threshold voltage variation of 30,000 bitcells and was used to analyze the effect of random telegraph noise on failures. The RAVEN2 testchip explored a single-p-well bitcell design that can compensate for global process variation. The RAVEN3 and RAVEN3.5 chips included processors with on-chip switched-capacitor voltage conversion, and resilient SRAM macros with circuit-level techniques that enable operation down to 0.45V. The SWERVE processor included architecture-level techniques to avoid failing cells, and decreases energy by over 30% with only 2% area overhead, and includes in-situ pipelined ECC to measure the contribution of intermittent SRAM error sources such as random telegraph noise and aging.

Overall, this dissertation describes a general methodology that is used to evaluate resilient design technique effectiveness for different process and architecture assumptions, and proposes a new set of resilient design techniques that lower the minimum operating voltage of caches with low overhead.





Download Full History