Getting a Handle on Variability
Laura Peters, Lead Technical Editor -- Semiconductor International, 11/1/2007
“The inability to scale the tolerance of multiple electrical parameters along with their nominal value has contributed to a virtual crisis in the ability to improve performance and power consumption in new processes,” said Sani Nassif and colleagues at IBM's Research Center (Austin, Texas) and the company's T.J. Watson Research Center (Yorktown Heights, N.Y.). In fact, CMOS variability is beginning to become a major design limiter that must be managed judiciously.
Variability occurs over a widely divergent timescale, and several different mechanisms have an impact on device delay (Table). The researchers, who will present a paper at the upcoming International Electron Devices Meeting (IEDM) next month, believe the response to variability needs to encompass the process, device and circuit levels, and must include systematic and random components. An example of a random component is threshold voltage (Vt) variability caused by random dopant fluctuations. Variability can be improved by removing most of the doping (finFETs), but interfaces still exhibit randomness. A systematic component would be variation in interconnect thickness with layout density caused by chemical mechanical planarization (CMP). Systematic variability can be minimized by appropriately adapting the design. Unfortunately, the same does not apply to random variability.
In a fab environment, variability also has a spatial component induced by differences within-die, die-to-die, within wafer, wafer-to-wafer and lot-to-lot. Variability is reduced through yield learning, but economics dictate that variability not be reduced any more than necessary to maintain profitability.
The implementation of the circuit, but more importantly, the circuit architecture, also impacts delay variability. For instance, in looking at the normalized delay variability of 11 16-bit adders implemented in a variety of styles, it is shown that unit-level variability and gate-level variability are not necessarily the same. In addition, differences exist among physical contributors to variation, such as Vt and oxide thickness, dimensional contributors (length, width) and environmental contributors to variability, such as Vdd.
The effect of device variability at the circuit and system level is becoming amplified. In his paper, also scheduled to be presented at IEDM, Luca Benini of the DEIS Universita di Bologna in Italy will discuss how to design reliable systems with unreliable devices. He stated that, “Random variations at the finest granularity level [i.e., among elementary devices] are rapidly growing, and they falsify the assumption that identical replicas of a subsystem behave identically in fault-free conditions.” He talked about the fact that billion-device integration, combined with increased variations, failure and aging, brings the normal level of reliability of CMOS ICs to such low levels that yield and mean time between failure (MTBF) is insufficient, even for the cost-constrained embedded market. Therefore, the challenge is to identify the most variability-tolerant methods for design. He proposes three approaches.
Parsimonious robustness — In very cost-constrained markets, the aim is to achieve robustness with more limited hardware than would be required by duplication or triple-module redundancy methods. This can be achieved by focusing on high-impact reliability threats, such as timing errors caused by slow signal propagation; selectively exploiting existing hardware redundancy, such as design for test scan resources; or providing protection only for structures with a high vulnerability factor.
Dynamic reliability management (DRM) — In the future, DRM will be integrated with other forms of resource management, such as power and quality of service. The approach involves correlating settings with failures. DRM is most effective for gradual variations and failure mechanisms that can be effectively monitored and predicted.
Multicore reliability — The movement to multicore and polycore architectures for reduced clock frequencies, which reduces design constraints on circuit speed and dynamic power density, has little effect on reliability. But they offer the ability for functional redundancy and allocation and scheduling of workloads in a reliable fashion.
Find more information on yield management.