Title: “Circuit Timing Marginalities and Silent Data Corruption“
Bio: Adit Singh is Godbold Endowed Chair and Professor of Electrical and Computer Engineering at Auburn University, USA. He earlier served on the faculties of the University of Massachusetts in Amherst, and Virginia Tech in Blacksburg, and has held visiting positions at the University of Tokyo, Japan, the Universities of Freiburg and Potsdam in Germany, the Indian Institutes of Technology, and as a Fulbright scholar at the University Polytechnic of Catalonia in Barcelona, Spain. His technical interests span all aspects of VLSI technology, in particular integrated circuit test and reliability. He has published over three hundred research papers and holds international patents that have been licensed to industry. He has served as a consultant to several semiconductor and EDA companies, including as an expert witness for major patent litigation cases. He has had leadership roles as General Chair/Co-Chair/Program Chair for dozens of international VLSI design and test conferences. He served two terms (2007-11) as Chair of the IEEE Test Technology Technical Council (TTTC), and (2011-15) on the Board of Governors of the IEEE Council on Design Automation (CEDA). Singh received his B.Tech from IIT Kanpur, and the M.S. and Ph.D. from Virginia Tech, all in Electrical Engineering. He is a Life Fellow of IEEE.
Abstract: Recent presentations from Google and Facebook (Meta) have reported significant levels of silent data corruption in their large data centers. These transient errors, which can go undetected for long periods and are extremely difficult to diagnose and root cause, have been correlated with specific processor cores in large processor networks, suggesting faulty or unstable hardware. We suggest that a possible cause of these failures are statistically rare outlier circuit paths displaying marginal timing due to unavoidable random variations in the manufacturing processes. Since switching delays are dependent on circuit state and environmental conditions, some marginal paths can escape detection during postproduction testing, but still cause occasional failure under worse case conditions in operation. We present analysis from research that is studying the impact of random process variations on the timing of CMOS gates and circuit paths when operating at significantly reduced voltages. Circuit delays are accentuated in low voltage, power saving operational modes commonly employed by thermal management in advanced processors, increasing the likelihood of timing failures. Recent research, which has been validated on published volume production test data from Intel’s advanced 14nm FinFET technology, suggests ways of leveraging the voltage and timing of the applied timing tests to enhance the detection of marginal timing parts during scan and system level testing. The goal is to reliably screen out these marginal parts during postproduction testing and thereby prevent them from causing errors in operation.
Title: “Methodologies to evaluate Robustness of Modern Complex SoCs“
Surya Musunuri is Silicon Architect working on Aurix microcontrollers at Infineon Technologies.
Before Infineon, Surya worked for Apple Inc, Cupertino, USA, where he was responsible for Hardware Power Management features of iPhone. Prior to Apple, he worked at Intel Corporation on developing IPs such as Voltage Regulators, PLLs, Clock Control Units, Random Number Generators and Security Units. Over the years at both Intel and Apple, he worked on several topics related Clock, Power, High-speed IOs for PC, Server, ultramobile and Smartphone platforms.
Surya graduated with Masters and PhD degrees in Electrical and Computer Engineering from University of Illinois at Urbana Champaign, where his research focused on integrated voltage regulators. He also has a Masters Certificate on Systems Design from Massachusetts Institute of Technology.
Outside work, Surya’s interests include Biking, Hiking, Table Tennis and Tennis.
Modern electronic systems use semiconductor components within tight specifications/limits. It is very important to understand how these semiconductor components actually perform within and outside these specification limits. This understanding can help in overall system optimization towards better Power, Performance, and cost. Alternatively this knowledge can be used to provide an additional guard band for semiconductor component’s functionality. Therefore, understanding Robustness of semiconductors help in achieving lower ppm-failure rates by ensuring sufficient guard band between the operating range of the semiconductor and the points at which the semiconductor fails.
Robustness Validation (RV) is a process to check functionality of a semiconductor IC for a given application profile. RV utilizes the understanding of the IC (analog, digital, other) failure mechanisms thereby providing key feedback to improve the IC within and outside the datasheet limits. Understanding the timing, power delivery network and physical implementation aspects of modern SoCs (System-on-Chips) can help in improving Robustness margin for a key semiconductors components (such as Microprocessors, Microcontrollers, ASICs, etc). This RV methodology can be applied to all electronic components within a system, thereby significantly increasing the overall system robustness.
With ZVEI robustness handbook as reference, this talk would cover three key components of RV: 1. Knowledge of the conditions of use (application profiles) 2. Knowledge of the failure mechanisms and failure modes and the possible interactions between different failure mechanisms 3. Knowledge of acceleration models for the failure mechanisms needed to define and assess accelerated tests.
This talk outlines methods to evaluate robustness for modern complex SoCs. It also provides hints into analysis techniques used to derive meaningful learnings from silicon towards better SoC design.