Improvements in chip manufacturing technology have propelled an astonishing growth of embedded systems which are integrated into our daily lives. However, this trend is facing serious challenges, both at device and system levels. As the minimum feature size continues to shrink, a host of vulnerabilities, such as radiation-induced soft errors and intermittent errors from weak cells and process variations, will influence the robustness, reliability, and availability of embedded and critical systems. The traditional design paradigm assumes no failure during the lifetime a design. Even classical fault tolerant approaches, such as duplication or Triple Modular Redundancy (TMR), which are used for high-end mainframes and safety-critical applications, are very costly and effective only for very small defect rates, and therefore not applicable to embedded systems.
In this project we will develop techniques and methodologies to ensure robustness, reliability, availability, and recoverability of critical embedded systems at both the hardware and the software levels in a very cost-effective way. We will design a set of techniques and tools for fast and accurate reliability modeling and estimation at various abstraction levels. We will propose concurrent error detection and localization methods (hardware-level) to ensure data integrity. We will also develop novel recovery mechanisms that provide error localization from the design level, and carefully consider how these interact with both the microarchitectural and architectural levels (software-level).