Off-campus UMass Amherst users: To download campus access theses, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this thesis through interlibrary loan.

Theses that have an embargo placed on them will not be available to anyone until the embargo expires.

Document Type

Open Access

Degree Program

Electrical & Computer Engineering

Degree Type

Master of Science (M.S.)

Year Degree Awarded

2009

Month Degree Awarded

September

Keywords

Computer architecture, Yield and reliability, Chip multiprocessors, Microarchitectural modification, Simulation, Redundance across cores

Abstract

Device reliability and manufacturability have emerged as dominant concerns in end-of-road CMOS devices. Today an increasing number of hardware failures are attributed to device reliability problems that cause partial system failure or shutdown. Also maintaining an acceptable manufacturing yield is seen as challenge because of smaller feature sizes, process variation, and reduced headroom for burn-in tests. In this project we investigate a hardware-based scheme for improving yield and reliability of a homogeneous chip multiprocessor (CMP). The proposed solution involves a hardware framework that enables us to utilize the redundancies inherent in a multi-core system to keep the system operational in face of partial failures due to hard faults (faults due to manufacturing defects or permanent faults developed during system lifetime). A micro-architectural modification allows a faulty core in a multiprocessor system to use another core as a coprocessor to service any instruction that the former cannot execute correctly by itself. This service is accessed to improve yield and reliability, but at the cost of some loss of performance. In order to quantify this loss we have used a cycle-accurate architectural simulator to simulate the performance of dual-core and quad-core systems with one or more cores sustaining partial failure. Simulation studies indicate that when a large and sparingly-used unit such as a floating point unit fails in a core, even for a floating point intensive benchmark, we can continue to run the faulty core with as little as 10% performance impact and minimal area overhead. Incorporating this recovery mechanism entails some modifications in the microprocessor micro-architecture. The modifications are also described here through a simplified model of a superscalar processor.

First Advisor

Sandip Kundu

Share

COinS