Off-campus UMass Amherst users: To download dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users, please click the view more button below to purchase a copy of this dissertation from Proquest.

(Some titles may also be available free of charge in our Open Access Dissertation Collection, so please check there first.)

Low-cost schemes for fault tolerance

Nitin Hemant Vaidya, University of Massachusetts Amherst

Abstract

Two aspects of fault tolerance are fault diagnosis and fault recovery. This dissertation studies both these aspects and presents low-cost schemes for achieving diagnosis and recovery. Two models for fault tolerance are studied, namely, modular redundancy and system-level diagnosis. Modular redundant systems achieve fault detection and recovery by employing multiple replicas of each module. Such systems try to mask the failures, whenever possible. When high reliability is to be achieved with low redundancy, it is not always possible to mask the failures without retrying the computation. Check-pointing and rollback recovery is a technique that tries to minimize the expense of retrying. Multiprocessor fault tolerance schemes using modular redundancy are proposed here to minimize this expense further by exploiting the inherent redundancy offered by modular redundant systems. The proposed schemes are shown to improve the performance of modular redundant systems in the presence of faults, as compared to rollback schemes. A trade-off exists between cost and performance of any fault tolerant system. Such a trade-off for modular redundant systems can be exploited to achieve high reliability at a low cost by trading the performance. The cost-performance trade-off is governed by the reliability-safety trade-off for the modular redundant systems. This trade-off is studied and the effect of increasing the level of redundancy on reliability-safety of a modular redundant system is analyzed. System-level diagnosis is a graph-theoretic approach for diagnosing the status of the modules in a system. A method for minimizing the cost of diagnosis, named safe diagnosis, is proposed. It is shown that a large level of diagnostic safety in addition to existing diagnostic reliability can be achieved with a low overhead. Additionally, it is shown that achieving high safety does not increase the complexity of fault diagnosis algorithms.

Subject Area

Electrical engineering|Computer science

Recommended Citation

Vaidya, Nitin Hemant, "Low-cost schemes for fault tolerance" (1993). Doctoral Dissertations Available from Proquest. AAI9316722.
https://scholarworks.umass.edu/dissertations/AAI9316722

Share

COinS