Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

ORCID

N/A

Access Type

Open Access Thesis

Document Type

thesis

Degree Program

Electrical & Computer Engineering

Degree Type

Master of Science in Electrical and Computer Engineering (M.S.E.C.E.)

Year Degree Awarded

2017

Month Degree Awarded

September

Abstract

Amazon Spot Instances provide inexpensive service for high-performance computing. With spot instances, it is possible to get at most 90% off as discount in costs by bidding spare Amazon Elastic Computer Cloud (Amazon EC2) instances. In exchange for low cost, spot instances bring the reduced reliability onto the computing environment, because this kind of instance could be revoked abruptly by the providers due to supply and demand, and higher-priority customers are first served.

To achieve high performance on instances with compromised reliability, Spark is applied to run jobs. In this thesis, a wide set of spark experiments are conducted to study its performance on spot instances. Without stateful replicating, Spark suffers from cascad- ing rollback and is forced to regenerate these states for ad hoc practices repeatedly. Such downside leads to discussion on trade-off between compatible slow checkpointing and regenerating on rollback and inspires us to apply multiple fault tolerance schemes. And Spark is proven to finish a job only with proper revocation rate. To validate and evaluate our work, prototype and simulator are designed and implemented. And based on real history price records, we studied how various checkpoint write frequencies and bid level affect performance. In case study, experiments show that our presented techniques can lead to ~20% shorter completion time and ~25% lower costs than those cases without such techniques. And compared with running jobs on full-price instance, the absolute saving in costs can be ~70%.

DOI

https://doi.org/10.7275/10711113

First Advisor

David Irwin

Share

COinS