Loading...
An error occurred retrieving the object's statistics
Citations
Abstract
Big data analytics systems in the cloud have revolutionized data-driven insight discovery for businesses and organizations. Yet, intelligent cost-performance optimization for big data analytics remains fraught with complexities: (i) accurately modeling performance amidst multiple factors (job characteristics, data distributions, system states) (ii) balancing cost and latency in multi-objective optimization while ensuring good coverage, consistency, and efficiency, and (iii) dealing with fine-grained or adaptive controls (e.g., MaxCompute partition-level tuning, Spark’s adaptive query execution) that introduce hierarchical and runtime re-optimization challenges. Motivated by these points, this dissertation focuses on two key questions: (1) how to build accurate, robust models for analytical jobs in big data analytical systems, and (2) how to systematically recommend optimal configurations across diverse system granularities and objectives to achieve cost-performance optimization.
The first contribution in the dissertation introduces a model server capable of learning job properties from both black-box traces and white-box query plans. For black-box scenarios, it employs an autoencoder (guided by specialized losses) to extract job embeddings from runtime metrics, whereas white-box cases leverage graph-based neural networks for SQL query plans. Synthetic data generation further extends coverage to out-of-distribution workloads, enabling robust predictive performance.
Next, the optimization suite uses a principled multi-objective optimization (MOO) process to balance monetary cost and job performance. A unified data analytics optimizer (UDAO) tackles high-dimensional parameters, employing a gradient-based solver to quickly locate Pareto-optimal solutions. Two specialized optimizers then refine control at finer granularities: (1) a stage-level resource optimizer for MaxCompute that schedules millions of partition instances in sub-seconds, reducing latency by 37–72% and cost by 43–78% in production; and (2) a hybrid compile-time/runtime optimizer for Spark SQL, fine-tuning parameters (query, subquery, stage) and dynamically re-optimizing queries with updated runtime statistics, yielding over 60% latency cuts on TPC-DS and TPC-H.
Evaluations show that this intelligent optimizer consistently outperforms existing heuristic or partially automated strategies, achieving faster solving times (under one second) and better Pareto-frontier coverage. In summary, this dissertation demonstrates that intelligent cost-performance optimization, anchored in advanced modeling and multi-objective parameter tuning, enables big data systems to meet stringent performance goals while effectively managing cloud expenses.
Type
Dissertation (Open Access)
Date
2025-05
Publisher
Degree
Advisors
License
Attribution-NonCommercial 4.0 International
License
http://creativecommons.org/licenses/by-nc/4.0/
Files
Loading...
LyuDissertation2025.pdf
Adobe PDF, 7.38 MB