Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.
Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.
Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.
Author ORCID Identifier
Open Access Dissertation
Doctor of Philosophy (PhD)
Year Degree Awarded
Month Degree Awarded
OS and Networks
Cloud wide-area networks (WANs) play a key role in enabling high performance applications on the Internet. Cloud providers like Amazon, Google and Microsoft, spend over hundred million dollars annually to design, provision and operate their WANs to fulfill the low-latency, high-bandwidth communication demands of their clients. In the last decade, cloud providers have rapidly expanded their datacenter deployments, network equipment and backbone capacity, preparing their infrastructure to meet the growing client demands. This dissertation re-examines the design and operation choices made by cloud providers in this phase of exponential growth along the axes of network performance, reliability and operational expenditure. I use empirical evidence from a large commercial cloud to develop software-defined traffic engineering systems that remedy the inefficiencies in the operation of cloud networks. First, I demonstrate how knowledge of optical signal quality can lead to a 75% increase in capacity for 80% of the optical wavelengths in the cloud backbone. I show that optical wavelengths can potentially sustain 175 Gbps or higher capacity but they were being utilized for a conservative 100 Gbps only, leaving 145 Tbps of network capacity on the table. This gain stems from the fact that operators have been conservative in utilizing the fiber out of concerns for network reliability. My analysis shows that by dynamically adapting link capacities, it is possible to have the best of both worlds – gains in network capacity with fewer link failures. I develop a traffic engineering controller for the WAN that dynamically adapts link capacities in response to changing optical signal quality. The rate adaptive wide-area network (RADWAN) controller reclaims terabits of network capacity from the existing cloud WAN infrastructure while preventing 25% of link failures. Second, I demonstrate cost inefficiencies in the design of cloud optical backbones. Cloud providers traditionally operate point-to-point inter-regional networks - where optical signals are converted to electrical signals and back at every geographical region. While flexible to accommodate new traffic demands, the conventional point-to-point design does not keep in view the nature of traffic demands and the traffic flow imposed by them. My analysis shows that on average, 60% of traffic traversing through WAN regions is passing through –- neither originating nor terminating at the region. The pass-through or transit traffic undergoes wasteful optical-to-electrical-to-optical (OEO) conversions at all intermediate regions in point-to-point networks, occupying scarce optical line- and router ports. These ports contribute a majority of the cost of provisioning capacity in cloud networks with existing fiber deployments. I design and implement Shoofly, a network design tool that minimizes hardware costs of provisioning long-haul capacity by optically bypassing network hops where conversion of signals from optical to electrical domain is unnecessary and uneconomical. Shoofly provisions bypass-enabled topologies that meet 8X the present-day demands using existing network hardware. Even under aggressive stochastic and deterministic link failure scenarios, these topologies save 32% of the cost of long-haul capacity. Finally, I develop Cascara, a cloud edge traffic engineering controller to minimize the cost of inter-domain bandwidth incurred by the cloud provider. Traffic engineering systems at the cloud edge attempt to strike a fine balance between minimizing costs and maintaining the latency expected by clients. The nature of this tradeoff is complex due to non-linear pricing schemes prevalent in the market for inter-domain bandwidth. I quantify this tradeoff and uncover several key insights from the link-utilization between a large cloud provider and Internet service providers. Based on these insights, Cascara optimizes inter-domain bandwidth allocations with non-linear pricing schemes. Cascara exploits the abundance of latency-equivalent peer links on the cloud edge to minimize costs without impacting latency significantly. Extensive evaluation on production traffic demands shows that Cascara saves 11–50% in bandwidth costs per cloud point of presence, while bounding the increase in client latency by 3 milliseconds.
Singh, Rachee, "Traffic engineering in planet-scale cloud networks" (2021). Doctoral Dissertations. 2220.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.