Publications | Karan Tandon

2024

OPPerTune: Post-Deployment Configuration Tuning of Services Made Easy

Gagan Somashekar^*, Karan Tandon^*, Anush Kini, and 6 more authors

In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), Apr 2024

Abs PDF Video

Real-world application deployments have hundreds of interdependent configuration parameters, many of which significantly influence performance and efficiency. With today’s complex and dynamic services, operators need to continuously monitor and set the right configuration values (configuration tuning) well after a service is widely deployed. This is challenging since experimenting with different configurations post-deployment may reduce application performance or cause disruptions. While state-of-the-art ML approaches do help to automate configuration tuning, they do not fully address the multiple challenges in end-to-end configuration tuning of deployed applications. This paper presents OPPerTune, a service that enables configuration tuning of applications in deployment at Microsoft. OPPerTune reduces application interruptions while maximizing the performance of deployed applications as and when the workload or the underlying infrastructure changes. It automates three essential processes that facilitate post-deployment configuration tuning: (a) determining which configurations to tune, (b) automatically managing the scope at which to tune the configurations, and (c) using a novel reinforcement learning algorithm to simultaneously and quickly tune numerical and categorical configurations, thereby keeping the overhead of configuration tuning low. We deploy OPPerTune on two enterprise applications in Microsoft Azure’s clusters. Our experiments show that OPPerTune reduces the end-to-end P95 latency of microservice applications by more than 50% over expert configuration choices made ahead of deployment. The code and datasets used are made available at https://aka.ms/OPPerTune.
Reward Copilot for RL-driven Systems Optimization

Karan Tandon, Manav Mishra, Gagan Somashekar, and 2 more authors

In NeurIPS 2024 Workshop on Machine Learning for Systems, Dec 2024

Abs PDF

Systems optimization problems such as workload auto-scaling, kernel parameter tuning, and cluster management arising in large-scale enterprise infrastructure are becoming increasingly RL-driven. While effective, it is difficult to set up the RL framework for such real-world problems — designing correct and useful reward functions or state spaces is highly challenging and needs a lot of domain expertise. Our proposed novel Reward Copilot solution can help design suitable and interpretable reward functions guided by client-provided specifications for any RL framework. Using experiments on standard benchmarks as well as systems-specific optimization problems, we show that our solution can return reward functions with a certain (informal) feasibility certificate in addition to pareto-optimality.
Improving training time and GPU utilization in geo-distributed language model training

Palak, Rohan Gandhi, Karan Tandon, and 2 more authors

Nov 2024

Abs PDF

The widespread adoption of language models (LMs) across multiple industries has caused huge surge in demand for GPUs. Training LMs requires tens of thousands of GPUs and housing them in the same datacenter (DCs) is becoming challenging. We focus on training such models across multiple DCs connected via Wide-Area-Network (WAN). We build ATLAS that speeds up such training time using novel temporal bandwidth sharing and many other design choices. While ATLAS improves the training time, it does not eliminate the bubbles (idle GPU cycles). We built BUBBLETEA that runs prefill-as-a-service (part of LM inference) during the bubbles that improves the GPU utilization substantially without any impact of training. Together, ATLAS and BUBBLETEA improve training time by up to 17X and achieve GPU utilization of up to 94%.