
Anshul Gandhi
· Research Assistant ProfessorVerifiedStony Brook University · Computer Science
Active 2006–2026
About
Anshul Gandhi is an Associate Professor in the Department of Computer Science at Stony Brook University. He leads the PACE Lab and specializes in applying analytical tools such as Machine Learning, Optimization, Queueing Theory, and Control Theory to improve computer systems. His research focuses on leveraging these mathematical tools to analyze and optimize the behavior of systems including Data Centers, Cloud, and Systems for Machine Learning, with the aim of enhancing performance, reducing energy consumption, and lowering carbon footprints. He completed his Ph.D. in 2013 from the Computer Science Department at Carnegie Mellon University under the advisement of Prof. Mor Harchol-Balter, with a thesis on Dynamic Server Provisioning for Data Center Power Management. Following his doctoral studies, he spent a year as a post-doctoral researcher at the IBM T.J. Watson Research Center in the Cloud Optimization and Analytics group. His undergraduate studies were completed in 2007 at the Indian Institute of Technology, Kanpur. His current research projects include sustainable computing, systems for machine learning, and improving the efficiency of distributed systems, such as optimizing the carbon footprint of jobs, colocating workloads in cloud environments, increasing throughput of inference serving systems, and analytically modeling system performance. He has received several awards, including the ACM Sigmetrics Rising Star Award, NSF Career Award, Google Faculty Research Award, and IBM Faculty Award.
Research topics
- Computer Science
- Operating system
- Artificial Intelligence
- Telecommunications
- Computer network
- Distributed computing
- Engineering
- Computer hardware
- Operations management
- Computer architecture
- Parallel computing
Selected publications
Scaling State-Space Models on Multiple GPUs with Tensor Parallelism
Open MIND · 2026-02-24
preprintSenior authorSelective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU, making multi-GPU execution increasingly necessary. Although tensor parallelism (TP) is widely used to scale Transformer inference, applying it to selective SSM blocks is non-trivial because the SSM mixer couples large projections with a sequence-wise recurrent state update and local mixing whose efficiency depends on preserving locality and avoiding synchronization in the critical path. This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges: enabling TTFT improvements via an SSM state cache across prefill and decode, partitioning the mixer's packed parameter tensor so that recurrent updates remain local while minimizing communication, and reducing TP aggregation overhead with quantized AllReduce. We evaluate on three representative SSM-based LLMs spanning pure-SSM and hybrid architectures - Mamba, Falcon-Mamba, and Zamba - on NVIDIA A6000 and A100 clusters. Our experiments show substantial throughput gains from tensor-parallel SSM inference, improving batch-request throughput by ~1.6-2.1x on 2 GPUs and ~2.6-4.0x on 4 GPUs for Mamba, with the largest benefits at long context lengths, and achieving a further ~10-18% throughput improvement from quantized all-reduce by lowering synchronization bandwidth overhead.
Scaling State-Space Models on Multiple GPUs with Tensor Parallelism
arXiv (Cornell University) · 2026-02-24
articleOpen accessSenior authorSelective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU, making multi-GPU execution increasingly necessary. Although tensor parallelism (TP) is widely used to scale Transformer inference, applying it to selective SSM blocks is non-trivial because the SSM mixer couples large projections with a sequence-wise recurrent state update and local mixing whose efficiency depends on preserving locality and avoiding synchronization in the critical path. This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges: enabling TTFT improvements via an SSM state cache across prefill and decode, partitioning the mixer's packed parameter tensor so that recurrent updates remain local while minimizing communication, and reducing TP aggregation overhead with quantized AllReduce. We evaluate on three representative SSM-based LLMs spanning pure-SSM and hybrid architectures - Mamba, Falcon-Mamba, and Zamba - on NVIDIA A6000 and A100 clusters. Our experiments show substantial throughput gains from tensor-parallel SSM inference, improving batch-request throughput by ~1.6-2.1x on 2 GPUs and ~2.6-4.0x on 4 GPUs for Mamba, with the largest benefits at long context lengths, and achieving a further ~10-18% throughput improvement from quantized all-reduce by lowering synchronization bandwidth overhead.
Energy-efficient GPU SM allocation
ACM SIGMETRICS Performance Evaluation Review · 2025-08-26
articleGPU sharing between workloads is an e!ective approach to increase GPU utilization and reduce idle power waste. To minimize resource contention under GPU sharing, current architectures allow users to allocate core GPU compute resources exclusively to workloads. However, identifying the most e''cient GPU compute resource allocation for colocated workloads is challenging, as it requires balancing potential performance degradation and power savings. This paper presents a framework for finding the most energy-e''cient compute allocation for colocated workload pairs under NVIDIA MPS using lightweight prediction models. Experimental results, using a range of training, inference, and general CUDA workloads, demonstrate that our solution outperforms the equal sharing strategy by 35%, on average, and is within 1.5% of the o#ine optimal strategy.
Foreword - Special Issue - MASCOTS 2023
Performance Evaluation · 2025-01-05
articleSenior author2025-10-21
articleAs the adoption of Virtualized Radio Access Networks (vRAN) is gaining momentum in 5 G networks, Mobility Network Operators are considering a Centralized RAN (CRAN) architecture that moves the baseband functions to a far-edge cloud in order to gain dimensioning flexibility, resiliency and improved RAN performance. However, there have been limited studies on the benefits of centralization in improving RAN compute utilization, especially in the context of pooling the compute-intensive Distributed Unit (DU) resources. In this paper, we present the first study on the benefits of pooling in improving DU server utilization. Using longitudinal traces from a real-world 5G network, we show that significant Capex and Opex gains of $\mathbf{8 4 \%}$ and $\mathbf{9 4 \%}$, respectively, can be obtained through fine-grained pooling at a granularity of 1 second. We also present an affinitybased and dynamic pooling algorithm that can reduce the pooling overheads while still achieving significant pooling gains.
GUIDE - GNN based Unified Incident Detection for Microservices Application Deployments
2025-07-21 · 1 citations
articleSenior authorMicroservices deployments in the real-world present significant challenges in detecting and localizing performance bottlenecks due to their scale, complexity, and dynamic interactions. This paper presents GUIDE, a GNN-based framework for unified incident detection and bottleneck localization, leveraging multisource telemetry and a customizable incident trigger warning mechanism. Specifically, GUIDE employs a novel integration of Graph Attention Networks, temporal embeddings, and an expert classifier to predict and localize bottlenecks efficiently in practice. Evaluation results on real-world traces collected from Observea live, cloud-native platform-show that GUIDE achieves an F1score of 87% for anomaly detection and 84% for bottleneck localization, outperforming existing baselines. Additionally, GUIDE’s incident trigger warning mechanism achieves an $\mathbf{F 1}$-score of $\mathbf{8 5} \boldsymbol{\%}$, ensuring early and accurate detection of system failures.
Fine-Grained Energy Prediction For Parallellized LLM Inference With PIE-P
ArXiv.org · 2025-12-14
preprintOpen accessWith the widespread adoption of Large Language Models (LLMs), energy costs of running LLMs is quickly becoming a critical concern. However, precisely measuring the energy consumption of LLMs is often infeasible because hardware-based power monitors are not always accessible and software-based energy measurement tools are not accurate. While various prediction techniques have been developed to estimate LLM energy consumption, these approaches are limited to single-GPU environments and thus are not applicable to modern LLM inference which is typically parallelized across multiple GPUs. In this work, we remedy this gap and introduce PIE-P, a fine-grained energy prediction framework for multi-GPU inference, including tensor, pipeline, and data parallelism. Predicting the energy under parallelized inference is complicated by the non-determinism in inter-GPU communication, additional communication overheads, and difficulties in isolating energy during the communication/synchronization phase. We develop a scalable prediction framework that addresses these issues via precise sampling, fine-grained modeling of inter-GPU communication, and careful accounting of parallelization overhead. Our evaluation results show that PIE-P yields accurate and fine-grained energy predictions across parallelism strategies, significantly outperforming baselines.
Kneeliverse: A universal knee-detection library for performance curves
SoftwareX · 2025-05-01 · 3 citations
articleOpen accessIdentifying knee and elbow points in performance curves is a critical task in various domains, including machine learning and system design. These points represent optimal trade-offs between cost and performance, facilitating efficient decision-making and resource allocation. However, accurately determining the knees and elbows in curves poses a significant challenge. To address this challenge, we introduce Kneeliverse , an open-source library dedicated to knee/elbow point detection. Kneeliverse incorporates a suite of well-established knee-detection algorithms, including Menger, L-method, Kneedle, and DFDT. Additionally, Kneeliverse extends these algorithms to detect multiple knees and elbows in complex curves, employing a recursive approach. Kneeliverse further includes Z-Method, a recently developed algorithm specifically designed for multi-knee detection.
Investigating WebRTC BBR as an alternative to GCC for live video streaming
2025-01-06
articleSenior authorGoogle Congestion Control (GCC) is the default congestion control algorithm for WebRTC, a popular web application used for live video streaming. BBR, also developed at Google, is commonly used for streaming pre-recorded video on services like YouTube. However, BBR has not been widely deployed for real-time applications like live video streaming. It was implemented for WebRTC in 2018, but it was later deprecated due to poor performance. While GCC performs well under most network conditions, it can be starved by a loss-based TCP flow using the same bottleneck link. In this work, we investigate the possibility of using BBR as an alternative to GCC for WebRTC congestion control. We test it under a variety of network conditions and find that it performs better than GCC when competing with TCP, and it achieves bitrates comparable to GCC’s in isolation, except when bandwidth is restricted and the bottleneck buffer is deep. We find that this is because of bandwidth overestimation, a problem which also exists in TCP BBR. While modifying WebRTC BBR’s bandwidth estimation fails to improve performance in our experiments, we do find that disabling its recovery state, a unique loss response, improves WebRTC BBR’s performance in under-provisioned networks.
The case for accurate lifetime accounting in carbon metrics
ACM SIGMETRICS Performance Evaluation Review · 2025-08-26
articleSenior authorTo represent the entire carbon footprint of computing devices, carbon metrics often include both an embodied cost (i.e., carbon cost to produce the device) and an operational cost (i.e., carbon cost to run the device). The embodied carbon cost is typically high, but it is amortized over the lifetime of the device. In this vision statement, we argue that for carbon metrics to be useful, we need (i) accurate metrics for lifetime, which are challenging for SSDs, and (ii) correct reasoning about carbon costs when using such metrics.
Recent grants
NSF · $258k · 2023–2026
EAGER: Elastic Multi-layer Memcached Tiers
NSF · $257k · 2016–2019
CSR: Small: Scalable, heterogeneity-aware load balancing
NSF · $395k · 2016–2020
CAREER: Enabling Predictable Performance in Cloud Computing
NSF · $436k · 2018–2025
CRII: CSR: Online Performance Modeling of Opaque Cloud Applications
NSF · $173k · 2015–2017
Frequent coauthors
- 23 shared
Mor Harchol‐Balter
Carnegie Mellon University
- 14 shared
Muhammad Wajahat
Pir Mehr Ali Shah Arid Agriculture University
- 11 shared
Seyyed Ahmad Javadi
Amirkabir University of Technology
- 11 shared
Aruna Balasubramanian
Stony Brook University
- 9 shared
Andrzej Kochut
- 9 shared
Michael A. Kozuch
Intel (United States)
- 9 shared
Sneha Shrivastava
Stony Brook University
- 9 shared
Abby L. Spencer
General Department of Preventive Medicine
Labs
PACE LabPI
Awards & honors
- ACM Sigmetrics Rising Star Award
- NSF Career Award
- Google Faculty Research Award
- IBM Faculty Award
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Anshul Gandhi
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup