Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Anshul Gandhi

Anshul Gandhi

· Research Assistant ProfessorVerified

Stony Brook University · Computer Science

Active 2006–2026

h-index27
Citations2.9k
Papers11134 last 5y
Funding$2.9M2 active
See your match with Anshul Gandhi — sign in to PhdFit.Sign in

About

Anshul Gandhi is an Associate Professor in the Department of Computer Science at Stony Brook University. He leads the PACE Lab and specializes in applying analytical tools such as Machine Learning, Optimization, Queueing Theory, and Control Theory to improve computer systems. His research focuses on leveraging these mathematical tools to analyze and optimize the behavior of systems including Data Centers, Cloud, and Systems for Machine Learning, with the aim of enhancing performance, reducing energy consumption, and lowering carbon footprints. He completed his Ph.D. in 2013 from the Computer Science Department at Carnegie Mellon University under the advisement of Prof. Mor Harchol-Balter, with a thesis on Dynamic Server Provisioning for Data Center Power Management. Following his doctoral studies, he spent a year as a post-doctoral researcher at the IBM T.J. Watson Research Center in the Cloud Optimization and Analytics group. His undergraduate studies were completed in 2007 at the Indian Institute of Technology, Kanpur. His current research projects include sustainable computing, systems for machine learning, and improving the efficiency of distributed systems, such as optimizing the carbon footprint of jobs, colocating workloads in cloud environments, increasing throughput of inference serving systems, and analytically modeling system performance. He has received several awards, including the ACM Sigmetrics Rising Star Award, NSF Career Award, Google Faculty Research Award, and IBM Faculty Award.

Research topics

  • Computer Science
  • Operating system
  • Artificial Intelligence
  • Telecommunications
  • Computer network
  • Distributed computing
  • Engineering
  • Computer hardware
  • Operations management
  • Computer architecture
  • Parallel computing

Selected publications

  • Scaling State-Space Models on Multiple GPUs with Tensor Parallelism

    Open MIND · 2026-02-24

    preprintSenior author

    Selective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU, making multi-GPU execution increasingly necessary. Although tensor parallelism (TP) is widely used to scale Transformer inference, applying it to selective SSM blocks is non-trivial because the SSM mixer couples large projections with a sequence-wise recurrent state update and local mixing whose efficiency depends on preserving locality and avoiding synchronization in the critical path. This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges: enabling TTFT improvements via an SSM state cache across prefill and decode, partitioning the mixer's packed parameter tensor so that recurrent updates remain local while minimizing communication, and reducing TP aggregation overhead with quantized AllReduce. We evaluate on three representative SSM-based LLMs spanning pure-SSM and hybrid architectures - Mamba, Falcon-Mamba, and Zamba - on NVIDIA A6000 and A100 clusters. Our experiments show substantial throughput gains from tensor-parallel SSM inference, improving batch-request throughput by ~1.6-2.1x on 2 GPUs and ~2.6-4.0x on 4 GPUs for Mamba, with the largest benefits at long context lengths, and achieving a further ~10-18% throughput improvement from quantized all-reduce by lowering synchronization bandwidth overhead.

  • Scaling State-Space Models on Multiple GPUs with Tensor Parallelism

    arXiv (Cornell University) · 2026-02-24

    articleOpen accessSenior author

    Selective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU, making multi-GPU execution increasingly necessary. Although tensor parallelism (TP) is widely used to scale Transformer inference, applying it to selective SSM blocks is non-trivial because the SSM mixer couples large projections with a sequence-wise recurrent state update and local mixing whose efficiency depends on preserving locality and avoiding synchronization in the critical path. This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges: enabling TTFT improvements via an SSM state cache across prefill and decode, partitioning the mixer's packed parameter tensor so that recurrent updates remain local while minimizing communication, and reducing TP aggregation overhead with quantized AllReduce. We evaluate on three representative SSM-based LLMs spanning pure-SSM and hybrid architectures - Mamba, Falcon-Mamba, and Zamba - on NVIDIA A6000 and A100 clusters. Our experiments show substantial throughput gains from tensor-parallel SSM inference, improving batch-request throughput by ~1.6-2.1x on 2 GPUs and ~2.6-4.0x on 4 GPUs for Mamba, with the largest benefits at long context lengths, and achieving a further ~10-18% throughput improvement from quantized all-reduce by lowering synchronization bandwidth overhead.

  • Energy-efficient GPU SM allocation

    ACM SIGMETRICS Performance Evaluation Review · 2025-08-26

    article

    GPU sharing between workloads is an e!ective approach to increase GPU utilization and reduce idle power waste. To minimize resource contention under GPU sharing, current architectures allow users to allocate core GPU compute resources exclusively to workloads. However, identifying the most e''cient GPU compute resource allocation for colocated workloads is challenging, as it requires balancing potential performance degradation and power savings. This paper presents a framework for finding the most energy-e''cient compute allocation for colocated workload pairs under NVIDIA MPS using lightweight prediction models. Experimental results, using a range of training, inference, and general CUDA workloads, demonstrate that our solution outperforms the equal sharing strategy by 35%, on average, and is within 1.5% of the o#ine optimal strategy.

  • Foreword - Special Issue - MASCOTS 2023

    Performance Evaluation · 2025-01-05

    articleSenior author
  • Constellate: Establishing the opportunity for Distributed Unit pooling in real-world 5G Radio Access Networks

    2025-10-21

    article

    As the adoption of Virtualized Radio Access Networks (vRAN) is gaining momentum in 5 G networks, Mobility Network Operators are considering a Centralized RAN (CRAN) architecture that moves the baseband functions to a far-edge cloud in order to gain dimensioning flexibility, resiliency and improved RAN performance. However, there have been limited studies on the benefits of centralization in improving RAN compute utilization, especially in the context of pooling the compute-intensive Distributed Unit (DU) resources. In this paper, we present the first study on the benefits of pooling in improving DU server utilization. Using longitudinal traces from a real-world 5G network, we show that significant Capex and Opex gains of $\mathbf{8 4 \%}$ and $\mathbf{9 4 \%}$, respectively, can be obtained through fine-grained pooling at a granularity of 1 second. We also present an affinitybased and dynamic pooling algorithm that can reduce the pooling overheads while still achieving significant pooling gains.

  • GUIDE - GNN based Unified Incident Detection for Microservices Application Deployments

    2025-07-21 · 1 citations

    articleSenior author

    Microservices deployments in the real-world present significant challenges in detecting and localizing performance bottlenecks due to their scale, complexity, and dynamic interactions. This paper presents GUIDE, a GNN-based framework for unified incident detection and bottleneck localization, leveraging multisource telemetry and a customizable incident trigger warning mechanism. Specifically, GUIDE employs a novel integration of Graph Attention Networks, temporal embeddings, and an expert classifier to predict and localize bottlenecks efficiently in practice. Evaluation results on real-world traces collected from Observea live, cloud-native platform-show that GUIDE achieves an F1score of 87% for anomaly detection and 84% for bottleneck localization, outperforming existing baselines. Additionally, GUIDE’s incident trigger warning mechanism achieves an $\mathbf{F 1}$-score of $\mathbf{8 5} \boldsymbol{\%}$, ensuring early and accurate detection of system failures.

  • Fine-Grained Energy Prediction For Parallellized LLM Inference With PIE-P

    ArXiv.org · 2025-12-14

    preprintOpen access

    With the widespread adoption of Large Language Models (LLMs), energy costs of running LLMs is quickly becoming a critical concern. However, precisely measuring the energy consumption of LLMs is often infeasible because hardware-based power monitors are not always accessible and software-based energy measurement tools are not accurate. While various prediction techniques have been developed to estimate LLM energy consumption, these approaches are limited to single-GPU environments and thus are not applicable to modern LLM inference which is typically parallelized across multiple GPUs. In this work, we remedy this gap and introduce PIE-P, a fine-grained energy prediction framework for multi-GPU inference, including tensor, pipeline, and data parallelism. Predicting the energy under parallelized inference is complicated by the non-determinism in inter-GPU communication, additional communication overheads, and difficulties in isolating energy during the communication/synchronization phase. We develop a scalable prediction framework that addresses these issues via precise sampling, fine-grained modeling of inter-GPU communication, and careful accounting of parallelization overhead. Our evaluation results show that PIE-P yields accurate and fine-grained energy predictions across parallelism strategies, significantly outperforming baselines.

  • Kneeliverse: A universal knee-detection library for performance curves

    SoftwareX · 2025-05-01 · 3 citations

    articleOpen access

    Identifying knee and elbow points in performance curves is a critical task in various domains, including machine learning and system design. These points represent optimal trade-offs between cost and performance, facilitating efficient decision-making and resource allocation. However, accurately determining the knees and elbows in curves poses a significant challenge. To address this challenge, we introduce Kneeliverse , an open-source library dedicated to knee/elbow point detection. Kneeliverse incorporates a suite of well-established knee-detection algorithms, including Menger, L-method, Kneedle, and DFDT. Additionally, Kneeliverse extends these algorithms to detect multiple knees and elbows in complex curves, employing a recursive approach. Kneeliverse further includes Z-Method, a recently developed algorithm specifically designed for multi-knee detection.

  • Investigating WebRTC BBR as an alternative to GCC for live video streaming

    2025-01-06

    articleSenior author

    Google Congestion Control (GCC) is the default congestion control algorithm for WebRTC, a popular web application used for live video streaming. BBR, also developed at Google, is commonly used for streaming pre-recorded video on services like YouTube. However, BBR has not been widely deployed for real-time applications like live video streaming. It was implemented for WebRTC in 2018, but it was later deprecated due to poor performance. While GCC performs well under most network conditions, it can be starved by a loss-based TCP flow using the same bottleneck link. In this work, we investigate the possibility of using BBR as an alternative to GCC for WebRTC congestion control. We test it under a variety of network conditions and find that it performs better than GCC when competing with TCP, and it achieves bitrates comparable to GCC’s in isolation, except when bandwidth is restricted and the bottleneck buffer is deep. We find that this is because of bandwidth overestimation, a problem which also exists in TCP BBR. While modifying WebRTC BBR’s bandwidth estimation fails to improve performance in our experiments, we do find that disabling its recovery state, a unique loss response, improves WebRTC BBR’s performance in under-provisioned networks.

  • The case for accurate lifetime accounting in carbon metrics

    ACM SIGMETRICS Performance Evaluation Review · 2025-08-26

    articleSenior author

    To represent the entire carbon footprint of computing devices, carbon metrics often include both an embodied cost (i.e., carbon cost to produce the device) and an operational cost (i.e., carbon cost to run the device). The embodied carbon cost is typically high, but it is amortized over the lifetime of the device. In this vision statement, we argue that for carbon metrics to be useful, we need (i) accurate metrics for lifetime, which are challenging for SSDs, and (ii) correct reasoning about carbon costs when using such metrics.

Recent grants

Frequent coauthors

  • Mor Harchol‐Balter

    Carnegie Mellon University

    23 shared
  • Muhammad Wajahat

    Pir Mehr Ali Shah Arid Agriculture University

    14 shared
  • Seyyed Ahmad Javadi

    Amirkabir University of Technology

    11 shared
  • Aruna Balasubramanian

    Stony Brook University

    11 shared
  • Andrzej Kochut

    9 shared
  • Michael A. Kozuch

    Intel (United States)

    9 shared
  • Sneha Shrivastava

    Stony Brook University

    9 shared
  • Abby L. Spencer

    General Department of Preventive Medicine

    9 shared

Labs

Awards & honors

  • ACM Sigmetrics Rising Star Award
  • NSF Career Award
  • Google Faculty Research Award
  • IBM Faculty Award
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Anshul Gandhi

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup