
Rahul Mazumder
· Nanyang Technological University Associate Professor of Operations Research and StatisticsMassachusetts Institute of Technology · Operations Research and Statistics
Active 1992–2026
About
Rahul Mazumder is the Nanyang Technological University Associate Professor of Operations Research and Statistics and an Associate Professor at the MIT Sloan School of Management. His research interests include data science, statistical machine learning, large scale optimization, mathematical programming, and their interplay. He is particularly interested in 'big data' applications in environmental and climate studies, social science, and recommender systems. Mazumder has published in various prestigious journals such as the Journal of Machine Learning Research, Annals of Statistics, Journal of the American Statistical Association, and Annals of Applied Statistics. He completed his BS and MS in statistics from the Indian Statistical Institute, Kolkata in 2007, and earned his PhD in statistics from Stanford University in 2012.
Research topics
- Artificial Intelligence
- Machine Learning
- Computer Science
- Mathematics
- Data Mining
- Engineering
- Algorithm
- Mathematical optimization
- Geometry
Selected publications
MOONSHOT : A Framework for Multi-Objective Pruning of Vision and Large Language Models
arXiv (Cornell University) · 2026-04-14
articleOpen accessSenior authorWeight pruning is a common technique for compressing large neural networks. We focus on the challenging post-training one-shot setting, where a pre-trained model is compressed without any retraining. Existing one-shot pruning methods typically optimize a single objective, such as a layer-wise reconstruction loss or a second-order Taylor approximation of the training loss. We highlight that neither objective alone is consistently the most effective across architectures and sparsity levels. Motivated by this insight, we propose MOONSHOT, a general and flexible framework that extends any single-objective pruning method into a multi-objective formulation by jointly optimizing both the layer-wise reconstruction error and second-order Taylor approximation of the training loss. MOONSHOT acts as a wrapper around existing pruning algorithms. To enable this integration while maintaining scalability to billion-parameter models, we propose modeling decisions and introduce an efficient procedure for computing the inverse Hessian, preserving the efficiency of state-of-the-art one-shot pruners. When combined with state-of-the-art pruning methods on Llama-3.2 and Llama-2 models, MOONSHOT reduces C4 perplexity by up to 32.6% at 2:4 sparsity and improves zero-shot mean accuracy across seven classification benchmarks by up to 4.9 points. On Vision Transformers, it improves accuracy on ImageNet-1k by over 5 points at 70% sparsity, and on ResNet-50, it yields a 4-point gain at 90% sparsity.
MOONSHOT : A Framework for Multi-Objective Pruning of Vision and Large Language Models
arXiv (Cornell University) · 2026-04-14
preprintOpen accessSenior authorWeight pruning is a common technique for compressing large neural networks. We focus on the challenging post-training one-shot setting, where a pre-trained model is compressed without any retraining. Existing one-shot pruning methods typically optimize a single objective, such as a layer-wise reconstruction loss or a second-order Taylor approximation of the training loss. We highlight that neither objective alone is consistently the most effective across architectures and sparsity levels. Motivated by this insight, we propose MOONSHOT, a general and flexible framework that extends any single-objective pruning method into a multi-objective formulation by jointly optimizing both the layer-wise reconstruction error and second-order Taylor approximation of the training loss. MOONSHOT acts as a wrapper around existing pruning algorithms. To enable this integration while maintaining scalability to billion-parameter models, we propose modeling decisions and introduce an efficient procedure for computing the inverse Hessian, preserving the efficiency of state-of-the-art one-shot pruners. When combined with state-of-the-art pruning methods on Llama-3.2 and Llama-2 models, MOONSHOT reduces C4 perplexity by up to 32.6% at 2:4 sparsity and improves zero-shot mean accuracy across seven classification benchmarks by up to 4.9 points. On Vision Transformers, it improves accuracy on ImageNet-1k by over 5 points at 70% sparsity, and on ResNet-50, it yields a 4-point gain at 90% sparsity.
Extracting Interpretable Models from Tree Ensembles: Computational and Statistical Perspectives
Journal of the American Statistical Association · 2026-04-28
articleOpen accessTree ensembles are non-parametric methods widely recognized for their accuracy and ability to capture complex interactions. While these models excel at prediction, they are difficult to interpret and may fail to uncover useful relationships in the data. We propose an estimator to extract compact sets of decision rules from tree ensembles. The extracted models are accurate and can be manually examined to reveal relationships between the predictors and the response. A key novelty of our estimator is the flexibility to jointly control the number of rules extracted and the interaction depth of each rule, which improves accuracy. We develop a tailored exact algorithm to efficiently solve optimization problems underlying our estimator and an approximate algorithm for computing regularization paths, sequences of solutions that correspond to varying model sizes. We also establish novel non-asymptotic prediction error bounds for our proposed approach, comparing it to an oracle that chooses the best data-dependent linear combination of the rules in the ensemble subject to the same complexity constraint as our estimator. The bounds illustrate that the large-sample predictive performance of our estimator is on par with that of the oracle. Through experiments, we demonstrate that our estimator outperforms existing algorithms for rule extraction.
Operations Research · 2026-04-09
preprintOpen accessSenior authorGraphL0: Sparse Gaussian Graphical Models with Discrete Optimization Recovering sparse dependency graphs in undirected Gaussian graphical models is a well-known problem in statistical machine learning. Given samples from a [Formula: see text]-dimensional Gaussian distribution, the task amounts to estimating the [Formula: see text] precision (inverse covariance) matrix under the assumption that only a small fraction of its entries are nonzero. In “Sparse Gaussian Graphical Models with Discrete Optimization: Computational and Statistical Perspectives,” the authors introduce GraphL0. GraphL0 is an estimator based on an [Formula: see text]-penalized pseudo-likelihood, departing from the more common [Formula: see text] relaxation. The resulting formulation is a convex mixed-integer program, which becomes challenging for standard commercial solvers at moderate-to-large [Formula: see text]. To make the approach practical, the authors develop a custom nonlinear branch-and-bound algorithm, alongside scalable approximate solvers. The paper also provides new statistical guarantees for estimation accuracy and support recovery, and experiments on synthetic and real data sets show substantial computational gains over off-the-shelf solvers and competitive runtime and accuracy versus leading alternatives.
Modelling with categorical features via exact fusion and sparsity regularization
Journal of the Royal Statistical Society Series B (Statistical Methodology) · 2026-03-28
articleSenior authorAbstract We study the high-dimensional linear regression problem with categorical predictors that have many levels. We propose a new estimation approach, which performs model compression via two mechanisms by simultaneously encouraging (a) clustering of the regression coefficients to collapse some of the categorical levels together; and (b) sparsity of the regression coefficients. We present novel mixed integer programming formulations for our estimator, and develop a custom row generation procedure to speed up the exact off-the-shelf solvers. We also propose a fast approximate algorithm for our method that obtains high-quality feasible solutions via block coordinate descent. As the main building block of our algorithm, we develop an exact algorithm for the univariate case based on dynamic programming, which can be of independent interest. We establish new theoretical guarantees for both the prediction and the cluster recovery performance of our estimator. Our numerical experiments on synthetic and real datasets demonstrate that our proposed estimator tends to outperform the state-of-the-art.
SPARTA: An Optimization Framework for Differentially Private Sparse Fine-Tuning
2025-08-03
articleOpen accessKDD ’25, Toronto, ON, Canada
TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks
ArXiv.org · 2025-05-29
preprintOpen accessSenior authorNetwork pruning reduces the computational requirements of large neural networks, with N:M sparsity -- retaining only N out of every M consecutive weights -- offering a compelling balance between compressed model quality and hardware acceleration. However, N:M sparsity only accelerates forward-pass computations, as N:M patterns are not preserved during matrix transposition, limiting efficiency during training where both passes are computationally intensive. While transposable N:M sparsity has been proposed to address this limitation, existing methods for finding transposable N:M sparse masks either fail to scale to large models or are restricted to M=4 which results in suboptimal compression-accuracy trade-off. We introduce an efficient solver for transposable N:M masks that scales to billion-parameter models. We formulate mask generation as optimal transport problems and solve through entropy regularization and Dykstra's algorithm, followed by a rounding procedure. Our tensor-based implementation exploits GPU parallelism, achieving up to 100x speedup with only 1-10% error compared to existing methods. Our approach can be integrated with layer-wise N:M pruning frameworks including Wanda, SparseGPT and ALPS to produce transposable N:M sparse models with arbitrary N:M values. Experiments show that LLaMA3.2-8B with transposable 16:32 sparsity maintains performance close to its standard N:M counterpart and outperforms standard 2:4 sparse model, showing the practical value of our approach.
Efficient Algorithms for Leveraging LLMs for Generative and Predictive Recommender Systems
2025-05-08
articleLarge language models (LLMs) have taken the world by storm, revolutionizing the use of AI in products. While scaling laws demonstrate that larger models yield better results, making them work in production is hard, often due to latency demands on inference. In this proposed tutorial, we will share optimizations - both algorithmic and systems-related - that help leverage LLMs (both small and large) for recommendation and generative AI use cases at planet scale for the world's largest professional network - LinkedIn. In the first part of the tutorial, we will discuss state-of-the-art (SOTA) model quantization and pruning techniques. This will be in conjunction with a discussion on GPU kernel-level optimizations including minimizing memory copying, effectively utilizing shared memory, optimizing thread scheduling, and maximizing parallel efficiency. We will discuss our own experience with these inventing and leveraging such techniques, while also discussing the latest advancements from other enterprises and the open source world. Our discussions will cover models ranging in size from 1 billion to 100 billion+ parameters. In the second part of the tutorial, we will discuss the latest advancements in the world of LLM knowledge distillation which can result in training very powerful and performant small language models (SLMs). We will also discuss effective instruction tuning and preference alignment techniques that help with improving accuracy and quality of results for generative use cases. Finally, we will discuss actual production use cases that benefit from the aforementioned techniques at planet scale for LinkedIn.
Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction
ArXiv.org · 2025-09-15
preprintOpen accessSenior authorReasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This "Reasoning-Aware Compression" (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC
Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems
ArXiv.org · 2025-02-20 · 1 citations
preprintOpen accessSenior authorLarge language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendation systems to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present a comprehensive set of insights for training and deploying small language models (SLMs) that deliver high performance for a variety of industry use cases. We focus on two key techniques: (1) knowledge distillation and (2) model compression via structured pruning and quantization. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training/serving costs and latency. We detail the impact of these techniques on a variety of use cases in a large professional social network platform and share deployment lessons, including hardware optimization strategies that improve speed and throughput for both predictive and reasoning-based applications in Recommendation Systems.
Recent grants
III: Small: A New Perspective on Grouped Variable Selection via Modern Optimization
NSF · $318k · 2017–2022
Frequent coauthors
- 26 shared
Hussein Hazimeh
- 14 shared
Dimitris Bertsimas
- 13 shared
Robert M. Freund
Massachusetts Institute of Technology
- 13 shared
Paul Grigas
- 13 shared
Kayhan Behdin
- 12 shared
Shibal Ibrahim
- 11 shared
Wenyu Chen
Chinese University of Hong Kong
- 11 shared
Haoyue Wang
Labs
MIT Sloan School of ManagementPI
Awards & honors
- 2024 Leo Breiman Junior Award from the Statistical Learning…
- 2023 International Indian Statistical Association (IISA) Ear…
- 2021 Donald P. Gaver, Jr. Early Career Award from INFORMS
- 2018 Young Investigator Program (YIP) Award from the Office…
- 2020 INFORMS Optimization Society Prize for Young Researcher…
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Rahul Mazumder
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup