
Romila Pradhan
· Assistant ProfessorPurdue University · Department of Computer and Information Technology
Active 2015–2026
About
Romila Pradhan is an Assistant Professor in the Department of Computer & Information Technology at Purdue University in West Lafayette, Indiana. Her research interests are in the areas of data management and data science, with a focus on building trustworthy and responsible decision-making systems. Recently, her work has been centered on developing systems that facilitate explainability, fairness, and accountability in data-driven decision-making systems. Dr. Pradhan holds a Ph.D. in Computer Science from Purdue University, and both a B.S. and M.S. in Mathematics and Computing from the Indian Institute of Technology (IIT) Kharagpur, India. Prior to her current position, she was a Postdoctoral Researcher at the Halıcıoğlu Data Science Institute at the University of California San Diego and served as a Visiting Assistant Professor in the Department of Computer Science at Purdue University. She has received notable awards including the NSF CAREER Award in 2023 and the Google Research Scholar Award in 2022.
Research topics
- Artificial Intelligence
- Computer Science
- Data Mining
- Machine Learning
- Data science
- Management science
- Engineering
- Knowledge management
- Cognitive science
- Psychology
- Mathematics
- Statistics
- Programming language
Selected publications
Selective Data Expansion for Model Performance
OpenProceedings · 2026-01-01
datasetOpen accessSenior authorExplanations for Machine Learning Pipelines under Data Drift
2025-06-22 · 1 citations
articleOpen accessSenior authorEnsuring the robustness of data preprocessing pipelines is essential for maintaining the reliability of machine learning model performance in the face of real-world data shifts. Traditional methods optimize preprocessing sequences for specific datasets but often overlook their vulnerability to future data variations. This research introduces a vulnerability score to quantify the susceptibility of preprocessing components to data shift. We propose a Linear Regression approach to establish a predictive relationship between the vulnerability of the pipeline components and changes in the model's performance. The generated relationships act as explanations for practitioners of the system and help them quantify the robustness of the pipeline to data shift. For a given pipeline, we generate an explanation that highlights a tolerable threshold beyond which a component is considered shift-vulnerable and is likely to contribute to performance degradation. For the shift-vulnerable scenarios, we further suggest a new pipeline for system maintainers that preserves the model performance without retraining. The proposed framework delivers a risk-aware assessment, empowering practitioners to anticipate potential performance changes and adapt their pipeline strategies accordingly. Experimental results on several real-world datasets generate valid explanations for pipeline robustness and demonstrate the opportunities in this field of research.
Explanations for Machine Learning Pipelines under Data Drift
2025-01-01
articleSenior authorEnsuring the robustness of data preprocessing pipelines is essential for maintaining the reliability of machine learning model performance in the face of real-world data shifts. Traditional methods optimize preprocessing sequences for specific datasets but often overlook their vulnerability to future data variations. This research introduces a vulnerability score to quantify the susceptibility of preprocessing components to data shift. We propose a Linear Regression approach to establish a predictive relationship between the vulnerability of the pipeline components and changes in the model’s performance. The generated relationships act as explanations for practitioners of the system and help them quantify the robustness of the pipeline to data shift. For a given pipeline, we generate an explanation that highlights a tolerable threshold beyond which a component is considered shift-vulnerable and is likely to contribute to performance degradation. For the shift-vulnerable scenarios, we further suggest a new pipeline for system maintainers that preserves the model performance without retraining. The proposed framework delivers a risk-aware assessment, empowering practitioners to anticipate potential performance changes and adapt their pipeline strategies accordingly. Experimental results on several real-world datasets generate valid explanations for pipeline robustness and demonstrate the opportunities in this field of research.
SourceSplice: Source Selection for Machine Learning Tasks
ArXiv.org · 2025-07-29
preprintOpen accessSenior authorData quality plays a pivotal role in the predictive performance of machine learning (ML) tasks - a challenge amplified by the deluge of data sources available in modern organizations. Prior work in data discovery largely focus on metadata matching, semantic similarity or identifying tables that should be joined to answer a particular query, but do not consider source quality for high performance of the downstream ML task. This paper addresses the problem of determining the best subset of data sources that must be combined to construct the underlying training dataset for a given ML task. We propose SourceGrasp and SourceSplice, frameworks designed to efficiently select a suitable subset of sources that maximizes the utility of the downstream ML model. Both the algorithms rely on the core idea that sources (or their combinations) contribute differently to the task utility, and must be judiciously chosen. While SourceGrasp utilizes a metaheuristic based on a greediness criterion and randomization, the SourceSplice framework presents a source selection mechanism inspired from gene splicing - a core concept used in protein synthesis. We empirically evaluate our algorithms on three real-world datasets and synthetic datasets and show that, with significantly fewer subset explorations, SourceSplice effectively identifies subsets of data sources leading to high task utility. We also conduct studies reporting the sensitivity of SourceSplice to the decision choices under several settings.
Example-based Explanations for Random Forests using Machine Unlearning
arXiv (Cornell University) · 2024-02-07 · 1 citations
preprintOpen accessSenior authorTree-based machine learning models, such as decision trees and random forests, have been hugely successful in classification tasks primarily because of their predictive power in supervised learning tasks and ease of interpretation. Despite their popularity and power, these models have been found to produce unexpected or discriminatory outcomes. Given their overwhelming success for most tasks, it is of interest to identify sources of their unexpected and discriminatory behavior. However, there has not been much work on understanding and debugging tree-based classifiers in the context of fairness. We introduce FairDebugger, a system that utilizes recent advances in machine unlearning research to identify training data subsets responsible for instances of fairness violations in the outcomes of a random forest classifier. FairDebugger generates top-$k$ explanations (in the form of coherent training data subsets) for model unfairness. Toward this goal, FairDebugger first utilizes machine unlearning to estimate the change in the tree structures of the random forest when parts of the underlying training data are removed, and then leverages the Apriori algorithm from frequent itemset mining to reduce the subset search space. We empirically evaluate our approach on three real-world datasets, and demonstrate that the explanations generated by FairDebugger are consistent with insights from prior studies on these datasets.
Data Acquisition for Improving Model Fairness using Reinforcement Learning
arXiv (Cornell University) · 2024-12-04
preprintOpen accessSenior authorMachine learning systems are increasingly being used in critical decision making such as healthcare, finance, and criminal justice. Concerns around their fairness have resulted in several bias mitigation techniques that emphasize the need for high-quality data to ensure fairer decisions. However, the role of earlier stages of machine learning pipelines in mitigating model bias has not been explored well. In this paper, we focus on the task of acquiring additional labeled data points for training the downstream machine learning model to rapidly improve its fairness. Since not all data points in a data pool are equally beneficial to the task of fairness, we generate an ordering in which data points should be acquired. We present DataSift, a data acquisition framework based on the idea of data valuation that relies on partitioning and multi-armed bandits to determine the most valuable data points to acquire. Over several iterations, DataSift selects a partition and randomly samples a batch of data points from the selected partition, evaluates the benefit of acquiring the batch on model fairness, and updates the utility of partitions depending on the benefit. To further improve the effectiveness and efficiency of evaluating batches, we leverage influence functions that estimate the effect of acquiring a batch without retraining the model. We empirically evaluate DataSift on several real-world and synthetic datasets and show that the fairness of a machine learning model can be significantly improved even while acquiring a few data points.
Studying Human Factors Aspects of Text Classification Task Using Eye Tracking
Lecture notes in computer science · 2023-01-01 · 1 citations
book-chapterGenerating Interpretable Data-Based Explanations for Fairness Debugging using Gopher
Proceedings of the 2022 International Conference on Management of Data · 2022-06-10 · 7 citations
articleOpen accessMachine learning (ML) models, while increasingly being used to make life-altering decisions, are known to reinforce systemic bias and discrimination. Consequently, practitioners and model developers need tools to facilitate debugging for bias in ML models. We introduce Gopher, a system that generates compact, interpretable and causal explanations for ML model bias. Gopher identifies the top-k coherent subsets of the training data that are root causes for model bias by quantifying the extent to which removing or updating a subset can resolve the bias. We describe the architecture of Gopher and will walk the audience through real-world use cases to highlight how Gopher generates explanations that enable data scientists to understand how subsets of the training data contribute to the bias of a machine learning (ML) model. Gopher is available as open-source software; The code and the demonstration video are available at https://gopher-sys.github.io/.
Interpretable Data-Based Explanations for Fairness Debugging
Proceedings of the 2022 International Conference on Management of Data · 2022 · 47 citations
1st authorCorresponding- Computer Science
- Machine Learning
- Computer Science
A wide variety of fairness metrics and eXplainable Artificial Intelligence (XAI) approaches have been proposed in the literature to identify bias in machine learning models that are used in critical real-life contexts. However, merely reporting on a model's bias or generating explanations using existing XAI techniques is insufficient to locate and eventually mitigate sources of bias. We introduce Gopher, a system that produces compact, interpretable, and causal explanations for bias or unexpected model behavior by identifying coherent subsets of the training data that are root-causes for this behavior. Specifically, we introduce the concept of causal responsibility that quantifies the extent to which intervening on training data by removing or updating subsets of it can resolve the bias. Building on this concept, we develop an efficient approach for generating the top-k patterns that explain model bias by utilizing techniques from the machine learning (ML) community to approximate causal responsibility, and using pruning rules to manage the large search space for patterns. Our experimental evaluation demonstrates the effectiveness of Gopher in generating interpretable explanations for identifying and debugging sources of bias.
Explainable AI: Foundations, Applications, Opportunities for Data Management Research
2022 IEEE 38th International Conference on Data Engineering (ICDE) · 2022 · 11 citations
1st authorCorresponding- Computer Science
- Artificial Intelligence
- Computer Science
Algorithmic decision-making systems are success-fully being adopted in a wide range of domains for diverse tasks. While the potential benefits of algorithmic decision-making are many, the importance of trusting these systems has only recently attracted attention. There has been a recent resurgence of interest in explainable artificial intelligence (XAI) that aims to reduce the opacity of a model by explaining its behavior, its predictions or both, thus allowing humans to scrutinize and trust the model. A host of technical advances have been made and several explanation methods have been proposed in recent years that address the problem of model explainability. In this tutorial, we will present these novel explanation approaches, characterize their strengths and limitations, and enumerate opportunities for data management research in the context of XAI.
Frequent coauthors
- 9 shared
Babak Salimi
- 7 shared
Sainyam Galhotra
Cornell University
- 4 shared
Sunil Prabhakar
Purdue University System
- 3 shared
Siarhei Bykau
- 3 shared
Jiongli Zhu
- 3 shared
Boris Glavic
University of Illinois Chicago
- 2 shared
Aditya Lahiri
University of California, San Diego
- 1 shared
Gaurav Nanda
Purdue University West Lafayette
Awards & honors
- NSF CAREER Award (2023)
- Google Research Scholar Award (2022)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Romila Pradhan
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup