
Sainyam Galhotra
VerifiedCornell University · Computer Science
Active 2014–2026
About
Sainyam Galhotra is an assistant professor in the Department of Computer Science at Cornell University. His research aims to develop data science tools for effective and responsible analytics. His work leverages techniques from causal inference, data management, theoretical computer science, machine learning, and human-computer interaction to understand various aspects of trustworthy system design, including robustness, explainability, and fairness. Prior to his current position, he was a Computing Innovation Fellow pursuing postdoctoral research at the University of Chicago. He received his Ph.D. from the University of Massachusetts, Amherst under the supervision of Barna Saha. His undergraduate studies were completed at the Indian Institute of Technology Delhi in May 2014 under the guidance of Prof. Amitabha Bagchi. Before joining UMass, he worked as a budding scientist at Xerox Research Centre India, Bangalore for a year.
Research topics
- Artificial Intelligence
- Computer Science
- Data Mining
- Machine Learning
- Knowledge management
- Algorithm
- Management science
- Mathematics
- Data science
- Engineering
Selected publications
Metric $k$-clustering using only Weak Comparison Oracles
ArXiv.org · 2026-01-27
articleOpen accessClustering is a fundamental primitive in unsupervised learning. However, classical algorithms for $k$-clustering (such as $k$-median and $k$-means) assume access to exact pairwise distances -- an unrealistic requirement in many modern applications. We study clustering in the \emph{Rank-model (R-model)}, where access to distances is entirely replaced by a \emph{quadruplet oracle} that provides only relative distance comparisons. In practice, such an oracle can represent learned models or human feedback, and is expected to be noisy and entail an access cost. Given a metric space with $n$ input items, we design randomized algorithms that, using only a noisy quadruplet oracle, compute a set of $O(k \cdot \mathsf{polylog}(n))$ centers along with a mapping from the input items to the centers such that the clustering cost of the mapping is at most constant times the optimum $k$-clustering cost. Our method achieves a query complexity of $O(n\cdot k \cdot \mathsf{polylog}(n))$ for arbitrary metric spaces and improves to $O((n+k^2) \cdot \mathsf{polylog}(n))$ when the underlying metric has bounded doubling dimension. When the metric has bounded doubling dimension we can further improve the approximation from constant to $1+\varepsilon$, for any arbitrarily small constant $\varepsilon\in(0,1)$, while preserving the same asymptotic query complexity. Our framework demonstrates how noisy, low-cost oracles, such as those derived from large language models, can be systematically integrated into scalable clustering algorithms.
Metric $k$-clustering using only Weak Comparison Oracles
Open MIND · 2026-01-27
preprintClustering is a fundamental primitive in unsupervised learning. However, classical algorithms for $k$-clustering (such as $k$-median and $k$-means) assume access to exact pairwise distances -- an unrealistic requirement in many modern applications. We study clustering in the \emph{Rank-model (R-model)}, where access to distances is entirely replaced by a \emph{quadruplet oracle} that provides only relative distance comparisons. In practice, such an oracle can represent learned models or human feedback, and is expected to be noisy and entail an access cost. Given a metric space with $n$ input items, we design randomized algorithms that, using only a noisy quadruplet oracle, compute a set of $O(k \cdot \mathsf{polylog}(n))$ centers along with a mapping from the input items to the centers such that the clustering cost of the mapping is at most constant times the optimum $k$-clustering cost. Our method achieves a query complexity of $O(n\cdot k \cdot \mathsf{polylog}(n))$ for arbitrary metric spaces and improves to $O((n+k^2) \cdot \mathsf{polylog}(n))$ when the underlying metric has bounded doubling dimension. When the metric has bounded doubling dimension we can further improve the approximation from constant to $1+\varepsilon$, for any arbitrarily small constant $\varepsilon\in(0,1)$, while preserving the same asymptotic query complexity. Our framework demonstrates how noisy, low-cost oracles, such as those derived from large language models, can be systematically integrated into scalable clustering algorithms.
VeriMinder: Mitigating Analytical Vulnerabilities in NL2SQL
2025-01-01 · 1 citations
articleOpen accessSenior authorApplication systems using natural language interfaces to databases (NLIDBs) have democratized data analysis.This positive development has also brought forth an urgent challenge to help users who might use these systems without a background in statistical analysis to formulate bias-free analytical questions.Although significant research has focused on text-to-SQL generation accuracy, addressing cognitive biases in analytical questions remains underexplored.We present VeriMinder, 1 , an interactive system for detecting and mitigating such analytical vulnerabilities.Our approach introduces three key innovations: (1) a contextual semantic mapping framework for biases relevant to specific analysis contexts (2) an analytical framework that operationalizes the Hard-to-Vary principle and guides users in systematic data analysis (3) an optimized LLM-powered system that generates high-quality, task-specific prompts using a structured process involving multiple candidates, critic feedback, and self-reflection.User testing confirms the merits of our approach.In direct user experience evaluation, 82.5% participants reported positively impacting the quality of the analysis.In comparative evaluation, VeriMinder scored significantly higher than alternative approaches, at least 20% better when considered for metrics of the analysis's concreteness, comprehensiveness, and accuracy.Our system, implemented as a web application, is set to help users avoid "wrong question" vulnerability during data analysis.VeriMinder code base with prompts 2 is available as an MIT-licensed open-source software to facilitate further research and adoption within the community.
Data Discovery in Data Lakes: Operations, Indexes, Systems
Proceedings of the VLDB Endowment · 2025-08-01
articleSenior authorData discovery has gained significant traction in the database community resulting in various discovery operations, index schemes, and discovery systems. This tutorial explores the architecture and components of data discovery systems, focusing on indexing structures and scalable algorithms for typical operations, such as join and union discovery. While giving insights into individual algorithms, we point out open challenges for holistic systems, data discovery evaluation, and discovery in federated setups.
Benchmarking Robust Aggregation in Decentralized Gradient Marketplaces
ArXiv.org · 2025-09-06
preprintOpen accessThe rise of distributed and privacy-preserving machine learning has sparked interest in decentralized gradient marketplaces, where participants trade intermediate artifacts like gradients. However, existing Federated Learning (FL) benchmarks overlook critical economic and systemic factors unique to such marketplaces-cost-effectiveness, fairness to sellers, and market stability-especially when a buyer relies on a private baseline dataset for evaluation. We introduce a comprehensive benchmark framework to holistically evaluate robust gradient aggregation methods within these buyer-baseline-reliant marketplaces. Our contributions include: (1) a simulation environment modeling marketplace dynamics with a variable buyer baseline and diverse seller distributions; (2) an evaluation methodology augmenting standard FL metrics with marketplace-centric dimensions such as Economic Efficiency, Fairness, and Selection Dynamics; (3) an in-depth empirical analysis of the existing Distributed Gradient Marketplace framework, MartFL, including the integration and comparative evaluation of adapted FLTrust and SkyMask as alternative aggregation strategies within it. This benchmark spans diverse datasets, local attacks, and Sybil attacks targeting the marketplace selection process; and (4) actionable insights into the trade-offs between model performance, robustness, cost, fairness, and stability. This benchmark equips the community with essential tools and empirical evidence to evaluate and design more robust, equitable, and economically viable decentralized gradient marketplaces.
A Theoretical Framework for Distribution-Aware Dataset Search
ArXiv.org · 2025-03-27
preprintOpen accessEffective data discovery is a cornerstone of modern data-driven decision-making. Yet, identifying datasets with specific distributional characteristics, such as percentiles or preferences, remains challenging. While recent proposals have enabled users to search based on percentile predicates, much of the research in data discovery relies on heuristics. This paper presents the first theoretically backed framework that unifies data discovery under centralized and decentralized settings. Let $\mathcal{P}=\{P_1,...,P_N\}$ be a repository of $N$ datasets, where $P_i\subset \mathbb{R}^d$, for $d=O(1)$ . We study the percentile indexing (Ptile) problem and the preference indexing (Pref) problem under the centralized and the federated setting. In the centralized setting we assume direct access to the datasets. In the federated setting we assume access to a synopsis of each dataset. The goal of Ptile is to construct a data structure such that given a predicate (rectangle $R$ and interval $θ$) report all indexes $J$ such that $j\in J$ iff $|P_j\cap R|/|P_j|\inθ$. The goal of Pref is to construct a data structure such that given a predicate (vector $v$ and interval $θ$) report all indexes $J$ such that $j\in J$ iff $ω(P_j,v)\in θ$, where $ω(P_j,v)$ is the inner-product of the $k$-th largest projection of $P_j$ on $v$. We first show that we cannot hope for near-linear data structures with polylogarithmic query time in the centralized setting. Next we show $\tilde{O}(N)$ space data structures that answer Ptile and Pref queries in $\tilde{O}(1+OUT)$ time, where $OUT$ is the output size. Each data structure returns a set of indexes $J$ such that i) for every $P_i$ that satisfies the predicate, $i\in J$ and ii) if $j\in J$ then $P_j$ satisfies the predicate up to an additive error $\varepsilon+2δ$, where $\varepsilon\in(0,1)$ and $δ$ is the error of synopses.
VERIRAG: Healthcare Claim Verification via Statistical Audit in Retrieval-Augmented Generation
2025-10-12 · 2 citations
articleOpen accessRetrieval-augmented generation (RAG) systems retrieve clinically-relevant evidence but remain methodologically blind, unable to judge study quality (e.g., retractions, underpowered analyses). We introduce VERIRAG, which addresses this gap through a three-part framework: (i) an 11-point Veritable audit for methodological rigor; (ii) a quality- and novelty-weighted Hard-to-Vary (HV) score to aggregate evidence; and (iii) a Dynamic Acceptance Threshold calibrated to claim boldness. Across four corpora simulating the evolution of scientific evidence, from initial flawed findings (TY0) to settled science (TY5), VERIRAG consistently outperforms RAG baselines like COT-RAG[2], Self-RAG[1], FLARE[4], and CIBER[3].
A Theoretical Framework for Distribution-Aware Dataset Search
Proceedings of the ACM on Management of Data · 2025-06-09
articleEffective data discovery is a cornerstone of modern data-driven decision-making. Yet, identifying datasets with specific distributional characteristics, such as percentiles or preferences, remains challenging. While recent proposals have enabled users to search based on percentile predicates, much of the research in data discovery relies on heuristic methods, which often result in biased outcomes. This paper presents the first theoretically backed framework that unifies data discovery under centralized and decentralized settings. More specifically, let P ={P 1 ,..., P N } be a repository of N datasets, such that each P i ⊂ ℝ d , where d is a constant. We study the percentile-aware indexing (Ptile) problem and the preference-aware indexing (Pref) problem under the centralized and the federated setting. In the centralized setting, we assume direct access to the datasets in P . In the federated setting we are given a synopsis S P i which is a compressed representation of P i that captures the structure of P i , for every i ∈ [N]. For the Ptile problem, the goal is to construct a data structure such that given a predicate (query rectangle R and an interval θ) report all indexes J such that j ∈ J if and only if |P j ∩ R|/|P j | ∈ [N]. For the Ptile problem, the goal is to construct a data structure such that given a predicate (query vector v → and an interval θ) report all indexes J such that j ∈ J if and only if ω k (P j ,v → )∈ θ, where ω k (p j ,v → ) is the score (inner-product) of the k -th largest projection of P j on v → . We first show lower bounds for the Ptile and Pref problems in the centralized setting, showing that we cannot hope for near-linear data structures with polylogarithmic query time. Then we focus on approximate data structures for both problems in both settings. We show Ø(N) space data structures with Ø(N) preprocessing time, that can answer Ptile and Pref queries in Ø(1+OUT) time, where OUT is the output size. The data structures return a set of indexes J such that: i) for every P i that satisfies the predicate, i ∈ J and ii) if j ∈ J then P j satisfies the predicate up to an additive error of ε+2δ, where ε is an arbitrarily small constant and δ is the error of the synopses.
Fair and Actionable Causal Prescription Ruleset
ArXiv.org · 2025-02-27
preprintOpen accessPrescriptions, or actionable recommendations, are commonly generated across various fields to influence key outcomes such as improving public health, enhancing economic policies, or increasing business efficiency. While traditional association-based methods may identify correlations, they often fail to reveal the underlying causal factors needed for informed decision-making. On the other hand, in decision-making for tasks with significant societal or economic impact, it is crucial to provide recommendations that are justifiable and equitable in terms of the outcome for both the protected and non-protected groups. Motivated by these two goals, this paper introduces a fairness-aware framework leveraging causal reasoning for generating a set of actionable prescription rules (ruleset) toward betterment of an outcome while preventing exacerbating inequalities for protected groups. By considering group and individual fairness metrics from the literature, we ensure that both protected and non-protected groups benefit from these recommendations, providing a balanced and equitable approach to decision-making. We employ efficient optimizations to explore the vast and complex search space considering both fairness and coverage of the ruleset. Empirical evaluation and case study on real-world datasets demonstrates the utility of our framework for different use cases.
SeerCuts: Explainable Attribute Discretization
2025-06-17
articleOpen accessThis demonstration showcases SeerCuts - a tool that suggests useful and semantically meaningful discretization strategies (partitions) for numerical attributes. SeerCuts is a generic, interactive framework where users specify attributes to discretize and their utility measure for a downstream task of choice. It uses GPT-4o to assess the semantic meaningfulness of candidate partitions and employs an efficient search strategy to explore the vast space of discretization options. With hierarchical clustering to group related partitions and a multi-armed bandit policy to identify useful partitions with only a few samples, SeerCuts quickly finds meaningful and useful partitions. In the demo, we will provide an overview of SeerCuts and allow the audience to explore various datasets and tasks, including data visualization and comprehensive modeling. The users will be able to evaluate how SeerCuts identifies meaningful discretization strategies and compare the tradeoff between different discretization options.
Frequent coauthors
- 28 shared
Barna Saha
- 13 shared
Babak Salimi
- 12 shared
Raul Castro Fernandez
- 9 shared
Soumyabrata Pal
- 9 shared
Arya Mazumdar
University of California, San Diego
- 8 shared
Divesh Srivastava
- 8 shared
Donatella Firmani
Sapienza University of Rome
- 8 shared
Udayan Khurana
University of Science and Technology
Education
Ph.D., Computer Science
University of Chicago
Other, Computer Science
University of Chicago
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Sainyam Galhotra
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup