Cynthia D. Rudin
· Gilbert, Louis, and Edward Lehrman Distinguished ProfessorVerifiedDuke University · Computer Science
Active 2003–2026
About
Cynthia D. Rudin is a professor of computer science, electrical and computer engineering, statistical science, and biostatistics & bioinformatics at Duke University. She directs the Interpretable Machine Learning Lab and holds the Gilbert, Louis, and Edward Lehrman Distinguished Professorship. Her academic background includes an undergraduate degree from the University at Buffalo and a PhD from Princeton University. She has previously held positions at MIT, Columbia, and NYU. Her research focuses on artificial intelligence, machine learning, and data science, with an emphasis on interpretability and practical applications. Rudin has received numerous awards, including the 2022 Squirrel AI Award for Artificial Intelligence for the Benefit of Humanity from the AAAI, which is comparable to the Nobel Prize and the Turing Award. She is a three-time winner of the INFORMS Innovative Applications in Analytics Award and has been recognized as one of the 'Top 40 Under 40' by Poets and Quants and one of the most impressive professors at MIT by Businessinsider.com. She is a fellow of the American Statistical Association and the Institute of Mathematical Statistics, and has served as chair of sections within INFORMS and the American Statistical Association. Rudin has served on committees for DARPA, the National Institute of Justice, AAAI, ACM SIGKDD, and three committees for the National Academies of Sciences, Engineering, and Medicine. She has delivered keynote and invited talks at major conferences such as KDD, AISTATS, and the Nobel Conference. Her work has been featured in prominent news outlets including the NY Times, Washington Post, Wall Street Journal, and NPR.
Research topics
- Machine Learning
- Artificial Intelligence
- Computer Science
- Mathematics
- Data Mining
- Theoretical computer science
Selected publications
AutoSchA: Automatic Hierarchical Music Representations via Multi-Relational Node Isolation
Proceedings of the AAAI Conference on Artificial Intelligence · 2026-03-14
articleOpen accessHierarchical representations provide powerful and principled approaches for analyzing many musical genres. Such representations have been broadly studied in music theory, for instance via Schenkerian analysis (SchA). Hierarchical music analyses, however, are highly cost-intensive; the analysis of a single piece of music requires a great deal of time and effort from trained experts. The representation of hierarchical analyses in a computer-readable format is also a further challenge. Given recent developments in hierarchical deep learning and increasing quantities of computer-readable data, there is great promise in extending such work for an automatic hierarchical representation framework. This paper thus introduces a novel approach, AutoSchA, which extends recent developments in graph neural networks (GNNs) for hierarchical music analysis. AutoSchA features three key contributions: 1) a new graph learning framework for hierarchical music representation, 2) a new graph pooling mechanism based on node isolation that directly optimizes learned pooling assignments, and 3) a state-of-the-art architecture that integrates such developments for automatic hierarchical music analysis. We show, in a suite of experiments, that AutoSchA performs comparably to human experts when analyzing Baroque fugue subjects.
bioRxiv (Cold Spring Harbor Laboratory) · 2025-02-14
preprintOpen accessDespite antiretroviral therapy (ART), people with HIV (PWH) on ART experience higher rates of morbidity and mortality vs. age-matched HIV negative controls, which may be driven by chronic inflammation due to persistent virus. We performed bulk RNA sequencing (RNA-seq) on peripheral CD4+ T cells, as well as quantified plasma immune marker levels from 154 PWH on ART to identify host immune signatures associated with immune recovery (CD4:CD8) and HIV persistence (cell-associated HIV DNA and RNA). Using a novel dimension reduction tool - Pairwise Controlled Manifold Approximation (PaCMAP), we defined three distinct participant transcriptomic clusters. We found that these three clusters were largely defined by differential expression of genes regulated by the transcription factor NF-κB. While clustering was not associated with HIV reservoir size, we observed an association with CD4:CD8 ratio, a marker of immune recovery and prognostic factor for mortality in PWH on ART. Furthermore, distinct patterns of plasma IL-1β, TNF-α and GCSF were also strongly associated with the clusters, suggesting that these immune markers play a key role in CD4+ T cell transcriptomic diversity and immune recovery in PWH on ART. These findings reveal novel subgroups of PWH on ART with distinct immunological characteristics, and define a transcriptional signature associated with clinically significant immune parameters for PWH. A deeper understanding of these subgroups could advance clinical strategies to treat HIV-associated immune dysfunction.
Harvard Dataverse · 2025-10-22
datasetOpen accessSenior authorMany major works in social science employ matching to make causal conclusions, but different matches on the same data may produce different treatment effect esti- mates, even when they achieve similar balance or minimize the same loss function. We discuss reasons and consequences of this problem. We present evidence of this prob- lem by replicating ten papers that use matching and we find that different popular matching algorithms produce inconsistent results. We introduce Matching Bounds: a finite-sample, nonstochastic method that allows analysts to know whether a matched sample that produces different results with the same levels of balance and overall match quality could be obtained from their data. We apply Matching Bounds to a replication of two studies and show that in one case results are robust to this issue and in another they are not.
Proceedings of the AAAI Conference on Artificial Intelligence · 2025-04-11
articleOpen accessSenior authorHealth outcomes depend on complex environmental and sociodemographic factors whose effects change over location and time. Only recently has fine-grained spatial and temporal data become available to study these effects, namely the MEDSAT dataset of English health, environmental, and sociodemographic information. Leveraging this new resource, we use a variety of variable importance techniques to robustly identify the most informative predictors across multiple health outcomes. We then develop an interpretable machine learning framework based on Generalized Additive Models (GAMs) and Multiscale Geographically Weighted Regression (MGWR) to analyze both local and global spatial dependencies of each variable on various health outcomes. Our findings identify NO2 as a global predictor for asthma, hypertension, and anxiety, alongside other outcome-specific predictors related to occupation, marriage, and vegetation. Regional analyses reveal local variations with air pollution and solar radiation, with notable shifts during COVID. This comprehensive approach provides actionable insights for addressing health disparities, and advocates for the integration of interpretable machine learning in public health.
The Journal of Politics · 2025-12-08
articleOpen accessSenior authorMany major works in social science employ matching to make causal conclusions, but different matches on the same data may produce different treatment effect estimates, even when they achieve similar balance or minimize the same loss function. We discuss reasons and consequences of this problem. We present evidence of this problem by replicating ten papers that use matching and we find that different popular matching algorithms produce inconsistent results. We introduce Matching Bounds: a finite-sample, nonstochastic method that allows analysts to know whether a matched sample that produces different results with the same levels of balance and overall match quality could be obtained from their data. We apply Matching Bounds to a replication of two studies and show that in one case results are robust to this issue and in another they are not.
NodMAISI: Nodule-Oriented Medical AI for Synthetic Imaging
ArXiv.org · 2025-12-19
articleOpen accessObjective: Although medical imaging datasets are increasingly available, abnormal and annotation-intensive findings critical to lung cancer screening, particularly small pulmonary nodules, remain underrepresented and inconsistently curated. Methods: We introduce NodMAISI, an anatomically constrained, nodule-oriented CT synthesis and augmentation framework trained on a unified multi-source cohort (7,042 patients, 8,841 CTs, 14,444 nodules). The framework integrates: (i) a standardized curation and annotation pipeline linking each CT with organ masks and nodule-level annotations, (ii) a ControlNet-conditioned rectified-flow generator built on MAISI-v2's foundational blocks to enforce anatomy- and lesion-consistent synthesis, and (iii) lesion-aware augmentation that perturbs nodule masks (controlled shrinkage) while preserving surrounding anatomy to generate paired CT variants. Results: Across six public test datasets, NodMAISI improved distributional fidelity relative to MAISI-v2 (real-to-synthetic FID range 1.18 to 2.99 vs 1.69 to 5.21). In lesion detectability analysis using a MONAI nodule detector, NodMAISI substantially increased average sensitivity and more closely matched clinical scans (IMD-CT: 0.69 vs 0.39; DLCS24: 0.63 vs 0.20), with the largest gains for sub-centimeter nodules where MAISI-v2 frequently failed to reproduce the conditioned lesion. In downstream nodule-level malignancy classification trained on LUNA25 and externally evaluated on LUNA16, LNDbv4, and DLCS24, NodMAISI augmentation improved AUC by 0.07 to 0.11 at <=20% clinical data and by 0.12 to 0.21 at 10%, consistently narrowing the performance gap under data scarcity.
Graph-based design of irregular metamaterials
International Journal of Mechanical Sciences · 2025-04-13 · 4 citations
articleNodMAISI: Nodule-Oriented Medical AI for Synthetic Imaging
arXiv (Cornell University) · 2025-12-19
preprintOpen accessObjective: Although medical imaging datasets are increasingly available, abnormal and annotation-intensive findings critical to lung cancer screening, particularly small pulmonary nodules, remain underrepresented and inconsistently curated. Methods: We introduce NodMAISI, an anatomically constrained, nodule-oriented CT synthesis and augmentation framework trained on a unified multi-source cohort (7,042 patients, 8,841 CTs, 14,444 nodules). The framework integrates: (i) a standardized curation and annotation pipeline linking each CT with organ masks and nodule-level annotations, (ii) a ControlNet-conditioned rectified-flow generator built on MAISI-v2's foundational blocks to enforce anatomy- and lesion-consistent synthesis, and (iii) lesion-aware augmentation that perturbs nodule masks (controlled shrinkage) while preserving surrounding anatomy to generate paired CT variants. Results: Across six public test datasets, NodMAISI improved distributional fidelity relative to MAISI-v2 (real-to-synthetic FID range 1.18 to 2.99 vs 1.69 to 5.21). In lesion detectability analysis using a MONAI nodule detector, NodMAISI substantially increased average sensitivity and more closely matched clinical scans (IMD-CT: 0.69 vs 0.39; DLCS24: 0.63 vs 0.20), with the largest gains for sub-centimeter nodules where MAISI-v2 frequently failed to reproduce the conditioned lesion. In downstream nodule-level malignancy classification trained on LUNA25 and externally evaluated on LUNA16, LNDbv4, and DLCS24, NodMAISI augmentation improved AUC by 0.07 to 0.11 at <=20% clinical data and by 0.12 to 0.21 at 10%, consistently narrowing the performance gap under data scarcity.
ArXiv.org · 2025-10-14
preprintOpen accessSenior authorVariable importance (VI) methods are often used for hypothesis generation, feature selection, and scientific validation. In the standard VI pipeline, an analyst estimates VI for a single predictive model with only the observed features. However, the importance of a feature depends heavily on which other variables are included in the model, and essential variables are often omitted from observational datasets. Moreover, the VI estimated for one model is often not the same as the VI estimated for another equally-good model - a phenomenon known as the Rashomon Effect. We address these gaps by introducing UNobservables and Inference for Variable importancE using Rashomon SEts (UNIVERSE). Our approach adapts Rashomon sets - the sets of near-optimal models in a dataset - to produce bounds on the true VI even with missing features. We theoretically guarantee the robustness of our approach, show strong performance on semi-synthetic simulations, and demonstrate its utility in a credit risk task.
arXiv (Cornell University) · 2025-01-03
preprintOpen accessSenior authorHealth outcomes depend on complex environmental and sociodemographic factors whose effects change over location and time. Only recently has fine-grained spatial and temporal data become available to study these effects, namely the MEDSAT dataset of English health, environmental, and sociodemographic information. Leveraging this new resource, we use a variety of variable importance techniques to robustly identify the most informative predictors across multiple health outcomes. We then develop an interpretable machine learning framework based on Generalized Additive Models (GAMs) and Multiscale Geographically Weighted Regression (MGWR) to analyze both local and global spatial dependencies of each variable on various health outcomes. Our findings identify NO2 as a global predictor for asthma, hypertension, and anxiety, alongside other outcome-specific predictors related to occupation, marriage, and vegetation. Regional analyses reveal local variations with air pollution and solar radiation, with notable shifts during COVID. This comprehensive approach provides actionable insights for addressing health disparities, and advocates for the integration of interpretable machine learning in public health.
Recent grants
CAREER: New Approaches for Ranking in Machine Learning
NSF · $480k · 2011–2017
NSF · $625k · 2022–2026
CAREER: New Approaches for Ranking in Machine Learning
NSF · $480k · 2016–2018
NSF · $1.8M · 2022–2026
Frequent coauthors
- 38 shared
Alina Jade Barnett
Duke University
- 32 shared
Alexander Volfovsky
Duke University
- 31 shared
M. Brandon Westover
Harvard University
- 28 shared
Margo Seltzer
- 24 shared
Wendong Ge
Beth Israel Deaconess Medical Center
- 23 shared
Edward P. Browne
University of North Carolina at Chapel Hill
- 23 shared
Chaofan Chen
Southeast University
- 22 shared
Lesia Semenova
Duke University
Awards & honors
- 2022 Squirrel AI Award for Artificial Intelligence for the B…
- Three-time winner of the INFORMS Innovative Applications in…
- Fellow of the American Statistical Association
- Fellow of the Institute of Mathematical Statistics
- Named as one of the "Top 40 Under 40" by Poets and Quants in…
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Cynthia D. Rudin
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup