
About
Giovanni Parmigiani is a Professor of Biostatistics at Harvard University. His research interests include Bayesian decision theory, multi-study statistical methods, machine learning for precision prevention and treatment in health care, and statistical techniques in cancer biology. He is affiliated with the Department of Statistics and is involved in various academic activities, including teaching and research, within the department.
Research topics
- Computer Science
- Mathematics
- Artificial Intelligence
- Machine Learning
- Statistics
- Immunology
- Biology
- Medicine
- Demography
- Internal medicine
- Environmental health
- Theoretical computer science
- Genetics
Selected publications
The Prostate · 2026-04-29
articleOpen accessPURPOSE: A key clinical challenge in prostate cancer is the identification and validation of biomarkers with high specificity for indolent long-term outcomes. We applied a novel statistical method to identify tumor transcriptomic biomarkers that optimally predicted patients with low metastatic potential. METHODS: Using tumor whole-transcriptome data from the Health Professionals Follow-up Study (HPFS, discovery set) and Physicians' Health Study (PHS, validation set), we compared patients who died of prostate cancer or developed metastases ("lethal," n = 113) and patients with > 8 years of metastasis-free survival ("indolent," n = 291). Whole transcriptome tumor gene expression data were generated using an Affymetrix array. We applied a novel method for optimizing a partial area under the curve (pAUC) that up-weighted indolent cases with a predefined 80%-100% specificity. This method leverages weighted logistic lasso regression, with weights chosen via cross-validation to reduce overfitting. RESULTS: Median age at cancer diagnosis was 66 years; median follow-up for outcomes was 14 years. We identified a 40-gene transcriptome signature of indolent prostate cancer, which, compared to Gleason grade groups, improved the pAUC over the predefined 80%-100% specificity range by 1.72-fold (p < 0.001) and improved overall AUC from 0.85 to 0.93 (p < 0.001). The signature improved positive predictive value for indolent tumors > 2-fold with minimal decrease in negative predictive value. Importantly, the 40-gene signature showed high discrimination among intermediate Gleason 7 tumors (Grade groups 2 and 3, AUC 0.88, 95% CI: 0.79-0.95). CONCLUSION: Incorporating pAUC into prognostic signature development improved identification of prostate tumors with low risk of metastatic potential. Its clinical application may help reduce overtreatment and overdiagnosis of indolent prostate cancers, and the pAUC may be relevant beyond prostate cancer.
arXiv (Cornell University) · 2026-04-04
preprintOpen accessSenior authorData-driven decision making frequently relies on predicting counterfactual outcomes. In practice, researchers commonly train counterfactual prediction models on a source dataset to inform decisions on a possibly separate target population. Conformal prediction has arisen as a popular method for producing assumption-lean prediction intervals for counterfactual outcomes that would arise under different treatment decisions in the target population of interest. However, existing methods require that every confounding factor of the treatment-outcome relationship used for training on the source data is additionally measured in the target population, risking miscoverage if important confounders are unmeasured in the target population. In this paper, we introduce a computationally efficient debiased machine learning framework that allows for valid prediction intervals when only a subset of confounders is measured in the target population, a common challenge referred to as runtime confounding. Grounded in semiparametric efficiency theory, we show the resulting prediction intervals achieve desired coverage rates with faster convergence compared to standard methods. Through numerous synthetic and semi-synthetic experiments, we demonstrate the utility of our proposed method.
bayesNMF: Fast Bayesian Poisson NMF with Automatically Learned Rank Applied to Mutational Signatures
Journal of Computational and Graphical Statistics · 2026-04-13
articleOpen accessSenior authorBayesian Poisson Non-Negative Matrix Factorization (NMF) is widely used to model count data, including in cancer mutational signature analysis. However, standard Gibbs samplers rely on computationally expensive Poisson augmentation, and current software implementations learn the latent rank either through slow and potentially subjective heuristic rank selection or with automatic approaches that do not report posterior uncertainty. In this paper, we introduce bayesNMF, an MH-within-Gibbs sampler to address both of these limitations. First, we define high-overlap proposals for Metropolis-Hastings sampling to remove the need for Poisson augmentation. Second, we define a BIC-based sparsity prior to learn rank automatically within the Bayesian formulation while allowing for posterior uncertainty quantification. We provide an open-source R software package with all of the models and plotting capabilities demonstrated in this paper on GitHub at jennalandy/bayesNMF. Although our applications focus on cancer mutational signatures, our software and results can be extended to any use of Bayesian Poisson NMF.
arXiv (Cornell University) · 2026-04-04
articleOpen accessSenior authorData-driven decision making frequently relies on predicting counterfactual outcomes. In practice, researchers commonly train counterfactual prediction models on a source dataset to inform decisions on a possibly separate target population. Conformal prediction has arisen as a popular method for producing assumption-lean prediction intervals for counterfactual outcomes that would arise under different treatment decisions in the target population of interest. However, existing methods require that every confounding factor of the treatment-outcome relationship used for training on the source data is additionally measured in the target population, risking miscoverage if important confounders are unmeasured in the target population. In this paper, we introduce a computationally efficient debiased machine learning framework that allows for valid prediction intervals when only a subset of confounders is measured in the target population, a common challenge referred to as runtime confounding. Grounded in semiparametric efficiency theory, we show the resulting prediction intervals achieve desired coverage rates with faster convergence compared to standard methods. Through numerous synthetic and semi-synthetic experiments, we demonstrate the utility of our proposed method.
BreakLoops: A New Feature for the Multi-Gene, Multi-Cancer Family History-Based Model, Fam3Pro
ArXiv.org · 2025-05-02
preprintOpen accessPreviously, we presented PanelPRO, now known as Fam3PRO, an open-source R package for multi-gene, multi-cancer risk modeling with pedigree data. The initial release could not handle pedigrees that contained cyclic structures called loops, which occur when relatives mate. Here, we present a graph-based function called breakloops that can detect and break loops in any pedigree. The core algorithm identifies the optimal set of loop breakers when individuals in a loop have exactly one parental mating, and extends to handle cases where individuals have multiple parental matings. The algorithm transforms complex pedigrees by strategically creating clones of key individuals to disrupt cycles while minimizing computational complexity. Our extensive testing demonstrates that this new feature can handle a wide variety of pedigree structures. The breakloops function is available in Fam3Pro version 2.0.0. This advancement enables Fam3Pro to assess cancer risk in a wider range of family structures, enhancing its applicability in clinical settings
Causal Inference for Latent Outcomes Learned with Factor Models
ArXiv.org · 2025-06-25
preprintOpen accessSenior authorIn many fields$\unicode{x2013}$including genomics, epidemiology, natural language processing, social and behavioral sciences, and economics$\unicode{x2013}$it is increasingly important to address causal questions in the context of factor models or representation learning. In this work, we investigate causal effects on $\textit{latent outcomes}$ derived from high-dimensional observed data using nonnegative matrix factorization. To the best of our knowledge, this is the first study to formally address causal inference in this setting. A central challenge is that estimating a latent factor model can cause an individual's learned latent outcome to depend on other individuals' treatments, thereby violating the standard causal inference assumption of no interference. We formalize this issue as $\textit{learning-induced interference}$ and distinguish it from interference present in a data-generating process. To address this, we propose a novel, intuitive, and theoretically grounded algorithm to estimate causal effects on latent outcomes while mitigating learning-induced interference and improving estimation efficiency. We establish theoretical guarantees for the consistency of our estimator and demonstrate its practical utility through simulation studies and an application to cancer mutational signature analysis. All baseline and proposed methods are available in our open-source R package, ${\tt causalLFO}$.
Independent and Complementary Value of RNA Expression Signatures in High-Risk Multiple Myeloma
Clinical Lymphoma Myeloma & Leukemia · 2025-09-01
articleBayesian Probit Multi-Study Non-negative Matrix Factorization for Mutational Signatures
ArXiv.org · 2025-02-03
preprintOpen accessMutational signatures are patterns of somatic mutations in tumor genomes that provide insights into underlying mutagenic processes and cancer origin. Developing reliable methods for their estimation is of growing importance in cancer biology. Somatic mutation data are often collected for different cancer types, highlighting the need for multi-study approaches that enable joint analysis in a principled and integrative manner. Despite significant advancements, statistical models tailored for analyzing the genomes of multiple cancer types remain underexplored. In this work, we introduce a Bayesian Multi-Study Non-negative Matrix Factorization (NMF) approach that uses mixture modeling to incorporate sparsity in the exposure weights of each subject to mutational signatures, allowing for individual tumor profiles to be represented by a subset rather than all signatures, and making this subset depend on covariates. This allows for a) more precise ability to identify meaningful contributions of mutational signatures at the individual level; b) estimation of the prevalence of activity of signatures within a cancer type, defined by the proportion of tumor profiles where a certain signature is present; and c) de-novo identification of interpretable patient subtypes based on the mutational signatures present within their mutational profile. We apply our approach to the mutational profiles of tumors from seven different cancer types, demonstrating its ability to accurately estimate mutational signatures while uncovering both individual and tissue-specific differences. An R package implementing our method is available at https://github.com/blhansen/BAPmultiNMF.
Multivariate Causal Effects: a Bayesian Causal Regression Factor Model
arXiv (Cornell University) · 2025-04-04
preprintOpen accessThe impact of wildfire smoke on air quality is a growing concern, contributing to air pollution through a complex mixture of chemical species with important implications for public health. While previous studies have primarily focused on its association with total particulate matter (PM2.5), the causal relationship between wildfire smoke and the chemical composition of PM2.5 remains largely unexplored. Exposure to these chemical mixtures plays a critical role in shaping public health, yet capturing their relationships requires advanced statistical methods capable of modeling the complex dependencies among chemical species. To fill this gap, we propose a Bayesian causal regression factor model that estimates the multivariate causal effects of wildfire smoke on the concentration of 27 chemical species in PM2.5 across the United States. Our approach introduces two key innovations: (i) a causal inference framework for multivariate potential outcomes, and (ii) a novel Bayesian factor model that employs a probit stick-breaking process as prior for treatment-specific factor scores. By focusing on factor scores, our method addresses the missing data challenge common in causal inference and enables a flexible, data-driven characterization of the latent factor structure, which is crucial to capture the complex correlation among multivariate outcomes. Through Monte Carlo simulations, we show the model's accuracy in estimating the causal effects in multivariate outcomes and characterizing the treatment-specific latent structure. Finally, we apply our method to US air quality data, estimating the causal effect of wildfire smoke on 27 chemical species in PM2.5, providing a deeper understanding of their interdependencies.
Bayesian multi-study non-negative matrix factorization for mutational signatures
Genome biology · 2025-04-16 · 4 citations
articleOpen accessSenior authorMutational signatures are typically identified from tumor genome sequencing data using non-negative matrix factorization (NMF). However, existing NMF techniques only decompose a single dataset, limiting rigorous comparisons of signatures across conditions. We propose a Bayesian NMF method that jointly decomposes multiple datasets to identify signatures and their sharing pattern across conditions. We propose a fully unsupervised "discovery-only" model and a semi-supervised "recovery-discovery" model that simultaneously estimates known and novel signatures, and extend both to estimate covariate effects. We demonstrate our approach on extensive simulations, and apply our method to answer questions related to colorectal cancer and early-onset breast cancer.
Recent grants
NIH · $997k · 2010
Multi-study Genomic Data Analysis
NSF · $212k · 2009–2011
Training Grant in Quantitative Sciences for Cancer Research
NIH · $504k · 1979–2016
Statistical Methods for Multi-Study Predictions
NSF · $350k · 2018–2022
Advancing Statistical Methods for Multi-Study Predictions
NSF · $350k · 2021–2025
Frequent coauthors
- 422 shared
Bert Vogelstein
Howard Hughes Medical Institute
- 413 shared
Victor E. Velculescu
University of Baltimore
- 411 shared
Kenneth W. Kinzler
Johns Hopkins University
- 329 shared
Levi Waldron
City University of New York
- 320 shared
D. Williams Parsons
Altarum Institute
- 278 shared
Curtis Huttenhower
Harvard University
- 272 shared
Siân Jones
- 236 shared
Michael J. Birrer
Winthrop Rockefeller Foundation
Labs
Sports Analytics Laboratory at Harvard UniversityPI
Not provided in the HTML snippet.
Education
- 1990
PhD, Statistics
Carnegie Mellon University
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Giovanni Parmigiani
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup