
Irina Gaynanova
· Associate Professor, BiostatisticsVerifiedUniversity of Michigan · Biostatistics
Active 2013–2026
About
Irina Gaynanova is an Associate Professor of Biostatistics at the University of Michigan School of Public Health. Her research focuses on the development of statistical methods for analyzing modern high-dimensional biomedical data. Her methodological interests include data integration, machine learning, and high-dimensional statistics, motivated by challenges arising in the analysis of multi-omics data such as RNASeq, metabolomics, microbiome data, and data from wearable devices like continuous glucose monitors, ambulatory blood pressure monitors, and activity trackers. Her work involves developing innovative statistical methods for the integration of high-dimensional disparate data sources and creating new machine learning techniques for extracting features from digital health data. Her research has been funded by the National Science Foundation and has received recognition through awards such as the David P. Byar Young Investigator Award and an NSF CAREER Award. Dr. Gaynanova holds a PhD in Statistics from Cornell University and a Master's degree in Statistics from the same institution, as well as a diploma in Applied Math and Computer Science from Lomonosov Moscow State University.
Research topics
- Computer Science
- Artificial Intelligence
- Machine Learning
- Data Mining
- Psychology
- Operating system
Selected publications
Fréchet regression of multivariate distributions with nonparanormal transport
Open MIND · 2026-03-07
preprintSenior authorRegression with distribution-valued responses and Euclidean predictors has gained increasing scientific relevance. While methodology for univariate distributional data has advanced rapidly in recent years, multivariate distributions, which additionally encode dependence across univariate marginals, have received less attention and pose computational and statistical challenges. In this work, we address these challenges with a new regression approach for multivariate distributional responses, in which distributions are modeled within the semiparametric nonparanormal family. By incorporating the nonparanormal transport (NPT) metric -- an efficient closed-form surrogate for the Wasserstein distance -- into the Fréchet regression framework, our approach decomposes the problem into separate regressions of marginal distributions and their dependence structure, facilitating both efficient estimation and granular interpretation of predictor effects. We provide theoretical justification for NPT, establishing its topological equivalence to the Wasserstein distance and proving that it mitigates the curse of dimensionality. We further prove uniform convergence guarantees for regression estimators, both when distributional responses are fully observed and when they are estimated from empirical samples, attaining fast convergence rates comparable to the univariate case. The utility of our method is demonstrated via simulations and an application to continuous glucose monitoring data.
Statistical Methods in Medical Research · 2026-01-23
articleSenior authorCorrespondingIncreasing epidemiologic evidence suggests that the diversity and composition of the gut microbiome can predict infection risk in cancer patients. Infections remain a major cause of morbidity and mortality during chemotherapy. Analyzing microbiome data to identify associations with infection pathogenesis for proactive treatment has become a critical research focus. However, the high-dimensional nature of the data necessitates the use of dimension-reduction methods to facilitate inference and interpretation. Traditional dimension reduction methods, which assume Gaussianity, perform poorly with skewed and zero-inflated microbiome data. To address these challenges, we propose a semiparametric principal component analysis method based on a truncated latent Gaussian copula model that accommodates both skewness and zero inflation. Simulation studies demonstrate that the proposed method outperforms existing approaches by providing more accurate estimates of scores and loadings across various copula transformation settings. We apply our method, along with competing approaches, to gut microbiome data from pediatric patients with acute lymphoblastic leukemia. The principal scores derived from the proposed method reveal the strongest associations between pre-chemotherapy microbiome composition and adverse events during subsequent chemotherapy, offering valuable insights for improving patient outcomes.
Fréchet regression of multivariate distributions with nonparanormal transport
arXiv (Cornell University) · 2026-03-07
articleOpen accessSenior authorRegression with distribution-valued responses and Euclidean predictors has gained increasing scientific relevance. While methodology for univariate distributional data has advanced rapidly in recent years, multivariate distributions, which additionally encode dependence across univariate marginals, have received less attention and pose computational and statistical challenges. In this work, we address these challenges with a new regression approach for multivariate distributional responses, in which distributions are modeled within the semiparametric nonparanormal family. By incorporating the nonparanormal transport (NPT) metric -- an efficient closed-form surrogate for the Wasserstein distance -- into the Fréchet regression framework, our approach decomposes the problem into separate regressions of marginal distributions and their dependence structure, facilitating both efficient estimation and granular interpretation of predictor effects. We provide theoretical justification for NPT, establishing its topological equivalence to the Wasserstein distance and proving that it mitigates the curse of dimensionality. We further prove uniform convergence guarantees for regression estimators, both when distributional responses are fully observed and when they are estimated from empirical samples, attaining fast convergence rates comparable to the univariate case. The utility of our method is demonstrated via simulations and an application to continuous glucose monitoring data.
Fast distance computation of multivariate distributions via nonparanormal transport
arXiv (Cornell University) · 2026-02-27
preprintOpen accessSenior authorWith the increasing availability of data objects in the form of probability distributions, there is a growing need for statistical methods tailored to distributional data. Distance measures, especially the pairwise distance matrix between data objects, provide the foundation for a wide range of modern data analysis methods, such as clustering, multidimensional scaling, and distance-based regression, among others. The Wasserstein distance is commonly used with distributional data due to its compelling optimal transport property. However, while the Wasserstein distance can be efficiently computed for univariate distributions, its application to multivariate distributions is limited due to high computational costs. To address these scalability issues, we introduce the Nonparanormal Transport (NPT) metric, a closed-form distance based on the flexible nonparanormal distribution family for modeling skewed and non-Gaussian multivariate data. Simulation studies demonstrate that NPT maintains a high level of agreement with the Wasserstein distance, while being at least 1000 times faster than its efficient variants when computing a 100-distribution pairwise distance matrix in both 2 and 5 dimensions. We illustrate the utility of NPT through a multidimensional scaling analysis of bivariate oxygen desaturation distributions of 723 individuals with sleep apnea in the Sleep Heart Health Study.
Diabetes Care · 2026-03-30
articleSenior authorOBJECTIVE: Consensus guidelines recommend at least 14 consecutive days of continuous glucose monitoring (CGM) with 70% completeness to represent 90-day glycemic exposure. This study quantifies bias and uncertainty introduced into downstream analyses by using CGM metrics from incomplete or reduced monitoring, relative to a 90-day complete profile. RESEARCH DESIGN AND METHODS: Using a type 1 diabetes cohort with 1,010 complete 90-day CGM profiles, we simulated incomplete profiles by varying monitoring duration and data completeness. Consensus CGM metrics were computed on incomplete and complete profiles to quantify measurement error, which was propagated into two downstream regression models: 1) CGM metric is an outcome for a binary treatment (clinical trial setting); 2) CGM metric is an explanatory variable (covariate) for another continuous outcome. Bias was quantified using observed-to-true effect size ratios and uncertainty by the sample size increase required to maintain precision. RESULTS: In the clinical trial setting, treatment effects remain unbiased but lose precision; for time in range (TIR), 14 days required ≥16% more participants versus 90 days; 30 days required ≥6.5%. When the CGM metric is a covariate, associations with outcomes are attenuated (biased toward zero up to 14% at 14 days and 6% at 30 days for TIR) and less precise. CONCLUSIONS: Representing 90 days of glycemic exposure with 14 days can lead to bias and loss of precision in downstream analyses. We recommend study protocols require at least 30 days of CGM with 70% completeness. If 30 days is not feasible, studies should plan for increased sample sizes.
Fast distance computation of multivariate distributions via nonparanormal transport
arXiv (Cornell University) · 2026-02-27
articleOpen accessSenior authorWith the increasing availability of data objects in the form of probability distributions, there is a growing need for statistical methods tailored to distributional data. Distance measures, especially the pairwise distance matrix between data objects, provide the foundation for a wide range of modern data analysis methods, such as clustering, multidimensional scaling, and distance-based regression, among others. The Wasserstein distance is commonly used with distributional data due to its compelling optimal transport property. However, while the Wasserstein distance can be efficiently computed for univariate distributions, its application to multivariate distributions is limited due to high computational costs. To address these scalability issues, we introduce the Nonparanormal Transport (NPT) metric, a closed-form distance based on the flexible nonparanormal distribution family for modeling skewed and non-Gaussian multivariate data. Simulation studies demonstrate that NPT maintains a high level of agreement with the Wasserstein distance, while being at least 1000 times faster than its efficient variants when computing a 100-distribution pairwise distance matrix in both 2 and 5 dimensions. We illustrate the utility of NPT through a multidimensional scaling analysis of bivariate oxygen desaturation distributions of 723 individuals with sleep apnea in the Sleep Heart Health Study.
2026-03-30
articleSenior author<p dir="ltr">Objective: Consensus guidelines recommend at least 14 consecutive days of CGM monitoring with 70% completeness to represent 90-day glycemic exposure. This study quantifies bias and uncertainty introduced into downstream analyses by using CGM metrics from incomplete or reduced monitoring, relative to a 90-day complete profile.</p><p dir="ltr">Research Design and Methods: Using a type 1 diabetes cohort with 1,010 complete 90-day CGM profiles, we simulated incomplete profiles by varying monitoring duration and data completeness. Consensus CGM metrics were computed on incomplete and complete profiles to quantify measurement error, which was propagated into two downstream regression models: (a) CGM metric is an outcome for a binary treatment (clinical trial setting); (b) CGM metric is an explanatory variable (covariate) for another continuous outcome. Bias was quantified using observed-to-true effect size ratios, and uncertainty by the sample size increase required to maintain precision.</p><p dir="ltr">Results: In the clinical trial setting, treatment effects remain unbiased but lose precision; for Time In Range (TIR), 14 days required ≥16% more participants versus 90 days; 30 days required ≥6.5%. When the CGM metric is a covariate, associations with outcomes are attenuated (biased towards zero up to 14% at 14 days and 6% at 30 days for TIR) and less precise.</p><p dir="ltr">Conclusions: Representing 90 days of glycemic exposure with 14 days can lead to bias and loss of precision in downstream analyses. We recommend study protocols require at least 30 days of CGM monitoring with 70% completeness. If 30 days is not feasible, studies should plan for increased sample sizes.</p>
Do Immediate Perioperative Glucose Measurements Predict Outcomes in Non-Elective Pedal Amputation?
Foot & Ankle Specialist · 2026-02-25
articleOpen accessBackgroundApproximately 20% of diabetic foot ulcers progress to amputation. While elevated glucose levels are known to increase infection risk in elective surgeries, their role in outcomes following non-elective amputation remains unclear. Methods. We conducted a 2-year retrospective chart review of adult patients who underwent non-elective, diabetes-related lower-extremity amputations at a tertiary care health system. Of 185 charts reviewed, 108 patients with at least 6 months of follow-up were included. Preoperative and immediate postoperative glucose values were recorded. Primary and secondary outcomes included healing time, postoperative infection, emergency department visits, and hospital readmissions. Multivariable regression models were used to adjust for patient sex, amputation level, and relevant comorbidities.ResultsThe mean healing time was 13.8 weeks (SD 12.9). Elevated perioperative glucose (>180 mg/dL) was associated with a 42% increase in healing time (P = .037). Postoperative infections occurred in 14.8% of patients and were associated with an almost two-fold increase in healing time (P=.001), as well as increased rates of emergency department visits and readmissions. Peripheral arterial disease and end-stage renal disease were independently associated with delayed healing and higher readmission rates.ConclusionElevated perioperative glucose levels, postoperative infection, peripheral arterial disease, and end-stage renal disease are associated with prolonged wound healing and higher complication rates after non-elective lower-extremity amputations in patients with diabetes. These findings underscore the importance of perioperative glucose optimization, infection prevention, and comprehensive management of comorbidities to improve surgical outcomes in this high-risk population.
2026-03-30
articleSenior author<p dir="ltr">Objective: Consensus guidelines recommend at least 14 consecutive days of CGM monitoring with 70% completeness to represent 90-day glycemic exposure. This study quantifies bias and uncertainty introduced into downstream analyses by using CGM metrics from incomplete or reduced monitoring, relative to a 90-day complete profile.</p><p dir="ltr">Research Design and Methods: Using a type 1 diabetes cohort with 1,010 complete 90-day CGM profiles, we simulated incomplete profiles by varying monitoring duration and data completeness. Consensus CGM metrics were computed on incomplete and complete profiles to quantify measurement error, which was propagated into two downstream regression models: (a) CGM metric is an outcome for a binary treatment (clinical trial setting); (b) CGM metric is an explanatory variable (covariate) for another continuous outcome. Bias was quantified using observed-to-true effect size ratios, and uncertainty by the sample size increase required to maintain precision.</p><p dir="ltr">Results: In the clinical trial setting, treatment effects remain unbiased but lose precision; for Time In Range (TIR), 14 days required ≥16% more participants versus 90 days; 30 days required ≥6.5%. When the CGM metric is a covariate, associations with outcomes are attenuated (biased towards zero up to 14% at 14 days and 6% at 30 days for TIR) and less precise.</p><p dir="ltr">Conclusions: Representing 90 days of glycemic exposure with 14 days can lead to bias and loss of precision in downstream analyses. We recommend study protocols require at least 30 days of CGM monitoring with 70% completeness. If 30 days is not feasible, studies should plan for increased sample sizes.</p>
Bayesian Segmented Gaussian Copula Factor Model for Single-Cell Sequencing Data
Bayesian Analysis · 2025-01-01 · 1 citations
articleOpen accessSingle-cell sequencing technologies have revolutionized molecular and cellular biology, offering unprecedented insights into cellular heterogeneity by enabling gene expression profiling at the resolution of individual cells. However, analysis of such data is complicated by excessive low or zero counts due to dropout events and the skewed nature of gene expression distributions, which conventional Gaussian factor models struggle to handle effectively. To address these challenges, we propose a novel Bayesian segmented Gaussian copula factor model that explicitly accounts for the inflation of zero and near-zero counts while modeling the high skewness in single-cell data. By employing a Dirichlet-Laplace prior on each column of the factor loadings matrix, we shrink factor loadings towards zero, enabling automatic selection of the number of latent factors as well as resolving the identifiability issues of factor models stemming from the rotational invariance of factor loadings without structural constraints. Through simulations with characteristics typical of single-cell data, such as excessive low counts and high skewness, we demonstrate the superior performance of our method over existing approaches. Furthermore, we apply the proposed method to a single-cell RNA-sequencing dataset from a lymphoblastoid cell line, successfully identifying biologically meaningful latent factors and detecting previously uncharacterized cell subtypes.
Recent grants
NSF · $246k · 2021–2024
Scalable Methods for Classification of Heterogeneous High-Dimensional Data
NSF · $163k · 2017–2020
Frequent coauthors
- 23 shared
Christian L. Müller
University of Mannheim
- 16 shared
Naresh M. Punjabi
University of Miami
- 14 shared
Grace Yoon
- 9 shared
Elizabeth Chun
Texas A&M University
- 9 shared
Alexander F. Lapanowski
Texas A&M University
- 8 shared
Johannes Lederer
Universität Hamburg
- 7 shared
Benjamin B. Risk
Emory University
- 7 shared
Renat Sergazinov
Awards & honors
- David P. Byar Young Investigator Award
- NSF CAREER Award
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Irina Gaynanova
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup