
Rebecca A Hubbard
VerifiedUniversity of Pennsylvania · Rehabilitation Medicine
Active 1931–2026
About
Rebecca A Hubbard, Ph.D., is an Adjunct Professor of Biostatistics and Epidemiology at the University of Pennsylvania Perelman School of Medicine. She serves as a Senior Scholar at the Center for Clinical Epidemiology and Biostatistics, a Senior Fellow at the Institute for Biomedical Informatics, and is a member of the Abramson Cancer Center. She also holds the position of Vice Chair for Faculty Professional Development in the Department of Biostatistics, Epidemiology & Informatics. Her research focuses on the development and application of statistical methodology for studies using observational data from community medical practice. This includes evaluation of screening and diagnostic test performance, methods for comparative effectiveness studies, and health services research. Dr. Hubbard's methodological work emphasizes the development of statistical tools for valid inference from complex electronic medical record data, which she has applied to studies in cancer screening, aging and dementia, pharmacoepidemiology, women’s health, and behavioral health.
Research topics
- Medicine
- Internal medicine
- Political Science
- Cardiology
- Medical emergency
- Surgery
- Oncology
- Emergency medicine
Selected publications
Racial and Ethnic Differences in Prevalence and Incidence of Diabetic Retinal Disease
Retina · 2026-03-31
articlePURPOSE: To determine prevalence and incidence trends of diabetic retinal disease (DRD) and its vision-threatening forms over the last 20 years among patients with diabetes mellitus (DM) among US racial and ethnic groups. METHODS: A retrospective cohort study of members of commercial and Medicare Advantage health plans between 2000 and 2022 was conducted, with cohorts of White(W), Black/African American(B/AA), Hispanic(H), and Asian(A) DM patients identified using ICD-9/10 codes. Outcomes included annual prevalence and incidence of DRD, diabetic macular edema (DME), and proliferative diabetic retinopathy (PDR). Multivariable logistic and Poisson regression models analyzed trends in prevalence odds ratios and incidence rate ratios, respectively. RESULTS: B/AA patients had higher prevalence rates of DRD every year analyzed compared with White patients (2021 B/AA:23.1%; W:19.0%; p<0.001). Both Hispanic (2001 H:12.3%) and Asian (2001:11.9%) patients initially had lower DRD prevalence than White patients (2001:13.1%; p<0.001 for both); both are now higher with Hispanic patients having the highest rates (2021 H:26.0%; A:21.2%;W:19.0%, p<0.001). DME and PDR prevalence increased across all groups through 2015/2016, then decreased through 2021 (2021 DME:W:4.5%, B/AA:5.9%; H:5.9%, A:4.7%; 2021 PDR:W:2.9%, B/AA:4.3%, H:5.0%, A:2.9%).Since 2009, incidence rates for DRD, DME, and PDR in Hispanic and B/AA patients have been higher than for White patients (IRR=1.08-1.85; p<0.001 for all comparisons). Asian patients initially had higher DRD incidence rates than White patients, but that difference disappeared in 2021 before increasing again in 2022 (2022 IRR=1.07, 95%CI=1.01-1.14). CONCLUSION: Disparities in prevalence and incidence of DRD, DME, and PDR persist for B/AA and have worsened for Hispanic patients.
Performance of Statistical and Machine Learning Risk Prediction Models for Advanced Breast Cancers
Cancer Epidemiology Biomarkers & Prevention · 2026-05-14
articleBACKGROUND: Machine learning enables complex risk prediction models, but comparative performance with statistical approaches remains context-dependent. We compared statistical and machine learning models for predicting advanced breast cancer risk. METHODS: Using data from 968,178 women (40-74 years) undergoing 2,796,459 annual or 812,126 biennial screening mammograms (2005-2019) in the Breast Cancer Surveillance Consortium, we cross-validated models predicting advanced breast cancer within 12 months (annual) or 24 months (biennial) following screening. Models included conventional logistic regression, regularized regressions (LASSO, Elastic net), and machine learning methods (random forests, gradient boosting), considering a modest number of clinical and demographic predictors. Performance was assessed using calibration and area under the receiver operating characteristic curve (AUC). RESULTS: Discrimination was similar across models (AUC 0.677-0.690). Calibration differences were more pronounced. Regularized regressions achieved the most favorable calibration overall and across racial and ethnic groups, with AUC 0.689 (95%CI = 0.676-0.701). Gradient boosting showed comparable AUC but suboptimal calibration (calibration slope 1.12; 95%CI = 1.04-1.20). Conventional logistic regression had slightly lower AUC (0.683; 95%CI = 0.671-0.696) and calibration slope of 0.90 (95%CI = 0.83-0.96). Regression-based approaches were generally well calibrated across racial and ethnic groups (E/O ratio 0.96-1.03; calibration intercept -0.03 to 0.04), with some subgroup deviations in calibration slopes (<1). CONCLUSIONS: For predicting advanced breast cancers, regularized regression demonstrated similar discrimination and generally more favorable calibration than other approaches. IMPACT: In settings with rare outcomes and low dimensional features, regularized regression may offer a practical balance between performance and interpretability.
ICPSR Data Holdings · 2026-03-23
datasetOpen access1st authorCorrespondingResearchers can use data from electronic health records, or EHRs, in studies that compare two or more treatments. In these studies, researchers need to identify all patients with the same phenotype. Phenotypes are a person's known traits, like height and weight, or known health problems, like diabetes. However, in EHR data, some data on patient traits or health problems may be missing for some patients. Missing data in EHRs make it hard to correctly identify all patients with the same phenotype. It's even harder when data are missing due to a patient's health status. For example, patients with uncontrolled diabetes may need more lab tests than patients with controlled diabetes. As a result, researchers who are looking at lab tests may not identify patients with controlled diabetes as having diabetes. In this project, the research team developed and tested a new statistical method that accounts for missing EHR data to estimate patient phenotypes. To access the methods and software, please visit the bias_correction GitHub repository.
Epidemiology · 2026-03-18
articleElectronic health record (EHR) systems capture patient information inconsistently, with patients generally contributing more data when they are sick than healthy. This creates "informed presence," systematic differences between captured and non-captured data, potentially biasing association estimates. There is growing interest in methods that account for informed presence, but practical approaches for conceptualizing, identifying, and addressing this bias in applied EHR research have received limited attention. Focusing on longitudinal settings, we present a conceptual framework for informed presence bias, which arises when data capture depends on exposure and outcome and thus the visit process acts as a collider. We then illustrate methods that aim to reduce bias by reweighting or resampling observed data to approximate conditional independence between the visit process and the outcome. We illustrate these methods using longitudinal EHR data from pediatric solid organ transplant recipients (N=271) to examine the association between steroids and cytomegalovirus viremia, where the frequency of cytomegalovirus testing varies across patients and over time. Incidence rate ratios decreased from 1.83 (95% CI 1.02, 3.28) in a naïve analysis to 1.37 (0.73, 2.57) when accounting for informed presence using inverse intensity weighting. Incidence rate ratio estimates from bootstrapped inverse intensity weighting were 1.37 (0.71, 2.27) and 1.40 (0.73, 2.68) from multiple outputation. These results show the anticipated attenuation of effect estimates after accounting for informed presence bias. When analyzing irregularly measured EHR data, we recommend (1) identifying the expected observation process using conceptual diagrams, (2) assessing dependence in the observation process, and (3) accounting for outcome dependence in statistical analysis.
ICPSR Data Holdings · 2026-03-23
datasetOpen access1st authorCorrespondingResearchers can use data from electronic health records, or EHRs, in studies that compare two or more treatments. In these studies, researchers need to identify all patients with the same phenotype. Phenotypes are a person's known traits, like height and weight, or known health problems, like diabetes. However, in EHR data, some data on patient traits or health problems may be missing for some patients. Missing data in EHRs make it hard to correctly identify all patients with the same phenotype. It's even harder when data are missing due to a patient's health status. For example, patients with uncontrolled diabetes may need more lab tests than patients with controlled diabetes. As a result, researchers who are looking at lab tests may not identify patients with controlled diabetes as having diabetes. In this project, the research team developed and tested a new statistical method that accounts for missing EHR data to estimate patient phenotypes. To access the methods and software, please visit the bias_correction GitHub repository.
Missingness in Eligibility Criteria for Target Trial Emulation in EHR With Survival Outcomes
Statistics in Medicine · 2026-04-01
articleOpen accessSenior authorCorrespondingIn certain settings, when conducting a randomized trial would be infeasible, electronic health records (EHR) can be used to emulate a target trial and estimate causal effects of an intervention. This process involves specifying the elements of a hypothetical trial protocol and applying these to the design of an observational study conducted with EHR data (or other observational data source). One element of target trial specification includes defining eligibility criteria. However, defining the eligible population with EHR can be complicated by missingness in eligibility-defining variables. Multiple imputation (MI) is one common approach to missingness in EHR data, but it is unclear whether imputation of eligibility criteria should occur before or after excluding ineligible individuals. Motivated by a target trial emulation of two treatments for advanced breast cancer, we explore this question when estimating the average causal effect under a target trial framework with survival outcomes. We illustrate how alternative MI strategies perform using simulated data and in a real-world analysis of oncology EHR data. We found that in most settings with high proportions of missingness in eligibility-defining variables, imputing missing data using a flexible imputation model, such as a random forest, prior to excluding ineligible individuals resulted in lower bias than complete case analysis or imputation after excluding ineligible individuals. Choices about how to handle practical challenges such as this in the application of target trial emulation to messy, real-world data sources can have substantial effects on causal parameter estimation and should be carefully considered to ensure that the results of observational studies are as rigorous as possible.
An E-value-Informed Sensitivity Analysis Framework for Hybrid Controlled Trials
medRxiv · 2026-03-06
articleOpen accessSenior authorCorrespondingHybrid controlled trials (HCTs) incorporate real-world data into randomized controlled trials (RCTs) by augmenting the internal control arm with patients receiving the same treatment in routine care. Beyond increasing power, HCTs may improve recruitment by supporting unequal randomization ratios that increase patient access to experimental treatments. However, HCT validity is threatened by bias from unmeasured confounding due to lack of randomization of external controls, leading to outcome non-exchangeability between internal and external control patients. To address this challenge, we developed a sensitivity analysis framework to assess the robustness of HCT results to potential unmeasured confounding. We propose a tipping point analysis that adapts the E-value framework to the HCT setting where trial participation rather than treatment assignment is subject to confounding. To aid interpretation, we also introduce a data-driven benchmark representing the strength of unmeasured confounding reflected by the observed outcome non-exchangeability. We then propose an operational decision rule and evaluate its performance through simulation studies. Finally, we illustrate the approach using an asthma trial augmented by data from electronic health records. Simulation results demonstrate that our decision rule safeguards against Type I error inflation while preserving the power gains achieved by incorporating external data. In settings where moderate unmeasured confounding led to poorer outcomes for external controls, Type I error was controlled near the nominal 5% level, and power increased by 10-20% compared with analyses using RCT data alone. Our approach provides a practical, interpretable method to assess HCT robustness, supporting rigorous inference when integrating external real-world data.
Surgery · 2026-04-04
articleJNCI Cancer Spectrum · 2026-04-21
articleOpen accessPURPOSE: We aimed to determine temporal trends and racial disparities in utilization and time to treatment initiation (TTI) of CDK4/6 inhibitors (CDK4/6i) and pertuzumab for first-line metastatic breast cancer (MBC). DESIGN: We extracted data from a nationwide electronic health record-derived deidentified database. Female patients ≥18 years old with ER+/HER2- or HER2+ MBC eligible for CDK4/6i(3/2015-10/2021) or pertuzumab(07/2012-09/2021) were included. Our outcomes were adjusted temporal trends in the proportion of patients receiving respective therapies using logistic regression with natural cubic splines for time trends and tested for changes in utilization over time within and between racial groups (non-Hispanic White (NHW) or non-Hispanic Black (NHB). Similar models using linear regression estimated mean TTI. RESULTS: 5173(NHW = 4478; NHB = 695) ER+/HER2- and 2321(NHW = 1915; NHB = 406) HER2+ MBC patients were included. There were significant differences in the proportion initiating CDK4/6i over time within racial groups (NHW, 23.5%(95%CI: 20.1%-27.3%) in 2015 to 53.8%(95%CI: 48.6%-59.0%) in 2021; NHB, 20.6%(95%CI: 11.9%-33.0%) in 2015 to 73.6%(95%CI: 61.7%-83.0%) in 2021) and between groups(p = 0.009). There was a significant increase in utilization of pertuzumab within both racial groups over time(p < 0.001), but no significant difference between groups(p = 0.45). TTI decreased over time with no significant differences in TTI trends between the two groups. CONCLUSIONS: Utilization of targeted therapies increased over time, however NHB patients were less likely to receive CDK4/6i compared to NHW. Approximately half of eligible patients did not receive pertuzumab. Further research is needed to understand mediators and design interventions to address underutilization of these therapies and those contributing to racial disparities in CDK4/6i utilization.
Repository@Nottingham (University of Nottingham) · 2026-02-23
articleOpen access
Recent grants
NIH · $158k · 2013
NIH · $160k · 2016
Improving confounder control in EHR-based studies of cancer epidemiology
NIH · $474k · 2019–2021
Frequent coauthors
- 154 shared
Karla Kerlikowske
San Francisco VA Health Care System
- 138 shared
Diana L. Miglioretti
- 126 shared
Louise M. Henderson
- 125 shared
Diana S. M. Buist
Menlo School
- 122 shared
Tracy Onega
- 94 shared
Weiwei Zhu
Second Hospital of Anhui Medical University
- 86 shared
Janie M. Lee
Fred Hutch Cancer Center
- 85 shared
Laura Ichikawa
Kaiser Permanente Washington Health Research Institute
Education
- 2007
PhD, Biostatistics
University of Washington
- 2002
MSc, Statistics
University of Oxford
- 2001
MSc, Epidemiology
University of Edinburgh
- 1999
BS, Biology
University of Pittsburgh
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Rebecca A Hubbard
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup