
Raymond Carroll
· Distinguished ProfessorVerifiedTexas A&M University · Statistics
Active 1972–2026
About
Dr. Raymond J. Carroll is a Distinguished Professor at Texas A&M University, where he holds positions in the Departments of Statistics, Nutrition, and Toxicology. He is also the Director of the Bioinformatics Training Program and the Texas A&M Institute for Applied Mathematics and Computational Science. His academic and research focus encompasses statistical methods, bioinformatics, nutrition, and toxicology, contributing to the advancement of these interdisciplinary fields. Dr. Carroll's work involves applying statistical techniques to biological and health-related data, supporting research in areas such as nutrition and toxicology. His leadership roles and academic appointments highlight his significant contributions to the development of statistical bioinformatics and applied mathematics within the university and the broader scientific community.
Research topics
- Data Mining
- Computer Science
- Biology
- Mathematics
- Statistics
- Medicine
- Environmental health
- Data science
- Gerontology
- Virology
- Demography
- Engineering
- Immunology
- Internal medicine
- Econometrics
- Genetics
Selected publications
medRxiv · 2026-04-20
articleOpen accessAlzheimer's disease genomics and other high-dimensional omics studies demand powerful statistical methods, yet Bayesian inference remains underutilized despite its advantages in small-sample settings, owing to the prohibitive cost of eliciting reliable priors across thousands or millions of parameters. We propose an AI-assisted Bayesian-frequentist hybrid inference framework that couples large language model based prior elicitation with the hybrid inference theory of Yuan (2009). ChatGPT-4o is queried via a standardized prompt to assess the strength of evidence linking each gene to a disease of interest, and the response is mapped to an informative normal prior via a standardized effect-size calibration. Parameters for covariates of secondary interest are treated as frequentist parameters, preserving efficiency and avoiding sensitivity to mis-specified priors. We derive closed-form hybrid estimators under uniform and conjugate normal priors in linear models, establish their asymptotic equivalence to the frequentist and full Bayes estimators, and show in simulations that hybrid inference using unconditional variance estimation leads to high statistical power while accurately controlling the Type I error rate. Applied to single-cell RNA sequencing data from the ROSMAP cohort for Alzheimer's disease as an example, the framework identifies biologically coherent pathways (such as gamma-secretase pathways) previously undetected. The proposed framework offers a principled and computationally scalable approach to genome-wide Bayesian analysis, with potential for broad application across omics platforms and disease settings.
The association between virus-induced spinal cord pathology and the genetic background of the host
Journal of Neuropathology & Experimental Neurology · 2025-10-17
articleOpen accessTheiler's murine encephalomyelitis virus (TMEV) infection in mice has been used to study diverse neurological diseases, including multiple sclerosis and epilepsy. In this investigation, 5 strains of collaborative cross (CC) mice were infected with TMEV and examined clinically and histologically at days 4, 14, and 90 post-infection (dpi). All CC strains tested exhibited lumbar spinal cord and/or ventral peripheral nerve lesions by 14 dpi; CC027, CC023, and CC078 strains exhibited lesions at 4 dpi. At 90 dpi, lesions were remnants of the inflammatory responses associated with earlier infection; there was skeletal muscle atrophy in the CC023 strain. Increased microglial/macrophage reactivity was observed in all strains at 4 and 14 dpi, but not at 90 dpi. TMEV mRNA expression was greatest in the CC023 and CC078 strains at the acute timepoints; TMEV was completely cleared in all mice at 90 dpi. The neuropathological and clinical profiles in CC023 mice, mainly at 14 dpi, share some clinical and histologic features with those in amyotrophic lateral sclerosis patients. This work demonstrates how viral infection might interact with the genetic background of a susceptible individual to contribute to the onset, clinical presentation and persistence of lesions despite viral clearance.
Valid and efficient inference for nonparametric variable importance in two-phase studies
Biometrics · 2025-07-03 · 1 citations
articleWe consider a common nonparametric regression setting, where the data consist of a response variable Y, some easily obtainable covariates $\mathbf {X}$, and a set of costly covariates $\mathbf {Z}$. Before establishing predictive models for Y, a natural question arises: Is it worthwhile to include $\mathbf {Z}$ as predictors, given the additional cost of collecting data on $\mathbf {Z}$ for both training the models and predicting Y for future individuals? Therefore, we aim to conduct preliminary investigations to infer importance of $\mathbf {Z}$ in predicting Y in the presence of $\mathbf {X}$. To achieve this goal, we propose a nonparametric variable importance measure for $\mathbf {Z}$. It is defined as a parameter that aggregates maximum potential contributions of $\mathbf {Z}$ in single or multiple predictive models, with contributions quantified by general loss functions. Considering two-phase data that provide a large number of observations for $(Y,\mathbf {X})$ with the expensive $\mathbf {Z}$ measured only in a small subsample, we develop a novel approach to infer the proposed importance measure, accommodating missingness of $\mathbf {Z}$ in the sample by substituting functions of $(Y,\mathbf {X})$ for each individual's contribution to the predictive loss of models involving $\mathbf {Z}$. Our approach attains unified and efficient inference regardless of whether $\mathbf {Z}$ makes zero or positive contribution to predicting Y, a desirable yet surprising property owing to data incompleteness. As intermediate steps of our theoretical development, we establish novel results in two relevant research areas, semi-supervised inference and two-phase nonparametric estimation. Numerical results from both simulated and real data demonstrate superior performance of our approach.
American Journal of Epidemiology · 2025-02-13
articleOpen accessThe restricted mean survival time (RMST) analysis has been used extensively in clinical research involving time-to-event endpoints. The threshold time up to which the restricted mean survival is calculated has a critical impact on the analysis results. However, identifying an optimal threshold time for treatment comparison, which corresponds to the greatest restricted mean lifetime difference between groups, remains unclear in practice, and no analytical method has been developed on this topic. We present a novel method for determining the threshold time in the RMST analysis to compare two groups. Simulation studies indicate that this method leads to high statistical power and controlled type I error rate compared with existing methods. The proposed method is illustrated in two applications: (1) a clinical oncology study for non-small-cell lung cancer treatments comparison given a programmed death-ligand 1 biomarker measurement, and (2) a gerontology study of instrumental activities for care recipients with dementia.
American Journal of Clinical Nutrition · 2025-04-03 · 4 citations
articleOpen accessBACKGROUND: Valid dietary supplement (DS) assessment methods are critical for nutrition research and monitoring as DS contributes substantially toward micronutrient exposures for millions of Americans. Little is known about how DS assessment tools vary in estimating the prevalence of use and micronutrient amounts from DS. OBJECTIVES: We compared repeat collections over a year of 2 commonly used DS assessment methods: the diet history questionnaire-II (DHQII) and the automated-self-administered 24-h dietary recall (ASA24), within the longitudinal Interactive Diet and Activity Tracking in American Association of Retired Persons (IDATA) study. METHODS: DS information was collected among IDATA participants (n = 795; 50-74 y) who completed 2-6 ASA24s and a second DHQII. Agreement [Kappa (κ)] at the individual level and group-level prevalence of DS use (McNemar's test) overall and by product type were compared among all participants. Mean calcium and vitamin D intakes, by source, and nutrient amounts per consumption day (i.e., dosages) from DS were compared between the DHQII and ASA24 among DS users. Calcium and vitamin D were chosen as priority nutrients, as they reflect vitamins and minerals and are ubiquitous in DS. RESULTS: Prevalence of DS use varied by product type [13 of 28 comparisons differed in prevalence (McNemar's test); Kappa agreement range: κ = -0.03 to 0.73)]. Mean consumption day amounts of vitamin D (but not calcium) were remarkably different as assessed by the DHQII and ASA24 (mean ± standard error): vitamin D ranged from 24 ± 2.7 to 45 ± 9.5 μg/d on the ASA24 and from 12 ± 0.3 to 14 ± 0.3 μg/d on the DHQII (P < 0.0001). CONCLUSIONS: Within IDATA, the comparability of ASA24 and DHQII in estimating the prevalence of use of and nutrient intakes from DS fluctuates by nutrient and product type. DS approaches beyond a questionnaire may be warranted for estimating absolute nutrient amounts, and the choice of the DS assessment method depends on the nutrient/dietary component of interest.
International Journal of Behavioral Nutrition and Physical Activity · 2025-05-26 · 3 citations
articleOpen accessAbstract Background Physical activity reduces morbidity and mortality risk in cancer survivors, but a meaningful proportion of this vulnerable population are physically inactive. Targeted interventions can help cancer survivors adopt a more active lifestyle, but the efficacy of these interventions must be rigorously evaluated in randomized controlled intervention trials. A major barrier to such trials involves the difficulty in obtaining unbiased estimates of physical activity in free-living conditions. Methods We conducted a randomized controlled trial of a 3-month intervention designed to increase physical activity vs. usual care in breast cancer survivors (n = 316). The primary outcome was change in physical activity as estimated by hip-worn accelerometer (MTI/Actigraph, models GT1M and GT3X). The trial included a sub-study (n = 106) wherein unbiased measures of total energy expenditure (doubly labeled water), and resting energy expenditure (indirect calorimetry) were collected. A linear mixed measurement error model characterized the structure of measurement error in accelerometry-estimated physical activity energy expenditure (PAEE), and corrected for bias in the estimated intervention effect due to measurement error. Results Bias in the accelerometer estimates was related to true PAEE ( p < 0.001) and baseline body mass index ( p < 0.001) but was not related to age ( p = 0.13). After correcting for measurement error, the estimated intervention effect at 3 months (change from baseline in PAEE in the intervention arm minus change in the control arm) was 77 kcal/day (95% confidence interval (CI) = 31–125), compared to 48 kcal/day (95% CI = 22–75) when measurement error was ignored. These results indicate a 20% (21%) increase in PAEE kcal x d −1 (kcal x kg −1 × d −1 ) at month 3 relative to baseline for the corrected model vs. 14% (15%) for the uncorrected model. There was no evidence that measurement error in accelerometry-estimated PAEE was differential (differed by treatment arm) in the trial (p = 0.86). Conclusions Measurement error in accelerometer-estimated PAEE can attenuate the effect size related to intervention effects in randomized controlled trials of physical activity interventions. Sub-studies that collect unbiased measures of PAEE can be used to correct for this short-coming. Trial registration ClinicalTrials.gov; NCT00929617; registered 06/26/2009; https://clinicaltrials.gov/study/NCT00929617
American Journal of Epidemiology · 2024-05-25 · 4 citations
articleOpen accessPolygenic risk scores (PRSs) are rapidly emerging as a way to measure disease risk by aggregating multiple genetic variants. Understanding the interplay of the PRS with environmental factors is critical for interpreting and applying PRSs in a wide variety of settings. We develop an efficient method for simultaneously modeling gene-environment correlations and interactions using the PRS in case-control studies. We use a logistic-normal regression modeling framework to specify the disease risk and PRS distribution in the underlying population and propose joint inference across the 2 models using the retrospective likelihood of the case-control data. Extensive simulation studies demonstrate the flexibility of the method in trading-off bias and efficiency for the estimation of various model parameters compared with standard logistic regression or a case-only analysis for gene-environment interactions, or a control-only analysis, for gene-environment correlations. Finally, using simulated case-control data sets within the UK Biobank study, we demonstrate the power of our method for its ability to recover results from the full prospective cohort for the detection of an interaction between long-term oral contraceptive use and the PRS on the risk of breast cancer. This method is computationally efficient and implemented in a user-friendly R package.
American Journal of Epidemiology · 2024-04-06 · 6 citations
articleOpen accessThe objective of this study was to examine the impact of methodological changes to the 2018 World Cancer Research Fund/American Institute for Cancer Research (WCRF/AICR) Score on associations with risk for all-cause mortality, cancer mortality, and cancer risk jointly among older adults in the National Institutes of Health (NIH)-AARP Diet and Health Study. Weights were incorporated for each score component; a continuous point scale was developed in place of the score's fully discrete cut points; and cut-point values were changed for physical activity and red meat based on evidence-based recommendations. Exploratory aims also examined the impact of separating components with more than one subcomponent and whether all components were necessary to retain within this population utilizing a penalized scoring approach. Findings suggested weighting the original 2018 WCRF/AICR Score improved its predictive performance in association with all-cause mortality and provided more precise estimates in relation to cancer risk and mortality outcomes. The importance of healthy weight, physical activity, and plant-based foods in relation to cancer and overall mortality risk were highlighted in this population of older adults. Further studies are needed to better understand the consistency and generalizability of these findings across other populations.
American Journal of Clinical Nutrition · 2024-02-23
articleOpen accessPenalized Regression with Multiple Loss Functions and Variable Selection by Voting
Statistica Sinica · 2024-01-11
articleOpen accessSenior authorWe consider a sparse linear model with a fixed design matrix in a high dimensional scenario. We introduce a new variable selection procedure called "voting", which combines the results from multiple regression models with different penalized loss functions to select the relevant predictors. A predictor is included in the final model if it receives enough votes, i.e. is selected by most of the individual models. By employing multiple different loss functions our method takes various properties of the error distribution into account. This is in contrast to the standard penalized regression approach, which typically relies on just one criterion. When that single criterion is not met the standard approach is likely to fail, whereas our method is still able to identify the underlying sparse model.
Recent grants
NIH · $2.4M · 2005
NIH · $7.6M · 2016
NIH · $261k · 1992
Nutrition, Biostatistics & Bioinformatics Training Grant
NIH · $712k · 2016–2021
NIH · $2.6M · 2021
Frequent coauthors
- 136 shared
David Ruppert
Cornell University
- 91 shared
Yanyuan Ma
- 81 shared
Donald F. Steiner
University of Chicago
- 78 shared
Victor Kipnis
National Cancer Institute
- 65 shared
Annamaria Guolo
University of Padua
- 65 shared
James M. Robins
Harvard University
- 64 shared
Bressen Christian
Park University
- 64 shared
Ann Arbor
Klinikum Saarbrücken
Labs
Not provided
Education
Ph.D., Statistics
Purdue University System
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Raymond Carroll
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup