Jiawen Hou
· Math FellowVerifiedUniversity of Minnesota · Mathematics
Active 2010–2026
Research topics
- Medicine
- Computer science
- Chemistry
- Internal medicine
- Materials science
Selected publications
ArXiv.org · 2026-04-10
articleOpen accessWe present a theoretical model for the power spectrum and bispectrum of galaxy clustering that exploits the complementarity between small-scale power spectrum information and large-scale bispectrum measurements. We extend the FOLPS code by combining its one-loop EFT galaxy power spectrum with a tree-level galaxy bispectrum projected onto the tripolar spherical harmonics (Sugiyama) basis. To access additional small-scale information, we also consider a line-of-sight damping factor in both statistics, mirroring approaches commonly used in studies of redshift-space distortions. We test the model using DESI DR2 galaxy mocks. Even without damping, the joint analysis of the EFT power spectrum and bispectrum significantly improves constraints and reduces parameter degeneracies relative to power spectrum analyses alone. For LRG-like samples, including the damping further extends the range beyond $k\sim 0.3 \,h \text{Mpc}^{-1}$ in the power spectrum and $k \sim 0.24 \,h \text{Mpc}^{-1}$ in the bispectrum without introducing statistically significant parameter biases. This leads to up to $\sim 30\%$ tighter constraints on $A_s$ and $ω_{cdm}$. For low signal-to-noise tracers such as QSOs, however, the damping parameters are weakly constrained and can absorb noise fluctuations, leading to shifts in inferred parameters. Similar limitations may arise in models where cosmological information is encoded in power-spectrum shape features degenerate with the damping, such as scenarios with massive neutrinos. In contrast, for $w_0w_a$CDM we obtain $15\%$ and $21\%$ tighter constraints on $w_0$ and $w_a$, respectively, yielding a deviation from constant dark energy at slightly more than the $1σ$ level using full-shape information alone. The code is publicly available at https://github.com/cosmodesi/FolpsD
Estimate renal cell carcinoma recurrence rates using electronic health records
ESMO Real World Data and Digital Oncology · 2026-04-14
articleOpen access1st authorCorrespondingBackground: Lack of readily available recurrence data has limited the use of electronic health records (EHR) for risk assessment of cancer recurrence and optimal patient management. This study aims to derive high-quality EHR recurrence data and estimate recurrence rates in overall population and specific subgroups. Materials and methods: Using EHR data between 1 January 2000 and 1 September 2022, we developed a computational tool for automatically annotating the renal cell carcinoma (RCC) recurrence outcome and a natural language processing (NLP) tool for extracting key RCC characteristics. Using data constructed from stage I-III RCC patients who underwent nephrectomy at Mass General Brigham (2000-2022), we analyzed recurrence rates by TNM (tumor-node-metastasis) stage, grade, and histological subtype. Analyses were conducted from 1 September 2022 to 16 August 2024. Results: A total of 5603 patients whose EHR met the eligibility criteria were included in the study [3590 (64%) men, 2013 (36%) women; median age at baseline 62 years (range 36-87 years); 4225 (75%) non-Hispanic white, 1378 (25%) other race-ethnicity. Tumor stage was as follows: 3324 (59%) stage I, 778 (14%) stage II, and 128 (2%) stage III, 1373 (25%) missing stage information]. Among patients with TNM stage T1-3 N0M0 clear-cell RCC any grade, EHR-derived recurrences were indicative for true recurrence with area under the receiver operating characteristic curve (AUC) of 0.914 for 5-year recurrence status cross-validated against expert annotated gold standard recurrence times. The estimated overall 5-year recurrence rate was 11.1%. We observe a substantially higher recurrence risk for T3 group (48.8%) versus T1 (2.8%) or T2 (14.2%) and G4 group (45.3%) versus G1 (3.7%), G2 (6.8%), or G3 (18.9%). Conclusions: Our computational approach demonstrates that high-quality recurrence data can be reliably extracted from EHR systems, providing a scalable solution for real-world RCC risk determination. These tools enable health care systems to better identify high-risk patients and potentially guide personalized follow-up strategies and adjuvant treatment options.
Arthritis & Rheumatology · 2026-04-13
articleOpen accessOBJECTIVE: Disease activity plays a central role in rheumatoid arthritis (RA) clinical studies. The inconsistent availability of data on disease activity in real-world electronic health records (EHRs) data has limited the ability to generate real-world evidence (RWE). This study aimed to develop and validate scalable machine learning (ML) models to infer RA disease activity from EHR data. METHODS: We used EHR data from Mass General Brigham (MGB) and the Department of Veterans Affairs (VA) linked with RA registries that prospectively collected the Disease Activity Score with 28-joint counts (DAS28). Features for the algorithm were extracted from the EHRs including structured data (eg, diagnosis codes and narrative data using natural language processing [NLP]). ML models were trained on the registry-collected DAS28. The performance of models trained within the same institution and across institutions was evaluated. To assess face validity, we estimated the association between inferred disease activity and major adverse cardiovascular events (MACEs) with stratified Cox models. RESULTS: We studied 1,105 MGB and 2,631 VA patients with RA. Models with structured data achieved an area under the receiver operating curve (AUC) of 0.68 to 0.70; models incorporating structured and NLP achieved higher performance (MGB, AUC = 0.843; VA, AUC = 0.833). Cross-institution validation demonstrated limited transportability of algorithms across sites (MGB→VA, AUC = 0.679; VA→MGB, AUC = 0.718). Within the same institution, inferred disease activity was significantly associated with increased risk for incident MACEs (MGB, hazard ratio [HR] = 1.12; VA, HR = 1.14). CONCLUSION: RA disease activity can be inferred at scale from within-institution EHR data, though cross-institution performance is limited. The inferred disease activity replicated known associations with MACEs, and the results support its use in future studies to generate RWE.
JMIR Diabetes · 2026-04-15
articleOpen accessSenior authorBackground: Patients with type 2 diabetes mellitus (T2D) have a higher risk of cardiovascular disease, including heart failure (HF), leading to health care burden including hospitalization and mortality. Among multiple T2D therapies, there are inadequate head-to-head comparisons of their effects on HF in the real-world patient population. Objective: This study aims to compare the time-to-HF among patients treated with different T2D drugs following metformin. Methods: We conducted a retrospective data analysis on electronic health records of 5000 patients with T2D. The inclusion criteria were previous treatment with metformin and initiation of glucagon-like peptide-1 receptor agonists (GLP1 RAs), dipeptidyl peptidase-4 inhibitors (DPP4i), sulfonylureas, or insulin. We grouped patients by the mechanism of their subsequent therapies and focused on 2 pairs of comparisons classified by insulin resistance: sulfonylureas versus insulin (increased resistance) and GLP1 RA versus DPP4i (decreased resistance). The outcomes were 5-year HF status and the HF-free survival time, which was verified manually by examining clinical notes. We applied doubly robust causal estimation and accounted for confounding by adjusting for coded and natural language processing electronic health record features identified through medical knowledge networks. Results: The study included 939 patients, of whom 204 (21.7%) received insulin, 482 (51.3%) received sulfonylureas, 90 (9.6%) received GLP1 RA, and 163 (17.4%) received DPP4i. Patients in the sulfonylureas group had a significantly higher 5-year HF-free survival compared to the insulin group (survival ratio of insulin/sulfonylureas 0.902, 95% CI 0.840-0.976; P=.01). There was no significant difference between the DPP4i versus GLP1 RA group in 5-year HF-free survival (survival ratio of GLP1 RA/DPP4i was 0.953, 95% CI 0.849-1.067; P=.40). For the occurrence of a HF-related hospitalization within 5 years, there were no significant differences between the sulfonylureas and insulin groups (risk difference 0.057, 95% CI -0.011 to 0.132; P=.11), and between the GLP1 RA and DPP4i groups (risk difference 0.010, 95% CI -0.096 to 0.129). Conclusions: We evaluated real-world evidence on 2 head-to-head comparisons of second-line T2D therapies on 5-year HF outcomes. Patients on sulfonylureas were associated with lower 5-year HF risks than those treated with insulin when measured by risk ratio, but no significant difference was detected when measured by the risk difference. Limitations of this study included potentially inadequate adjustment of confounding in the observational study and a limited sample size with validated HF outcomes.
62.6 GHz ScAlN solidly mounted acoustic resonators
Applied Physics Letters · 2026-01-26
articleOpen accessWe demonstrate a record-high 62.6 GHz solidly mounted acoustic resonator (SMR) incorporating a 67.6 nm scandium aluminum nitride (Sc0.3Al0.7N) piezoelectric layer on a 40 nm buried platinum (Pt) bottom electrode, positioned above an acoustic Bragg reflector composed of alternating SiO2 (28.2 nm) and Ta2O5 (24.3 nm) layers in 8.5 pairs. The Bragg reflector and piezoelectric stack above are designed to confine a third-order thickness-extensional bulk acoustic wave mode, while efficiently transducing with thickness-field excitation. The fabricated SMR exhibits an extracted piezoelectric coupling coefficient (k2) of 0.8% and a maximum Bode quality factor (Q) of 51 at 63 GHz, representing the highest operating frequency reported for an SMR to date. These results establish a pathway toward mmWave SMR devices for filters and resonators in next-generation RF front ends.
Applied Economics · 2025-05-21 · 2 citations
article1st authorFederated Adaptive Causal Estimation (FACE) of Target Treatment Effects
Journal of the American Statistical Association · 2025-01-21 · 13 citations
articleOpen accessFederated learning of causal estimands may greatly improve estimation efficiency by leveraging data from multiple study sites, but robustness to heterogeneity and model misspecifications is vital for ensuring validity. We develop a Federated Adaptive Causal Estimation (FACE) framework to incorporate heterogeneous data from multiple sites to provide treatment effect estimation and inference for a flexibly specified target population of interest. FACE accounts for site-level heterogeneity in the distribution of covariates through density ratio weighting. To safely incorporate source sites and avoid negative transfer, we introduce an adaptive weighting procedure via a penalized regression, which achieves both consistency and optimal efficiency. Our strategy is communication-efficient and privacy-preserving, allowing participating sites to share summary statistics only once with other sites. We conduct both theoretical and numerical evaluations of FACE and apply it to conduct a comparative effectiveness study of BNT162b2 (Pfizer) and mRNA-1273 (Moderna) vaccines on COVID-19 outcomes in U.S. veterans using electronic health records from five VA regional sites. We show that compared to traditional methods, FACE meaningfully increases the precision of treatment effect estimates, with reductions in standard errors ranging from 26% to 67%.
2025-01-28
preprint<sec> <title>UNSTRUCTURED</title> Managing chronic diseases requires ongoing monitoring of disease activity and therapeutic responses to optimize treatment plans. With the growing availability of disease-modifying therapies, it is crucial to investigate comparative effectiveness and long-term outcomes beyond those available from randomized clinical trials. We introduce a comprehensive pipeline for generating reproducible and generalizable real-world evidence on disease outcomes by leveraging electronic health record data. The pipeline first generates scalable disease outcomes by linking electronic health record data with registry data containing a small sample of labeled outcomes. It then applies causal analysis using these scalable outcomes to evaluate therapies for chronic diseases. The implementation of the pipeline is illustrated in a case study based on multiple sclerosis. Our approach addresses challenges in real-world evidence generation for disease activity of chronic conditions, specifically the lack of direct observations on key outcomes and biases arising from imperfect or incomplete data. We present advanced machine learning techniques such as semisupervised and ensemble methods to impute missing outcome data, further incorporating steps for calibrated causal analyses and bias correction. </sec>
medRxiv · 2025-11-17
preprintOpen accessObjective: Disease activity plays a central role in rheumatoid arthritis (RA) clinical studies. However, RA disease activity is inconsistently recorded in real-world electronic health records (EHR) data limiting the generation of real-world evidence (RWE). This study aimed to develop and validate scalable machine learning (ML) models to infer RA disease activity from EHR data. Methods: We conducted studies from EHR data from Mass General Brigham (MGB) and the Veterans Affairs (VA); both have RA registries with prospectively collected disease activity score 28 (DAS28). The features for the algorithm were extracted from the EHR including structured data, e.g., ICD codes and narrative data using natural language processing (NLP). Machine learning models were trained on the registry-collected DAS28.We tested within-institution trained model performance and across systems transportability. The association between inferred disease activity and major adverse cardiovascular events (MACE) was tested with stratified Cox models to test face-validity. Results: We studied 1105 MGB and 2631 VA RA patients. Models with structured data models achieved an AUC of 0.68-0.70; models incorporating structured and NLP achieved higher performance (AUC=0.843, MGB; 0.833, VA). Cross-site validation demonstrated reduced transportability (AUC=0.679, MGB→VA; 0.718, VA→MGB), due to differences in the important feature. Within institution, inferred disease activity was significantly associated with increased risk for incident MACE (MGB: HR=1.12; VA: HR=1.14). Conclusion: RA disease activity can be inferred at scale from within-institution EHR data, though cross-institution performance is limited. The inferred disease activity replicated association between RA and MACE and supports it's use in future studies to generate RWE.
medRxiv · 2025-12-02
preprintOpen accessBACKGROUND: Ocrelizumab and natalizumab are commonly prescribed high-effectiveness disease-modifying therapies (DMTs) for relapsing-remitting multiple sclerosis (RRMS). However, no randomized clinical trial and few real-world studies have directly compared their effectiveness in reducing disability progression. Subtype classification and disability status are critical for multiple sclerosis (MS) research, but these data are often missing in electronic health records (EHRs), limiting robust real-world evidence generation. OBJECTIVE: To compare the effectiveness of ocrelizumab and natalizumab in two-year rater-assessed disability progression among RRMS patients using longitudinal registry-linked EHR data. DESIGN: Retrospective cohort study. SETTING: A large healthcare system that includes both academic and community practices. PARTICIPANTS: Patients diagnosed with MS who initiated ocrelizumab or natalizumab between 2012 and 2020, with at least 6-month EHR data before treatment initiation and no prior exposure to other high-effectiveness DMTs. EXPOSURES: Treatment with ocrelizumab vs natalizumab. MEASUREMENTS: We developed an ensemble machine learning model to impute RRMS subtype and disability outcomes using structured and narrative EHR data. The primary outcome was moderate/severe rater-assessed disability at 2 years (observed or imputed Expanded Disability Status Scale [EDSS]≥4) after treatment initiation. We estimated the average treatment effects using semi-supervised doubly robust approach with comprehensive confounder adjustment and calibration to mitigate imputation bias. Covariates included standard demographic and clinical features such as baseline disability as well as knowledge graph-selected features. Sensitivity analyses used observed EDSS scores in registry-derived RRMS patients. Exploratory analyses included rituximab, another B-cell-depleting therapy, with adjustments for differences in patient profiles. RESULTS: Among RRMS patients, those treated with ocrelizumab (n=543) had a significantly lower two-year risk of moderate/severe disability compared with those treated with natalizumab (n=205) based on imputed outcomes (risk difference, -5.87%; 95% CI: -11.28% to -0.46%; p=0.033) after confounder adjustment. Sensitivity analyses yielded consistent findings using imputed or observed EDSS outcomes in registry-derived RRMS patients. CONCLUSION AND RELEVANCE: In this real-world comparative effectiveness study using a novel semi-supervised doubly-robust framework, ocrelizumab was associated with a lower risk of disability progression than natalizumab among RRMS patients. This approach provides a roadmap for generating robust large-scale real-world evidence in settings of missing key inclusion features and outcomes.
Frequent coauthors
- 34 shared
Charles P. Lin
Center for Systems Biology
- 32 shared
Eric O. Potma
University of California, Irvine
- 28 shared
Bruce J. Tromberg
- 19 shared
Tianxi Cai
Harvard University
- 15 shared
Jayaraj Rajagopal
Bharathidasan University
- 15 shared
Giuseppe Intini
McGowan Institute for Regenerative Medicine
- 13 shared
Mihaela Balu
University of California, Irvine
- 13 shared
Tianrun Cai
Brigham and Women's Hospital
Education
- 2019
Doctor of Philosophy in Math w/spec in Stat, Mathematics
University of California San Diego
- 2013
Master of Science in Statistics, Statistics
University of Illinois at Urbana-Champaign
- 2011
Bachelor of Mathematics, Mathematics
Fudan University
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Jiawen Hou
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup