
Yangruibo (Robin) Ding
· Assistant Professor (Starting July 1, 2026)VerifiedUniversity of California, Los Angeles · Computer Science
Active 2002–2026
About
Yangruibo (Robin) Ding is an assistant professor at UCLA Samueli School of Engineering, starting July 1, 2026. His research focuses on developing large language models (LLMs) and agentic systems for software engineering. His recent work involves training LLMs with advanced symbolic reasoning capabilities, such as debugging, testing, program analysis, and verification, as well as building efficient, collaborative agentic systems for complex software development and maintenance tasks. Ding holds a PhD in Computer Science from Columbia University, earned in 2025, and has received notable awards including the IBM Ph.D. Fellowship Award (2022-2024), the ACM SIGSOFT Distinguished Paper Award (2023), and the IEEE TSE Best Paper Award Runner-up (2022).
Research topics
- Medicine
- Genetics
- Biology
- Evolutionary biology
- Computer Science
- Computational biology
- Emergency medicine
- Pathology
- Virology
- Internal medicine
- Mathematics
- Environmental health
- Data science
- Demography
- Statistics
- Econometrics
Selected publications
Infection and Drug Resistance · 2026-04-01
articleOpen accessSenior authorBackground and Aims: To explore nutritional heterogeneity among patients with nontuberculous mycobacterial (NTM) pulmonary disease by identifying nutritional phenotypes using K-means clustering, and to compare nutritional and inflammatory markers, Runyon classification, and comorbidities across phenotypes. Methods: A retrospective analysis of 457 patients diagnosed with NTM pulmonary disease was conducted. Nine nutritional and inflammatory indicators were collected: body mass index (BMI), hemoglobin (HGB), lymphocyte count (LY), C-reactive protein (CRP), prealbumin (PAB), albumin (ALB), total protein (TP), triglycerides (TG), and total cholesterol (TC). The candidate clusters were initially evaluated using the elbow method and silhouette scores. In the absence of a distinct inflection point in the elbow method, the solution with the highest silhouette coefficient was regarded as the most compact candidate solution. An exploratory final clustering scheme was then determined by further incorporating bootstrap internal stability analysis and clinical interpretability. Subsequently, nutritional and inflammatory profiles, Runyon classification (Groups I and III), and comorbidity distributions were compared among phenotypes. Ordinal logistic regression analysis was employed to identify factors associated with nutritional phenotype grade. Results: Four nutritional phenotypes were identified: Healthy–Low Inflammation (24.5%), Hyperlipidemic–Well-nourished (18.8%), Lean–Moderate Inflammation (42.5%), and Severely Emaciated–High Inflammation (14.2%). Significant differences existed in nutritional and inflammatory markers (all P< 0.001). The Severely Emaciated–High Inflammation type exhibited the lowest BMI, HGB, ALB, and TP levels; highest CRP level; highest proportion of Runyon Group III [mainly represented by the Mycobacterium avium complex (MAC)] infections (67.7%, P=0.011); and more severe comorbidities (malignancy, renal insufficiency; P< 0.05). The Healthy–Low Inflammation type displayed optimal nutritional profiles, highest proportion of Runyon Group I (predominated by Mycobacterium kansasii and Mycobacterium marinum ) infections (44.6%, P=0.003), and few severe comorbidities. Ordinal logistic regression analysis demonstrated that age ≥ 60 years, malignant tumor, respiratory diseases, and renal diseases were significantly associated with an increased cumulative odds of progressing to a poorer nutritional phenotype grade (all P < 0.05). Conclusion: Nutritional status in patients with NTM pulmonary disease shows significant heterogeneity closely associated with inflammation, bacterial strain type, and comorbidities. The Severely Emaciated–High Inflammation type exhibited the most unfavorable nutritional-inflammatory profile, suggesting that early identification, nutritional assessment, and comprehensive intervention should be strengthened clinically for such patients. Keywords: nontuberculous mycobacterial pulmonary disease, nutritional phenotype, K-means clustering, Runyon classification, malnutrition, inflammatory response
medRxiv · 2026-01-05
articleOpen accessMost current GWAS-eQTL approaches prioritize genes whose mediating effects on complex traits act through cis-regulation, while trans-acting genes remain largely underexplored. Recent perturbational screening technology provides a novel approach to quantifying trans-effects between gene pairs, but its integration with GWAS data remains largely unexamined. We introduce Mr. PEG, a novel framework that integrates perturbational screens, eQTL, and GWAS summary data to identify mediating genes of complex traits. Integrating gene-to-gene effects estimated from perturbational screens and GWAS data across 40 complex traits, Mr. PEG identifies a total of 546 significant mediating genes. These genes are more constrained than background genes and enriched for Gene Ontology terms related to immune response and cellular signaling. Compared to genes identified by GWAS and Mendelian randomization-based approaches, Mr. PEG genes exhibit longer average lengths across enhancers and stronger co-expression. Mr. PEG effects learned from common non-coding variants are associated with rare coding burden effects, highlighting its ability to capture disease-relevant mechanisms missed by approaches focused only on cis-eQTLs. We also highlight a case in which Mr. PEG uniquely identifies PTGS2 as a mediating gene for gout, suggesting potential opportunities for drug repurposing. Our findings demonstrate the value of integrating trans-effects informed by experimental perturbation screens and population-scale GWAS and eQTL data to identify disease-relevant mediating genes beyond individual GWAS loci.
medRxiv · 2026-04-22
articleOpen accessAbstract Genome-wide association studies (GWAS) have advanced the understanding of germline susceptibility in common cancers, yet rare malignancies remain underexplored due to limited sample sizes. To address this gap, we conducted large-scale GWAS across 20 rare cancer types and meta-analyzed results from three cohorts: two clinically sequenced cancer center cohorts and an independent population biobank, comprising over 480,000 individuals. We identified nine novel genome-wide significant susceptibility loci with moderate to large effect sizes that replicated across cohorts in eight rare malignancies, including myelodysplastic syndromes (MDS), germ cell tumors, gastrointestinal stromal tumor (GIST), gastrointestinal neuroendocrine tumors, anal cancer (ANSC), non-melanoma skin cancer, mesothelioma, and hepatobiliary cancer. Among the strongest associations were loci in MDS near API5 (OR = 2.21, p = 1.06×10 −8 ), in GIST near SLC6A18 and TERT (OR = 1.91, p = 8.20×10 −50 ), and in ANSC near HLA-DQA2 (OR = 1.58, p = 5.50×10 −18 ). The GIST risk variant was enriched in tumors harboring somatic KIT mutations (OR = 2.21, p = 6.5×10 −4 ) and was associated with worse survival among carriers with KIT-mutant tumors (hazard ratio = 4.06, p = 0.015), implicating germline–somatic interplay in tumor initiation and progression. The ANSC risk variant was associated with HPV infection (OR = 1.44, p = 3.19×10 −5 ), supporting a host–viral interaction in HPV-driven tumorigenesis. The MDS risk variant at the API5 locus was associated with altered neutrophil counts, suggesting a role in hematopoietic dysregulation in disease pathogenesis. We further identified novel, independent associations with mesothelioma, GIST, and hepatobiliary cancer at the 5p15.33 locus encompassing TERT , consistent with pleiotropic genetic effects at a core telomere-maintenance gene. Collectively, these findings demonstrate that integrating clinically ascertained sequencing cohorts with population biobanks substantially enhances germline discovery in rare cancers, enabling identification of high-confidence susceptibility loci and facilitating downstream biological interpretation through linked somatic, viral, and clinical data. This framework provides a scalable approach for characterizing inherited susceptibility across diverse rare malignancies.
ViTSP: A Vision Language Models Guided Framework for Solving Large-Scale Traveling Salesman Problems
ArXiv.org · 2025-09-27
preprintOpen accessSolving the Traveling Salesman Problem (TSP) is NP-hard yet fundamental for a wide range of real-world applications. Classical exact methods face challenges in scaling, and heuristic methods often require domain-specific parameter calibration. While learning-based approaches have shown promise, they suffer from poor generalization and limited scalability due to fixed training data. This work proposes ViTSP, a novel framework that leverages pre-trained vision language models (VLMs) to visually guide the solution process for large-scale TSPs. The VLMs function to identify promising small-scale subproblems from a visualized TSP instance, which are then efficiently optimized using an off-the-shelf solver to improve the global solution. ViTSP bypasses the dedicated model training at the user end while maintaining effectiveness across diverse instances. Experiments on real-world TSP instances ranging from 1k to 88k nodes demonstrate that ViTSP consistently achieves solutions with average optimality gaps of 0.24%, outperforming existing learning-based methods. Under the same runtime budget, it surpasses the best-performing heuristic solver, LKH-3, by reducing its gaps by 3.57% to 100%, particularly on very-large-scale instances with more than 10k nodes. Our framework offers a new perspective in hybridizing pre-trained generative models and operations research solvers in solving combinatorial optimization problems. The framework holds potential for integration into more complex real-world logistics systems. The code is available at https://github.itap.purdue.edu/uSMART/ViTSP_ICLR2026.
Association between polygenic risk and survival in breast cancer patients
BMC Cancer · 2025-08-28 · 1 citations
articleOpen accessPolygenic risk scores (PRS) estimate an individual’s germline genetic predisposition to a quantitative trait and/or risk of disease. Several PRS have been developed for cancer risk with the goal of improved risk screening. Here, we sought to establish whether PRS for cancer risk and other common traits may influence survival for patients with cancer. We conducted a PRS survival analysis using 23,770 cancer patients of European ancestry from the Dana-Farber Cancer Institute Profile cohort. We identified an association between PRS for breast cancer risk and longer patient survival (HR = 0.89 (95% CI: 0.84–0.95), p = 1.50 × 10–4, < 5% FDR), implying that individuals at high genetic risk had better outcomes. High PRS individuals were also significantly less likely to harbor somatic TP53 mutations, consistent with having less aggressive tumors. This association persisted when including tumor grade and became more protective when restricting to ER-negative tumors (HR = 0.78 (95% CI: 0.68–0.89), p = 1.69 × 10–4). Potential confounders such as hormone receptor status, age, grade, stage, and ER-targeted therapy did not fully explain this association, nor was there statistical evidence of index event bias at individual variants. We did not observe significant associations between cancer risk and survival for other cancers, suggesting that this mechanism may be largely unique to breast cancer. However, we did observe associations between shorter survival and type 2 diabetes, bipolar, and pancreatitis PRS (1% FDR). These findings suggest that higher germline risk may predispose individuals to less aggressive breast cancer tumors and provide novel insights into breast cancer development and prognosis.
Exploring depression treatment response by using polygenic risk scoring across diverse populations
The American Journal of Human Genetics · 2025-06-27 · 6 citations
articleOpen accessAssociation between plausible genetic factors and weight loss from GLP1-RA and bariatric surgery
Nature Medicine · 2025-04-18 · 31 citations
articleOpen accessAbstract Obesity is a major public health challenge. Glucagon-like peptide-1 receptor agonists (GLP1-RA) and bariatric surgery (BS) are effective weight loss interventions; however, the genetic factors influencing treatment response remain largely unexplored. Moreover, most previous studies have focused on race and ethnicity rather than genetic ancestry. Here we analyzed 10,960 individuals from 9 multiancestry biobank studies across 6 countries to assess the impact of known genetic factors on weight loss. Between 6 and 12 months, GLP1-RA users had an average weight change of −3.93% or −6.00%, depending on the outcome definition, with modest ancestry-based differences. BS patients experienced −21.17% weight change between 6 and 48 months. We found no significant associations between GLP1-RA-induced weight loss and polygenic scores for body mass index or type 2 diabetes, nor with missense variants in GLP1R . A higher body mass index polygenic score was modestly linked to lower weight loss after BS (+0.7% per s.d., P = 1.24 × 10 −4 ), but the effect attenuated in sensitivity analyses. Our findings suggest known genetic factors have limited impact on GLP1-RA effectiveness with respect to weight change and confirm treatment efficacy across ancestry groups.
Introducing Multimodal Paradigm for Learning Sleep Staging PSG via General-Purpose Model
ArXiv.org · 2025-09-26
preprintOpen accessSleep staging is essential for diagnosing sleep disorders and assessing neurological health. Existing automatic methods typically extract features from complex polysomnography (PSG) signals and train domain-specific models, which often lack intuitiveness and require large, specialized datasets. To overcome these limitations, we introduce a new paradigm for sleep staging that leverages large multimodal general-purpose models to emulate clinical diagnostic practices. Specifically, we convert raw one-dimensional PSG time-series into intuitive two-dimensional waveform images and then fine-tune a multimodal large model to learn from these representations. Experiments on three public datasets (ISRUC, MASS, SHHS) demonstrate that our approach enables general-purpose models, without prior exposure to sleep data, to acquire robust staging capabilities. Moreover, explanation analysis reveals our model learned to mimic the visual diagnostic workflow of human experts for sleep staging by PSG images. The proposed method consistently outperforms state-of-the-art baselines in accuracy and robustness, highlighting its efficiency and practical value for medical applications. The code for the signal-to-image pipeline and the PSG image dataset will be released.
2024-03-07
articleOpen accessIndividuals of admixed ancestries (for example, African Americans) inherit a mosaic of ancestry segments (local ancestry) originating from multiple continental ancestral populations. This offers the unique opportunity of investigating the similarity of genetic effects on traits across ancestries within the same population. Here we introduce an approach to estimate correlation of causal genetic effects (radmix) across local ancestries and analyze 38 complex traits in African-European admixed individuals (N = 53,001) to observe very high correlations (meta-analysis radmix = 0.95, 95% credible interval 0.93–0.97), much higher than correlation of causal effects across continental ancestries. We replicate our results using regression-based methods from marginal genome-wide association study summary statistics. We also report realistic scenarios where regression-based methods yield inflated heterogeneity-by-ancestry due to ancestry-specific tagging of causal effects, and/or polygenicity. Our results motivate genetic analyses that assume minimal heterogeneity in causal effects by ancestry, with implications for the inclusion of ancestry-diverse individuals in studies.
Calibrated prediction intervals for polygenic scores across diverse contexts
Nature Genetics · 2024-06-17 · 53 citations
articleOpen access
Frequent coauthors
- 275 shared
Sergey Knyazev
CDC Foundation
- 187 shared
Malika Freund
Stanford University
- 185 shared
Brian Hill
University of California, Los Angeles
- 184 shared
Noah Zaitlen
University of California, Los Angeles
- 142 shared
Bogdan Paşaniuc
University of California, Los Angeles
- 114 shared
Daniel H. Geschwind
Center for Autism and Related Disorders
- 107 shared
Ruth Johnson
University of California, Los Angeles
- 106 shared
Kristin Boulier
University of California, Los Angeles
Awards & honors
- IBM Ph.D. Fellowship Award (2022-2024)
- ACM SIGSOFT Distinguished Paper Award (2023)
- IEEE TSE Best Paper Award Runner-up (2022)
- Ph.D. Service Excellence Award, Columbia CS (2025)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Yangruibo (Robin) Ding
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup