
James Zou
· Assistant Professor of Biomedical Data Science Faculty Director, AI for HealthVerifiedStanford University · Rheumatology
Active 2007–2026
About
James Zou is an Assistant Professor of Biomedical Data Science and, by courtesy, of Computer Science and of Electrical Engineering at Stanford University. He is affiliated with the Center for Artificial Intelligence in Medicine & Imaging (AIMI). His research focuses on artificial intelligence in healthcare, leveraging machine learning and data science to advance medical imaging and biomedical applications. Zou's work involves developing innovative algorithms and computational methods to improve diagnosis, treatment, and understanding of medical conditions, contributing to the integration of AI technologies into clinical practice.
Research topics
- Computer Science
- Artificial Intelligence
- Medicine
- Internal medicine
- Machine Learning
- Cardiology
- Data science
- Computer Security
- Political Science
- Biology
- Genetics
- Engineering
- Psychology
- Algorithm
- Software engineering
- Computational biology
- Ophthalmology
- Cell biology
- Database
- Evolutionary biology
- Law
- Risk analysis (engineering)
- Pharmacology
- Radiology
Selected publications
Evaluation-driven Scaling for Scientific Discovery
ArXiv.org · 2026-04-21
articleOpen accessLanguage models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery loops along the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely used LASSO algorithm by over 2x, designed quantum circuit routing policies that reduce gate overhead by 24.5%, and discovered new Erdos minimum overlap constructions that surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancing LLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.
Artificial intelligence agents in cancer research and oncology
Nature reviews. Cancer · 2026-01-12 · 4 citations
articleReasoning or Knowledge: Stratified Evaluation of Biomedical LLMs
Underline Science Inc. · 2026-03-06
otherOpen accessSenior authorMedical reasoning in large language models seeks to replicate clinicians' cognitive processes in interpreting patient data and making diagnostic decisions. However, widely used benchmarks—such as MedQA, MedMCQA, and PubMedQA—mix questions that require multi-step reasoning with those answerable through factual recall, complicating evaluation. We demonstrate this by training a PubMedBERT-based classifier on expert-curated labels and applying it to 11 widely used biomedical QA benchmarks, where we find that only 32.8% of the questions require multi-step reasoning, indicating that current evaluations largely measure recall. This stratified evaluation of biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3) reveals consistently lower performance on reasoning versus knowledge (e.g., HuatuoGPT-o1: 56.9% vs. 44.8%). Beyond accuracy, we assess robustness through adversarial evaluations in which models are prefixed with uncertainty-inducing statements; biomedical reasoning models degrade sharply in this setting (e.g., MedReason: 50.4% → 24.4%), with declines especially pronounced on reasoning-heavy questions. Finally, we show that fine-tuning on high-quality reasoning examples augmented with adversarial traces, followed by reinforcement learning with GRPO, improves both robustness and accuracy across knowledge and reasoning subsets.
The Virtual Biotech: A Multi-Agent AI Framework for Therapeutic Discovery and Development
bioRxiv (Cold Spring Harbor Laboratory) · 2026-02-23 · 1 citations
articleOpen accessSenior authorCorrespondingAbstract Drug discovery and development requires integrating diverse evidence across biological scales and data modalities. However, relevant data, tools, and expertise remain fragmented across teams and organizations, making integration difficult. To address these challenges, we introduce the Virtual Biotech, a coordinated team of AI agents that mirrors the structure of human therapeutic research organizations to support end-to-end computational discovery. The Virtual Biotech is led by a Chief Scientific Officer agent that receives scientific queries, delegates them to domain-specialized scientist agents, and integrates their outputs through data-driven reasoning. Scientist agents leverage complementary tools and knowledge sources spanning statistical genetics, functional genomics, pathways and interactions, chemoinformatics, disease biology, and clinical data. We showcase the Virtual Biotech across three translational applications. First, the agents autonomously annotated and analyzed outcomes from 55,984 clinical trials to identify genomic features of drug targets associated with trial success. More than 37,000 clinical-trialist agents curated structured trial outcomes and linked targets to multi-omic annotations, including cell-type-specific features derived by the agents from single-cell RNA-sequencing atlases. The agents discovered that drugs targeting cell-type-specific genes were 40% more likely to progress from Phase I to Phase II and 48% more likely to reach market (Phase IV), while exhibiting 32% lower adverse event rates. Second, the Virtual Biotech evaluated B7-H3 as a lung cancer target, integrating statistical genetics, single-cell, spatial, and clinicogenomic evidence to propose an antibody–drug conjugate strategy while identifying key liabilities and differentiation opportunities. Third, the platform analyzed a terminated ulcerative colitis trial targeting OSMR β to infer potential failure mechanisms and proposed biomarker-guided enrollment strategies to address precision-medicine gaps. Together, these results illustrate how the Virtual Biotech can enable more transparent, efficient, and comprehensive multi-scale therapeutic analyses, helping to accelerate early-stage drug discovery workflows while keeping human scientists in the loop.
Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers
Underline Science Inc. · 2026-03-06
otherOpen accessSenior authorScientific research is inherently global. However, the vast majority of academic journals are published exclusively in English, creating barriers for non-native-English-speaking researchers. In this study, we leverage large language models (LLMs) to translate published scientific articles while preserving their native JATS XML formatting, thereby developing a practical, automated approach for implementation by academic journals. Using our approach, we translate articles across multiple scientific disciplines into 28 languages. To evaluate translation accuracy, we introduce a novel question-and-answer (QA) benchmarking method and show an average performance of 95.9%, indicating that the key scientific details are accurately conveyed. In a user study, we translate the scientific papers of 15 researchers into their native languages. Interestingly, a third of the authors found many technical terms “overtranslated,” expressing a preference to keep terminology more familiar in English untranslated. Finally, we demonstrate how in-context learning techniques can be used to align translations with domain-specific preferences such as mitigating overtranslation, highlighting the adaptability and utility of LLM-driven scientific translation.
Leveraging multi-modal foundation models for analysing spatial multi-omic and histopathology data
Nature Biomedical Engineering · 2026-02-05
articleReimagining human-centric drug development with new approach methodologies
Science · 2026-04-16
articleOpen accessDespite unprecedented technological progress, most drug candidates continue to fail in clinical trials, reflecting a persistent gap between preclinical models and human biology. New approach methodologies (NAMs), by spanning human-derived cellular systems, microphysiological platforms, and artificial intelligence, offer a paradigm shift in human-centric drug development and biomedical research. Recent regulatory reforms, such as the US Food and Drug Administration (FDA) Modernization Act 3.0, have begun to position NAMs as a complement to or replacement for animal testing. This Review synthesizes emerging biological and computational NAMs and examines how their integration is reshaping drug development. We also discuss regulatory and ethical frameworks enabling this transition and outline a roadmap for embedding human-based science in a predictive, efficient, and ethically grounded infrastructure of human-centered drug development.
AI Agents for Data Science: A Discussion of “LAMBDA: A Large Model Based Data Agent”
Journal of the American Statistical Association · 2026-01-02
article1st authorCorrespondingMolecular Systems Biology · 2026-04-23
articleOpen accessThe rise of antibiotic-resistant pathogens such as Staphylococcus aureus has created an urgent need for new antibiotics. Generative artificial intelligence (AI) has shown promise in drug discovery, but existing models often fail to propose compounds that are both effective and synthetically tractable. To address these challenges, we introduce SyntheMol-RL, a reinforcement learning-based generative model that can rapidly design synthetically accessible small-molecule drug candidates from a massive chemical space of 46 billion compounds. SyntheMol-RL improves upon our prior Monte Carlo tree search (MCTS)-based SyntheMol model by generalizing across chemically similar building blocks and enabling multi-parameter optimization. We applied SyntheMol-RL to generate candidate antibiotics against S. aureus by optimizing for both antibacterial activity and aqueous solubility, and we found that SyntheMol-RL generated molecules with improved predicted properties compared to both the previous MCTS version of SyntheMol as well as an AI-based virtual screening baseline. We synthesized 79 SyntheMol-RL compounds that were unique relative to the training dataset and found that 13 showed potent in vitro activity, of which seven passed our structural novelty filters that compared them to known antibiotics. Furthermore, one hit compound, synthecin, demonstrated efficacy in a murine wound infection model of methicillin-resistant S. aureus (MRSA). These results validate SyntheMol-RL's ability to generate synthetically accessible candidate antibiotics and position SyntheMol-RL as a powerful tool for drug design across therapeutic domains.
Improving LLM Group Fairness on Tabular Data via In-Context Learning
Proceedings of the AAAI/ACM Conference on AI Ethics and Society · 2025-10-15 · 1 citations
articleOpen accessSenior authorLarge language models (LLMs) have been shown to be effective on tabular prediction tasks in the low-data regime, leveraging their internal knowledge and ability to learn from instructions and examples. However, LLMs can fail to generate predictions that satisfy group fairness, that is, produce equitable outcomes across groups. Critically, conventional debiasing approaches for natural language tasks do not directly translate to mitigating group unfairness in tabular settings. In this work, we systematically investigate four empirical approaches to improve group fairness of LLM predictions on tabular datasets, including fair prompt optimization, soft prompt tuning, strategic selection of few-shot examples, and self-refining predictions via chain-of-thought reasoning. Through experiments on four tabular datasets using both open-source and proprietary LLMs, we show the effectiveness of these methods in enhancing demographic parity while maintaining high overall performance. Our analysis provides actionable insights for practitioners in selecting the most suitable approach based on their specific requirements and constraints.
Recent grants
CRII: III: Robust Machine Learning Methods for Messy Data
NSF · $175k · 2017–2019
Stanford Medicine Center for Longevity and Healthy Aging Analysis Core
NIH · $9.0M · 2018–2028
AF: MEDIUM: Collaborative Research: Foundations of Adaptive Data Analysis
NSF · $276k · 2018–2021
CAREER: Enabling data valuation and deletion in human-centered machine learning
NSF · $500k · 2020–2025
NIH · $191k · 2020
Frequent coauthors
- 81 shared
Zhenqin Wu
Enable Biosciences (United States)
- 79 shared
Eric Q. Wu
- 59 shared
Alexandro E. Trevino
- 59 shared
Aaron T. Mayer
Enable Biosciences (United States)
- 58 shared
Martin Jinye Zhang
Harvard University
- 58 shared
David Ouyang
Cedars-Sinai Smidt Heart Institute
- 51 shared
B Bernstein
Broad Institute
- 46 shared
Bryan He
Education
- 2018
Ph.D., Biomedical Data Science
Stanford University
- 2013
M.S., Computer Science
Stanford University
- 2011
B.S., Electrical Engineering and Computer Science
University of California, Berkeley
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with James Zou
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup