Olivier Gevaert

Verified

Stanford University · Rheumatology

Active 2005–2026

h-index60

Citations15.8k

Papers490264 last 5y

Funding$6.5M

Faculty page Lab page

See your match with Olivier Gevaert — sign in to PhdFit.Sign in

About

Olivier Gevaert is an Assistant Professor of Medicine in Biomedical Informatics and of Biomedical Data Science at Stanford University. He is affiliated with the Center for Artificial Intelligence in Medicine & Imaging (AIMI). His research focuses on the application of artificial intelligence and data science to medicine and imaging, contributing to advancements in biomedical informatics. As part of his role, he is involved in the development and integration of AI technologies to improve healthcare outcomes and medical imaging analysis.

Research topics

Computer Science
Artificial Intelligence
Data science
Medicine
Machine Learning
Pathology
Data Mining
Geology
Internal medicine
Radiology
Biology
Nuclear medicine
Computational biology
Genetics
Medical physics
Simulation
Engineering
Structural engineering

Selected publications

MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification
ArXiv.org · 2026-05-19
articleOpen accessSenior author
Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP
Publisher OA PDF
MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification
arXiv (Cornell University) · 2026-05-19
preprintOpen accessSenior author
Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP
Publisher DOI
Improving Medical VQA through Trajectory-Aware Process Supervision
ArXiv.org · 2026-04-10
articleOpen accessSenior author
Reasoning capabilities are crucial for reliable medical visual question answering (VQA); however, existing datasets rarely include reasoning explanations. We address this by generating reasoning trajectories for six medical VQA benchmarks using the COMCTS algorithm with open-source vision-language models, with an LLM serving as the verification judge. Building on these generated datasets, we propose a two-stage training framework: supervised fine-tuning followed by Group Relative Policy Optimization (GRPO) with a novel process-based reward. While standard approaches rely solely on exact-match rewards for final answers, we introduce a trajectory-aware reward that measures the similarity between generated and ground-truth reasoning processes. Specifically, we embed reasoning steps using sentence transformers and compute the Dynamic Time Warping (DTW) distance between the resulting vector sequences. Experiments across six benchmarks demonstrate that combining the DTW-based process reward with exact-match reward consistently outperforms SFT-only training, raising mean accuracy from 0.598 to 0.689, mean BERTScore from 0.845 to 0.881, and mean ROUGE-L from 0.665 to 0.748. Our results highlight the importance of process supervision in training reasoning-capable medical VLMs. We make our code and generated reasoning datasets publicly available at https://anonymous.4open.science/r/MICCAI-R1-MED-VQA-code-B14B/
Publisher OA PDF
SemEnrich: Self-Supervised Semantic Enrichment of Radiology Reports for Vision-Language Learning
ArXiv.org · 2026-04-10
articleOpen accessSenior author
Medical vision-language datasets are often limited in size and biased toward negative findings, as clinicians report abnormalities mostly but might omit some positive/neutral findings because they might be considered as irrelevant to the patient's condition. We propose a self-supervised data enrichment method that leverages semantic clustering of report sentences. Then we enrich the findings in the medical reports in the training set by adding positive/neutral observations from different clusters in a self-supervised manner. Our approach yields consistent gains in supervised fine-tuning (5.63%, 3.04%, 7.40%, 5.30%, 7.47% average gains on COMET score, Bert score, Sentence Bleu, CheXbert-F1 and RadGraph-F1 scores respectively). Ablation studies confirm that improvements stem from semantic clustering rather than random augmentation. Furthermore, we introduce a way to incorporate semantic cluster information into the reward design for GRPO training, which leads to further performance gains (2.78%, 3.14%, 12.80% average gains on COMET score, Bert score and Sentence Bleu scores respectively). We share our code at https://anonymous.4open.science/r/SemEnrich-75CF
Publisher OA PDF
SemEnrich: Self-Supervised Semantic Enrichment of Radiology Reports for Vision-Language Learning
arXiv (Cornell University) · 2026-04-10
preprintOpen accessSenior author
Medical vision-language datasets are often limited in size and biased toward negative findings, as clinicians report abnormalities mostly but might omit some positive/neutral findings because they might be considered as irrelevant to the patient's condition. We propose a self-supervised data enrichment method that leverages semantic clustering of report sentences. Then we enrich the findings in the medical reports in the training set by adding positive/neutral observations from different clusters in a self-supervised manner. Our approach yields consistent gains in supervised fine-tuning (5.63%, 3.04%, 7.40%, 5.30%, 7.47% average gains on COMET score, Bert score, Sentence Bleu, CheXbert-F1 and RadGraph-F1 scores respectively). Ablation studies confirm that improvements stem from semantic clustering rather than random augmentation. Furthermore, we introduce a way to incorporate semantic cluster information into the reward design for GRPO training, which leads to further performance gains (2.78%, 3.14%, 12.80% average gains on COMET score, Bert score and Sentence Bleu scores respectively). We share our code at https://anonymous.4open.science/r/SemEnrich-75CF
Publisher DOI
SAGE-FM: A lightweight and interpretable spatial transcriptomics foundation model
ArXiv.org · 2026-01-21
articleOpen accessSenior author
Spatial transcriptomics enables spatial gene expression profiling, motivating computational models that capture spatially conditioned regulatory relationships. We introduce SAGE-FM, a lightweight spatial transcriptomics foundation model based on graph convolutional networks (GCNs) trained with a masked central spot prediction objective. Trained on 416 human Visium samples spanning 15 organs, SAGE-FM learns spatially coherent embeddings that robustly recover masked genes, with 91% of masked genes showing significant correlations (p < 0.05). The embeddings generated by SAGE-FM outperform MOFA and existing spatial transcriptomics methods in unsupervised clustering and preservation of biological heterogeneity. SAGE-FM generalizes to downstream tasks, enabling 81% accuracy in pathologist-defined spot annotation in oropharyngeal squamous cell carcinoma and improving glioblastoma subtype prediction relative to MOFA. In silico perturbation experiments further demonstrate that the model captures directional ligand-receptor and upstream-downstream regulatory effects consistent with ground truth. These results demonstrate that simple, parameter-efficient GCNs can serve as biologically interpretable and spatially aware foundation models for large-scale spatial transcriptomics.
Publisher OA PDF
Abstract 2775: Deep-learning CT biomarker improves early efficacy detection in simulated randomized phase II NSCLC trials.
Cancer Research · 2026-04-03
article
Abstract Background: Early decision-making in advanced non-small cell lung cancer (NSCLC) phase II trials is limited by the modest ability of objective response and progression-free survival (PFS) to detect early biological activity or predict overall survival (OS). Quantitative deep-learning analysis of routine CT imaging may offer a more sensitive measure that better reflects long-term benefit. We evaluated whether Serial CTRS, a fully automated CT-based deep-learning imaging biomarker, could improve early efficacy detection in simulated randomized phase II NSCLC trials. Methods: We evaluated the utility of Serial CTRS using data from the randomized phase III trial of cetuximab plus carboplatin/paclitaxel with or without bevacizumab in advanced NSCLC, which did not meet its co-primary endpoints of PFS in patients with EGFR FISH-positive cancer and OS in the entire study population (SWOG S0819; N=1275). Serial CTRS is a convolutional-neural-network pipeline, trained on a large real-world advanced NSCLC dataset, using paired baseline and follow-up thoracic CT scans to generate a continuous imaging score without manual annotation. To quantify OS surrogacy, we repeatedly sampled 1000 pairs of random 50-patient arms from the full cohort, and correlated Serial CTRS differences at 8, 16, and 24 weeks with final OS hazard ratios (HR), comparing results with best overall response (BOR) and PFS. To simulate a positive phase II trial, we constructed a balanced subset (target OS HR≈0.50) using stratified pruning matched on randomization factors. We then simulated 1000 two-arm phase II trials (n=50/arm) with realistic staggered enrollment (averaging 1 patient/day) and interim analyses (IA) at 12-48 weeks from study start. PFS was evaluated via log-rank tests and Serial CTRS differences via Wilcoxon rank-sum tests (α=0.05). False-positive rates were evaluated through null simulations using the full dataset. Results: Serial CTRS differences showed increasing concordance with OS HR across timepoints (R2=0.10, 0.23, 0.35 at 8, 16, and 24 weeks), outperforming BOR (R2 = 0.08) and PFS (R2=0.09, 0.20, 0.28). In the simulated phase II trials, the biomarker achieved 60% (95% CI 58-62%) power and 66% (63-69%) power at 36 weeks to detect a long-term survival benefit while maintaining a 5-6% false-positive rate. BOR achieved 35% (33-37%) power, and PFS achieved 49% (46-51%) and 50% (48-52%) at the same timepoints. Conclusions: A fully automated deep-learning CT biomarker provided earlier and more reliable efficacy readouts than BOR and PFS in simulated phase II NSCLC trials. These results suggest that quantitative CT biomarkers using the full thoracic scan can strengthen early drug-development decisions by improving power and reducing uncertainty around early activity signals. Ongoing work is focused on broader evaluation across tumor types, therapeutic modalities, and additional clinical datasets. Citation Format: Chiharu Sako, Brenda F. Kurland, Taly G. Schmidt, Dwight H. Owen, Arpan A. Patel, Nicholas C. Love, Olivier Gevaert, George R. Simon, Ravi B. Parikh, Petr Jordan. Deep-learning CT biomarker improves early efficacy detection in simulated randomized phase II NSCLC trials [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 2775.
Publisher DOI
The ADAPT learning cancer treatment system: ARPA-H’s initiative to revolutionize cancer therapy
Cancer Cell · 2026-01-08
articleOpen access
Publisher OA PDF DOI
SAGE-FM: A lightweight and interpretable spatial transcriptomics foundation model.
PubMed · 2026-01-21
articleSenior author
perturbation experiments further show that the model captures directional ligand-receptor and upstream-downstream regulatory effects consistent with ground truth. These results demonstrate that simple, parameter-efficient GCNs can serve as biologically interpretable and spatially aware foundation models for large-scale spatial transcriptomics.
Publisher
SAGE-FM: A lightweight and interpretable spatial transcriptomics foundation model
Europe PMC (PubMed Central) · 2026-01-21
preprintOpen accessSenior author
Spatial transcriptomics enables spatial gene expression profiling, motivating computational models that capture spatially conditioned regulatory relationships. We introduce SAGE-FM, a lightweight spatial transcriptomics foundation model based on graph convolutional networks (GCNs) trained with a masked central spot prediction objective. Trained on 416 human Visium samples spanning 15 organs, SAGE-FM learns spatially coherent embeddings that robustly recover masked genes, with 91% of masked genes showing significant correlations (p < 0.05). The embeddings generated by SAGE-FM outperform MOFA and existing spatial transcriptomics methods in unsupervised clustering and preservation of biological heterogeneity. SAGE-FM generalizes to downstream tasks, enabling 81% accuracy in pathologist-defined spot annotation in oropharyngeal squamous cell carcinoma and improving glioblastoma subtype prediction relative to MOFA. In silico perturbation experiments further demonstrate that the model captures directional ligand-receptor and upstream-downstream regulatory effects consistent with ground truth. These results demonstrate that simple, parameter-efficient GCNs can serve as biologically interpretable and spatially aware foundation models for large-scale spatial transcriptomics.
Publisher OA PDF DOI

Recent grants

Radiogenomics framework for non-invasive personalized medicine
NIH · $444k · 2015–2021
NIH Grant U01DE025188
NIH · $3.8M · 2019
NIH Grant R01EB020527
NIH · $2.3M · 2021

Frequent coauthors

Daniel T. Chang
1211 shared
Sylvia K. Plevritis
950 shared
Gary K. Steinberg
Stanford Medicine
901 shared
Erik P. Sulman
New York University
900 shared
Lih‐Shen Chin
Shanghai University of Traditional Chinese Medicine
900 shared
N. Saito
900 shared
Kelsey Hopkins
Purdue University West Lafayette
900 shared
Ivan Smirnov
University of California, San Francisco
900 shared

Education

Ph.D., Biomedical Informatics
Stanford University
2015
M.S., Biomedical Informatics
Stanford University
2011
B.S., Computer Science
University of Ghent
2007

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Olivier Gevaert

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you