
Olivier Gevaert
VerifiedStanford University · Rheumatology
Active 2005–2026
About
Olivier Gevaert is an Assistant Professor of Medicine in Biomedical Informatics and of Biomedical Data Science at Stanford University. He is affiliated with the Center for Artificial Intelligence in Medicine & Imaging (AIMI). His research focuses on the application of artificial intelligence and data science to medicine and imaging, contributing to advancements in biomedical informatics. As part of his role, he is involved in the development and integration of AI technologies to improve healthcare outcomes and medical imaging analysis.
Research topics
- Computer Science
- Artificial Intelligence
- Data science
- Medicine
- Machine Learning
- Pathology
- Data Mining
- Geology
- Internal medicine
- Radiology
- Biology
- Nuclear medicine
- Computational biology
- Genetics
- Medical physics
- Simulation
- Engineering
- Structural engineering
Selected publications
MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification
ArXiv.org · 2026-05-19
articleOpen accessSenior authorDeep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP
MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification
arXiv (Cornell University) · 2026-05-19
preprintOpen accessSenior authorDeep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP
Improving Medical VQA through Trajectory-Aware Process Supervision
ArXiv.org · 2026-04-10
articleOpen accessSenior authorReasoning capabilities are crucial for reliable medical visual question answering (VQA); however, existing datasets rarely include reasoning explanations. We address this by generating reasoning trajectories for six medical VQA benchmarks using the COMCTS algorithm with open-source vision-language models, with an LLM serving as the verification judge. Building on these generated datasets, we propose a two-stage training framework: supervised fine-tuning followed by Group Relative Policy Optimization (GRPO) with a novel process-based reward. While standard approaches rely solely on exact-match rewards for final answers, we introduce a trajectory-aware reward that measures the similarity between generated and ground-truth reasoning processes. Specifically, we embed reasoning steps using sentence transformers and compute the Dynamic Time Warping (DTW) distance between the resulting vector sequences. Experiments across six benchmarks demonstrate that combining the DTW-based process reward with exact-match reward consistently outperforms SFT-only training, raising mean accuracy from 0.598 to 0.689, mean BERTScore from 0.845 to 0.881, and mean ROUGE-L from 0.665 to 0.748. Our results highlight the importance of process supervision in training reasoning-capable medical VLMs. We make our code and generated reasoning datasets publicly available at https://anonymous.4open.science/r/MICCAI-R1-MED-VQA-code-B14B/
SemEnrich: Self-Supervised Semantic Enrichment of Radiology Reports for Vision-Language Learning
ArXiv.org · 2026-04-10
articleOpen accessSenior authorMedical vision-language datasets are often limited in size and biased toward negative findings, as clinicians report abnormalities mostly but might omit some positive/neutral findings because they might be considered as irrelevant to the patient's condition. We propose a self-supervised data enrichment method that leverages semantic clustering of report sentences. Then we enrich the findings in the medical reports in the training set by adding positive/neutral observations from different clusters in a self-supervised manner. Our approach yields consistent gains in supervised fine-tuning (5.63%, 3.04%, 7.40%, 5.30%, 7.47% average gains on COMET score, Bert score, Sentence Bleu, CheXbert-F1 and RadGraph-F1 scores respectively). Ablation studies confirm that improvements stem from semantic clustering rather than random augmentation. Furthermore, we introduce a way to incorporate semantic cluster information into the reward design for GRPO training, which leads to further performance gains (2.78%, 3.14%, 12.80% average gains on COMET score, Bert score and Sentence Bleu scores respectively). We share our code at https://anonymous.4open.science/r/SemEnrich-75CF
SemEnrich: Self-Supervised Semantic Enrichment of Radiology Reports for Vision-Language Learning
arXiv (Cornell University) · 2026-04-10
preprintOpen accessSenior authorMedical vision-language datasets are often limited in size and biased toward negative findings, as clinicians report abnormalities mostly but might omit some positive/neutral findings because they might be considered as irrelevant to the patient's condition. We propose a self-supervised data enrichment method that leverages semantic clustering of report sentences. Then we enrich the findings in the medical reports in the training set by adding positive/neutral observations from different clusters in a self-supervised manner. Our approach yields consistent gains in supervised fine-tuning (5.63%, 3.04%, 7.40%, 5.30%, 7.47% average gains on COMET score, Bert score, Sentence Bleu, CheXbert-F1 and RadGraph-F1 scores respectively). Ablation studies confirm that improvements stem from semantic clustering rather than random augmentation. Furthermore, we introduce a way to incorporate semantic cluster information into the reward design for GRPO training, which leads to further performance gains (2.78%, 3.14%, 12.80% average gains on COMET score, Bert score and Sentence Bleu scores respectively). We share our code at https://anonymous.4open.science/r/SemEnrich-75CF
SAGE-FM: A lightweight and interpretable spatial transcriptomics foundation model
ArXiv.org · 2026-01-21
articleOpen accessSenior authorSpatial transcriptomics enables spatial gene expression profiling, motivating computational models that capture spatially conditioned regulatory relationships. We introduce SAGE-FM, a lightweight spatial transcriptomics foundation model based on graph convolutional networks (GCNs) trained with a masked central spot prediction objective. Trained on 416 human Visium samples spanning 15 organs, SAGE-FM learns spatially coherent embeddings that robustly recover masked genes, with 91% of masked genes showing significant correlations (p < 0.05). The embeddings generated by SAGE-FM outperform MOFA and existing spatial transcriptomics methods in unsupervised clustering and preservation of biological heterogeneity. SAGE-FM generalizes to downstream tasks, enabling 81% accuracy in pathologist-defined spot annotation in oropharyngeal squamous cell carcinoma and improving glioblastoma subtype prediction relative to MOFA. In silico perturbation experiments further demonstrate that the model captures directional ligand-receptor and upstream-downstream regulatory effects consistent with ground truth. These results demonstrate that simple, parameter-efficient GCNs can serve as biologically interpretable and spatially aware foundation models for large-scale spatial transcriptomics.
Cancer Research · 2026-04-03
articleAbstract Background: Early decision-making in advanced non-small cell lung cancer (NSCLC) phase II trials is limited by the modest ability of objective response and progression-free survival (PFS) to detect early biological activity or predict overall survival (OS). Quantitative deep-learning analysis of routine CT imaging may offer a more sensitive measure that better reflects long-term benefit. We evaluated whether Serial CTRS, a fully automated CT-based deep-learning imaging biomarker, could improve early efficacy detection in simulated randomized phase II NSCLC trials. Methods: We evaluated the utility of Serial CTRS using data from the randomized phase III trial of cetuximab plus carboplatin/paclitaxel with or without bevacizumab in advanced NSCLC, which did not meet its co-primary endpoints of PFS in patients with EGFR FISH-positive cancer and OS in the entire study population (SWOG S0819; N=1275). Serial CTRS is a convolutional-neural-network pipeline, trained on a large real-world advanced NSCLC dataset, using paired baseline and follow-up thoracic CT scans to generate a continuous imaging score without manual annotation. To quantify OS surrogacy, we repeatedly sampled 1000 pairs of random 50-patient arms from the full cohort, and correlated Serial CTRS differences at 8, 16, and 24 weeks with final OS hazard ratios (HR), comparing results with best overall response (BOR) and PFS. To simulate a positive phase II trial, we constructed a balanced subset (target OS HR≈0.50) using stratified pruning matched on randomization factors. We then simulated 1000 two-arm phase II trials (n=50/arm) with realistic staggered enrollment (averaging 1 patient/day) and interim analyses (IA) at 12-48 weeks from study start. PFS was evaluated via log-rank tests and Serial CTRS differences via Wilcoxon rank-sum tests (α=0.05). False-positive rates were evaluated through null simulations using the full dataset. Results: Serial CTRS differences showed increasing concordance with OS HR across timepoints (R2=0.10, 0.23, 0.35 at 8, 16, and 24 weeks), outperforming BOR (R2 = 0.08) and PFS (R2=0.09, 0.20, 0.28). In the simulated phase II trials, the biomarker achieved 60% (95% CI 58-62%) power and 66% (63-69%) power at 36 weeks to detect a long-term survival benefit while maintaining a 5-6% false-positive rate. BOR achieved 35% (33-37%) power, and PFS achieved 49% (46-51%) and 50% (48-52%) at the same timepoints. Conclusions: A fully automated deep-learning CT biomarker provided earlier and more reliable efficacy readouts than BOR and PFS in simulated phase II NSCLC trials. These results suggest that quantitative CT biomarkers using the full thoracic scan can strengthen early drug-development decisions by improving power and reducing uncertainty around early activity signals. Ongoing work is focused on broader evaluation across tumor types, therapeutic modalities, and additional clinical datasets. Citation Format: Chiharu Sako, Brenda F. Kurland, Taly G. Schmidt, Dwight H. Owen, Arpan A. Patel, Nicholas C. Love, Olivier Gevaert, George R. Simon, Ravi B. Parikh, Petr Jordan. Deep-learning CT biomarker improves early efficacy detection in simulated randomized phase II NSCLC trials [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 2775.
The ADAPT learning cancer treatment system: ARPA-H’s initiative to revolutionize cancer therapy
Cancer Cell · 2026-01-08
articleOpen accessSAGE-FM: A lightweight and interpretable spatial transcriptomics foundation model.
PubMed · 2026-01-21
articleSenior authorperturbation experiments further show that the model captures directional ligand-receptor and upstream-downstream regulatory effects consistent with ground truth. These results demonstrate that simple, parameter-efficient GCNs can serve as biologically interpretable and spatially aware foundation models for large-scale spatial transcriptomics.
SAGE-FM: A lightweight and interpretable spatial transcriptomics foundation model
Europe PMC (PubMed Central) · 2026-01-21
preprintOpen accessSenior authorSpatial transcriptomics enables spatial gene expression profiling, motivating computational models that capture spatially conditioned regulatory relationships. We introduce SAGE-FM, a lightweight spatial transcriptomics foundation model based on graph convolutional networks (GCNs) trained with a masked central spot prediction objective. Trained on 416 human Visium samples spanning 15 organs, SAGE-FM learns spatially coherent embeddings that robustly recover masked genes, with 91% of masked genes showing significant correlations (p < 0.05). The embeddings generated by SAGE-FM outperform MOFA and existing spatial transcriptomics methods in unsupervised clustering and preservation of biological heterogeneity. SAGE-FM generalizes to downstream tasks, enabling 81% accuracy in pathologist-defined spot annotation in oropharyngeal squamous cell carcinoma and improving glioblastoma subtype prediction relative to MOFA. In silico perturbation experiments further demonstrate that the model captures directional ligand-receptor and upstream-downstream regulatory effects consistent with ground truth. These results demonstrate that simple, parameter-efficient GCNs can serve as biologically interpretable and spatially aware foundation models for large-scale spatial transcriptomics.
Recent grants
Radiogenomics framework for non-invasive personalized medicine
NIH · $444k · 2015–2021
NIH · $3.8M · 2019
NIH · $2.3M · 2021
Frequent coauthors
- 1211 shared
Daniel T. Chang
- 950 shared
Sylvia K. Plevritis
- 901 shared
Gary K. Steinberg
Stanford Medicine
- 900 shared
Erik P. Sulman
New York University
- 900 shared
Lih‐Shen Chin
Shanghai University of Traditional Chinese Medicine
- 900 shared
N. Saito
- 900 shared
Kelsey Hopkins
Purdue University West Lafayette
- 900 shared
Ivan Smirnov
University of California, San Francisco
Education
- 2015
Ph.D., Biomedical Informatics
Stanford University
- 2011
M.S., Biomedical Informatics
Stanford University
- 2007
B.S., Computer Science
University of Ghent
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Olivier Gevaert
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup