
Yejin Choi
VerifiedStanford University · Learning, Design, and Technology
Active 2003–2025
About
Yejin Choi is a leading figure in AI research, focusing on AI for Science, Pluralistic Alignment & AI for humanity, and Alternative Training and Inference Algorithms. She works on molecular foundation models, protein reasoning, and molecular reasoning and retrosynthesis. She also explores data and algorithms for pluralistic norms and values, deliberate alignment processes, and civic discourse.
Research topics
- Artificial Intelligence
- Computer Science
- Natural Language Processing
- Data Mining
- Information Retrieval
- Programming language
- Computer vision
- Chemistry
- Linguistics
- Chromatography
- Data science
- Philosophy
- Cartography
- Mathematics
- Statistics
- Algorithm
- Mathematical optimization
- Psychology
- Geography
Selected publications
Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
ArXiv.org · 2025-11-07
preprintOpen accessSenior authorDespite rapid progress, multimodal reasoning still lacks a systematic approach to synthesize large-scale vision-centric datasets beyond visual math. We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts supporting SFT, offline and online RL. Our vision-centric synthesis framework uses a two-stage process focusing on: (1) generating diverse verifiable questions from existing images at scale, and (2) creating complex compositional visual problems by merging simpler questions. Remarkably, finetuning Qwen2.5-VL-7B on our data outperforms existing open-data baselines across evaluated vision-centric benchmarks, and our best configurations match or surpass strong closed-data models such as MiMo-VL-7B-RL on Vstar Bench, CV-Bench and MMStar-V. Notably, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro, +3.7%) and audio reasoning (MMAU, +1.32%), demonstrating its effectiveness. Similarly, despite containing no embodied visual data, we observe notable gains (NiEH, +8.8%) when evaluating open-ended embodied QA. Lastly, we use our data to comprehensively analyze at scale (1M+) the entire VLM post-training pipeline showing that (i) SFT on high-quality data with cognitive behaviors on reasoning traces is essential to scale online RL, (ii) offline RL could match online RL's performance while disaggregating compute demands, and, (iii) SFT on high quality data also improve out-of-domain, cross-modality transfer.
PLoS Pathogens · 2025-07-07 · 9 citations
articleOpen accessProtein glycosylation, a co- and post-translational modification that enhances the functional diversity of the proteome, contributes to various molecular and cellular functions by transferring different polysaccharides onto proteins. During the last decade, the role of glycosylation in plant pathogenic fungi has received significant attention, and glycoproteins are expected to play essential roles in various biological processes including pathogenicity. However, the comprehensive functional genetic analyses for protein glycosylation pathways and glycan structures of phytopathogenic fungi are still largely unknown. Here, we investigated the role of protein glycosylation in Fusarium graminearum by identifying 65 putative genes involved in protein glycosylation and characterizing their functions. Through cell wall component profiling and HPLC analysis, we characterized the overall N- and O-glycan structures in F. graminearum and found that deletion of ALG3 and ALG12 led to truncated core N-glycan structures. Quantitative proteomics analysis revealed that the truncated core N-glycans, generated by the loss of two key enzymes in the initial core N-glycosylation pathway, Alg3 and Alg12, affected a wide range of glycoproteins-including transcription factors, phosphatases, kinases, peroxidases, and other proteins involved in various biological processes-ultimately impacting the virulence of F. graminearum. This study elucidates the complex roles of glycosylation, highlighting the connections among genes involved in the protein glycosylation pathway, glycans, and glycoproteins in regulating the general biology and pathogenicity of F. graminearum. It also would be the fungal glycobiology study initiative.
Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation
2025-10-19
preprintOpen accessGender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.
ArXiv.org · 2025-06-09
preprintOpen accessReasoning over visual relationships-spatial, functional, interactional, social, etc.-is considered to be a fundamental component of human cognition. Yet, despite the major advances in visual comprehension in multimodal language models (MLMs), precise reasoning over relationships and their generations remains a challenge. We introduce ROBIN: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train ROBIN, we curate SVG, a synthetic scene graph dataset by completing the missing relations of selected objects in existing scene graphs using a teacher MLM and a carefully designed filtering process to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o further refines ROBIN's predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. In total, our dataset contains 146K images and 5.6M relationships for 2.6M objects. Results show that our ROBIN-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of 88.9, surpassing the previous best of 87.4. Our results suggest that training on the refined scene graph data is crucial to maintaining high performance across diverse visual reasoning task.
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them
2025-01-01 · 5 citations
articleOpen accessSenior authorDespite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context.However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming.In this work, we release HALOGEN , a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic highprecision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source.We use this framework to evaluate 150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain).We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors).We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
ArXiv.org · 2025-04-09
preprintOpen accessWe present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations
ArXiv.org · 2025-06-05
preprintOpen accessSpatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 4K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks like 3D cube net folding and tangram puzzles that require multi-step visual simulations. Humans achieve near-perfect accuracy but take considerable time (up to 28.9s) on complex tasks, significantly speeding up (down by 7.5 seconds on average) with intermediate visual simulations. In contrast, models exhibit inconsistent performance gains from visual simulations, improving on most tasks but declining in specific cases like tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0 Flash), indicating that models may not know how to effectively leverage intermediate visual information.
Retro-Search: Exploring Untaken Paths for Deeper and Efficient Reasoning
ArXiv.org · 2025-04-06
preprintOpen accessSenior authorLarge reasoning models exhibit remarkable reasoning capabilities via long, elaborate reasoning trajectories. Supervised fine-tuning on such reasoning traces, also known as distillation, can be a cost-effective way to boost reasoning capabilities of student models. However, empirical observations reveal that these reasoning trajectories are often suboptimal, switching excessively between different lines of thought, resulting in under-thinking, over-thinking, and even degenerate responses. We introduce Retro-Search, an MCTS-inspired search algorithm, for distilling higher quality reasoning paths from large reasoning models. Retro-Search retrospectively revises reasoning paths to discover better, yet shorter traces, which can then lead to student models with enhanced reasoning capabilities with shorter, thus faster inference. Our approach can enable two use cases: self-improvement, where models are fine-tuned on their own Retro-Search-ed thought traces, and weak-to-strong improvement, where a weaker model revises stronger model's thought traces via Retro-Search. For self-improving, R1-distill-7B, fine-tuned on its own Retro-Search-ed traces, reduces the average reasoning length by 31.2% while improving performance by 7.7% across seven math benchmarks. For weak-to-strong improvement, we retrospectively revise R1-671B's traces from the OpenThoughts dataset using R1-distill-32B as the Retro-Search-er, a model 20x smaller. Qwen2.5-32B, fine-tuned on this refined data, achieves performance comparable to R1-distill-32B, yielding an 11.3% reduction in reasoning length and a 2.4% performance improvement compared to fine-tuning on the original OpenThoughts data. Our work counters recently emergent viewpoints that question the relevance of search algorithms in the era of large reasoning models, by demonstrating that there are still opportunities for algorithmic advancements, even for frontier models.
A Roadmap for Alignable Algorithmic Decision-Makers in the Medical Triage Domain
2025-05-05
articleArtificial intelligence (AI) is increasingly being used in low- and high-stakes decision-making. However, safe and responsible use of AI decision-making systems must also consider human values and characteristics. A promising research direction is to develop novel methods and techniques to align AI systems with human values and intentions, potentially reducing undesirable or harmful behaviors while promoting greater human trust. In this paper, we highlight several promising approaches to this AI alignment problem, focusing on the use of large language models (LLMs) as alignable decision-makers. Specifically, these alignment approaches include several novel prompt-based techniques (using zero- or few-shot learning, persona narratives, or training on a large dataset of pluralistic values) and a technique based on transforming output word embeddings. We demonstrate the feasibility of these approaches for difficult decision-making in the medical triage domain, while also providing several promising future research directions to pursue.
Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
ArXiv.org · 2025-06-13
preprintOpen accessLanguage models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora - counting string appearances and retrieving the enclosing documents - yet the high storage overhead hinders their application on Internet-scale data. We present infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18$\times$) and memory use during both indexing (3.2$\times$ reduction) and querying (down to a negligible amount). We index 83TB of Internet text in 99 days with a single CPU node with 128 vCPUs (or 19 hours if using 137 such nodes). We show one important use case of infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 74.2% in GSM8K), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on infini-gram mini indexes.
Recent grants
RI: Small: ConnotationNet: Modeling Non-Literal Meaning in Context
NSF · $500k · 2017–2021
RI: Small: A Data-Driven Framework to Sketch-to-Text Generation
NSF · $450k · 2015–2019
Frequent coauthors
- 186 shared
Ronan Le Bras
- 153 shared
Maarten Sap
- 137 shared
Noah A. Smith
- 126 shared
Chandra Bhagavatula
Allen Institute
- 117 shared
Ximing Lu
- 95 shared
Swabha Swayamdipta
- 86 shared
Antoine Bosselut
- 86 shared
Jack Hessel
Awards & honors
- MacArthur Fellow (class of 2022)
- ACL Fellow (2022)
- Brett Helsel Career Development Professorship (2020 - 2023)
- Borg Early Career Award (BECA) (2018)
- IEEE AI's 10 to Watch (2016)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Yejin Choi
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup