Hannaneh Hajishirzi
· ProfessorVerifiedUniversity of Washington · Computer Science & Engineering
Active 2007–2025
About
Hannaneh Hajishirzi is the Torode Family Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington and a Senior Research Director at the Allen Institute for AI (AI2). She earned her Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign and completed a postdoctoral associate position at Disney Research and Carnegie Mellon University. Her research primarily focuses on natural language processing (NLP) and artificial intelligence (AI), with a particular emphasis on understanding and advancing large language models. She leads the H2Lab, which publishes extensively in top-tier NLP, AI, and machine learning conferences. Her research goals include establishing the science of language modeling through the OLMo project, expanding the applicability of language models to benefit human lives via post-training efforts, and developing a new generation of retrieval-based language models that address fundamental challenges in current models. Professor Hajishirzi has published over 140 scientific articles in leading journals and conferences across machine learning, AI, NLP, and computer vision. She has received numerous prestigious awards, including the 2020 Alfred Sloan Fellowship, the 2021 NSF CAREER award, the 2019 Intel Rising Star award, the 2018 Allen Distinguished Investigator award, the 2023 Academic Achievement UIUC Alumni award, and was a 2024 Innovator of the Year award finalist by GeekWire. Her lab's work has been recognized with best paper nominations and awards and has been featured in prominent media outlets such as The New York Times, Forbes, NPR, MIT Technology Review, GeekWire, and Wired Magazine.
Research topics
- Computer Science
- Artificial Intelligence
- Machine Learning
- Natural Language Processing
- Archaeology
- Algorithm
- History
- Engineering
- Psychology
- Geology
- Data science
Selected publications
A Large-Scale Study of Reranker Relevance Feedback at Inference
2025-07-13 · 1 citations
articleOpen accessSenior authorNeural IR systems often employ a retrieve-and-rerank framework: a bi-encoder retrieves a fixed number of candidates (e.g., =100), which a cross-encoder then reranks.Recent studies have indicated that relevance feedback from the reranker at inference time can improve the recall of the retriever.The approach works by updating the retriever's query representations via a distillation process that aligns it with the reranker's predictions.While a powerful idea, the arguably narrow scope of past studies focusing on a small number of specific domains such as english question answering and entity retrieval has left a gap in our understanding of how well it generalizes.In this paper, we study inference-time reranker relevance feedback extensively across multiple retrieval domains, languages, and modalities, while also investigating aspects such as the performance and latency implications of the number of distillation updates and feedback candidates.
RewardBench 2: Advancing Reward Model Evaluation
arXiv (Cornell University) · 2025-06-02
preprintOpen accessReward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to others that test agreement with human preferences. At the same time, progress in evaluation has not been mirrored by the effectiveness of reward models in downstream tasks -- simpler direct alignment algorithms are reported to work better in many cases. This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark designed to bring new, challenging data for accuracy-based reward model evaluation -- models score about 20 points on average lower on RewardBench 2 compared to the first RewardBench -- while being highly correlated with downstream performance. Compared to most other benchmarks, RewardBench 2 sources new human prompts instead of existing prompts from downstream evaluations, facilitating more rigorous evaluation practices. In this paper, we describe our benchmark construction process and report how existing models perform on it, while quantifying how performance on the benchmark correlates with downstream use of the models in both inference-time scaling algorithms, like best-of-N sampling, and RLHF training algorithms like proximal policy optimization.
Spurious Rewards: Rethinking Training Signals in RLVR
arXiv (Cornell University) · 2025-06-12
preprintOpen accessWe show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards. To explain this counterintuitive observation, we show that GRPO exhibits a clipping bias from the clip term, which can amplify high-prior behaviors learned during pretraining even without informative rewards. As a case study, we identify one such behavior in Qwen2.5-Math models, which we call code reasoning -- reasoning in code without actual code execution; code-reasoning frequency increases from 65 percent to over 90 percent with spurious rewards. However, the presence of such amplifiable behaviors is highly model-dependent. In practice, spurious rewards that are effective for Qwen models often fail to produce gains for other model families, such as Llama3 or OLMo2. Our results highlight the importance of validating RL methods across diverse models rather than relying on a single de facto choice: large gains can arise on Qwen models even from random rewards that do not reflect genuine capability improvements.
Generalizing Verifiable Instruction Following
ArXiv.org · 2025-07-03 · 1 citations
preprintOpen accessSenior authorA crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like ``only answer with yes or no" or ``mention the word `abrakadabra' at least 3 times" that the user adds to craft a more useful answer. Even today's strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.
FlexOlmo: Open Language Models for Flexible Data Use
ArXiv.org · 2025-07-09
preprintOpen accessWe introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners, leading to an average 41% relative improvement while allowing users to opt out of certain data based on data licensing or permission requirements. Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, this research presents a solution for both data owners and researchers in regulated industries with sensitive or protected data. FlexOlmo enables benefiting from closed data while respecting data owners' preferences by keeping their data local and supporting fine-grained control of data access during inference.
Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations
ArXiv.org · 2025-10-20
preprintOpen accessLanguage models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.
Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
2025-01-01 · 1 citations
articleOpen accessSenior authorOLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
ArXiv.org · 2025-04-09
preprintOpen accessWe present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
arXiv (Cornell University) · 2025-02-14
preprintOpen accessModern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.
ScienceMeter: Tracking Scientific Knowledge Updates in Language Models
ArXiv.org · 2025-05-30
preprintOpen accessSenior authorLarge Language Models (LLMs) are increasingly used to support scientific research, but their knowledge of scientific advancements can quickly become outdated. We introduce ScienceMeter, a new framework for evaluating scientific knowledge update methods over scientific knowledge spanning the past, present, and future. ScienceMeter defines three metrics: knowledge preservation, the extent to which models' understanding of previously learned papers are preserved; knowledge acquisition, how well scientific claims from newly introduced papers are acquired; and knowledge projection, the ability of the updated model to anticipate or generalize to related scientific claims that may emerge in the future. Using ScienceMeter, we examine the scientific knowledge of LLMs on claim judgment and generation tasks across a curated dataset of 15,444 scientific papers and 30,888 scientific claims from ten domains including medicine, biology, materials science, and computer science. We evaluate five representative knowledge update approaches including training- and inference-time methods. With extensive experiments, we find that the best-performing knowledge update methods can preserve only 85.9% of existing knowledge, acquire 71.7% of new knowledge, and project 37.7% of future knowledge. Inference-based methods work for larger models, whereas smaller models require training to achieve comparable performance. Cross-domain analysis reveals that performance on these objectives is correlated. Even when applying on specialized scientific LLMs, existing knowledge update methods fail to achieve these objectives collectively, underscoring that developing robust scientific knowledge update mechanisms is both crucial and challenging.
Recent grants
RI: Small: Learning to Read, Ground, and Reason in Multimodal Text
NSF · $450k · 2016–2020
EAGER: Generating and Understanding Narratives for Dynamic Environments
NSF · $150k · 2013–2016
Frequent coauthors
- 101 shared
Noah A. Smith
- 90 shared
Luke Zettlemoyer
- 89 shared
Sewon Min
- 77 shared
Yejin Choi
- 72 shared
Daniel Khashabi
- 68 shared
Ali Farhadi
- 67 shared
Akari Asai
- 43 shared
Mari Ostendorf
Education
Ph.D.
University of Illinois at Urbana-Champaign
Other
Disney Research and CMU
Awards & honors
- 2020 Alfred Sloan Fellowship
- 2021 NSF CAREER award
- 2019 Intel rising star award
- 2018 Allen Distinguished Investigator award
- 2023 Academic Achievement UIUC Alumni award
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Hannaneh Hajishirzi
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup