Hannaneh Hajishirzi

· ProfessorVerified

University of Washington · Computer Science & Engineering

Active 2007–2025

h-index75

Citations21.6k

Papers435302 last 5y

Funding$600k

Faculty page Lab page

See your match with Hannaneh Hajishirzi — sign in to PhdFit.Sign in

About

Hannaneh Hajishirzi is the Torode Family Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington and a Senior Research Director at the Allen Institute for AI (AI2). She earned her Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign and completed a postdoctoral associate position at Disney Research and Carnegie Mellon University. Her research primarily focuses on natural language processing (NLP) and artificial intelligence (AI), with a particular emphasis on understanding and advancing large language models. She leads the H2Lab, which publishes extensively in top-tier NLP, AI, and machine learning conferences. Her research goals include establishing the science of language modeling through the OLMo project, expanding the applicability of language models to benefit human lives via post-training efforts, and developing a new generation of retrieval-based language models that address fundamental challenges in current models. Professor Hajishirzi has published over 140 scientific articles in leading journals and conferences across machine learning, AI, NLP, and computer vision. She has received numerous prestigious awards, including the 2020 Alfred Sloan Fellowship, the 2021 NSF CAREER award, the 2019 Intel Rising Star award, the 2018 Allen Distinguished Investigator award, the 2023 Academic Achievement UIUC Alumni award, and was a 2024 Innovator of the Year award finalist by GeekWire. Her lab's work has been recognized with best paper nominations and awards and has been featured in prominent media outlets such as The New York Times, Forbes, NPR, MIT Technology Review, GeekWire, and Wired Magazine.

Research topics

Computer Science
Artificial Intelligence
Machine Learning
Natural Language Processing
Archaeology
Algorithm
History
Engineering
Psychology
Geology
Data science

Selected publications

A Large-Scale Study of Reranker Relevance Feedback at Inference
2025-07-13 · 1 citations
articleOpen accessSenior author
Neural IR systems often employ a retrieve-and-rerank framework: a bi-encoder retrieves a fixed number of candidates (e.g., =100), which a cross-encoder then reranks.Recent studies have indicated that relevance feedback from the reranker at inference time can improve the recall of the retriever.The approach works by updating the retriever's query representations via a distillation process that aligns it with the reranker's predictions.While a powerful idea, the arguably narrow scope of past studies focusing on a small number of specific domains such as english question answering and entity retrieval has left a gap in our understanding of how well it generalizes.In this paper, we study inference-time reranker relevance feedback extensively across multiple retrieval domains, languages, and modalities, while also investigating aspects such as the performance and latency implications of the number of distillation updates and feedback candidates.
Publisher OA PDF DOI
RewardBench 2: Advancing Reward Model Evaluation
arXiv (Cornell University) · 2025-06-02
preprintOpen access
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to others that test agreement with human preferences. At the same time, progress in evaluation has not been mirrored by the effectiveness of reward models in downstream tasks -- simpler direct alignment algorithms are reported to work better in many cases. This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark designed to bring new, challenging data for accuracy-based reward model evaluation -- models score about 20 points on average lower on RewardBench 2 compared to the first RewardBench -- while being highly correlated with downstream performance. Compared to most other benchmarks, RewardBench 2 sources new human prompts instead of existing prompts from downstream evaluations, facilitating more rigorous evaluation practices. In this paper, we describe our benchmark construction process and report how existing models perform on it, while quantifying how performance on the benchmark correlates with downstream use of the models in both inference-time scaling algorithms, like best-of-N sampling, and RLHF training algorithms like proximal policy optimization.
Publisher OA PDF DOI
Spurious Rewards: Rethinking Training Signals in RLVR
arXiv (Cornell University) · 2025-06-12
preprintOpen access
We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR training with GRPO improves MATH-500 performance for Qwen2.5-Math-7B by 21.4 percentage points using randomly assigned rewards, nearly matching the 29.1-point gain from ground-truth rewards. To explain this counterintuitive observation, we show that GRPO exhibits a clipping bias from the clip term, which can amplify high-prior behaviors learned during pretraining even without informative rewards. As a case study, we identify one such behavior in Qwen2.5-Math models, which we call code reasoning -- reasoning in code without actual code execution; code-reasoning frequency increases from 65 percent to over 90 percent with spurious rewards. However, the presence of such amplifiable behaviors is highly model-dependent. In practice, spurious rewards that are effective for Qwen models often fail to produce gains for other model families, such as Llama3 or OLMo2. Our results highlight the importance of validating RL methods across diverse models rather than relying on a single de facto choice: large gains can arise on Qwen models even from random rewards that do not reflect genuine capability improvements.
Publisher OA PDF DOI
Generalizing Verifiable Instruction Following
ArXiv.org · 2025-07-03 · 1 citations
preprintOpen accessSenior author
A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like ``only answer with yes or no" or ``mention the word `abrakadabra' at least 3 times" that the user adds to craft a more useful answer. Even today's strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.
Publisher OA PDF DOI
FlexOlmo: Open Language Models for Flexible Data Use
ArXiv.org · 2025-07-09
preprintOpen access
We introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners, leading to an average 41% relative improvement while allowing users to opt out of certain data based on data licensing or permission requirements. Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, this research presents a solution for both data owners and researchers in regulated industries with sensitive or protected data. FlexOlmo enables benefiting from closed data while respecting data owners' preferences by keeping their data local and supporting fine-grained control of data access during inference.
Publisher OA PDF DOI
Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations
ArXiv.org · 2025-10-20
preprintOpen access
Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.
Publisher OA PDF DOI
Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
2025-01-01 · 1 citations
articleOpen accessSenior author
Publisher OA PDF DOI
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
ArXiv.org · 2025-04-09
preprintOpen access
We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
Publisher OA PDF DOI
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
arXiv (Cornell University) · 2025-02-14
preprintOpen access
Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.
Publisher OA PDF DOI
ScienceMeter: Tracking Scientific Knowledge Updates in Language Models
ArXiv.org · 2025-05-30
preprintOpen accessSenior author
Large Language Models (LLMs) are increasingly used to support scientific research, but their knowledge of scientific advancements can quickly become outdated. We introduce ScienceMeter, a new framework for evaluating scientific knowledge update methods over scientific knowledge spanning the past, present, and future. ScienceMeter defines three metrics: knowledge preservation, the extent to which models' understanding of previously learned papers are preserved; knowledge acquisition, how well scientific claims from newly introduced papers are acquired; and knowledge projection, the ability of the updated model to anticipate or generalize to related scientific claims that may emerge in the future. Using ScienceMeter, we examine the scientific knowledge of LLMs on claim judgment and generation tasks across a curated dataset of 15,444 scientific papers and 30,888 scientific claims from ten domains including medicine, biology, materials science, and computer science. We evaluate five representative knowledge update approaches including training- and inference-time methods. With extensive experiments, we find that the best-performing knowledge update methods can preserve only 85.9% of existing knowledge, acquire 71.7% of new knowledge, and project 37.7% of future knowledge. Inference-based methods work for larger models, whereas smaller models require training to achieve comparable performance. Cross-domain analysis reveals that performance on these objectives is correlated. Even when applying on specialized scientific LLMs, existing knowledge update methods fail to achieve these objectives collectively, underscoring that developing robust scientific knowledge update mechanisms is both crucial and challenging.
Publisher OA PDF DOI

Recent grants

RI: Small: Learning to Read, Ground, and Reason in Multimodal Text
NSF · $450k · 2016–2020
EAGER: Generating and Understanding Narratives for Dynamic Environments
NSF · $150k · 2013–2016

Frequent coauthors

Noah A. Smith
101 shared
Luke Zettlemoyer
90 shared
Sewon Min
89 shared
Yejin Choi
77 shared
Daniel Khashabi
72 shared
Ali Farhadi
68 shared
Akari Asai
67 shared
Mari Ostendorf
43 shared

Education

Ph.D.
University of Illinois at Urbana-Champaign
Other
Disney Research and CMU

Awards & honors

2020 Alfred Sloan Fellowship
2021 NSF CAREER award
2019 Intel rising star award
2018 Allen Distinguished Investigator award
2023 Academic Achievement UIUC Alumni award

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Hannaneh Hajishirzi

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you