Mohit Iyyer

· Adjunct Associate ProfessorVerified

University of Massachusetts Amherst · International Relations

Active 2011–2025

h-index42

Citations23.6k

Papers190125 last 5y

Funding$1.1M

Faculty page Lab page Website

See your match with Mohit Iyyer — sign in to PhdFit.Sign in

About

Mohit Iyyer is an assistant professor in computer science at the Manning College of Information and Computer Sciences (CICS) at the University of Massachusetts Amherst. He is a member of UMass NLP and his research interests broadly encompass natural language processing and machine learning. Much of his work utilizes deep learning techniques to model language at the discourse level. Previously, he was a young investigator at AI2 and completed his PhD at the University of Maryland, College Park, advised by Jordan Boyd-Graber and Hal Daumé III. He holds a master's degree in computer science from the University of Maryland, College Park, and a bachelor's degree from Washington University. His academic background and research focus are centered on advancing understanding and modeling of human language through artificial intelligence.

Research topics

Machine Learning
Computer Science
Artificial Intelligence
Natural Language Processing
Philosophy
Geology
Linguistics

Selected publications

Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries
Transactions of the Association for Computational Linguistics · 2025-01-01 · 3 citations
articleOpen access
Abstract Language model users often issue queries that lack specification, where the context under which a query was issued—such as the user’s identity, the query’s intent, and the criteria for a response to be useful—is not explicit. For instance, a good response to a subjective query like “What book should I read next?” would depend on the user’s preferences, and a good response to an open-ended query like “How do antibiotics work against bacteria?” would depend on the user’s expertise. This makes evaluation of responses to such queries an ill-posed task, as evaluators may make arbitrary judgments about the response quality. To remedy this, we present contextualized evaluations, a protocol that synthetically constructs context surrounding an underspecified query and provides it during evaluation. We find that the presence of context can 1) alter conclusions drawn from evaluation, even flipping benchmark rankings between model pairs, 2) nudge evaluators to make fewer judgments based on surface-level criteria, like style, and 3) provide new insights about model behavior across diverse contexts. Specifically, our procedure suggests a potential bias towards WEIRD (Western, Educated, Industrialized, Rich and Democratic) contexts in models’ “default” responses and we find that models are not equally sensitive to following different contexts, even when they are provided in prompts.1
Publisher OA PDF DOI
Literary Evidence Retrieval via Long-Context Language Models
ArXiv.org · 2025-06-03
preprintOpen accessSenior author
How well do modern long-context language models understand literary fiction? We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of That et al. (2022) to construct a benchmark where the entire text of a primary source (e.g., The Great Gatsby) is provided to an LLM alongside literary criticism with a missing quotation from that work. This setting, in which the model must generate the missing quotation, mirrors the human process of literary analysis by requiring models to perform both global narrative reasoning and close textual examination. We curate a high-quality subset of 292 examples through extensive filtering and human verification. Our experiments show that recent reasoning models, such as Gemini Pro 2.5 can exceed human expert performance (62.5% vs. 50% accuracy). In contrast, the best open-weight model achieves only 29.1% accuracy, highlighting a wide gap in interpretive reasoning between open and closed-weight models. Despite their speed and apparent accuracy, even the strongest models struggle with nuanced literary signals and overgeneration, signaling open challenges for applying LLMs to literary analysis. We release our dataset and evaluation code to encourage future work in this direction.
Publisher OA PDF DOI
AI use in American newspapers is widespread, uneven, and rarely disclosed
ArXiv.org · 2025-10-21 · 1 citations
preprintOpen accessSenior author
AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1.5K American newspapers published in the summer of 2025. Using Pangram, a state-of-the-art AI detector, we discover that approximately 9% of newly-published articles are either partially or fully AI-generated. This AI use is unevenly distributed, appearing more frequently in smaller, local outlets, in specific topics such as weather and technology, and within certain ownership groups. We also analyze 45K opinion pieces from Washington Post, New York Times, and Wall Street Journal, finding that they are 6.4 times more likely to contain AI-generated content than news articles from the same publications, with many AI-flagged op-eds authored by prominent public figures. Despite this prevalence, we find that AI use is rarely disclosed: a manual audit of 100 AI-flagged articles found only five disclosures of AI use. Overall, our audit highlights the immediate need for greater transparency and updated editorial standards regarding the use of AI in journalism to maintain public trust.
Publisher OA PDF DOI
OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature
2025-01-01 · 1 citations
articleOpen accessSenior author
Alisha Srivastava, Emir Kaan Korukluoglu, Minh Nhat Le, Duyen Tran, Chau Minh Pham, Marzena Karpinska, Mohit Iyyer. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025.
Publisher OA PDF DOI
Localizing and Mitigating Errors in Long-form Question Answering
2025-01-01
articleOpen access
Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension.However, such detailed responses are prone to hallucinations and factual inconsistencies, challenging their faithful evaluation.This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers.HaluQuestQA comprises 698 QA pairs with 1.8k span-level error annotations for five different error types by expert annotators, along with preference judgments.Using our collected data, we thoroughly analyze the shortcomings of long-form answers and find that they lack comprehensiveness and provide unhelpful references.We train an automatic feedback model on this dataset that predicts error spans with incomplete information and provides associated explanations.Finally, we propose a promptbased approach, Error-Informed Refinement, that uses signals from the learned feedback model to refine generated answers, which we show reduces errors and improves the quality of the answers across multiple models.Furthermore, humans find the answers generated by our approach comprehensive and highly prefer them (84%) over the baseline answers. 1 1 Code and data available at: github.com/
Publisher OA PDF DOI
Does quantization affect models' performance on long-context tasks?
ArXiv.org · 2025-05-26 · 1 citations
preprintOpen accessSenior author
Large language models (LLMs) now support context windows exceeding 128K tokens, but this comes with significant memory requirements and high inference latency. Quantization can mitigate these costs, but may degrade performance. In this work, we present the first systematic evaluation of quantized LLMs on tasks with long inputs (>64K tokens) and long-form outputs. Our evaluation spans 9.7K test examples, five quantization methods (FP8, GPTQ-int8, AWQ-int4, GPTQ-int4, BNB-nf4), and five models (Llama-3.1 8B and 70B; Qwen-2.5 7B, 32B, and 72B). We find that, on average, 8-bit quantization preserves accuracy (~0.8% drop), whereas 4-bit methods lead to substantial losses, especially for tasks involving long-context inputs (drops of up to 59%). This degradation tends to worsen when the input is in a language other than English. Crucially, the effects of quantization depend heavily on the quantization method, model, and task. For instance, while Qwen-2.5 72B remains robust under BNB-nf4, Llama-3.1 70B experiences a 32% performance drop on the same task. These findings highlight the importance of a careful, task-specific evaluation before deploying quantized LLMs, particularly in long-context scenarios and for languages other than English.
Publisher OA PDF DOI
EditLens: Quantifying the Extent of AI Editing in Text
ArXiv.org · 2025-10-03
preprintOpen accessSenior author
A significant proportion of queries to large language models ask them to edit user-provided text, rather than generate new text from scratch. While previous work focuses on detecting fully AI-generated text, we demonstrate that AI-edited text is distinguishable from human-written and AI-generated text. First, we propose using lightweight similarity metrics to quantify the magnitude of AI editing present in a text given the original human-written text and validate these metrics with human annotators. Using these similarity metrics as intermediate supervision, we then train EditLens, a regression model that predicts the amount of AI editing present within a text. Our model achieves state-of-the-art performance on both binary (F1=94.7%) and ternary (F1=90.4%) classification tasks in distinguishing human, AI, and mixed writing. Not only do we show that AI-edited text can be detected, but also that the degree of change made by AI to human writing can be detected, which has implications for authorship attribution, education, and policy. Finally, as a case study, we use our model to analyze the effects of AI-edits applied by Grammarly, a popular writing assistance tool. To encourage further research, we commit to publicly releasing our models and dataset.
Publisher OA PDF DOI
People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text
2025-01-01 · 6 citations
articleSenior author
Publisher DOI
People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text
ArXiv.org · 2025-01-26 · 3 citations
preprintOpen accessSenior author
In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1). We hire annotators to read 300 non-fiction English articles, label them as either human-written or AI-generated, and provide paragraph-length explanations for their decisions. Our experiments show that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback. In fact, the majority vote among five such "expert" annotators misclassifies only 1 of 300 articles, significantly outperforming most commercial and open-source detectors we evaluated even in the presence of evasion tactics like paraphrasing and humanization. Qualitative analysis of the experts' free-form explanations shows that while they rely heavily on specific lexical clues ('AI vocabulary'), they also pick up on more complex phenomena within the text (e.g., formality, originality, clarity) that are challenging to assess for automatic detectors. We release our annotated dataset and code to spur future research into both human and automated detection of AI-generated text.
Publisher OA PDF DOI
Literary Evidence Retrieval via Long-Context Language Models
2025-01-01
articleOpen accessSenior author
How well do modern long-context language models understand literary fiction?We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of Thai et al. (2022) to construct a benchmark where the entire text of a primary source (e.g., The Great Gatsby) is provided to an LLM alongside literary criticism with a missing quotation from that work.This setting, in which the model must generate the missing quotation, mirrors the human process of literary analysis by requiring models to perform both global narrative reasoning and close textual examination.We curate a high-quality subset of 292 examples through extensive filtering and human verification.Our experiments show that recent reasoning models, such as GEMINI PRO 2.5 can exceed human expert performance (62.5% vs. 50% accuracy).In contrast, the best open-weight model achieves only 29.1% accuracy, highlighting a gap in interpretive reasoning.Despite their speed and apparent accuracy, even the strongest models struggle with nuanced literary signals and overgeneration, signaling open challenges for applying LLMs to literary analysis.We release our dataset and evaluation code to encourage future work in this direction.
Publisher OA PDF DOI

Recent grants

RI: Medium: Tree-Structured Self-Supervised Modeling for Natural Language
NSF · $1.1M · 2020–2024

Frequent coauthors

Kalpesh Krishna
70 shared
Luke Zettlemoyer
31 shared
John Wieting
29 shared
Jordan Boyd‐Graber
27 shared
Katherine Thai
25 shared
Eunsol Choi
23 shared
Yixiao Song
21 shared
Yejin Choi
19 shared

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Mohit Iyyer

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you