Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…

Mohit Iyyer

· Adjunct Associate ProfessorVerified

University of Massachusetts Amherst · International Relations

Active 2011–2025

h-index42
Citations23.6k
Papers190125 last 5y
Funding$1.1M
See your match with Mohit Iyyer — sign in to PhdFit.Sign in

About

Mohit Iyyer is an assistant professor in computer science at the Manning College of Information and Computer Sciences (CICS) at the University of Massachusetts Amherst. He is a member of UMass NLP and his research interests broadly encompass natural language processing and machine learning. Much of his work utilizes deep learning techniques to model language at the discourse level. Previously, he was a young investigator at AI2 and completed his PhD at the University of Maryland, College Park, advised by Jordan Boyd-Graber and Hal Daumé III. He holds a master's degree in computer science from the University of Maryland, College Park, and a bachelor's degree from Washington University. His academic background and research focus are centered on advancing understanding and modeling of human language through artificial intelligence.

Research topics

  • Machine Learning
  • Computer Science
  • Artificial Intelligence
  • Natural Language Processing
  • Philosophy
  • Geology
  • Linguistics

Selected publications

  • Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries

    Transactions of the Association for Computational Linguistics · 2025-01-01 · 3 citations

    articleOpen access

    Abstract Language model users often issue queries that lack specification, where the context under which a query was issued—such as the user’s identity, the query’s intent, and the criteria for a response to be useful—is not explicit. For instance, a good response to a subjective query like “What book should I read next?” would depend on the user’s preferences, and a good response to an open-ended query like “How do antibiotics work against bacteria?” would depend on the user’s expertise. This makes evaluation of responses to such queries an ill-posed task, as evaluators may make arbitrary judgments about the response quality. To remedy this, we present contextualized evaluations, a protocol that synthetically constructs context surrounding an underspecified query and provides it during evaluation. We find that the presence of context can 1) alter conclusions drawn from evaluation, even flipping benchmark rankings between model pairs, 2) nudge evaluators to make fewer judgments based on surface-level criteria, like style, and 3) provide new insights about model behavior across diverse contexts. Specifically, our procedure suggests a potential bias towards WEIRD (Western, Educated, Industrialized, Rich and Democratic) contexts in models’ “default” responses and we find that models are not equally sensitive to following different contexts, even when they are provided in prompts.1

  • Literary Evidence Retrieval via Long-Context Language Models

    ArXiv.org · 2025-06-03

    preprintOpen accessSenior author

    How well do modern long-context language models understand literary fiction? We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of That et al. (2022) to construct a benchmark where the entire text of a primary source (e.g., The Great Gatsby) is provided to an LLM alongside literary criticism with a missing quotation from that work. This setting, in which the model must generate the missing quotation, mirrors the human process of literary analysis by requiring models to perform both global narrative reasoning and close textual examination. We curate a high-quality subset of 292 examples through extensive filtering and human verification. Our experiments show that recent reasoning models, such as Gemini Pro 2.5 can exceed human expert performance (62.5% vs. 50% accuracy). In contrast, the best open-weight model achieves only 29.1% accuracy, highlighting a wide gap in interpretive reasoning between open and closed-weight models. Despite their speed and apparent accuracy, even the strongest models struggle with nuanced literary signals and overgeneration, signaling open challenges for applying LLMs to literary analysis. We release our dataset and evaluation code to encourage future work in this direction.

  • AI use in American newspapers is widespread, uneven, and rarely disclosed

    ArXiv.org · 2025-10-21 · 1 citations

    preprintOpen accessSenior author

    AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1.5K American newspapers published in the summer of 2025. Using Pangram, a state-of-the-art AI detector, we discover that approximately 9% of newly-published articles are either partially or fully AI-generated. This AI use is unevenly distributed, appearing more frequently in smaller, local outlets, in specific topics such as weather and technology, and within certain ownership groups. We also analyze 45K opinion pieces from Washington Post, New York Times, and Wall Street Journal, finding that they are 6.4 times more likely to contain AI-generated content than news articles from the same publications, with many AI-flagged op-eds authored by prominent public figures. Despite this prevalence, we find that AI use is rarely disclosed: a manual audit of 100 AI-flagged articles found only five disclosures of AI use. Overall, our audit highlights the immediate need for greater transparency and updated editorial standards regarding the use of AI in journalism to maintain public trust.

  • OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature

    2025-01-01 · 1 citations

    articleOpen accessSenior author

    Alisha Srivastava, Emir Kaan Korukluoglu, Minh Nhat Le, Duyen Tran, Chau Minh Pham, Marzena Karpinska, Mohit Iyyer. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025.

  • Localizing and Mitigating Errors in Long-form Question Answering

    2025-01-01

    articleOpen access

    Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension.However, such detailed responses are prone to hallucinations and factual inconsistencies, challenging their faithful evaluation.This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers.HaluQuestQA comprises 698 QA pairs with 1.8k span-level error annotations for five different error types by expert annotators, along with preference judgments.Using our collected data, we thoroughly analyze the shortcomings of long-form answers and find that they lack comprehensiveness and provide unhelpful references.We train an automatic feedback model on this dataset that predicts error spans with incomplete information and provides associated explanations.Finally, we propose a promptbased approach, Error-Informed Refinement, that uses signals from the learned feedback model to refine generated answers, which we show reduces errors and improves the quality of the answers across multiple models.Furthermore, humans find the answers generated by our approach comprehensive and highly prefer them (84%) over the baseline answers. 1 1 Code and data available at: github.com/

  • Does quantization affect models' performance on long-context tasks?

    ArXiv.org · 2025-05-26 · 1 citations

    preprintOpen accessSenior author

    Large language models (LLMs) now support context windows exceeding 128K tokens, but this comes with significant memory requirements and high inference latency. Quantization can mitigate these costs, but may degrade performance. In this work, we present the first systematic evaluation of quantized LLMs on tasks with long inputs (>64K tokens) and long-form outputs. Our evaluation spans 9.7K test examples, five quantization methods (FP8, GPTQ-int8, AWQ-int4, GPTQ-int4, BNB-nf4), and five models (Llama-3.1 8B and 70B; Qwen-2.5 7B, 32B, and 72B). We find that, on average, 8-bit quantization preserves accuracy (~0.8% drop), whereas 4-bit methods lead to substantial losses, especially for tasks involving long-context inputs (drops of up to 59%). This degradation tends to worsen when the input is in a language other than English. Crucially, the effects of quantization depend heavily on the quantization method, model, and task. For instance, while Qwen-2.5 72B remains robust under BNB-nf4, Llama-3.1 70B experiences a 32% performance drop on the same task. These findings highlight the importance of a careful, task-specific evaluation before deploying quantized LLMs, particularly in long-context scenarios and for languages other than English.

  • EditLens: Quantifying the Extent of AI Editing in Text

    ArXiv.org · 2025-10-03

    preprintOpen accessSenior author

    A significant proportion of queries to large language models ask them to edit user-provided text, rather than generate new text from scratch. While previous work focuses on detecting fully AI-generated text, we demonstrate that AI-edited text is distinguishable from human-written and AI-generated text. First, we propose using lightweight similarity metrics to quantify the magnitude of AI editing present in a text given the original human-written text and validate these metrics with human annotators. Using these similarity metrics as intermediate supervision, we then train EditLens, a regression model that predicts the amount of AI editing present within a text. Our model achieves state-of-the-art performance on both binary (F1=94.7%) and ternary (F1=90.4%) classification tasks in distinguishing human, AI, and mixed writing. Not only do we show that AI-edited text can be detected, but also that the degree of change made by AI to human writing can be detected, which has implications for authorship attribution, education, and policy. Finally, as a case study, we use our model to analyze the effects of AI-edits applied by Grammarly, a popular writing assistance tool. To encourage further research, we commit to publicly releasing our models and dataset.

  • People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text

    2025-01-01 · 6 citations

    articleSenior author
  • People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text

    ArXiv.org · 2025-01-26 · 3 citations

    preprintOpen accessSenior author

    In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1). We hire annotators to read 300 non-fiction English articles, label them as either human-written or AI-generated, and provide paragraph-length explanations for their decisions. Our experiments show that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback. In fact, the majority vote among five such "expert" annotators misclassifies only 1 of 300 articles, significantly outperforming most commercial and open-source detectors we evaluated even in the presence of evasion tactics like paraphrasing and humanization. Qualitative analysis of the experts' free-form explanations shows that while they rely heavily on specific lexical clues ('AI vocabulary'), they also pick up on more complex phenomena within the text (e.g., formality, originality, clarity) that are challenging to assess for automatic detectors. We release our annotated dataset and code to spur future research into both human and automated detection of AI-generated text.

  • Literary Evidence Retrieval via Long-Context Language Models

    2025-01-01

    articleOpen accessSenior author

    How well do modern long-context language models understand literary fiction?We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of Thai et al. (2022) to construct a benchmark where the entire text of a primary source (e.g., The Great Gatsby) is provided to an LLM alongside literary criticism with a missing quotation from that work.This setting, in which the model must generate the missing quotation, mirrors the human process of literary analysis by requiring models to perform both global narrative reasoning and close textual examination.We curate a high-quality subset of 292 examples through extensive filtering and human verification.Our experiments show that recent reasoning models, such as GEMINI PRO 2.5 can exceed human expert performance (62.5% vs. 50% accuracy).In contrast, the best open-weight model achieves only 29.1% accuracy, highlighting a gap in interpretive reasoning.Despite their speed and apparent accuracy, even the strongest models struggle with nuanced literary signals and overgeneration, signaling open challenges for applying LLMs to literary analysis.We release our dataset and evaluation code to encourage future work in this direction.

Recent grants

Frequent coauthors

  • Kalpesh Krishna

    70 shared
  • Luke Zettlemoyer

    31 shared
  • John Wieting

    29 shared
  • Jordan Boyd‐Graber

    27 shared
  • Katherine Thai

    25 shared
  • Eunsol Choi

    23 shared
  • Yixiao Song

    21 shared
  • Yejin Choi

    19 shared
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Mohit Iyyer

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup