Dan Roth

· ProfessorVerified

University of Pennsylvania · Computer and Information Science

Active 1992–2026

h-index88

Citations32.8k

Papers842306 last 5y

Funding$1.6M

Faculty page

See your match with Dan Roth — sign in to PhdFit.Sign in

Research topics

Artificial Intelligence
Computer Science
Natural Language Processing
Linguistics
Information Retrieval
Machine Learning
Data Mining
Psychology
Medicine
Psychiatry
History
Theoretical computer science
Programming language
Archaeology
Mathematics
Data science
Engineering

Selected publications

M <scp>o</scp> N <scp>a</scp> C <scp>o</scp> : More Natural and Complex Questions for Reasoning Across Dozens of Documents
Transactions of the Association for Computational Linguistics · 2026-01-01
articleOpen access
Abstract Automated agents, powered by large language models (LLMs), are emerging as the go-to tool for querying information. However, evaluation benchmarks for LLM agents rarely feature natural questions that are both information-seeking and genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and time-consuming questions that require dozens, and at times hundreds, of intermediate steps to solve— far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer real-world time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the limitations of LLM-powered agents in handling the complexity and sheer breadth of real-world information-seeking tasks—with MoNaCo providing an effective resource for tracking such progress. The MoNaCo benchmark, codebase, prompts, and models predictions are all publicly available at: https://tomerwolgithub.github.io/monaco.
Publisher DOI
Program-of-Thought Reveals LLM Abstraction Ceilings
2026-01-01
articleOpen accessSenior author
Large language models (LLMs) are often claimed to exhibit reasoning ability when supervised with chain-of-thought (CoT) traces.True reasoning, however, requires invariance: isomorphic problems should yield identical solutions regardless of superficial variation.We test this property by evaluating base and reasoningoptimized models-including LLaMA, Mistral, Qwen, GPT-OSS, and Deepseek-on isomorphic variants from GSM8K and MATH.All models exhibit substantial accuracy drops under perturbation.To assess whether training can induce invariance, we fine-tune models with Program-of-Thought (PoT) supervision under concrete and masked formulations.PoT fine-tuning increases behavioral cross-variant consistency but does not significantly reduce the accuracy gap, and these gains fail to transfer across prompting formats and domains.Our central finding is that models converge toward stable but systematically incorrect behaviors: consistency without correctness.This dissociation suggests that current reasoning supervision teaches models to reproduce solution templates rather than to abstract mathematical structure.
Publisher OA PDF DOI
OraPlan–SQL: A Planning-Centric Framework for Complex Bilingual NL2SQL Reasoning
Lecture notes in computer science · 2026-01-01
book-chapterSenior authorCorresponding
Publisher DOI
Conflicts in Texts: Data, Implications and Challenges
ArXiv.org · 2025-04-28
preprintOpen accessSenior author
As NLP models become increasingly integrated into real-world applications, it becomes clear that there is a need to address the fact that models often rely on and generate conflicting information. Conflicts could reflect the complexity of situations, changes that need to be explained and dealt with, difficulties in data annotation, and mistakes in generated outputs. In all cases, disregarding the conflicts in data could result in undesired behaviors of models and undermine NLP models' reliability and trustworthiness. This survey categorizes these conflicts into three key areas: (1) natural texts on the web, where factual inconsistencies, subjective biases, and multiple perspectives introduce contradictions; (2) human-annotated data, where annotator disagreements, mistakes, and societal biases impact model training; and (3) model interactions, where hallucinations and knowledge conflicts emerge during deployment. While prior work has addressed some of these conflicts in isolation, we unify them under the broader concept of conflicting information, analyze their implications, and discuss mitigation strategies. We highlight key challenges and future directions for developing conflict-aware NLP systems that can reason over and reconcile conflicting information more effectively.
Publisher OA PDF DOI
SAKE: Structured Agentic Knowledge Extrapolation for Complex LLM Reasoning via Reinforcement Learning
ArXiv.org · 2025-05-21
preprintOpen access
Knowledge extrapolation is the process of inferring novel information by combining and extending existing knowledge that is explicitly available. It is essential for solving complex questions in specialized domains where retrieving comprehensive external knowledge is impractical. We propose SAKE (Structured Agentic Knowledge Extrapolation), a RL powered agentic framework that trains LLMs to autonomously retrieve and extrapolate structured knowledge through tool-augmented reinforcement learning. SAKE defines two external KG tools: entity group construction and cross-group triplet retrieval. The model learns to interleave these 2 retrieval tools during a three-turn rollout: extracting key entities, filtering relevant concept groups, and associative reasoning by constructing new triplets through analogy. The entire pipeline is optimized end-to-end with GRPO using a curriculum reward, teaching the model what to retrieve and how to reason over it. Our experiments proved that SAKE fine-tuned Qwen2.5-7B model surpasses GPT-3.5-Turbo with state-of-the-art agentic KG reasoning on both biomedical (75.4% vs. 70.1%) and commonsense (81.3% vs. 74.7%) benchmarks, while reducing token usage by over 90%. These results demonstrate that associative reasoning over incomplete structured knowledge does not requiring large models with complex, multi-step prompting, thus can be learned end-to-end by small, open-weight models through reinforcement learning with the right tools and training signal. Our code is available at https://anonymous.4open.science/r/SAKE-7585.
Publisher OA PDF DOI
REaR: Retrieve, Expand and Refine for Effective Multitable Retrieval
ArXiv.org · 2025-11-02
preprintOpen access
Answering natural language queries over relational data often requires retrieving and reasoning over multiple tables, yet most retrievers optimize only for query-table relevance and ignore table table compatibility. We introduce REAR (Retrieve, Expand and Refine), a three-stage, LLM-free framework that separates semantic relevance from structural joinability for efficient, high-fidelity multi-table retrieval. REAR (i) retrieves query-aligned tables, (ii) expands these with structurally joinable tables via fast, precomputed column-embedding comparisons, and (iii) refines them by pruning noisy or weakly related candidates. Empirically, REAR is retriever-agnostic and consistently improves dense/sparse retrievers on complex table QA datasets (BIRD, MMQA, and Spider) by improving both multi-table retrieval quality and downstream SQL execution. Despite being LLM-free, it delivers performance competitive with state-of-the-art LLM-augmented retrieval systems (e.g.,ARM) while achieving much lower latency and cost. Ablations confirm complementary gains from expansion and refinement, underscoring REAR as a practical, scalable building block for table-based downstream tasks (e.g., Text-to-SQL).
Publisher OA PDF DOI
EnrichIndex: Using LLMs to Enrich Retrieval Indices Offline
ArXiv.org · 2025-04-04
preprintOpen accessSenior author
Existing information retrieval systems excel in cases where the language of target documents closely matches that of the user query. However, real-world retrieval systems are often required to implicitly reason whether a document is relevant. For example, when retrieving technical texts or tables, their relevance to the user query may be implied through a particular jargon or structure, rather than explicitly expressed in their content. Large language models (LLMs) hold great potential in identifying such implied relevance by leveraging their reasoning skills. Nevertheless, current LLM-augmented retrieval is hindered by high latency and computation cost, as the LLM typically computes the query-document relevance online, for every query anew. To tackle this issue we introduce EnrichIndex, a retrieval approach which instead uses the LLM offline to build semantically-enriched retrieval indices, by performing a single pass over all documents in the retrieval corpus once during ingestion time. Furthermore, the semantically-enriched indices can complement existing online retrieval approaches, boosting the performance of LLM re-rankers. We evaluated EnrichIndex on five retrieval tasks, involving passages and tables, and found that it outperforms strong online LLM-based retrieval systems, with an average improvement of 11.7 points in recall @ 10 and 10.6 points in NDCG @ 10 compared to strong baselines. In terms of online calls to the LLM, it processes 293.3 times fewer tokens which greatly reduces the online latency and cost. Overall, EnrichIndex is an effective way to build better retrieval indices offline by leveraging the strong reasoning skills of LLMs.
Publisher OA PDF DOI
INTERCHART: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information
2025-01-01
articleOpen access
Anirudh Iyengar Kaniyar Narayana Iyengar, Srija Mukhopadhyay, Adnan Qidwai, Shubhankar Singh, Dan Roth, Vivek Gupta. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. 2025.
Publisher DOI
Towards Long Context Hallucination Detection
ArXiv.org · 2025-04-28
preprintOpen accessSenior author
Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, they are prone to contextual hallucination, generating information that is either unsubstantiated or contradictory to the given context. Although many studies have investigated contextual hallucinations in LLMs, addressing them in long-context inputs remains an open problem. In this work, we take an initial step toward solving this problem by constructing a dataset specifically designed for long-context hallucination detection. Furthermore, we propose a novel architecture that enables pre-trained encoder models, such as BERT, to process long contexts and effectively detect contextual hallucinations through a decomposition and aggregation mechanism. Our experimental results show that the proposed architecture significantly outperforms previous models of similar size as well as LLM-based models across various metrics, while providing substantially faster inference.
Publisher OA PDF DOI
RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts
ArXiv.org · 2025-10-06
preprintOpen accessSenior author
With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11% and 8% of cases, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.
Publisher OA PDF DOI

Recent grants

Integrated Social History Environment for Research (ISHER)-Digging into Social Unrest
NSF · $125k · 2012–2014
ITR-(ASE+ECS)-(soc+sim+int)-Natural Language Processing Technology for Guided Study of Bioinformatics
NSF · $1.0M · 2004–2008
SoD-HCER: Learning Based Programming
NSF · $483k · 2006–2010

Frequent coauthors

Stephen Mayhew
49 shared
Yanai Elazar
48 shared
Deepak Ramachandran
44 shared
Mark Sammons
42 shared
Hongming Zhang
36 shared
Daniel Khashabi
36 shared
Yangqiu Song
35 shared
Ian Tenney
33 shared

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Dan Roth

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you