Dan Jurafsky

· Jackson Eli Reynolds Professor in Humanities, and Professor of Linguistics and Computer ScienceVerified

Stanford University · Linguistics

Active 1988–2026

h-index87

Citations34.3k

Papers372164 last 5y

Funding$1.6M

Faculty page Lab page Website

See your match with Dan Jurafsky — sign in to PhdFit.Sign in

About

Dan Jurafsky is Professor of Linguistics, Professor of Computer Science, and Reynolds Professor in Humanities at Stanford University. He is an award-winning teacher and a MacArthur Fellow, recognized for his significant contributions to the fields of linguistics and computer science. Jurafsky has received the Richard C. Atkinson Prize in Psychological and Cognitive Sciences from the National Academy of Sciences and is a member of both the National Academy of Sciences and the American Academy of Arts and Sciences. Additionally, he is a Fellow of the Association for Computational Linguistics, the American Association for the Advancement of Science, and the Linguistics Society of America. His research and teaching focus on language models and other natural language processing tools, emphasizing their applications to the cognitive and social sciences as well as to social good. Jurafsky is the author of the widely-used online textbook "Speech and Language Processing" and the 2015 international bestseller and James Beard Award-nominee, "The Language of Food." He earned his Ph.D. in Computer Science in 1992 from the University of California at Berkeley.

Research topics

Artificial Intelligence
Computer Science
Political Science
Natural Language Processing
Sociology
Linguistics
Machine Learning
Law
Mathematics
History
Gender studies
Social Science
Information Retrieval
Engineering
Psychology
Management science
Statistics
Speech recognition
Econometrics
Social psychology
Economics
Data science
Environmental economics
Programming language

Selected publications

The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?
arXiv (Cornell University) · 2026-01-12
articleOpen access
Multilingual language models (LMs) promise broader NLP access, yet current systems deliver uneven performance across the world's languages. This survey examines why these gaps persist and whether they reflect intrinsic linguistic difficulty or modeling artifacts. We organize the literature around two questions: do linguistic disparities arise from representation and allocation choices (e.g., tokenization, encoding, data exposure, parameter sharing) rather than inherent complexity; and which design choices mitigate inequities across typologically diverse languages. We review linguistic features, such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. Gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling choices. We synthesize these insights into design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.
Publisher OA PDF
Evaluating Commercial AI Chatbots as News Intermediaries
ArXiv.org · 2026-05-21
articleOpen access
AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5 and GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services (US & Canada, Arabic, Afrique, Hindi, Russian, Turkish). The best systems achieve over 90% multiple-choice accuracy on questions about events reported hours earlier. The same systems, however, lose 11-13% under free-response evaluation, and 16-17% across the cohort. We further characterize three failure patterns. First, every model achieves its lowest accuracy on Hindi (79% vs. 89-91% elsewhere) and citations indicate an Anglophone retrieval bias (e.g., models answering Hindi queries cite English Wikipedia more than any Hindi outlet). Second, retrieval, not reasoning, failures drive over 70% of all errors. When models retrieve a correct source, they often extract the correct answer; the problem is to land on the right source in the first place. Third, models achieving 88-96% accuracy on well-formed questions drop to 19-70% when questions contain subtle false premises, with the most vulnerable model accepting fabricated facts 64% of the time. We also identify a detection-accuracy paradox: the best false-premise detector ranks second in adversarial accuracy (abstention rate), while a weaker detector ranks first, showing that premise detection and answer recovery are partially independent capabilities. Overall, these suggest that high accuracy can mask systematic regional inequity, near-total dependence on retrieval infrastructure, and vulnerability to imperfect queries real users pose.
Publisher OA PDF
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory
2026-01-01 · 1 citations
articleOpen access
Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, James Zou. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2026.
Publisher DOI
Labeling messages as AI-generated does not reduce their persuasive effects
PNAS Nexus · 2026-02-01 · 7 citations
articleOpen access
Abstract As generative AI enables the creation and dissemination of information at massive scale and speed, it is increasingly important to understand how people perceive AI-generated content. One prominent policy proposal requires explicitly labeling AI-generated content to increase transparency and encourage critical thinking about the information, but prior research has not yet tested the effects of such labels. To address this gap, we conducted a survey experiment (N=1,601) on a diverse sample of Americans, presenting participants with an AI-generated message about several public policies (e.g. allowing colleges to pay student-athletes), randomly assigning whether participants were told the message was generated by (i) an expert AI model, (ii) a human policy expert, or (iii) no label. We found that messages were generally persuasive, influencing participants’ views of the policies by 9.74 percentage points on average. However, while 92.0% of participants assigned to the AI and human label conditions believed the authorship labels, labels had no significant effects on participants’ attitude change toward the policies, judgments of message accuracy, nor intentions to share the message with others. These patterns were robust across a variety of participant characteristics, including prior knowledge of the policy, prior experience with AI, political party, education level, and age. Given current levels of trust in AI content, these results imply that, while authorship labels would likely enhance transparency, they are unlikely to substantially affect the persuasiveness of the labeled content, highlighting the need for alternative strategies to address challenges posed by AI-generated information.
Publisher DOI
PreFT: Prefill-only finetuning for efficient inference
arXiv (Cornell University) · 2026-05-14
preprintOpen access
Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.
Publisher DOI
Beyond Tokens: Concept-Level Training Objectives for LLMs
ArXiv.org · 2026-01-16
articleOpen access
The next-token prediction (NTP) objective has been foundational in the development of modern large language models (LLMs), driving advances in fluency and generalization. However, NTP operates at the \textit{token} level, treating deviations from a single reference continuation as errors even when alternative continuations are equally plausible or semantically equivalent (e.g., ``mom'' vs. ``mother''). As a result, token-level loss can penalize valid abstractions, paraphrases, or conceptually correct reasoning paths, biasing models toward surface form rather than underlying meaning. This mismatch between the training signal and semantic correctness motivates learning objectives that operate over higher-level representations. We propose a shift from token-level to concept-level prediction, where concepts group multiple surface forms of the same idea (e.g., ``mom,'' ``mommy,'' ``mother'' $\rightarrow$ \textit{MOTHER}). We introduce various methods for integrating conceptual supervision into LLM training and show that concept-aware models achieve lower perplexity, improved robustness under domain shift, and stronger performance than NTP-based models on diverse NLP benchmarks. This suggests \textit{concept-level supervision} as an improved training signal that better aligns LLMs with human semantic abstractions.
Publisher OA PDF
Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition
ArXiv.org · 2026-01-11
articleOpen accessSenior author
In speech language modeling, two architectures dominate the frontier: the Transformer and the Conformer. However, it remains unknown whether their comparable performance stems from convergent processing strategies or distinct architectural inductive biases. We introduce Architectural Fingerprinting, a probing framework that isolates the effect of architecture on representation, and apply it to a controlled suite of 24 pre-trained encoders (39M-3.3B parameters). Our analysis reveals divergent hierarchies: Conformers implement a "Categorize Early" strategy, resolving phoneme categories 29% earlier in depth and speaker gender by 16% depth. In contrast, Transformers "Integrate Late," deferring phoneme, accent, and duration encoding to deep layers (49-57%). These fingerprints suggest design heuristics: Conformers' front-loaded categorization may benefit low-latency streaming, while Transformers' deep integration may favor tasks requiring rich context and cross-utterance normalization.
Publisher OA PDF
PreFT: Prefill-only finetuning for efficient inference
ArXiv.org · 2026-05-14
articleOpen access
Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.
Publisher OA PDF
Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs
ArXiv.org · 2026-01-07
articleOpen accessSenior author
Large language models (LLMs) frequently fail to challenge users' harmful beliefs in domains ranging from medical advice to social reasoning. We argue that these failures can be understood and addressed pragmatically as consequences of LLMs defaulting to accommodating users' assumptions and exhibiting insufficient epistemic vigilance. We show that social and linguistic factors known to influence accommodation in humans (at-issueness, linguistic encoding, and source reliability) similarly affect accommodation in LLMs, explaining performance differences across three safety benchmarks that test models' ability to challenge harmful beliefs, spanning misinformation (Cancer-Myth, SAGE-Eval) and sycophancy (ELEPHANT). We further show that simple pragmatic interventions, such as adding the phrase "wait a minute", significantly improve performance on these benchmarks while preserving low false-positive rates. Our results highlight the importance of considering pragmatics for evaluating LLM behavior and improving LLM safety.
Publisher OA PDF
Verbalizing LLMs' assumptions to explain and control sycophancy
arXiv (Cornell University) · 2026-04-03
articleOpen access
LLMs can be socially sycophantic, affirming users when they ask questions like "am I in the wrong?" rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the top bigram in LLMs' assumptions on social sycophancy datasets is ``seeking validation.'' We provide evidence for a causal link between Verbalized Assumptions and sycophantic model behavior: our assumption probes (linear probes trained on internal representations of these assumptions) enable interpretable fine-grained steering of social sycophancy. We explore why LLMs default to sycophantic assumptions: on identical queries, people expect more objective and informative responses from AI than from other humans, but LLMs trained on human-human conversation do not account for this difference in expectations. Our work contributes a new understanding of assumptions as a mechanism for sycophancy.
Publisher OA PDF

Recent grants

RI: Small: New tools for studying structural and inductive bias in NLP models
NSF · $500k · 2021–2024
RI: Medium: Deep Understanding: Integrating Neural and Symbolic Models of Meaning
NSF · $1.1M · 2015–2019

Frequent coauthors

Mirac Süzgün
University Hospital of Bern
30 shared
Wido van Peursen
25 shared
Sophia L. Pitcher
University of Cambridge
25 shared
Jacobus A. Naudé
25 shared
Elizabeth Robar
Vrije Universiteit Amsterdam
25 shared
William A. Ross
25 shared
William Jones
25 shared
Randall Buth
25 shared

Labs

Stanford Natural Language Processing GroupPI
Performing groundbreaking Natural Language Processing research since 1999.

Awards & honors

MacArthur Fellow

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Dan Jurafsky

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you