
Harvey Lederman
· Professor, PhilosophyVerifiedUniversity of Texas at Austin · Philosophy
Active 2006–2026
About
Harvey Lederman is a professor of philosophy at the University of Texas at Austin. Prior to his current position, he served as an assistant professor and then a professor of philosophy at Princeton University until 2023. Beginning in July 2026, he will join New York University. His research encompasses a broad range of interests in contemporary philosophy as well as the history of philosophy, with a particular focus on Chinese neo-Confucianism. Recently, Lederman has concentrated on conceptual and empirical questions related to AI mentality and the implications of artificial intelligence for the meaning of human life. He is co-principal investigator of the AI and Human Objectives Initiative and is affiliated with the Population and Wellbeing Initiative and the School of Civic Leadership. His scholarly contributions have been recognized with awards such as the 2023 Sanders Prize in Epistemology for his paper "Of marbles and matchsticks," which addresses incomplete preferences in decision theory, and the Dao Best Essay Award in 2022 for his work on Wang Yangming titled "What is the 'Unity' in the 'Unity of Knowledge and Action'?"
Research topics
- Epistemology
- Philosophy
- Computer Science
- Linguistics
- Programming language
- Artificial Intelligence
- Natural Language Processing
- Theology
Selected publications
Emergent Introspection in AI is Content-Agnostic
Open MIND · 2026-03-05
preprint1st authorCorrespondingIntrospection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study the mechanism of this introspection. We first extensively replicate Lindsey (2025)'s thought injection detection paradigm in large open-source models. We show that introspection in these models is content-agnostic: models can detect that an anomaly occurred even when they cannot reliably identify its content. The models confabulate injected concepts that are high-frequency and concrete (e.g., "apple"). They also require fewer tokens to detect an injection than to guess the correct concept (with wrong guesses coming earlier). We argue that a content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.
Dissociating Direct Access from Inference in AI Introspection
arXiv (Cornell University) · 2026-03-05
articleOpen access1st authorCorrespondingIntrospection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first extensively replicating Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models. We show that these models detect injected representations via two separable mechanisms: (i) probability-matching (inferring from perceived anomaly of the prompt) and (ii) direct access to internal states. The direct access mechanism is content-agnostic: models detect that an anomaly occurred but cannot reliably identify its semantic content. The two model classes we study confabulate injected concepts that are high-frequency and concrete (e.g., "apple'"); for them correct concept guesses typically require significantly more tokens. This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.
Privileged Self-Access Matters for Introspection in AI
ArXiv.org · 2025-08-20
preprintOpen accessWhether AI models can introspect is an increasingly important practical question. But there is no consensus on how introspection is to be defined. Beginning from a recently proposed ''lightweight'' definition, we argue instead for a thicker one. According to our proposal, introspection in AI is any process which yields information about internal states through a process more reliable than one with equal or lower computational cost available to a third party. Using experiments where LLMs reason about their internal temperature parameters, we show they can appear to have lightweight introspection while failing to meaningfully introspect per our proposed definition.
A Dominance Argument Against Incompleteness
The Philosophical Review · 2025-10-01 · 1 citations
articleThis article presents a new argument against many forms of moral and prudential value incompleteness. The argument relies on two central principles: (i) a weak “negative dominance” principle, to the effect that lottery 1 is better than lottery 2 only if some possible outcome of lottery 1 is better than some possible outcome of lottery 2, and (ii) a weak form of ex ante Pareto, to the effect that, if lottery 1 gives an unambiguously better (stochastically dominant) prospect to some individuals than lottery 2, and equally good prospects to everyone else, then lottery 1 is better than lottery 2. Given modest auxiliary assumptions, these two principles rule out incompleteness in the prudential ranking of individual lives, and many forms of incompleteness in the moral rankings of outcomes and lotteries.
On the Value of Irreplaceable Objects Forthcoming in The Journal of Philosophy
Durham Research Online (Durham University) · 2025-04-23
articleOpen accessBradford (2023) calls attention to the fact that the strength of our reasons to preserve distinctively valuable objects increases as the number of such objects decreases. Bradford develops an account of this phenomenon in terms of 'irreplaceable value', and in particular in terms of a notion of the degree of such value, which is distinct from its amount. We present an alternative explanation of this pattern in our reasons, which appeals to the value of diversity: the world is better, other things equal, insofar as it contains more kinds of value. We develop this view in two connected ways: one appeals to evidential probability under conditions of uncertainty, and the other appeals to the value of diversity. We conclude by discussing some explanatory advantages of our view over Bradford's.
Transactions of the Association for Computational Linguistics · 2024-01-01 · 1 citations
articleOpen access1st authorCorrespondingAbstract Are LLMs cultural technologies like photocopiers or printing presses, which transmit information but cannot create new content? A challenge for this idea, which we call bibliotechnism, is that LLMs generate novel text. We begin with a defense of bibliotechnism, showing how even novel text may inherit its meaning from original human-generated text. We then argue that bibliotechnism faces an independent challenge from examples in which LLMs generate novel reference, using new names to refer to new entities. Such examples could be explained if LLMs were not cultural technologies but had beliefs, desires, and intentions. According to interpretationism in the philosophy of mind, a system has such attitudes if and only if its behavior is well explained by the hypothesis that it does. Interpretationists may hold that LLMs have attitudes, and thus have a simple solution to the novel reference problem. We emphasize, however, that interpretationism is compatible with very simple creatures having attitudes and differs sharply from views that presuppose these attitudes require consciousness, sentience, or intelligence (topics about which we make no claims).
Maximal Social Welfare Relations on Infinite Populations Satisfying Permutation Invariance
arXiv (Cornell University) · 2024-08-11
preprintOpen accessSenior authorWe study social welfare relations (SWRs) on an infinite population. Our main result is a new characterization of a utilitarian SWR as the \emph{largest} SWR (in terms of subset when the weak relation is viewed as a set of pairs) which satisfies Strong Pareto, Permutation Invariance (elsewhere called ``Relative Anonymity'' and ``Isomorphism Invariance''), and a further ``Quasi-Independence'' axiom.
A Dominance Argument Against Incompleteness
arXiv (Cornell University) · 2024-03-26
preprintOpen accessThis article presents a new argument against many forms of moral and prudential value incompleteness. The argument relies on two central principles: (i) a weak "negative dominance" principle, to the effect that Lottery 1 is better than Lottery 2 only if some possible outcome of Lottery 1 is better than some possible outcome of Lottery 2, and (ii) a weak form of ex ante Pareto, to the effect that, if Lottery 1 gives an unambiguously better (stochastically dominant) prospect to some individuals than Lottery 2, and equally good prospects to everyone else, then Lottery 1 is better than Lottery 2. Given modest auxiliary assumptions, these two principles rule out incompleteness in the prudential ranking of individual lives, and many forms of incompleteness in the moral rankings of outcomes and lotteries.
arXiv (Cornell University) · 2024-01-10 · 9 citations
preprintOpen access1st authorCorrespondingAre LLMs cultural technologies like photocopiers or printing presses, which transmit information but cannot create new content? A challenge for this idea, which we call bibliotechnism, is that LLMs generate novel text. We begin with a defense of bibliotechnism, showing how even novel text may inherit its meaning from original human-generated text. We then argue that bibliotechnism faces an independent challenge from examples in which LLMs generate novel reference, using new names to refer to new entities. Such examples could be explained if LLMs were not cultural technologies but had beliefs, desires, and intentions. According to interpretationism in the philosophy of mind, a system has such attitudes if and only if its behavior is well explained by the hypothesis that it does. Interpretationists may hold that LLMs have attitudes, and thus have a simple solution to the novel reference problem. We emphasize, however, that interpretationism is compatible with very simple creatures having attitudes and differs sharply from views that presuppose these attitudes require consciousness, sentience, or intelligence (topics about which we make no claims).
Philosophical Studies · 2024-08-20 · 11 citations
articleOpen accessSenior authorCorresponding
Frequent coauthors
- 9 shared
Hartry Field
New York University
- 9 shared
Tore Fjetland Øgaard
University of Bergen
- 7 shared
Jeremy Goodman
- 6 shared
P. Fritz
University of Oslo
- 2 shared
H. R. G. Greaves
University of Oxford
- 1 shared
Kyle Mahowald
- 1 shared
Christian Tarsney
The University of Texas at Austin
- 1 shared
Dean Spears
Indian Statistical Institute
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Harvey Lederman
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup