
Talia Konkle
· Professor of PsychologyVerifiedHarvard University · Human Development and Psychology
Active 2006–2026
About
Professor Talia Konkle leads the Konkle Lab at Harvard University, where her research broadly aims to understand how humans see and represent the world around them. Her work focuses on the organization of the human visual system and the biological constraints that guide this organization. She investigates how vision interfaces with action demands to enable interaction with the environment, as well as with conceptual representation to facilitate learning through visual experience. The lab's approach is grounded in the premise that the brain's connections are shaped by powerful biological constraints, making the spatial distribution of different types of information in the brain meaningful and informative about the system's representational goals. This perspective emphasizes the experience and needs of an active observer, deepening understanding of how behavioral capacities are embedded in the brain's local and long-range architecture, and how neural networks internalize the statistics of visual experience and the consequences of actions to realize functional visual representations.
Research topics
- Artificial Intelligence
- Computer Science
- Machine Learning
- Psychology
- Cognitive psychology
- Neuroscience
- Cognitive science
- Biology
- Ecology
- Management science
- Epistemology
- Communication
- Mathematics
- Cartography
- Geography
- Data science
Selected publications
Principles of coarse-scale functional organization in occipitotemporal cortex
bioRxiv (Cold Spring Harbor Laboratory) · 2026-02-18
articleOpen accessAbstract Occipitotemporal cortex is known to process visually-perceived objects, but identifying general principles underlying its coarse-scale functional organization has remained challenging for two reasons. First, much previous work has left open whether proposed organizational dimensions, such as animacy or real-world size, are useful for characterizing high-level vision and to what degree they would generalize to more diverse and naturalistic stimuli. Second, many natural object properties are highly intercorrelated, making it challenging to determine which dimensions, if any, dominate the coarse-scale representational organization of the visual system. To address these challenges, we carried out detailed analyses of the cortical topography underlying 15 object properties, using a densely-sampled fMRI dataset with responses to thousands of natural images. We focused our investigation around the properties animacy and real-world size, given their established prominence as candidate organizational dimensions. While our results confirm many characteristics of the purported animacy-size organization, they revealed distinctions that challenge dominant notions of this organization, highlighting the importance of generalizing to large-scale naturalistic data. Moving beyond animacy and size, our results demonstrate that many of the 15 candidate properties can alternatively serve as organizational dimensions underlying the coarse-scale functional topography of occipitotemporal cortex, but that none of them clearly stand out in the representational organization. This suggests that trying to reduce the coarse-scale functional organization of occipitotemporal cortex to individual conceptual dimensions may be the wrong goal. Rather, our results suggest that occipitotemporal cortex may provide a foundation for the flexible read-out of a wide range of object properties. Significance There has been a long tradition in cognitive neuroscience of reducing brain response profiles to a small number of unifying principles. The extent to which this approach also applies to the structure of representations remains largely untested. Using the widely discussed putative organizational principles of animacy and size as test cases, we examined their generality using densely-sampled fMRI data and behavioral ratings along 15 object properties. Our results demonstrate a more complex organization of animacy and size representations than assumed by prevailing accounts, and hierarchical partitioning revealed that none of the 15 properties dominated representations. These findings challenge the pivotal role often assigned to individual properties, indicating that the dominant approach oversimplifies the role of individual principles in visual representations.
Geometric Dynamics Across Recurrent Vision Models
2026-01-01
articleSenior authorBi-Orthogonal Factor Decomposition for Vision Transformers
ArXiv.org · 2026-01-08
articleOpen accessSelf-attention is the central computational primitive of Vision Transformers, yet we lack a principled understanding of what information attention mechanisms exchange between tokens. Attention maps describe where weight mass concentrates; they do not reveal whether queries and keys trade position, content, or both. We introduce Bi-orthogonal Factor Decomposition (BFD), a two-stage analytical framework: first, an ANOVA-based decomposition statistically disentangles token activations into orthogonal positional and content factors; second, SVD of the query-key interaction matrix QK^T exposes bi-orthogonal modes that reveal how these factors mediate communication. After validating proper isolation of position and content, we apply BFD to state-of-the-art vision models and uncover three phenomena.(i) Attention operates primarily through content. Content-content interactions dominate attention energy, followed by content-position coupling. DINOv2 allocates more energy to content-position than supervised models and distributes computation across a richer mode spectrum. (ii) Attention mechanisms exhibit specialization: heads differentiate into content-content, content-position, and position-position operators, while singular modes within heads show analogous specialization. (iii) DINOv2's superior holistic shape processing emerges from intermediate layers that simultaneously preserve positional structure while contextually enriching semantic content. Overall, BFD exposes how tokens interact through attention and which informational factors - positional or semantic - mediate their communication, yielding practical insights into vision transformer mechanisms.
FOVI: A biologically-inspired foveated interface for deep vision models
Open MIND · 2026-02-03
preprintSenior authorHuman vision is foveated, with variable resolution peaking at the center of a large field of view; this reflects an efficient trade-off for active sensing, allowing eye-movements to bring different parts of the world into focus with other parts of the world in context. In contrast, most computer vision systems encode the visual world at a uniform resolution, raising challenges for processing full-field high-resolution images efficiently. We propose a foveated vision interface (FOVI) based on the human retina and primary visual cortex, that reformats a variable-resolution retina-like sensor array into a uniformly dense, V1-like sensor manifold. Receptive fields are defined as k-nearest-neighborhoods (kNNs) on the sensor manifold, enabling kNN-convolution via a novel kernel mapping technique. We demonstrate two use cases: (1) an end-to-end kNN-convolutional architecture, and (2) a foveated adaptation of the foundational DINOv3 ViT model, leveraging low-rank adaptation (LoRA). These models provide competitive performance at a fraction of the computational cost of non-foveated baselines, opening pathways for efficient and scalable active sensing for high-resolution egocentric vision. Code and pre-trained models are available at https://github.com/nblauch/fovi and https://huggingface.co/fovi-pytorch.
Bi-Orthogonal Factor Decomposition for Vision Transformers
arXiv (Cornell University) · 2026-01-08
preprintOpen accessSelf-attention is the central computational primitive of Vision Transformers, yet we lack a principled understanding of what information attention mechanisms exchange between tokens. Attention maps describe where weight mass concentrates; they do not reveal whether queries and keys trade position, content, or both. We introduce Bi-orthogonal Factor Decomposition (BFD), a two-stage analytical framework: first, an ANOVA-based decomposition statistically disentangles token activations into orthogonal positional and content factors; second, SVD of the query-key interaction matrix QK^T exposes bi-orthogonal modes that reveal how these factors mediate communication. After validating proper isolation of position and content, we apply BFD to state-of-the-art vision models and uncover three phenomena.(i) Attention operates primarily through content. Content-content interactions dominate attention energy, followed by content-position coupling. DINOv2 allocates more energy to content-position than supervised models and distributes computation across a richer mode spectrum. (ii) Attention mechanisms exhibit specialization: heads differentiate into content-content, content-position, and position-position operators, while singular modes within heads show analogous specialization. (iii) DINOv2's superior holistic shape processing emerges from intermediate layers that simultaneously preserve positional structure while contextually enriching semantic content. Overall, BFD exposes how tokens interact through attention and which informational factors - positional or semantic - mediate their communication, yielding practical insights into vision transformer mechanisms.
A Unified Account of Lightness Illusions via Edge-Based Reconstruction of Natural Images
bioRxiv (Cold Spring Harbor Laboratory) · 2026-04-10
articleOpen accessABSTRACT The human visual system transforms patterns of light into rich perceptual experiences, where what we see is a construction that goes beyond simple measurement. Lightness illusions—where identical parts of an image can appear dramatically different depending on context—provide a window into these processes. Here we leverage a deep learning framework to investigate the constructive processes that give rise to lightness illusions, introducing the core computational goal of edge-based image reconstruction. Specifically, we demonstrate that autoencoder models trained to reconstruct natural images based only on an edge-based image representation naturally recapitulate a wide range of lightness illusions, which were previously assumed to require distinct mechanisms, inference over lighting sources, and explicit three-dimensional scene representation. These results offer a simpler, unified account of diverse lightness phenomena as emerging naturally from surface filling-in mechanisms, and broadly provide a framework for understanding the computational principles that underlie our perception of the visual world. SIGNIFICANCE STATEMENT The human visual system shows remarkably stable perception of objects under different viewing conditions, but it uses strategies that can be thwarted by clever visual illusions – for instance, the exact same object can appear as either white or black in different contexts. The most complex of these lightness illusions have long been taken as evidence that perception involves explicit inference about 3D scene geometry and lighting conditions. However, here we show that these illusions also emerge in deep learning models, trained simply to reconstruct natural images from sparse edge signals. Thus, our perception of the lightness of surfaces in our world may instead arise from a much more primitive computation — reconstructing surface appearance from edge responses.
FOVI: A biologically-inspired foveated interface for deep vision models
ArXiv.org · 2026-02-03
articleOpen accessSenior authorHuman vision is foveated, with variable resolution peaking at the center of a large field of view; this reflects an efficient trade-off for active sensing, allowing eye-movements to bring different parts of the world into focus with other parts of the world in context. In contrast, most computer vision systems encode the visual world at a uniform resolution, raising challenges for processing full-field high-resolution images efficiently. We propose a foveated vision interface (FOVI) based on the human retina and primary visual cortex, that reformats a variable-resolution retina-like sensor array into a uniformly dense, V1-like sensor manifold. Receptive fields are defined as k-nearest-neighborhoods (kNNs) on the sensor manifold, enabling kNN-convolution via a novel kernel mapping technique. We demonstrate two use cases: (1) an end-to-end kNN-convolutional architecture, and (2) a foveated adaptation of the foundational DINOv3 ViT model, leveraging low-rank adaptation (LoRA). These models provide competitive performance at a fraction of the computational cost of non-foveated baselines, opening pathways for efficient and scalable active sensing for high-resolution egocentric vision. Code and pre-trained models are available at https://github.com/nblauch/fovi and https://huggingface.co/fovi-pytorch.
2026-05-08
articleOpen accessbioRxiv (Cold Spring Harbor Laboratory) · 2025-03-10 · 6 citations
preprintOpen accessA bstract Recent progress in multimodal AI and ‘language-aligned’ visual representation learning has rekindled debates about the role of language in shaping the human visual system. In particular, the emergent ability of ‘language-aligned’ vision models (e.g. CLIP) – and even pure language models (e.g. BERT) – to predict image-evoked brain activity has led some to suggest that human visual cortex itself may be ‘language-aligned’ in comparable ways. But what would we make of this claim if the same procedures could model visual activity in a species without language? Here, we conducted controlled comparisons of pure-vision, pure-language, and multimodal vision-language models in their prediction of human (N=4) and rhesus macaque (N=6, 5:IT, 1:V1) ventral visual activity to the same set of 1000 captioned natural images (the ‘NSD1000’). The results revealed markedly similar patterns in model predictivity of early and late ventral visual cortex across both species. This suggests that language model predictivity of the human visual system is not necessarily due to the evolution or learning of language perse , but rather to the statistical structure of the visual world that is reflected in natural language.
Dissecting sparse circuits to high-level visual categories in deep neural networks
Journal of Vision · 2025-07-15
articleOpen accessSenior authorWhile humans easily recognize innumerable object categories, the underlying computational paths from retina to category-level representations are still being unraveled. Convolutional neural networks (CNNs) like AlexNet have remarkable competence in visual categorization, and thus offer a unique case study for understanding the hierarchical routing of visual information. Extending work from Hamblin et al., 2023, here we develop a method to extract the relevant connections involved in the computation of each output category, and assess the effectiveness of this sparser sub-network. The key idea is that not all connections are necessarily involved in the computation of any one category; thus, for each of the 1000 category-level output units in the Alexnet, our algorithm assigns scores to connections based on their contribution to the category unit's outputs and prunes the lowest-scored connections to a specified sparsity. Our goal is to identify the sparsest circuit through the network that still maintains the original function. To evaluate how well the extracted circuits reflect the output unit’s original functionality, we introduce a new metric–circuit substitution accuracy (CSA). We find that circuits need only 5.0% (median) of connections to achieve 85% of the unpruned CSA. Surprisingly, we observed that CSA initially increases with pruning and often actually exceeds the unpruned baseline at its peak (median peak CSA = 188.0% median unpruned CSA) with just 13.3% (median) of connections. We hypothesize that the full network must employ inhibition to negotiate between competing, interfering pathways. Finally, the “anatomical overlap” amongst these category circuits ranged from <1% to >99% shared circuitry, revealing a range of implicit modularization in the network's categorical processing routes. Broadly, this work presents a novel method for gaining insight into the functional neuroanatomy of neural networks, and offers a foundation for understanding the hierarchical computations involved in the emergence of category-level information in visual systems.
Recent grants
The Role of Animacy and Size in the Neural Organization of Object Knowledge
NIH · $114k · 2013–2016
CAREER: The Tuning and Topography of the Ventral Visual Stream
NSF · $704k · 2020–2026
Frequent coauthors
- 65 shared
George A. Alvarez
- 30 shared
Bria Long
- 29 shared
Aude Oliva
Massachusetts Institute of Technology
- 21 shared
Emilie Josephs
Massachusetts Institute of Technology
- 21 shared
Jacob S. Prince
Harvard University
- 19 shared
Arturo Deza
Massachusetts Institute of Technology
- 18 shared
Timothy F. Brady
- 16 shared
Alfonso Caramazza
Harvard University
Education
- 2005
B.A., Psychology
University of California, Berkeley
- 2007
M.A., Psychology
University of California, Berkeley
- 2011
Ph.D., Psychology
University of California, Berkeley
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Talia Konkle
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup