James DiCarlo

· Director, Quest/ProfessorVerified

Massachusetts Institute of Technology · Psychology

Active 1991–2025

h-index90

Citations44.0k

Papers32998 last 5y

Funding$12.7M

Faculty page Lab page

See your match with James DiCarlo — sign in to PhdFit.Sign in

About

James DiCarlo is the Peter de Florez Professor of Neuroscience in the Department of Brain and Cognitive Sciences at MIT. He is also the Director of the Siegel Family Quest for Intelligence and an Investigator at the McGovern Institute for Brain Research. He served as the department head of Brain and Cognitive Sciences from 2012 to 2021. Dr. DiCarlo earned his Ph.D. in biomedical engineering and M.D. from The Johns Hopkins University in 1998 and completed postdoctoral training in primate visual neurophysiology at Baylor College of Medicine. He joined the MIT faculty in 2002 and received tenure in 2009. His research group focuses on understanding the neuronal mechanisms and cortical computations underlying visual object recognition, utilizing techniques such as large-scale neurophysiology, brain imaging, optogenetics, and computational simulations. His work aims to inspire machine vision systems, develop neural prosthetics, and deepen understanding of how high-level visual representations are altered in conditions like agnosia, autism, and dyslexia.

Research topics

Computer Science
Artificial Intelligence
Machine Learning
Cognitive science
Psychology
Data science
Human–computer interaction
Neuroscience
Engineering
Political Science
Epistemology
Management
Programming language
Business
Mathematics
Law
Economics
Biology

Selected publications

Noninvasive precision modulation of high-level neural population activity via natural vision perturbations
ArXiv.org · 2025-06-05
preprintOpen accessSenior author
Precise control of neural activity -- modulating target neurons deep in the brain while leaving nearby neurons unaffected -- is an outstanding challenge in neuroscience, generally approached using invasive techniques. This study investigates the possibility of precisely and noninvasively modulating neural activity in the high-level primate ventral visual stream via perturbations on one's natural visual feed. When tested on macaque inferior temporal (IT) neural populations, we found quantitative agreement between the model-predicted and biologically realized effect: strong modulation concentrated on targeted neural sites. We extended this to demonstrate accurate injection of experimenter-chosen neural population patterns via subtle perturbations applied on the background of typical natural visual feeds. These results highlight that current machine-executable models of the ventral stream can now design noninvasive, visually-delivered, possibly imperceptible neural interventions at the resolution of individual neurons.
Publisher OA PDF DOI
Feature-based encoding of face identity by single neurons in the human amygdala and hippocampus
Nature Human Behaviour · 2025-06-06 · 6 citations
articleOpen access
Publisher OA PDF DOI
Assaying the effect of recent sensory history on object categorization via human psychophysics and computational modeling
Journal of Vision · 2025-07-15
articleOpen accessSenior author
Sensory history is thought to strongly influence object perception. While the field has image-computable models of how images map to behavioral reports, we lack a comparable understanding of how dynamic sequences affect object perception. Here, we take initial steps to address this gap: we measure how sensory history affects human categorization performance (online psychophysics, N=500). Using 300 naturalistic videos and a binary object detection task, we compared pre-cued categorization reports based on video clips (200-1600ms) that end at a particular target frame with reports on the same frame shown statically for 200 ms. Surprisingly, single-frame-based reports explained substantial behavioral variance, even in the longest clips, challenging the notion that object recognition heavily depends on sensory history. Still, longer sensory history reports increasingly differed from frame-based recognition, yielding performance increases suggestive of evidence accumulation over time. Next, we focused on what mechanisms might explain these effects of sensory history. We hypothesized that frame-based encoding (e.g., via the ventral visual stream) combined with downstream temporal integration mechanisms may account for the emerging differences with longer sensory history. To test specific instantiations of this hypothesis, we augmented a pre-trained artificial neural network with diverse temporal decoders, including max-pooling, mean integration, leaky-integrators, and recurrent architectures (RNNs, GRUs, LSTMs), each optimized for categorization on a separate set of videos (40 repetitions). Interestingly, unlike simpler decoders, we found that non-linear temporal decoders increasingly captured the unique behavioral variance emerging with extended sensory history. Still, compared to human frame-based reports, frame-based ANN predictions (without temporal decoders) proved much less powerful at explaining human behavior overall, highlighting weaknesses of current image-based encoding models. Leveraging powerful, rapid, frame-based inferences as a foundation, our results demonstrate how sensory history could enrich object recognition through dynamic temporal integration of high-level visual representations.
Publisher DOI
Hierarchical Optimization predicts Plasticity in the Macaque Inferior Temporal Cortex following Object Training
bioRxiv (Cold Spring Harbor Laboratory) · 2024-12-28 · 3 citations
preprintOpen access
Abstract How does the primate brain coordinate plasticity when learning to discriminate new objects? We measured consequences of object learning on macaque inferior temporal (IT) cortex, a key waypoint supporting object recognition in the ventral visual stream. Neural activity in task-trained monkeys’ IT showed increased object selectivity, enhanced linear separability across objects, and more object-invariant representations compared to task-naïve monkeys. To model these differences, we developed a computational framework using anatomically-mapped artificial neural network (ANN) models of the ventral stream with various learning algorithms. Simulations revealed that gradient-based, performance-optimizing updates of ANN internal representations accurately approximated observed IT cortex changes. These models predict novel training-induced phenomena in IT cortex, including changes independent of object identity and IT’s alignment with behavior. This convergence between empirical measurements and model predictions suggests ventral stream plasticity follows task optimization principles well-approximated by gradient descent, enabling accurate predictions about visual plasticity and generalization to test images.
Publisher DOI
Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations
arXiv (Cornell University) · 2024-12-12 · 1 citations
preprintOpen accessSenior author
Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring -- despite much prior evidence -- its role in estimating "spatial" latents such as object position and pose. Most leading ventral stream models are derived by optimizing networks for object categorization, which seems to imply that the ventral stream is also derived under such an objective. Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents? And a closely related question: How different -- if at all -- are representations learned from spatial latent estimation compared to categorization? To ask these questions, we leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. Spatial latent and category-trained models have very similar -- but not identical -- internal representations, especially in their early and middle layers. We provide evidence that this convergence is partly driven by non-target latent variability in the training data, which facilitates the implicit learning of representations of those non-target latents. Taken together, these results suggest that many training objectives, such as spatial latents, can lead to similar models aligned neurally with the ventral stream. Thus, one should not assume that the ventral stream is optimized for object categorization only. As a field, we need to continue to sharpen our measures of comparing models to brains to better understand the functional roles of the ventral stream.
Publisher OA PDF DOI
How does the primate brain combine generative and discriminative computations in vision?
PubMed · 2024-01-11 · 2 citations
preprintOpen access
Vision is widely understood as an inference problem. However, two contrasting conceptions of the inference process have each been influential in research on biological vision as well as the engineering of machine vision. The first emphasizes bottom-up signal flow, describing vision as a largely feedforward, discriminative inference process that filters and transforms the visual information to remove irrelevant variation and represent behaviorally relevant information in a format suitable for downstream functions of cognition and behavioral control. In this conception, vision is driven by the sensory data, and perception is direct because the processing proceeds from the data to the latent variables of interest. The notion of "inference" in this conception is that of the engineering literature on neural networks, where feedforward convolutional neural networks processing images are said to perform inference. The alternative conception is that of vision as an inference process in Helmholtz's sense, where the sensory evidence is evaluated in the context of a generative model of the causal processes that give rise to it. In this conception, vision inverts a generative model through an interrogation of the sensory evidence in a process often thought to involve top-down predictions of sensory data to evaluate the likelihood of alternative hypotheses. The authors include scientists rooted in roughly equal numbers in each of the conceptions and motivated to overcome what might be a false dichotomy between them and engage the other perspective in the realm of theory and experiment. The primate brain employs an unknown algorithm that may combine the advantages of both conceptions. We explain and clarify the terminology, review the key empirical evidence, and propose an empirical research program that transcends the dichotomy and sets the stage for revealing the mysterious hybrid algorithm of primate vision.
Publisher OA PDF DOI
Let's move forward: Image-computable models and a common model evaluation scheme are prerequisites for a scientific understanding of human vision – CORRIGENDUM
Behavioral and Brain Sciences · 2024-01-01
erratumOpen access1st authorCorresponding
An abstract is not available for this content. As you have access to this content, full HTML content is provided on this page. A PDF of this content is also available in through the 'Save PDF' action button.
Publisher OA PDF DOI
L-WISE: Boosting Human Visual Category Learning Through Model-Based Image Selection and Enhancement
arXiv (Cornell University) · 2024-12-12
preprintOpen access
The currently leading artificial neural network models of the visual ventral stream - which are derived from a combination of performance optimization and robustification methods - have demonstrated a remarkable degree of behavioral alignment with humans on visual categorization tasks. We show that image perturbations generated by these models can enhance the ability of humans to accurately report the ground truth class. Furthermore, we find that the same models can also be used out-of-the-box to predict the proportion of correct human responses to individual images, providing a simple, human-aligned estimator of the relative difficulty of each image. Motivated by these observations, we propose to augment visual learning in humans in a way that improves human categorization accuracy at test time. Our learning augmentation approach consists of (i) selecting images based on their model-estimated recognition difficulty, and (ii) applying image perturbations that aid recognition for novice learners. We find that combining these model-based strategies leads to categorization accuracy gains of 33-72% relative to control subjects without these interventions, on unmodified, randomly selected held-out test images. Beyond the accuracy gain, the training time for the augmented learning group was also shortened by 20-23%, despite both groups completing the same number of training trials. We demonstrate the efficacy of our approach in a fine-grained categorization task with natural images, as well as two tasks in clinically relevant image domains - histology and dermoscopy - where visual learning is notoriously challenging. To the best of our knowledge, our work is the first application of artificial neural networks to increase visual learning performance in humans by enhancing category-specific image features.
Publisher OA PDF DOI
The Quest for an Integrated Set of Neural Mechanisms Underlying Object Recognition in Primates
Annual Review of Vision Science · 2024-07-01 · 19 citations
reviewOpen accessSenior author
Inferences made about objects via vision, such as rapid and accurate categorization, are core to primate cognition despite the algorithmic challenge posed by varying viewpoints and scenes. Until recently, the brain mechanisms that support these capabilities were deeply mysterious. However, over the past decade, this scientific mystery has been illuminated by the discovery and development of brain-inspired, image-computable, artificial neural network (ANN) systems that rival primates in these behavioral feats. Apart from fundamentally changing the landscape of artificial intelligence, modified versions of these ANN systems are the current leading scientific hypotheses of an integrated set of mechanisms in the primate ventral visual stream that support core object recognition. What separates brain-mapped versions of these systems from prior conceptual models is that they are sensory computable, mechanistic, anatomically referenced, and testable (SMART). In this article, we review and provide perspective on the brain mechanisms addressed by the current leading SMART models. We review their empirical brain and behavioral alignment successes and failures, discuss the next frontiers for an even more accurate mechanistic understanding, and outline the likely applications.
Publisher DOI
Do Topographic Deep ANN Models of the Primate Ventral Stream Predict the Perceptual Effects of Direct IT Cortical Interventions?
bioRxiv (Cold Spring Harbor Laboratory) · 2024-01-09 · 5 citations
preprintOpen accessSenior author
Ever-advancing artificial neural network (ANN) models of the ventral visual stream capture core object recognition behavior and the neural mechanisms underlying it with increasing precision. These models take images as input, propagate through simulated neural representations that resemble biological neural representations at all stages of the primate ventral stream, and produce simulated behavioral choices that resemble primate behavioral choices. We here extend this modeling approach to make and test predictions of neural intervention experiments. Specifically, we enable a new prediction regime for topographic deep ANN (TDANN) models of primate visual processing through the development of perturbation modules that translate micro-stimulation, optogenetic suppression, and muscimol suppression into changes in model neural activity . This unlocks the ability to predict the behavioral effects from particular neural perturbations. We compare these predictions with the key results from the primate IT perturbation experimental literature via a suite of nine corresponding benchmarks. Without any fitting to the benchmarks, we find that TDANN models generated via co-training with both a spatial correlation loss and a standard categorization task qualitatively predict all nine behavioral results. In contrast, TDANN models generated via random topography or via topographic unit arrangement after classification training predict less than half of those results. However, the models’ quantitative predictions are consistently misaligned with experimental data, over-predicting the magnitude of some behavioral effects and under-predicting others. None of the TDANN models were built with separate model hemispheres and thus, unsurprisingly, all fail to predict hemispheric-dependent effects. Taken together, these findings indicate that current topographic deep ANN models paired with perturbation modules are reasonable guides to predict the qualitative results of direct causal experiments in IT, but that improved TDANN models will be needed for precise quantitative predictions.
Publisher OA PDF DOI

Recent grants

The role of inferior temporal cortex in core visual object recognition
NIH · $4.0M · 2004–2018
Imaging Core
NIH · $7.9M · 1997–2024
Time delimited neural silencing to dissect the basis of visual object perception
NIH · $403k · 2013–2015
Post-natal development of high-level visual representation in primates
NIH · $425k · 2017–2019

Frequent coauthors

Patrick Cavanagh
York University
121 shared
Danny Dilks
McGovern Institute for Brain Research
121 shared
Doug Crawford
York University
121 shared
Laurence R. Harris
York University
121 shared
Keynote Speaker
University of British Columbia
121 shared
Kohitij Kar
York University
86 shared
Daniel Yamins
84 shared
Ha Hong
57 shared

Labs

Brain and Cognitive SciencesPI

Education

Ph.D., Neuroscience
Massachusetts Institute of Technology
1992
B.A., Psychology
Harvard University
1987

Awards & honors

Alfred Sloan Fellow
Pew Scholar in the Biomedical Sciences
McKnight Scholar in Neuroscience

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with James DiCarlo

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you