Adji Bousso Dieng

· Assistant Professor of Computer ScienceVerified

Princeton University · Philosophy

Active 2016–2026

h-index15

Citations1.6k

Papers4825 last 5y

Funding—

Faculty page

See your match with Adji Bousso Dieng — sign in to PhdFit.Sign in

About

Adji Bousso Dieng is an Assistant Professor at Princeton University in the Department of Computer Science. Her research areas include Machine Learning, Computational Biology, and Natural Language Processing. She focuses on developing advanced algorithms and models to address complex problems in these fields, contributing to the understanding and application of machine learning techniques across various domains.

Research signals

Five dimensions sourced from public faculty / publication signals. Sign in to compare against your own profile and see your match score.

Research topics

Optoelectronics
Physics
Chemistry
Materials science
Molecular physics
Optics

Selected publications

Vendi Novelty Scores for Out-of-Distribution Detection
ArXiv.org · 2026-02-10
articleOpen accessSenior author
Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.
Publisher OA PDF
Vendi Novelty Scores for Out-of-Distribution Detection
Open MIND · 2026-02-10
preprintSenior author
Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.
DOI
Building informative materials datasets beyond targeted objectives
arXiv (Cornell University) · 2026-05-06
preprintOpen access
Materials science data collection can be expensive, making the reuse and long-term utility of datasets critical important for future discovery campaigns. In practice, researchers prioritize a subset of properties due to research interests. However, ignoring a subset of outcomes in data collection campaigns potentially generate datasets poorly suited for future learning tasks. Here, we present a framework for dataset construction that maximizes informativeness for target properties of interest while preserving performance on untargeted ones. Our approach uses diversity-aware selection to ensure broad coverage of the materials space. In noisy experimental dataset construction, we find that without our diversity-aware framework, prediction performance on untargeted properties can degrade by up to 40% relative to random sampling, whereas applying our framework yields improvements of up to 10% . For targeted properties, performance can degrade with respect to random sampling by up to 12.5% without diversity, while our framework achieves gains of up to 25%. Incorporating diversity into dataset construction not only preserves informativeness for the targeted properties, but also improves materials coverage for potential future objectives. As a result, the constructed datasets remain broadly informative across considered and unconsidered outcomes, ensuring unbiased quality entries and mitigating cold-start limitations in subsequent modeling and discovery campaigns.
Publisher DOI
Vendi Information Gain for Active Learning and its Application to Ecology
2025-09-15
articleOpen accessSenior author
While monitoring biodiversity through camera traps has become an important endeavor for ecological research, identifying species in the captured image data remains a major bottleneck due to limited labeling resources. Active learning—a machine learning paradigm that selects the most informative data to label and train a predictive model—offers a promising solution, but typically focuses on uncertainty in the individual predictions without considering uncertainty across the entire dataset. We introduce a new active learning policy, Vendi information gain (VIG), that selects images based on their impact on dataset-wide prediction uncertainty, capturing both informativeness and diversity. Applied to the Snapshot Serengeti dataset, VIG achieves impressive predictive accuracy close to full supervision using less than 10% of the labels. It consistently outperforms standard baselines across metrics and batch sizes, collecting more diverse data in the feature space. VIG has broad applicability beyond ecology, and our results highlight its value for biodiversity monitoring in data-limited environments.
Publisher OA PDF DOI
Are neural scaling laws leading quantum chemistry astray?
ArXiv.org · 2025-09-30
preprintOpen accessSenior author
Neural scaling laws are driving the machine learning community toward training ever-larger foundation models across domains, assuring high accuracy and transferable representations for extrapolative tasks. We test this promise in quantum chemistry by scaling model capacity and training data from quantum chemical calculations. As a generalization task, we evaluate the resulting models' predictions of the bond dissociation energy of neutral H$_2$, the simplest possible molecule. We find that, regardless of dataset size or model capacity, models trained only on stable structures fail dramatically to even qualitatively reproduce the H$_2$ energy curve. Only when compressed and stretched geometries are explicitly included in training do the predictions roughly resemble the correct shape. Nonetheless, the largest foundation models trained on the largest and most diverse datasets containing dissociating diatomics exhibit serious failures on simple diatomic molecules. Most strikingly, they cannot reproduce the trivial repulsive energy curve of two bare protons, revealing their failure to learn the basic Coulomb's law involved in electronic structure theory. These results suggest that scaling alone is insufficient for building reliable quantum chemical models.
Publisher OA PDF DOI
Building Machine Learning Challenges for Anomaly Detection in Science
ArXiv.org · 2025-03-03
preprintOpen access
Scientific discoveries are often made by finding a pattern or object that was not predicted by the known rules of science. Oftentimes, these anomalous events or objects that do not conform to the norms are an indication that the rules of science governing the data are incomplete, and something new needs to be present to explain these unexpected outliers. The challenge of finding anomalies can be confounding since it requires codifying a complete knowledge of the known scientific behaviors and then projecting these known behaviors on the data to look for deviations. When utilizing machine learning, this presents a particular challenge since we require that the model not only understands scientific data perfectly but also recognizes when the data is inconsistent and out of the scope of its trained behavior. In this paper, we present three datasets aimed at developing machine learning-based anomaly detection for disparate scientific domains covering astrophysics, genomics, and polar science. We present the different datasets along with a scheme to make machine learning challenges around the three datasets findable, accessible, interoperable, and reusable (FAIR). Furthermore, we present an approach that generalizes to future machine learning challenges, enabling the possibility of large, more compute-intensive challenges that can ultimately lead to scientific discovery.
Publisher OA PDF DOI
Alternators With Noise Models
ArXiv.org · 2025-05-18
preprintOpen accessSenior author
Alternators have recently been introduced as a framework for modeling time-dependent data. They often outperform other popular frameworks, such as state-space models and diffusion models, on challenging time-series tasks. This paper introduces a new Alternator model, called Alternator++, which enhances the flexibility of traditional Alternators by explicitly modeling the noise terms used to sample the latent and observed trajectories, drawing on the idea of noise models from the diffusion modeling literature. Alternator++ optimizes the sum of the Alternator loss and a noise-matching loss. The latter forces the noise trajectories generated by the two noise models to approximate the noise trajectories that produce the observed and latent trajectories. We demonstrate the effectiveness of Alternator++ in tasks such as density estimation, time series imputation, and forecasting, showing that it outperforms several strong baselines, including Mambas, ScoreGrad, and Dyffusion.
Publisher OA PDF DOI
Constraint-Aware Diffusion Models for Trajectory Optimization
Lecture notes in computer science · 2025-08-25 · 2 citations
book-chapter
Publisher DOI
LLM-Prop: predicting the properties of crystalline materials using large language models
npj Computational Materials · 2025-06-18 · 19 citations
articleOpen accessSenior author
Abstract The prediction of crystal properties plays a crucial role in materials science and applications. Current methods for predicting crystal properties focus on modeling crystal structures using graph neural networks (GNNs). However, accurately modeling the complex interactions between atoms and molecules within a crystal remains a challenge. Surprisingly, predicting crystal properties from crystal text descriptions is understudied, despite the rich information and expressiveness that text data offer. In this paper, we develop and make public a benchmark dataset (TextEdge) that contains crystal text descriptions with their properties. We then propose LLM-Prop, a method that leverages the general-purpose learning capabilities of large language models (LLMs) to predict properties of crystals from their text descriptions. LLM-Prop outperforms the current state-of-the-art GNN-based methods by approximately 8% on predicting band gap, 3% on classifying whether the band gap is direct or indirect, and 65% on predicting unit cell volume, and yields comparable performance on predicting formation energy per atom, energy per atom, and energy above hull. LLM-Prop also outperforms the fine-tuned MatBERT, a domain-specific pre-trained BERT model, despite having 3 times fewer parameters. We further fine-tune the LLM-Prop model directly on CIF files and condensed structure information generated by Robocrystallographer and found that LLM-Prop fine-tuned on text descriptions provides a better performance on average. Our empirical results highlight the importance of having a natural language input to LLMs to accurately predict crystal properties and the current inability of GNNs to capture information pertaining to space group symmetry and Wyckoff sites for accurate crystal property prediction.
Publisher OA PDF DOI
The Vendiscope: An Algorithmic Microscope For Data Collections
arXiv (Cornell University) · 2025-02-15
preprintOpen accessSenior author
The evolution of microscopy, beginning with its invention in the late 16th century, has continuously enhanced our ability to explore and understand the microscopic world, enabling increasingly detailed observations of structures and phenomena. In parallel, the rise of data-driven science has underscored the need for sophisticated methods to explore and understand the composition of complex data collections. This paper introduces the Vendiscope, the first algorithmic microscope designed to extend traditional microscopy to computational analysis. The Vendiscope leverages the Vendi scores -- a family of differentiable diversity metrics rooted in ecology and quantum mechanics -- and assigns weights to data points based on their contribution to the overall diversity of the collection. These weights enable high-resolution data analysis at scale. We demonstrate this across biology, materials science, and machine learning (ML). We analyzed the $250$ million protein sequences in the protein universe, discovering that over $200$ million are near-duplicates and that AlphaFold fails on proteins with Gene Ontology (GO) functions that contribute most to diversity. Applying the Vendiscope to the Materials Project database led to similar findings: more than $85\%$ of the crystals with formation energy data are near-duplicates and ML models perform poorly on materials that enhance diversity. Additionally, the Vendiscope can be used to study phenomena such as memorization in generative models. We used the Vendiscope to identify memorized training samples from $13$ different generative models and found that the best-performing ones often memorize the training samples that contribute least to diversity. Our findings demonstrate that the Vendiscope can serve as a powerful tool for data-driven science.
Publisher OA PDF DOI

Frequent coauthors

David M. Blei
22 shared
Francisco J. R. Ruiz
10 shared
John Paisley
7 shared
Rajesh Ranganath
Courant Institute of Mathematical Sciences
7 shared
Dustin Tran
5 shared
Michalis K. Titsias
4 shared
Amey P. Pasarkar
Princeton University
3 shared
Samarth Sinha
University of Toronto
3 shared

Education

PhD, Thesis title: "Deep Probabilistic Graphical Modeling", winner of the Savage Award 2020
Columbia University
2020
Master's, Statistics
Cornell University
2013
Diplome d'Ingenieur, Joined Telecom ParisTech after the intensive "Classes Preparatoires" exams covering Mathematics, Physics, and Chemistry from Lycee Henri IV in Paris, France
Télécom Paris
2013

Awards & honors

AI2050 Early Career Fellow by Schmidt Futures
Annie T. Randall Innovator of 2022 by the American Statistic…
Google PhD Fellowship in Machine Learning
Rising Star in Machine Learning nomination by the University…
Savage Award from the International Society for Bayesian Ana…

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Adji Bousso Dieng

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you