Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…

Sanjoy Dasgupta

· Professor of Computer Science and EngineeringVerified

University of California, San Diego · Computer Science and Engineering

Active 1960–2026

h-index45
Citations9.7k
Papers19531 last 5y
Funding$1.4M
See your match with Sanjoy Dasgupta — sign in to PhdFit.Sign in

Research topics

  • Artificial Intelligence
  • Computer Science
  • Programming language
  • Cognitive psychology
  • Theoretical computer science
  • Neuroscience
  • Data science
  • Computer engineering
  • Psychology
  • Telecommunications

Selected publications

  • Performance bounds for nearest neighbor search with k-d trees

    ArXiv.org · 2026-05-11

    articleOpen accessSenior author

    The $k$-d tree is one of the oldest and most widely used data structures for nearest neighbor search. It partitions Euclidean space into axis-aligned rectangular cells. There are two standard ways to find the nearest neighbor to a query in a $k$-d tree. Defeatist search returns the closest data point in the query's cell, while comprehensive search also searches other cells as needed to guarantee it finds the nearest neighbor. Both strategies are commonly believed to perform poorly in high dimensions, but there have been few theoretical results explaining this. We prove non-asymptotic bounds on the runtime of comprehensive search and the accuracy of defeatist search. Under mild distributional assumptions, when the dimension $d$ is at least polylogarithmic in the number of data points, defeatist search is no more likely to return the nearest neighbor than random guessing, and comprehensive search visits every cell with high probability. We also show that on uniform data, with high probability, comprehensive search visits at most $2^{\mathcal{O}(d)}$ cells when each cell contains at least logarithmically many data points, and defeatist search returns the nearest neighbor when each cell additionally contains at least $2^{\mathcal{O}(d \log d)}$ data points. Finally, for arbitrary absolutely continuous distributions, we upper bound the expected distance between the query and the point returned by defeatist search.

  • Learnability with Partial Labels and Adaptive Nearest Neighbors

    arXiv (Cornell University) · 2026-03-16

    preprintOpen accessSenior author

    Prior work on partial labels learning (PLL) has shown that learning is possible even when each instance is associated with a bag of labels, rather than a single accurate but costly label. However, the necessary conditions for learning with partial labels remain unclear, and existing PLL methods are effective only in specific scenarios. In this work, we mathematically characterize the settings in which PLL is feasible. In addition, we present PL A-$k$NN, an adaptive nearest-neighbors algorithm for PLL that is effective in general scenarios and enjoys strong performance guarantees. Experimental results corroborate that PL A-$k$NN can outperform state-of-the-art methods in general PLL scenarios.

  • Performance bounds for nearest neighbor search with k-d trees

    arXiv (Cornell University) · 2026-05-11

    preprintOpen accessSenior author

    The $k$-d tree is one of the oldest and most widely used data structures for nearest neighbor search. It partitions Euclidean space into axis-aligned rectangular cells. There are two standard ways to find the nearest neighbor to a query in a $k$-d tree. Defeatist search returns the closest data point in the query's cell, while comprehensive search also searches other cells as needed to guarantee it finds the nearest neighbor. Both strategies are commonly believed to perform poorly in high dimensions, but there have been few theoretical results explaining this. We prove non-asymptotic bounds on the runtime of comprehensive search and the accuracy of defeatist search. Under mild distributional assumptions, when the dimension $d$ is at least polylogarithmic in the number of data points, defeatist search is no more likely to return the nearest neighbor than random guessing, and comprehensive search visits every cell with high probability. We also show that on uniform data, with high probability, comprehensive search visits at most $2^{\mathcal{O}(d)}$ cells when each cell contains at least logarithmically many data points, and defeatist search returns the nearest neighbor when each cell additionally contains at least $2^{\mathcal{O}(d \log d)}$ data points. Finally, for arbitrary absolutely continuous distributions, we upper bound the expected distance between the query and the point returned by defeatist search.

  • Learnability with Partial Labels and Adaptive Nearest Neighbors

    ArXiv.org · 2026-03-16

    articleOpen accessSenior author

    Prior work on partial labels learning (PLL) has shown that learning is possible even when each instance is associated with a bag of labels, rather than a single accurate but costly label. However, the necessary conditions for learning with partial labels remain unclear, and existing PLL methods are effective only in specific scenarios. In this work, we mathematically characterize the settings in which PLL is feasible. In addition, we present PL A-$k$NN, an adaptive nearest-neighbors algorithm for PLL that is effective in general scenarios and enjoys strong performance guarantees. Experimental results corroborate that PL A-$k$NN can outperform state-of-the-art methods in general PLL scenarios.

  • Reliable Programmatic Weak Supervision With Confidence Intervals for Label Probabilities

    IEEE Transactions on Pattern Analysis and Machine Intelligence · 2025-08-11

    articleSenior author

    The accurate labeling of datasets is often both costly and time-consuming. Given an unlabeled dataset, programmatic weak supervision obtains probabilistic predictions for the labels by leveraging multiple weak labeling functions (LFs) that provide rough guesses for labels. Weak LFs commonly provide guesses with assorted types and unknown interdependences that can result in unreliable predictions. Furthermore, existing techniques for programmatic weak supervision cannot provide assessments for the reliability of the probabilistic predictions for labels. This paper presents a methodology for programmatic weak supervision that can provide confidence intervals for label probabilities and obtain more reliable predictions. In particular, the methods proposed use uncertainty sets of distributions that encapsulate the information provided by LFs with unrestricted behavior and typology. Experiments on multiple benchmark datasets show the improvement of the presented methods over the state-of-the-art and the practicality of the confidence intervals presented.

  • Sparse, random sampling is sufficient for central tolerance

    bioRxiv (Cold Spring Harbor Laboratory) · 2025-12-12

    articleOpen access

    Abstract Negative selection in the thymus limits autoimmunity by eliminating T cells that react strongly to self. Individual T cells, however, are only exposed to a small fraction of all self peptides during their “training” in the thymus, and it is puzzling how tolerance can be generalized to the remaining “test” self peptides across peripheral tissues in the body. Using a machine learning perspective, we show that such generalization is possible because the immune system satisfies two conditions: first that peptide abundance levels in the human thymus and periphery are highly correlated (i.e., training distribution ≈ test distribution), and second that cross-reactivity allows T cells to effectively learn binding information of similar peptides without explicitly interacting with all of them. Together, we show that sparse, random sampling of only 10% of self peptides in the thymus is sufficient to avoid reactivity to 90% of peripheral self, and we support this result with diverse experimental data. We then validate two predictions by our model; the first is that only 200–250 antigen presenting cells need to be seen by a T cell to ensure its robust selection, and the second relates how peptides missing from the thymus can drive auto-immunity of peripheral tissues. Overall, we provide a plausible answer to a long-standing question underlying adaptive immunity, and we highlight how generalization, a fundamental challenge faced by nearly every learning algorithm, is uniquely tackled by the immune system.

  • New bounds on the cohesion of complete-link and other linkage methods for agglomeration clustering

    arXiv (Cornell University) · 2024-05-02

    preprintOpen access1st authorCorresponding

    Linkage methods are among the most popular algorithms for hierarchical clustering. Despite their relevance the current knowledge regarding the quality of the clustering produced by these methods is limited. Here, we improve the currently available bounds on the maximum diameter of the clustering obtained by complete-link for metric spaces. One of our new bounds, in contrast to the existing ones, allows us to separate complete-link from single-link in terms of approximation for the diameter, which corroborates the common perception that the former is more suitable than the latter when the goal is producing compact clusters. We also show that our techniques can be employed to derive upper bounds on the cohesion of a class of linkage methods that includes the quite popular average-link.

  • Convergence Behavior of an Adversarial Weak Supervision Method

    arXiv (Cornell University) · 2024-05-25

    preprintOpen accessSenior author

    Labeling data via rules-of-thumb and minimal label supervision is central to Weak Supervision, a paradigm subsuming subareas of machine learning such as crowdsourced learning and semi-supervised ensemble learning. By using this labeled data to train modern machine learning methods, the cost of acquiring large amounts of hand labeled data can be ameliorated. Approaches to combining the rules-of-thumb falls into two camps, reflecting different ideologies of statistical estimation. The most common approach, exemplified by the Dawid-Skene model, is based on probabilistic modeling. The other, developed in the work of Balsubramani-Freund and others, is adversarial and game-theoretic. We provide a variety of statistical results for the adversarial approach under log-loss: we characterize the form of the solution, relate it to logistic regression, demonstrate consistency, and give rates of convergence. On the other hand, we find that probabilistic approaches for the same model class can fail to be consistent. Experimental results are provided to corroborate the theoretical results.

  • Online Consistency of the Nearest Neighbor Rule

    arXiv (Cornell University) · 2024-10-31

    preprintOpen access1st authorCorresponding

    In the realizable online setting, a learner is tasked with making predictions for a stream of instances, where the correct answer is revealed after each prediction. A learning rule is online consistent if its mistake rate eventually vanishes. The nearest neighbor rule (Fix and Hodges, 1951) is a fundamental prediction strategy, but it is only known to be consistent under strong statistical or geometric assumptions: the instances come i.i.d. or the label classes are well-separated. We prove online consistency for all measurable functions in doubling metric spaces under the mild assumption that the instances are generated by a process that is uniformly absolutely continuous with respect to a finite, upper doubling measure.

  • Learning Smooth Distance Functions via Queries

    arXiv (Cornell University) · 2024-12-02

    preprintOpen accessSenior author

    In this work, we investigate the problem of learning distance functions within the query-based learning framework, where a learner is able to pose triplet queries of the form: ``Is $x_i$ closer to $x_j$ or $x_k$?'' We establish formal guarantees on the query complexity required to learn smooth, but otherwise general, distance functions under two notions of approximation: $ω$-additive approximation and $(1 + ω)$-multiplicative approximation. For the additive approximation, we propose a global method whose query complexity is quadratic in the size of a finite cover of the sample space. For the (stronger) multiplicative approximation, we introduce a method that combines global and local approaches, utilizing multiple Mahalanobis distance functions to capture local geometry. This method has a query complexity that scales quadratically with both the size of the cover and the ambient space dimension of the sample space.

Recent grants

Frequent coauthors

  • Kamalika Chaudhuri

    12 shared
  • Samory Kpotufe

    11 shared
  • Yoav Freund

    10 shared
  • Christopher Tosh

    9 shared
  • Saket Navlakha

    Cold Spring Harbor Laboratory

    9 shared
  • Daniel Hsu

    8 shared
  • Tajana Rosing

    8 shared
  • Nakul Verma

    7 shared
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Sanjoy Dasgupta

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup