Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Jason Lee

Jason Lee

· Associated FacultyVerified

Princeton University · Computer Science

Active 2004–2025

h-index53
Citations11.9k
Papers296132 last 5y
Funding$400k
See your match with Jason Lee — sign in to PhdFit.Sign in

About

Jason D. Lee is an associate professor of Electrical Engineering and Computer Science (EECS) and Statistics at UC Berkeley. His research focuses on machine learning theory, data and information science, and related areas. Prior to his current position, he was a research scientist at Google Deepmind, a member of the Institute for Advanced Study (IAS), and an associate professor at Princeton University. He also completed a postdoctoral fellowship at UC Berkeley working with Michael I. Jordan. Dr. Lee earned his Ph.D. in Computational and Mathematical Engineering from Stanford University in 2015, where he was advised by Trevor Hastie and Jonathan Taylor. He holds a B.Sc. in Mathematics from Duke University, advised by Mauro Maggioni. His work has been recognized through funding from the Navy's Young Investigator Program, and he is known for his expertise in machine learning theory.

Research topics

  • Computer science
  • Mathematics
  • Mathematical optimization
  • Algorithm
  • Artificial intelligence

Selected publications

  • Risk Comparisons in Linear Regression: Implicit Regularization Dominates Explicit Regularization

    ArXiv.org · 2025-09-21 · 1 citations

    preprintOpen access

    Existing theory suggests that for linear regression problems categorized by capacity and source conditions, gradient descent (GD) is always minimax optimal, while both ridge regression and online stochastic gradient descent (SGD) are polynomially suboptimal for certain categories of such problems. Moving beyond minimax theory, this work provides instance-wise comparisons of the finite-sample risks for these algorithms on any well-specified linear regression problem. Our analysis yields three key findings. First, GD dominates ridge regression: with comparable regularization, the excess risk of GD is always within a constant factor of ridge, but ridge can be polynomially worse even when tuned optimally. Second, GD is incomparable with SGD. While it is known that for certain problems GD can be polynomially better than SGD, the reverse is also true: we construct problems, inspired by benign overfitting theory, where optimally stopped GD is polynomially worse. Finally, GD dominates SGD for a significant subclass of problems -- those with fast and continuously decaying covariance spectra -- which includes all problems satisfying the standard capacity condition.

  • Internet of Things (IoT) trial in CSG upstream asset: a step towards more efficient operations

    Australian Energy Producers journal. · 2025-05-21

    articleSenior author

    The Upstream Wells Internet of Things (IoT) Trial Project aims to enhance operational efficiency and reduce costs in wellsite monitoring through the implementation of IoT technologies. This trial involves the deployment of IoT-enabled skids equipped with solar panels, cameras, ultrasonic level switches, and edge computing devices, facilitating remote monitoring. The project targets a 25% reduction in physical visits to wellsites, improved data acquisition for predictive maintenance and decreased kilometres driven by operators, contributing to improved safety outcomes. Initial results indicate significant potential for IoT solutions to transform wellsite operations, providing real-time data and enabling remote surveillance.

  • Multimodal Deep Learning-Based Intelligent Food Safety Detection and Traceability System

    International Journal of Management Science Research · 2025-03-31 · 2 citations

    articleOpen accessSenior author

    Food safety has become a critical global issue, requiring effective solutions to reduce health risks and economic losses. The rapid advancement of artificial intelligence (AI) and deep learning (DL) provides new opportunities to address this challenge. This study presents a multimodal food safety detection system that integrates computer vision (CV), natural language processing (NLP), and sensor data analysis to comprehensively monitor food contamination, quality deterioration, and supply chain security. Specifically, the Swin Transformer model is employed for surface defect detection, while temporal convolutional networks (TCN) predict storage environment conditions. Additionally, blockchain and federated learning technologies are incorporated to establish a secure and efficient data-sharing framework, enabling cross-supply chain collaboration and enhancing traceability accuracy. Experimental results show that the system achieves an accuracy rate of over 98% in food contamination detection and supply chain anomaly monitoring, significantly improving food safety management. This study offers a practical and innovative approach to enhancing intelligent food safety regulation.

  • On the Statistical Query Complexity of Learning Semiautomata: a Random Walk Approach

    ArXiv.org · 2025-10-05

    preprintOpen accessSenior author

    Semiautomata form a rich class of sequence-processing algorithms with applications in natural language processing, robotics, computational biology, and data mining. We establish the first Statistical Query hardness result for semiautomata under the uniform distribution over input words and initial states. We show that Statistical Query hardness can be established when both the alphabet size and input length are polynomial in the number of states. Unlike the case of deterministic finite automata, where hardness typically arises through the hardness of the language they recognize (e.g., parity), our result is derived solely from the internal state-transition structure of semiautomata. Our analysis reduces the task of distinguishing the final states of two semiautomata to studying the behavior of a random walk on the group $S_{N} \times S_{N}$. By applying tools from Fourier analysis and the representation theory of the symmetric group, we obtain tight spectral gap bounds, demonstrating that after a polynomial number of steps in the number of states, distinct semiautomata become nearly uncorrelated, yielding the desired hardness result.

  • Settling the Sample Complexity of Online Reinforcement Learning

    Journal of the ACM · 2025-05-02 · 2 citations

    articleOpen access

    A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a “large-sample” regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of MVP (Monotonic Value Propagation), an optimistic model-based algorithm proposed by Zhang et al. [82], achieves a regret on the order of (modulo log factors) \begin{equation*} \min \big \lbrace \sqrt {SAH^3K}, \,HK \big \rbrace, \end{equation*} where S is the number of states, A is the number of actions, H is the horizon length, and K is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size K ≥ 1, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield ε-accuracy) of \(\frac{SAH^3}{\varepsilon ^2} \) up to log factor, which is minimax-optimal for the full ε-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in a novel analysis paradigm (based on a new concept called “profiles”) to decouple complicated statistical dependency across the sample trajectories — a long-standing challenge facing the analysis of online RL in the sample-starved regime.

  • The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models

    ArXiv.org · 2025-06-05

    preprintOpen access

    In this work we consider generic Gaussian Multi-index models, in which the labels only depend on the (Gaussian) $d$-dimensional inputs through their projection onto a low-dimensional $r = O_d(1)$ subspace, and we study efficient agnostic estimation procedures for this hidden subspace. We introduce the \emph{generative leap} exponent $k^\star$, a natural extension of the generative exponent from [Damian et al.'24] to the multi-index setting. We first show that a sample complexity of $n=Θ(d^{1 \vee \k/2})$ is necessary in the class of algorithms captured by the Low-Degree-Polynomial framework. We then establish that this sample complexity is also sufficient, by giving an agnostic sequential estimation procedure (that is, requiring no prior knowledge of the multi-index model) based on a spectral U-statistic over appropriate Hermite tensors. We further compute the generative leap exponent for several examples including piecewise linear functions (deep ReLU networks with bias), and general deep neural networks (with $r$-dimensional first hidden layer).

  • Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation

    ArXiv.org · 2025-02-02

    preprintOpen access

    A key paradigm to improve the reasoning capabilities of large language models (LLMs) is to allocate more inference-time compute to search against a verifier or reward model. This process can then be utilized to refine the pretrained model or distill its reasoning patterns into more efficient models. In this paper, we study inference-time compute by viewing chain-of-thought (CoT) generation as a metastable Markov process: easy reasoning steps (e.g., algebraic manipulations) form densely connected clusters, while hard reasoning steps (e.g., applying a relevant theorem) create sparse, low-probability edges between clusters, leading to phase transitions at longer timescales. Under this framework, we prove that implementing a search protocol that rewards sparse edges improves CoT by decreasing the expected number of steps to reach different clusters. In contrast, we establish a limit on reasoning capability when the model is restricted to local information of the pretrained graph. We also show that the information gained by search can be utilized to obtain a better reasoning model: (1) the pretrained model can be directly finetuned to favor sparse edges via policy gradient methods, and moreover (2) a compressed metastable representation of the reasoning dynamics can be distilled into a smaller, more efficient model.

  • Optimal Multi-Distribution Learning

    Journal of the ACM · 2025-08-25

    articleOpen accessSenior author

    Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across k distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, and so on. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process. However, there exist substantial gaps between the state-of-the-art upper and lower bounds on the optimal sample complexity. Focusing on a hypothesis class of Vapnik–Chervonenkis (VC) dimension d , we propose a novel algorithm that yields an ɛ-optimal randomized hypothesis with a sample complexity on the order of \(\frac{d+k}{\varepsilon ^2}\) (modulo some logarithmic factor), matching the best-known lower bound. Our algorithmic ideas and theory are further extended to accommodate Rademacher classes. The proposed algorithms are oracle-efficient, which access the hypothesis class solely through an empirical risk minimization oracle. Additionally, we establish the necessity of improper learning, revealing a large sample size barrier when only deterministic, proper hypotheses are permitted. These findings resolve three open problems presented in COLT 2023 (i.e., Awasthi et al. [ 4 , Problems 1, 3, and 4]).

  • What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains

    ArXiv.org · 2025-08-10

    preprintOpen access

    In-context learning (ICL) is a hallmark capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context. Prior work has shown that ICL emerges in transformers due to the presence of special circuits called induction heads. Given the equivalence between induction heads and conditional k-grams, a recent line of work modeling sequential inputs as Markov processes has revealed the fundamental impact of model depth on its ICL capabilities: while a two-layer transformer can efficiently represent a conditional 1-gram model, its single-layer counterpart cannot solve the task unless it is exponentially large. However, for higher order Markov sources, the best known constructions require at least three layers (each with a single attention head) - leaving open the question: can a two-layer single-head transformer represent any kth-order Markov process? In this paper, we precisely address this and theoretically show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram. Thus, our result provides the tightest known characterization of the interplay between transformer depth and Markov order for ICL. Building on this, we further analyze the learning dynamics of our two-layer construction, focusing on a simplified variant for first-order Markov chains, illustrating how effective in-context representations emerge during training. Together, these results deepen our current understanding of transformer-based ICL and illustrate how even shallow architectures can surprisingly exhibit strong ICL capabilities on structured sequence modeling tasks.

  • Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding

    arXiv (Cornell University) · 2025-01-01

    preprintOpen access

    Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con\textbf{T}extualized equivari\textbf{A}nt \textbf{P}osition \textbf{E}ncoding (\textbf{TAPE}), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. We show that TAPE can provably facilitate LLM reasoning ability by emulating a broader class of algorithms. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving long-context ability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques. Code is available at https://github.com/VITA-Group/TAPE.

Recent grants

Frequent coauthors

Labs

  • Jason D. Lee LabPI

Education

  • Postdoc, Computer Science

    University of California Berkeley

    2016
  • PhD, Computational Math

    Stanford University

    2015
  • BS, Mathematics

    Duke University

    2010
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Jason Lee

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup