Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Percy Liang

Percy Liang

· Machine Learning & Natural Language ProcessingVerified

Stanford University · Learning, Design, and Technology

Active 2004–2025

h-index93
Citations45.3k
Papers461210 last 5y
Funding$550k
See your match with Percy Liang — sign in to PhdFit.Sign in

About

Percy Liang is a Professor of Computer Science with courtesy in Statistics. His research interests encompass fundamental questions around learning and intelligence, including how well we can learn in data-limited, infinite compute regimes, modeling interestingness and curiosity, and pushing the frontier of human knowledge. He leads Marin, an initiative to build models openly through experiments that are preregistered and accessible for public engagement, practicing a radical form of openness that extends beyond open-weight and open-source models. Liang is dedicated to enabling understanding, shaping, and contributing to foundation model development, exemplified by his teaching of CS336 (Language Models from Scratch). He advocates for efficient and reproducible research, having created CodaLab Worksheets, a platform that maintains full experiment provenance from raw data to results, with some papers on CodaLab as executable papers. His educational background includes a Ph.D. from Berkeley, advised by Michael Jordan and Dan Klein, an MEng from MIT, and a B.S. from MIT. His honors include the Presidential Early Career Award for Scientists and Engineers, the IJCAI Computers and Thought Award, an NSF CAREER Award, a Sloan Research Fellowship, and a Microsoft Research Faculty Fellowship. Liang has a diverse background with experience in programming contests, music competitions, and various academic and research roles, and he has mentored numerous students and post-docs who have gone on to prominent positions in academia and industry.

Research topics

  • Computer Science
  • Artificial Intelligence
  • Natural Language Processing
  • Machine Learning
  • Engineering
  • Psychology
  • Data Mining
  • Mathematics
  • Linguistics
  • Human–computer interaction
  • Political Science
  • Data science
  • Information Retrieval
  • Cognitive science
  • Cognitive psychology
  • Physics
  • Geography
  • Engineering ethics
  • Biology
  • Software engineering
  • Law
  • Cartography
  • Philosophy
  • Management science

Selected publications

  • Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    2025-06-21 · 13 citations

    articleOpen accessSenior author

    Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization.Despite these successes, VLAs struggle with novel robot setups and require finetuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies.In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model.Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input-output specifications.We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26.In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs (0 and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate.We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io.

  • MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

    ArXiv.org · 2025-05-12

    preprintOpen access

    We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long-horizon solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo's flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model-based agent tuning and promoting interoperability, scalability, and reproducibility. We open-source our framework and benchmarks to foster community-driven innovation towards next-generation MLE agents.

  • Independence Tests for Language Models

    ArXiv.org · 2025-02-17

    preprintOpen accessSenior author

    We consider the following problem: given the weights of two models, can we test whether they were trained independently -- i.e., from independent random initializations? We consider two settings: constrained and unconstrained. In the constrained setting, we make assumptions about model architecture and training and propose a family of statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. These p-values are valid regardless of the composition of either model's training data; we compute them by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures of weights and activations between the original two models versus these copies. We report the p-values from these tests on pairs of 21 open-weight models (210 total pairs) and correctly identify all pairs of non-independent models. Our tests remain effective even if one model was fine-tuned for many tokens. In the unconstrained setting, where we make no assumptions about training procedures, can change model architecture, and allow for adversarial evasion attacks, the previous tests no longer work. Instead, we propose a new test which matches hidden activations between two models, and which is robust to adversarial transformations and to changes in model architecture. The test can also do localized testing: identifying specific non-independent components of models. Though we no longer obtain exact p-values from this, empirically we find it behaves as one and reliably identifies non-independent models. Notably, we can use the test to identify specific parts of one model that are derived from another (e.g., how Llama 3.1-8B was pruned to initialize Llama 3.2-3B, or shared layers between Mistral-7B and StripedHyena-7B), and it is even robust to retraining individual layers of either model from scratch.

  • The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

    arXiv (Cornell University) · 2025-02-26 · 1 citations

    preprintOpen access

    Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. Although no specific table format leads to consistently better performance, we show that testing over multiple formats is crucial for reliably estimating model capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that table understanding and reasoning tasks remain a significant challenge.

  • Extracting memorized pieces of (copyrighted) books from open-weight language models

    ArXiv.org · 2025-05-18

    preprintOpen access

    Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) memorize protected expression from books in their training data. We show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we develop a technique to measure memorization of books, which we apply to 200 books and 14 open-weight LLMs. Through over 3000 experiments, we show that memorization varies both by model and book. With respect to our specific extraction methodology, we find that most LLMs do not memorize most books -- either in whole or in part; however, there are notable exceptions. For instance, Llama 3.1 70B entirely memorizes some books, like Harry Potter and the Sorcerer's Stone; memorization is so extensive that one can deterministically extract the whole book almost verbatim using the book's first few words as an initial prompt. We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.

  • Advancing science- and evidence-based AI policy

    Science · 2025-07-31 · 10 citations

    articleOpen access

    Policy must be informed by, but also facilitate the generation of, scientific evidence.

  • Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research

    SSRN Electronic Journal · 2025-01-01

    preprintOpen access
  • BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

    ArXiv.org · 2025-05-21

    preprintOpen accessSenior author

    AI agents have the potential to significantly alter the cybersecurity landscape. Here, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a given vulnerability), and Patch (patching a given vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards from \$10 to \$30,485, covering 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a given vulnerability. We evaluate 10 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, Qwen3 235B A22B, Llama 4 Maverick, and DeepSeek-R1. Given up to three attempts, the top-performing agents are Codex CLI: o3-high (12.5% on Detect, mapping to \$3,720; 90% on Patch, mapping to \$14,152), Custom Agent: Claude 3.7 Sonnet Thinking (67.5% on Exploit), and Codex CLI: o4-mini (90% on Patch, mapping to \$14,422). Codex CLI: o3-high, Codex CLI: o4-mini, and Claude Code are more capable at defense, achieving higher Patch scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and 57.5% respectively; while the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 17.5-67.5% and Patch scores of 25-60%.

  • s1: Simple test-time scaling

    ArXiv.org · 2025-01-31 · 10 citations

    preprintOpen access

    Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

  • Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

    ArXiv.org · 2025-10-13

    preprintOpen access

    AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.

Recent grants

Frequent coauthors

Education

  • B.S.

    MIT

    2004
  • Other

    MIT

    2005
  • Ph.D.

    Berkeley

    2011

Awards & honors

  • Presidential Early Career Award for Scientists and Engineers…
  • IJCAI Computers and Thought Award (2016)
  • NSF CAREER Award (2016)
  • Sloan Research Fellowship (2015)
  • Microsoft Research Faculty Fellowship (2014)
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Percy Liang

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup