
Chelsea Finn
· Machine Learning, Deep Learning & RoboticsVerifiedStanford University · Symbolic Systems
Active 1988–2026
About
Chelsea Finn is a researcher focused on developing scalable AI systems to provide personalized feedback in large online computer science courses. Her work addresses the challenge of delivering high-quality, individualized feedback to thousands of students, which is traditionally labor-intensive and difficult to scale. Finn and her collaborators proposed a meta-learning based AI system that trains neural networks to analyze student code and generate feedback with minimal instructor input. This system was tested on student solutions from Stanford's CS106A exams and demonstrated feedback quality comparable to human instructors. It was successfully deployed in the Code in Place 2021 course, an online computer science offering with over 12,000 students, where the AI-generated feedback achieved a 97.9% student agreement rate, surpassing the 96.7% agreement rate for human instructor feedback. Finn's research highlights the difficulty of providing feedback at scale due to the vast diversity of student solutions, which follow a Zipf distribution, and the complexity of reasoning about student misconceptions. Her work explores computational approaches, including supervised learning and generative grading, to automate feedback and overcome the limitations of traditional methods such as unit tests and crowdsourced instructor annotations. This research contributes to advancing online education by enabling effective, scalable feedback mechanisms for open-ended student work in programming courses.
Research topics
- Computer Science
- Artificial Intelligence
- Machine Learning
- Engineering
- Data Mining
- Political Science
- Human–computer interaction
- Management science
- Mathematics
- Computer vision
- Engineering ethics
- Biology
- Law
- Data science
- Geography
- Ecology
- Cartography
- Programming language
Selected publications
RoboReward: General-Purpose Vision-Language Reward Models for Robotics
ArXiv.org · 2026-01-02
articleOpen accessSenior authorA well-designed reward is critical for effective reinforcement learning-based policy improvement. In real-world robotics, obtaining such rewards typically requires either labor-intensive human labeling or brittle, handcrafted objectives. Vision-language models (VLMs) have shown promise as automatic reward models, yet their effectiveness on real robot tasks is poorly understood. In this work, we aim to close this gap by introducing (1) RoboReward, a robotics reward dataset and benchmark built on large-scale real-robot corpora from Open X-Embodiment (OXE) and RoboArena, and (2) vision-language reward models trained on this dataset (RoboReward 4B/8B). Because OXE is success-heavy and lacks failure examples, we propose a negative examples data augmentation pipeline that generates calibrated negative and near-misses via counterfactual relabeling of successful episodes and temporal clipping to create partial-progress outcomes from the same videos. Using this framework, we build a large training and evaluation dataset spanning diverse tasks and embodiments to test whether state-of-the-art VLMs can reliably provide rewards for robot learning. Our evaluation of open and proprietary VLMs finds that no model excels across tasks, highlighting substantial room for improvement. We then train general-purpose 4B- and 8B-parameter models that outperform much larger VLMs in assigning rewards for short-horizon robotic tasks. Finally, we deploy the 8B model in real-robot reinforcement learning and find that it improves policy learning over Gemini Robotics-ER 1.5 while narrowing the gap to RL training with human-provided rewards. We release the full dataset, trained reward models, and evaluation suite on our website to advance the development of general-purpose reward models in robotics: https://crfm.stanford.edu/helm/robo-reward-bench (project website).
RoboReward: General-Purpose Vision-Language Reward Models for Robotics
arXiv (Cornell University) · 2026-01-02
preprintOpen accessSenior authorA well-designed reward is critical for effective reinforcement learning-based policy improvement. In real-world robotics, obtaining such rewards typically requires either labor-intensive human labeling or brittle, handcrafted objectives. Vision-language models (VLMs) have shown promise as automatic reward models, yet their effectiveness on real robot tasks is poorly understood. In this work, we aim to close this gap by introducing (1) RoboReward, a robotics reward dataset and benchmark built on large-scale real-robot corpora from Open X-Embodiment (OXE) and RoboArena, and (2) vision-language reward models trained on this dataset (RoboReward 4B/8B). Because OXE is success-heavy and lacks failure examples, we propose a negative examples data augmentation pipeline that generates calibrated negative and near-misses via counterfactual relabeling of successful episodes and temporal clipping to create partial-progress outcomes from the same videos. Using this framework, we build a large training and evaluation dataset spanning diverse tasks and embodiments to test whether state-of-the-art VLMs can reliably provide rewards for robot learning. Our evaluation of open and proprietary VLMs finds that no model excels across tasks, highlighting substantial room for improvement. We then train general-purpose 4B- and 8B-parameter models that outperform much larger VLMs in assigning rewards for short-horizon robotic tasks. Finally, we deploy the 8B model in real-robot reinforcement learning and find that it improves policy learning over Gemini Robotics-ER 1.5 while narrowing the gap to RL training with human-provided rewards. We release the full dataset, trained reward models, and evaluation suite on our website to advance the development of general-purpose reward models in robotics: https://crfm.stanford.edu/helm/robo-reward-bench (project website).
Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison
ArXiv.org · 2025-11-11
preprintOpen accessSenior authorWe introduce \textit{Feedback Descent}, a framework that optimizes text artifacts -- prompts, code, and molecules -- through structured textual feedback, rather than relying solely on scalar rewards. By preserving detailed critiques instead of compressing them to binary preferences, Feedback Descent widens the information bottleneck in preference learning, enabling directed optimization in text space rather than weight space. We show that in-context learning can transform structured feedback into gradient-like directional information, enabling targeted edits. Unlike prior approaches that collapse judgments into single bits, our evaluators pair each comparison with textual feedback, which functions as high-bandwidth supervision. The iteration loop is done purely at inference time, without modifying any model weights, and is task-agnostic. We evaluate Feedback Descent on three diverse domains and find that it outperforms state-of-the-art prompt optimization (GEPA), reinforcement learning methods (GRPO, REINVENT), and even specialized graph-based molecular optimizers. In the DOCKSTRING molecule discovery benchmark, Feedback Descent identifies novel drug-like molecules surpassing the $99.9$th percentile of a database with more than $260{,}000$ compounds across six protein targets.
Emergence of Human to Robot Transfer in Vision-Language-Action Models
arXiv (Cornell University) · 2025-12-27
preprintOpen accessVision-language-action (VLA) models can enable broad open world generalization, but require large and diverse datasets. It is appealing to consider whether some of this data can come from human videos, which cover diverse real-world situations and are easy to obtain. However, it is difficult to train VLAs with human videos alone, and establishing a mapping between humans and robots requires manual engineering and presents a major research challenge. Drawing inspiration from advances in large language models, where the ability to learn from diverse supervision emerges with scale, we ask whether a similar phenomenon holds for VLAs that incorporate human video data. We introduce a simple co-training recipe, and find that human-to-robot transfer emerges once the VLA is pre-trained on sufficient scenes, tasks, and embodiments. Our analysis suggests that this emergent capability arises because diverse pretraining produces embodiment-agnostic representations for human and robot data. We validate these findings through a series of experiments probing human to robot skill transfer and find that with sufficiently diverse robot pre-training our method can nearly double the performance on generalization settings seen only in human data.
arXiv (Cornell University) · 2025-10-09
preprintOpen accessWhile most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3\times$ improvement on average in success rates. Website: https://pd-perry.github.io/value-flows Code: https://github.com/chongyi-zheng/value-flows
Speedtuning: Speeding Up Policy Execution with Lightweight Reinforcement Learning
2025-05-19
articleSenior authorWhile learned robotic policies hold promise for advancing generalizable manipulation, their practical deployment is often hindered by suboptimal execution speeds. Imitation learning policies are inherently limited by hardware constraints and the speed of the operator during data collection. In addition, there are no established methods for accelerating policies learned via imitation, and the empirical relationship between execution speed and task success remains underexplored. To address these issues, we introduce Speed Tuning, a reinforcement learning framework specifically designed to enhance the speed of manipulation policies. SPEEDTUNING learns to predict the optimal execution speed for actions, thereby complementing a base policy without necessitating additional data collection. We provide empirical evidence that SPEEDTUNING achieves substantial improvements in execution speed, exceeding 2.4x speed-up, while preserving an adequate success rate compared to both the original task policy and straightforward speed-up methods such as linear interpolation at a fixed speed. We evaluate our approach across a diverse set of dynamic and precise tasks, including pouring, throwing, and picking, demonstrating its effectiveness and robustness in enhancing real-world robotic manipulation. Videos and code are available at https://daivdyuan.github.io/speed-tuning/
Reinforcement Learning via Implicit Imitation Guidance
ArXiv.org · 2025-06-09
preprintOpen accessSenior authorWe study the problem of sample efficient reinforcement learning, where prior data such as demonstrations are provided for initialization in lieu of a dense reward signal. A natural approach is to incorporate an imitation learning objective, either as regularization during training or to acquire a reference policy. However, imitation learning objectives can ultimately degrade long-term performance, as it does not directly align with reward maximization. In this work, we propose to use prior data solely for guiding exploration via noise added to the policy, sidestepping the need for explicit behavior cloning constraints. The key insight in our framework, Data-Guided Noise (DGN), is that demonstrations are most useful for identifying which actions should be explored, rather than forcing the policy to take certain actions. Our approach achieves up to 2-3x improvement over prior reinforcement learning from offline data methods across seven simulated continuous control tasks.
Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning
ArXiv.org · 2025-06-05 · 1 citations
preprintOpen accessLarge reasoning models (LRMs) achieve higher performance on challenging reasoning tasks by generating more tokens at inference time, but this verbosity often wastes computation on easy problems. Existing solutions, including supervised finetuning on shorter traces, user-controlled budgets, or RL with uniform penalties, either require data curation, manual configuration, or treat all problems alike regardless of difficulty. We introduce Adaptive Length Penalty (ALP), a reinforcement learning objective tailoring generation length to per-prompt solve rate. During training, ALP monitors each prompt's online solve rate through multiple rollouts and adds a differentiable penalty whose magnitude scales inversely with that rate, so confident (easy) prompts incur a high cost for extra tokens while hard prompts remain unhindered. Posttraining DeepScaleR-1.5B with ALP cuts average token usage by 50\% without significantly dropping performance. Relative to fixed-budget and uniform penalty baselines, ALP redistributes its reduced budget more intelligently by cutting compute on easy prompts and reallocating saved tokens to difficult ones, delivering higher accuracy on the hardest problems with higher cost.
Curating Demonstrations using Online Experience
2025-06-21 · 2 citations
articleOpen accessSenior authorMany robot demonstration datasets contain heterogeneous demonstrations of varying quality.This heterogeneity may benefit policy pre-training, but can hinder robot performance when used with a final imitation learning objective.In particular, some strategies in the data may be less reliable than others or may be underrepresented in the data, leading to poor performance when such strategies are sampled at test time.Moreover, such unreliable or underrepresented strategies can be difficult even for people to discern, and sifting through demonstration datasets is time-consuming and costly.On the other hand, policy performance when trained on such demonstrations can reflect the reliability of different strategies.We thus propose for robots to self-curate based on online robot experience (Demo-SCORE).More specifically, we train and cross-validate a classifier to discern successful policy roll-outs from unsuccessful ones and use the classifier to filter heterogeneous demonstration datasets.Our experiments in simulation and the real world show that Demo-SCORE can effectively identify suboptimal demonstrations without manual curation.Notably, Demo-SCORE achieves over 15-35% higher absolute success rate in the resulting policy compared to the base policy trained with all original demonstrations.
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
2025-06-21 · 13 citations
articleOpen accessRecent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization.Despite these successes, VLAs struggle with novel robot setups and require finetuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies.In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model.Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input-output specifications.We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26.In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs (0 and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate.We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io.
Frequent coauthors
- 262 shared
Sergey Levine
- 53 shared
Tianhe Yu
- 49 shared
Pieter Abbeel
University of California, Berkeley
- 45 shared
Karol Hausman
Google (United States)
- 40 shared
Rafael Rafailov
- 35 shared
Archit Sharma
- 34 shared
Annie Xie
- 32 shared
Eric Mitchell
Neuroscience Institute
Labs
Meta-learning, reinforcement learning, and computer vision research
Education
- 2015
Ph.D., Computer Science
Stanford University
- 2011
M.S., Computer Science
Stanford University
- 2007
B.S., Electrical Engineering and Computer Science
Massachusetts Institute of Technology (MIT)
Awards & honors
- Presidential Early Career Award for Scientists and Engineers…
- Research Fellowship, Alfred P. Sloan Foundation (2023)
- Early Academic Career Award in Robotics and Automation, IEEE…
- Young Investigator Award, Office of Naval Research (2021)
- Microsoft Faculty Fellowship, Microsoft (2020)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Chelsea Finn
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup