Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Sanjiban Choudhury

Sanjiban Choudhury

· Assistant Professor of Computer ScienceVerified

Cornell University · Computer Science

Active 2012–2025

h-index21
Citations1.3k
Papers12256 last 5y
Funding
See your match with Sanjiban Choudhury — sign in to PhdFit.Sign in

About

Sanjiban Choudhury is an assistant professor of computer science at Cornell University. His research focuses on interactive AI agents that self-align through few-shot interactions with humans and their environment. His work emphasizes reinforcement learning (RLHF), imitation learning (IRL), and foundation models for planning, robotics, and code generation. He leads the PoRTaL group, which builds everyday robots for everyday users with the mission to make robots accessible, user-friendly, and practical for tasks such as cooking and cleaning. Choudhury completed his postdoctoral research at the University of Washington and earned his M.A. and Ph.D. at Carnegie Mellon University. He holds a B.S. and M.S. in electrical engineering from the Indian Institute of Technology, Kharagpur.

Research topics

  • Computer Science
  • Artificial Intelligence
  • Machine Learning
  • Physics
  • Theoretical computer science
  • Psychology
  • Algorithm
  • Philosophy
  • Mathematics

Selected publications

  • Distilling Realizable Students from Unrealizable Teachers

    2025-10-19

    articleSenior author

    We study policy distillation under privileged information, where a student policy with only partial observations must learn from a teacher with full-state access. A key challenge is information asymmetry: the student cannot directly access the teacher’s state space, leading to distributional shifts and policy degradation. Existing approaches either modify the teacher to produce realizable but sub-optimal demonstrations or rely on the student to explore missing information independently, both of which are inefficient. Our key insight is that the student should strategically interact with the teacher —querying only when necessary and resetting from recovery states —to stay on a recoverable path within its own observation space. We introduce two methods: (i) an imitation learning approach that adaptively determines when the student should query the teacher for corrections, and (ii) a reinforcement learning approach that selects where to initialize training for efficient exploration. We validate our methods in both simulated and real-world robotic tasks, demonstrating significant improvements over standard teacher-student baselines in training efficiency and final performance. The project website is available here.

  • Process Reward Models for LLM Agents: Practical Framework and Directions

    arXiv (Cornell University) · 2025-02-14

    preprintOpen access1st authorCorresponding

    We introduce Agent Process Reward Models (AgentPRM), a simple and scalable framework for training LLM agents to continually improve through interactions. AgentPRM follows a lightweight actor-critic paradigm, using Monte Carlo rollouts to compute reward targets and optimize policies. It requires minimal modifications to existing RLHF pipelines, making it easy to integrate at scale. Beyond AgentPRM, we propose InversePRM, which learns process rewards directly from demonstrations without explicit outcome supervision. We also explore key challenges and opportunities, including exploration, process reward shaping, and model-predictive reasoning. We evaluate on ALFWorld benchmark, show that small 3B models trained with AgentPRM and InversePRM outperform strong GPT-4o baselines, and analyze test-time scaling, reward hacking, and more. Our code is available at: https://github.com/sanjibanc/agent_prm.

  • Efficient Imitation under Misspecification

    ArXiv.org · 2025-03-17

    preprintOpen access

    We consider the problem of imitation learning under misspecification: settings where the learner is fundamentally unable to replicate expert behavior everywhere. This is often true in practice due to differences in observation space and action space expressiveness (e.g. perceptual or morphological differences between robots and humans). Given the learner must make some mistakes in the misspecified setting, interaction with the environment is fundamentally required to figure out which mistakes are particularly costly and lead to compounding errors. However, given the computational cost and safety concerns inherent in interaction, we'd like to perform as little of it as possible while ensuring we've learned a strong policy. Accordingly, prior work has proposed a flavor of efficient inverse reinforcement learning algorithms that merely perform a computationally efficient local search procedure with strong guarantees in the realizable setting. We first prove that under a novel structural condition we term reward-agnostic policy completeness, these sorts of local-search based IRL algorithms are able to avoid compounding errors. We then consider the question of where we should perform local search in the first place, given the learner may not be able to "walk on a tightrope" as well as the expert in the misspecified setting. We prove that in the misspecified setting, it is beneficial to broaden the set of states on which local search is performed to include those reachable by good policies the learner can actually play. We then experimentally explore a variety of sources of misspecification and how offline data can be used to effectively broaden where we perform local search from.

  • X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real

    ArXiv.org · 2025-05-11

    preprintOpen accessSenior author

    Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Sim introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection time, and (3) generalizes to new camera viewpoints and test-time changes. Code and videos are available at https://portal-cornell.github.io/X-Sim/.

  • All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

    ArXiv.org · 2025-03-03

    preprintOpen access

    From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure. Specifically, one first trains a reward model (RM) on some dataset (e.g., human preferences) before using it to provide online feedback as part of a downstream reinforcement learning (RL) procedure, rather than directly optimizing the policy parameters on said dataset via offline maximum likelihood estimation. In fact, from an information-theoretic perspective, we can only lose information via passing through a reward model and cannot create any new information via on-policy sampling. To explain this discrepancy, we scrutinize several hypotheses on the value of RL in FT through both theoretical and empirical lenses. Of the hypotheses considered, we find the most support for the explanation that on problems with a generation-verification gap, (1) it is relatively easy to learn the relatively simple RM (verifier) from the preference data. Then, (2) the downstream RL procedure only returns policies (generators) that are optimal for such relatively simple verifiers. Thus, end-to-end, two-stage online FT only has to search over a reduced subset of the full space of policies, requiring less data than offline FT.

  • Multi-Turn Code Generation Through Single-Step Rewards

    ArXiv.org · 2025-02-27

    preprintOpen accessSenior author

    We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $μ$Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $μ$Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $μ$Code at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.

  • One-Shot Imitation Under Mismatched Execution

    2025-05-19 · 1 citations

    articleSenior author

    Human demonstrations as prompts are a power-ful way to program robots to do long-horizon manipulation tasks. However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods for human-robot translation either depend on paired data, which is infeasible to scale, or rely heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically pairs human and robot trajectories using sequence-level optimal transport cost functions. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent human videos by retrieving and composing short-horizon human clips. This approach facilitates effective policy training without the need for paired data. RHyME successfully imitates a range of cross-embodiment demonstrators, both in simulation and with a real human hand, achieving over 50% increase in task success compared to previous methods. We release our code and datasets at this website.

  • Motion Tracks: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning

    2025-05-19 · 6 citations

    article

    Teaching robots to autonomously complete everyday tasks remains a challenge. Imitation Learning (IL) is a powerful approach that imbues robots with skills via demonstrations, but is limited by the labor-intensive process of collecting teleoperated robot data. Human videos offer a scalable alternative, but it remains difficult to directly train IL policies from them due to the lack of robot action labels. To address this, we propose to represent actions as short-horizon 2D trajectories on an image. These actions, or motion tracks, capture the predicted direction of motion for both human hands and robot end-effectors. We instantiate an IL policy called Motion Track Policy (MT- <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\pi$</tex> ) which receives image observations and outputs motion tracks as actions. By leveraging this unified, cross-embodiment action space, MT- <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\pi$</tex> completes tasks with high success given just minutes of human video and limited additional robot demonstrations. At test time, we predict motion tracks from two camera views, recovering 6DoF trajectories via multi-view synthesis. MT- <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\pi$</tex> achieves an average success rate of 86.5% across 4 real-world tasks, outperforming state-of-the-art IL baselines which do not leverage human data or our action space by 40%, and generalizes to scenarios seen only in human videos. Code and videos are available on our website (https://portal-cornell.github.io/motion_track_policy/).

  • Robotouille: An Asynchronous Planning Benchmark for LLM Agents

    ArXiv.org · 2025-02-06

    preprintOpen accessSenior author

    Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents. While large language model (LLM) agents show promise in high-level task planning, current benchmarks focus primarily on short-horizon tasks and do not evaluate such asynchronous planning capabilities. We introduce Robotouille, a challenging benchmark environment designed to test LLM agents' ability to handle long-horizon asynchronous scenarios. Our synchronous and asynchronous datasets capture increasingly complex planning challenges that go beyond existing benchmarks, requiring agents to manage overlapping tasks and interruptions. Our results show that ReAct (gpt4-o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement. We further analyze failure modes, demonstrating the need for LLM agents to better incorporate long-horizon feedback and self-audit their reasoning during task execution. Code is available at https://github.com/portal-cornell/robotouille.

  • Imitation Learning via Focused Satisficing

    ArXiv.org · 2025-05-20

    preprintOpen access

    Imitation learning often assumes that demonstrations are close to optimal according to some fixed, but unknown, cost function. However, according to satisficing theory, humans often choose acceptable behavior based on their personal (and potentially dynamic) levels of aspiration, rather than achieving (near-) optimality. For example, a lunar lander demonstration that successfully lands without crashing might be acceptable to a novice despite being slow or jerky. Using a margin-based objective to guide deep reinforcement learning, our focused satisficing approach to imitation learning seeks a policy that surpasses the demonstrator's aspiration levels -- defined over trajectories or portions of trajectories -- on unseen demonstrations without explicitly learning those aspirations. We show experimentally that this focuses the policy to imitate the highest quality (portions of) demonstrations better than existing imitation learning methods, providing much higher rates of guaranteed acceptability to the demonstrator, and competitive true returns on a range of environments.

Frequent coauthors

  • Siddhartha S Srinivasa

    51 shared
  • Sebastian Scherer

    42 shared
  • Mohak Bhardwaj

    15 shared
  • J. Andrew Bagnell

    15 shared
  • Aditya Mandalika

    14 shared
  • Gilwoo Lee

    University of Washington

    13 shared
  • Gokul Swamy

    13 shared
  • Zhiwei Steven Wu

    12 shared

Education

  • Ph.D, Robotics Institute

    Carnegie Mellon University

    2018
  • M.S., Robotics Institute

    Carnegie Mellon University

    2013
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Sanjiban Choudhury

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup