
Sanjeev Arora
· Director of Princeton Language and IntelligenceVerifiedPrinceton University · Philosophy
Active 1974–2026
Research topics
- Computer Science
- Computer Security
- Artificial Intelligence
- Machine Learning
- Natural Language Processing
- Theoretical computer science
- Algorithm
Selected publications
Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision
arXiv (Cornell University) · 2026-04-13
preprintOpen accessSenior authorCurrent post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization.
American Society of Clinical Oncology Educational Book · 2026-04-24
articleSenior authorCancer outcomes remain starkly unequal: 5-year survival rates for common malignancies in low- and middle-income countries (LMICs) often lag 20-40 percentage points behind high-income benchmarks, and similar disparities persist between well-resourced metropolitan centers and rural or safety-net settings within high-income countries. The gap is driven less by the absence of effective interventions than by workforce shortages, fragmented referral pathways, limited infrastructure, and loss to follow-up after abnormal screens. Technologies that scale must therefore function as deployable workflows integrating staffing, logistics, quality assurance (QA), governance, and monitoring not merely as stand-alone algorithms or devices. 1 This review synthesizes evidence across four complementary technology families that address these constraints across the continuum of care: artificial intelligence (AI)-supported screening and triage as the population entry-point layer; Project ECHO telementoring as the workforce-capacity layer; electronic patient-reported outcomes and remote symptom monitoring as the longitudinal continuity layer; and AI-powered clinical trial prescreening hubs as the access-to-innovation layer. The technologies do not carry equal evidentiary weight, and they should not be deployed identically in every setting. Our aim is to show how oncology leaders can sequence them pragmatically inside a common operating logic while adapting to local infrastructure and governance. We organize the framework in patient journey order entry into care through screening, workforce support through telementoring, continuity through remote monitoring, and access to innovation through trial prescreening, and draw examples from both LMIC programs and underserved US settings. We therefore present a practical implementation playbook, including a 90-day launch checklist, staffing models, QA frameworks, equity and bias monitoring metrics, and a program outcomes dashboard designed for oncology leaders seeking to move beyond pilots toward durable, monitored deployment.
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?
arXiv (Cornell University) · 2025-01-05
preprintOpen accessSenior authorVision Language Models (VLMs) are impressive at visual question answering and image captioning. But they underperform on multi-step visual reasoning -- even compared to LLMs on the same tasks presented in text form -- giving rise to perceptions of modality imbalance or brittleness. Towards a systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning, comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We propose strategies for training on the SIMPLE version of tasks that improve performance on the corresponding HARD task, i.e., simple-to-hard (S2H) generalization. This controlled setup, where each task also has an equivalent text-only version, allows a quantification of the modality imbalance and how it is impacted by training strategy. We show that 1) explicit image-to-text conversion is important in promoting S2H generalization on images, by transferring reasoning from text; 2) conversion can be internalized at test time. We also report results of mechanistic study of this phenomenon. We identify measures of gradient alignment that can identify training strategies that promote better S2H generalization. Ablations highlight the importance of chain-of-thought.
Indian Journal of Psychiatry · 2025-05-02 · 1 citations
articleOpen accessSenior authorABSTRACT Background: Alcohol use disorder (AUD) is a public health problem. In India, about 5.2% of the population aged 10–75 years, that is, approximately 5.7 crore individuals, need help for their alcohol use problems, and around 20 lakhs from Karnataka. As per the studies, among people dependent on alcohol who tried quitting, about 75% did not receive any treatment. One potential approach to reducing this gap is enhancing the knowledge and skills of the existing District Mental Health Programme (DMHP) Health Care Providers (HCPs) on a large scale by integrating a case-based tele-ECHO (Extension of Community Healthcare Outcome) mentoring model. Aim: This study evaluates the effectiveness of the ECHO telementoring in improving knowledge and perceived skills related to AUD among nonmedical HCPs in Karnataka’s DMHP. Methods: A digital-driven curriculum of the Foundation of Alcohol Management was designed and implemented with 84 DMHP healthcare providers (44 ECHO group, 40 waitlist) from 26 districts in Karnataka. The ECHO intervention comprised 27 weekly telementoring sessions over 9 months, combining case-based learning with didactic presentations. Knowledge and perceived skills were assessed at baseline and 3, 6, and 9 months using semistructured questionnaires apart from engagement and satisfaction. Results: At baseline, both groups showed comparable knowledge levels (ECHO: 5.84 ± 1.89, waitlist: 6.65 ± 2.67, P = .110). The ECHO group demonstrated significantly higher knowledge scores at 3 months (8.41 ± 2.84 vs 6.35 ± 2.07, P < .001) and 9 months (9.31 ± 2.48 vs 6.13 ± 1.82, P < .001). Self-perceived skills similarly improved in the ECHO group, showing significant enhancement from baseline (20.25 ± 5.62) to 9 months (25.10 ± 4.52, P = .004), with a large effect size (Cohen’s d = 1.040). Program engagement was high, with 42 participants attending more than 60% of sessions, and 104 cases were discussed during tele-ECHO sessions. The waitlist group showed no significant improvements in either domain. Conclusion: The ECHO telementoring program effectively improves the knowledge and perceived skills of the DMHP nonmedical HCPs in AUD. The ECHO model can be a valuable tool for exponential enhancement in capacity in addictive disorder, especially in low-resource settings, by leveraging technology.
The Dose Response Regarding Microbial Disease: A Mathematical View
2025-09-01
articleOpen accessSenior authorIn the society, we have seen the incidence of both communicable and non-communicable disease, we are also familiar with the events of chronic diseases like T. B. which takes longer duration to be cured as compared to cold, cough or flu etc. The chances of disease development in an individual basically depend upon the growth rate of micro-organism causing that disease. The drug given to infected individual after reaching to the target organ/s kill the pathogens and help to cure against disease through pharmaco kinetics and pharmaco dynamics (PKPD). In this way the rate of curing of disease in infected individual depends upon the rate of killing of causing microbes by that particular drug. But how much drug one should be given, otherwise its overdose may cause deleterious side effect or direct effect to the infected persons and may result into development of symptoms of other disease. Beside growth of microbes is also influenced by resisting ability of body, unfavorable/favorable environment, nutrition but in the present paper we have neglected these all factors and have focused on relation of growth rate of microbes in individual given and dose. In the present paper we have proved our above said effect through mathematical modeling which involves various facts.
Weak-to-Strong Generalization Even in Random Feature Networks, Provably
ArXiv.org · 2025-03-04
preprintOpen accessWeak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed bottom layer and a trained top layer. A "weak" teacher, with a small number of units (i.e. random features), is trained on the population, and a "strong" student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.
OCA: A Shiny web application for transparent overload compensation in higher education
SoftwareX · 2025-09-27
articleOpen accessSenior authorOCA (Overload Compensation App) is an interactive Shiny web application that automates faculty overload pay calculations in accordance with institutional policy and enables users to visualize the results. Designed to promote transparency, reproducibility, and fairness, OCA allows academic administrators to filter, compute, and export overload data across instructors and departments. The app supports strategic blending between institution- and instructor-favoring approaches, offering both flexibility and clarity in compensation planning. OCA is open-source, released under the AGPL-3 license, and requires no programming expertise to use.
Advancing science- and evidence-based AI policy
Science · 2025-07-31 · 10 citations
articleOpen accessPolicy must be informed by, but also facilitate the generation of, scientific evidence.
Rethinking Thinking Tokens: LLMs as Improvement Operators
ArXiv.org · 2025-10-01
preprintOpen accessReasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own "thoughts" with a continuum of possible strategies. We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields an interesting subcase, Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT. Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025).
How Does RL Post-training Induce Skill Composition? A Case Study on Countdown
ArXiv.org · 2025-12-01
preprintOpen accessSenior authorWhile reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known components) is often conflated with mere length generalization. To this end, we study what RL post-training teaches about skill composition and how the structure of the composition affects the skill transfer. We focus on the Countdown task (given n numbers and a target, form an expression that evaluates to the target) and analyze model solutions as expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.'' Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution (OOD) generalization to larger n and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) a structure-dependent hierarchy of learnability -- models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures (even when the composition depth is the same as some left-heavy structures). Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal.
Recent grants
New directions in Approximation Algorithms for NP-hard problems
NSF · $200k · 2005–2007
Collaborative Research: Understanding, Coping with, and Benefiting from Intractibility.
NSF · $6.9M · 2008–2014
AF: Small: Linear Algebra++ and applications to machine learning
NSF · $466k · 2015–2019
AF: Small: Expansion, Unique Games, and Efficient Algorithms
NSF · $458k · 2011–2015
NSF · $290k · 2005–2010
Frequent coauthors
- 48 shared
Summers Kalishman
University of New Mexico
- 44 shared
Karla Thornton
University of New Mexico
- 36 shared
Nishi Suryavanshi
Government Medical College
- 34 shared
Tengyu Ma
- 33 shared
Prabhat Chand
National Institute of Mental Health and Neurosciences
- 32 shared
Joanna G. Katzman
Community Initiatives
- 31 shared
Matthew F. Bouchonville
University of New Mexico
- 29 shared
Pratima Murthy
National Institute of Mental Health and Neurosciences
Labs
Arora Research Lab @ Princeton
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Sanjeev Arora
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup