Osbert Bastani

· Assistant ProfessorVerified

University of Pennsylvania · Computer and Information Science

Active 2011–2026

h-index24

Citations3.2k

Papers197133 last 5y

Funding$1.3M

Faculty page

See your match with Osbert Bastani — sign in to PhdFit.Sign in

Research topics

Artificial Intelligence
Computer Science
Mathematics
Theoretical computer science
Mathematical optimization
Human–computer interaction
Control engineering
Engineering
Data science

Selected publications

Improving access to essential medicines via decision-aware machine learning
Nature · 2026-04-29
articleSenior author
Publisher DOI
Choose, Don't Label: Multiple-Choice Query Synthesis for Program Disambiguation
arXiv (Cornell University) · 2026-04-09
articleOpen access
High-level specifications of code are inherently ambiguous, and prior systems have explored interactive techniques to help users clarify their intent and resolve such ambiguities. However, most existing approaches elicit supervision through labeled examples, which are often error-prone and may fail to capture user intent. This paper introduces a new active learning paradigm for program disambiguation based on multiple-choice queries. In this paradigm, the system presents a small set of high-level behaviors as multiple-choice options, and the user simply selects the intended one. Technically, each answer option corresponds to a Hoare triple that characterizes a cluster of semantically similar candidate programs. This formulation enables formal reasoning about the informativeness and interpretability of queries, and supports systematic construction of optimal queries. Building on this insight, we develop a new active learning algorithm and implement it in a tool called Socrates, which automatically synthesizes informative multiple-choice queries for program disambiguation. We evaluate Socrates across four domains spanning both symbolic and neurosymbolic settings and show that it produces intuitive, easy-to-answer queries and achieves efficient convergence. Most importantly, Socrates identifies the intended program more reliably than existing methods, while maintaining competitive runtime performance.
Publisher OA PDF DOI
PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications
arXiv (Cornell University) · 2026-05-18
preprintOpen accessSenior author
Compound AI applications, which compose calls to ML models using a general-purpose programming language like Python, are widely used for a variety of user-facing tasks, from software engineering to enterprise automation, making their end-to-end latency a critical bottleneck. In contrast to traditional applications, execution time is dominated by the external components, which cannot be handled by traditional language optimization systems, like optimizing compilers. To address this problem, we develop PopPy, a system that can uncover parallelization opportunities in Python applications that invoke these heavy external components, including those used in compound AI applications. PopPy supports a very expressive fragment of Python and requires minimal developer input to uncover parallelism. It combines an ahead-of-time compiler with a runtime, addressing three key challenges in extracting parallelism from Python applications: language complexity, dynamic dispatch, and variable mutation. On a set of real-world compound AI applications, PopPy achieves up to $6.4\times$ speedups in end-to-end execution time compared to standard Python execution while preserving the sequential program semantics.
Publisher DOI
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
arXiv (Cornell University) · 2026-05-07
preprintOpen access
Manually curated biomedical repositories -- spanning bioactivity, genomics, and chemistry -- are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted retrieval filters, induces an extraction schema, and emits structured records with nuance-rich fields and supporting passages. Across six tasks -- blood-brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene-disease associations, protein subcellular localization, and chemical reactions -- Starling produces ~6.3M records (91K-3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier-model rejection of our extractions is 0.6-7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard -- e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI-driven therapeutic design. Code and datasets: https://github.com/starling-labs/starling.
Publisher DOI
PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications
ArXiv.org · 2026-05-18
articleOpen accessSenior author
Compound AI applications, which compose calls to ML models using a general-purpose programming language like Python, are widely used for a variety of user-facing tasks, from software engineering to enterprise automation, making their end-to-end latency a critical bottleneck. In contrast to traditional applications, execution time is dominated by the external components, which cannot be handled by traditional language optimization systems, like optimizing compilers. To address this problem, we develop PopPy, a system that can uncover parallelization opportunities in Python applications that invoke these heavy external components, including those used in compound AI applications. PopPy supports a very expressive fragment of Python and requires minimal developer input to uncover parallelism. It combines an ahead-of-time compiler with a runtime, addressing three key challenges in extracting parallelism from Python applications: language complexity, dynamic dispatch, and variable mutation. On a set of real-world compound AI applications, PopPy achieves up to $6.4\times$ speedups in end-to-end execution time compared to standard Python execution while preserving the sequential program semantics.
Publisher OA PDF
Winner's Curse Drives False Promises in Data-Driven Decisions: A Case Study in Refugee Matching
SSRN Electronic Journal · 2026-01-01
preprintOpen access
Publisher DOI
Decaf: Improving Neural Decompilation with Automatic Feedback and Search
ArXiv.org · 2026-05-12
articleOpen access
Decompilers are useful tools used in reverse engineering to understand compiled source code. Reconstructing source code from compiled binaries is a challenging task, because high-level syntax, identifiers, and custom data types are generally lost as the compiler translates human-readable code to low-level machine code. Deterministic decompilers are useful tools for binary analysis, but can struggle to infer idiomatic syntax and identifier names. Generative AI models are a natural fit for reconstructing high-level syntax, identifiers, and types, but they can still suffer by hallucinating improper programming constructs and semantics. Instead of attempting to improve neural decompilers with more data and more training, we argue that compiler feedback can be used to dramatically improve the semantic correctness of neural decompiler outputs via search. Our system, Decaf (DECompilation with Automated Feedback), raises the neural decompilation rate from 26.0% on ExeBench to 83.9% on the Real -O2 split without sacrificing similarity to the original source code. We also find our automatic feedback methodology is highly effective for improving weaker neural decompilation models.
Publisher OA PDF
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
ArXiv.org · 2026-05-07
articleOpen access
Manually curated biomedical repositories -- spanning bioactivity, genomics, and chemistry -- are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted retrieval filters, induces an extraction schema, and emits structured records with nuance-rich fields and supporting passages. Across six tasks -- blood-brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene-disease associations, protein subcellular localization, and chemical reactions -- Starling produces ~6.3M records (91K-3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier-model rejection of our extractions is 0.6-7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard -- e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI-driven therapeutic design. Code and datasets: https://github.com/starling-labs/starling.
Publisher OA PDF
Choose, Don't Label: Multiple-Choice Query Synthesis for Program Disambiguation
arXiv (Cornell University) · 2026-04-09
preprintOpen access
High-level specifications of code are inherently ambiguous, and prior systems have explored interactive techniques to help users clarify their intent and resolve such ambiguities. However, most existing approaches elicit supervision through labeled examples, which are often error-prone and may fail to capture user intent. This paper introduces a new active learning paradigm for program disambiguation based on multiple-choice queries. In this paradigm, the system presents a small set of high-level behaviors as multiple-choice options, and the user simply selects the intended one. Technically, each answer option corresponds to a Hoare triple that characterizes a cluster of semantically similar candidate programs. This formulation enables formal reasoning about the informativeness and interpretability of queries, and supports systematic construction of optimal queries. Building on this insight, we develop a new active learning algorithm and implement it in a tool called Socrates, which automatically synthesizes informative multiple-choice queries for program disambiguation. We evaluate Socrates across four domains spanning both symbolic and neurosymbolic settings and show that it produces intuitive, easy-to-answer queries and achieves efficient convergence. Most importantly, Socrates identifies the intended program more reliably than existing methods, while maintaining competitive runtime performance.
Publisher DOI
Decaf: Improving Neural Decompilation with Automatic Feedback and Search
arXiv (Cornell University) · 2026-05-12
preprintOpen access
Decompilers are useful tools used in reverse engineering to understand compiled source code. Reconstructing source code from compiled binaries is a challenging task, because high-level syntax, identifiers, and custom data types are generally lost as the compiler translates human-readable code to low-level machine code. Deterministic decompilers are useful tools for binary analysis, but can struggle to infer idiomatic syntax and identifier names. Generative AI models are a natural fit for reconstructing high-level syntax, identifiers, and types, but they can still suffer by hallucinating improper programming constructs and semantics. Instead of attempting to improve neural decompilers with more data and more training, we argue that compiler feedback can be used to dramatically improve the semantic correctness of neural decompiler outputs via search. Our system, Decaf (DECompilation with Automated Feedback), raises the neural decompilation rate from 26.0% on ExeBench to 83.9% on the Real -O2 split without sacrificing similarity to the original source code. We also find our automatic feedback methodology is highly effective for improving weaker neural decompilation models.
Publisher DOI

Recent grants

Expeditions: Collaborative Research: Understanding the World Through Code
NSF · $808k · 2020–2025
SHF: Small: Inferring Specifications for Blackbox Code
NSF · $500k · 2019–2023

Frequent coauthors

Dinesh Jayaraman
University of Pennsylvania
23 shared
Rajeev Alur
University of Pennsylvania
22 shared
Kishor Jothimurugan
University of Pennsylvania
22 shared
Hamsa Bastani
University of Pennsylvania
20 shared
Armando Solar-Lezama
19 shared
Insup Lee
17 shared
Sangdon Park
17 shared
Yecheng Jason Ma
16 shared

Labs

Osbert Bastani LabPI

Awards & honors

Adobe Photoshop CC 2019

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Osbert Bastani

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you