Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Michael Carbin

Michael Carbin

Verified

Massachusetts Institute of Technology · Electrical Engineering & Computer Science

Active 2005–2025

h-index33
Citations5.8k
Papers12660 last 5y
Funding$525k
See your match with Michael Carbin — sign in to PhdFit.Sign in

About

Michael Carbin is an Associate Professor at MIT specializing in areas related to computer science and artificial intelligence, with a focus on programming languages and software engineering, security, and cryptography. His research involves developing techniques for analyzing and synthesizing systems that interact with the external environment through perception, communication, and action, while also learning, making decisions, and adapting to changing environments. His work contributes to advancing the understanding and capabilities of intelligent systems, emphasizing the integration of AI with system design and security.

Research topics

  • Computer Science
  • Programming language
  • Computer architecture
  • Artificial Intelligence
  • Machine Learning
  • Computer engineering
  • Parallel computing
  • Distributed computing
  • Embedded system

Selected publications

  • Inference Plans for Hybrid Particle Filtering

    Proceedings of the ACM on Programming Languages · 2025-01-07 · 2 citations

    articleOpen accessSenior author

    Advanced probabilistic programming languages (PPLs) using hybrid particle filtering combine symbolic exact inference and Monte Carlo methods to improve inference performance. These systems use heuristics to partition random variables within the program into variables that are encoded symbolically and variables that are encoded with sampled values, and the heuristics are not necessarily aligned with the developer’s performance evaluation metrics. In this work, we present inference plans , a programming interface that enables developers to control the partitioning of random variables during hybrid particle filtering. We further present Siren , a new PPL that enables developers to use annotations to specify inference plans the inference system must implement. To assist developers with statically reasoning about whether an inference plan can be implemented, we present an abstract-interpretation-based static analysis for Siren for determining inference plan satisfiability . We prove the analysis is sound with respect to Siren ’s semantics. Our evaluation applies inference plans to three different hybrid particle filtering algorithms on a suite of benchmarks. It shows that the control provided by inference plans enables speed ups of 1.76 x on average and up to 206 x to reach a target accuracy, compared to the inference plans implemented by default heuristics; the results also show that inference plans improve accuracy by 1.83 x on average and up to 595 x with less or equal runtime, compared to the default inference plans. We further show that our static analysis is precise in practice, identifying all satisfiable inference plans in 27 out of the 33 benchmark-algorithm evaluation settings.

  • FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

    ArXiv.org · 2025-04-17

    preprintOpen access

    We introduce FreshStack, a holistic framework for automatically building information retrieval (IR) evaluation benchmarks by incorporating challenging questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not improve first-stage retrieval accuracy (two out of five topics) and oracle context helps an LLM generator generate a high-quality RAG answer. We hope FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks.

  • Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

    ArXiv.org · 2025-02-17

    preprintOpen accessSenior author

    Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teaches LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that enables LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate parallel decoding on-the-fly at inference time. Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality; our results demonstrate geometric mean speedups ranging from 1.21x to 1.93x with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against sequential decoding baseline.

  • Planned Diffusion

    ArXiv.org · 2025-10-20

    preprintOpen accessSenior author

    Most large language models are autoregressive: they generate tokens one at a time. Discrete diffusion language models can generate multiple tokens in parallel, but sampling from them requires a denoising order: a strategy for deciding which tokens to decode at each step. Determining a good denoising order is difficult, and existing approaches use heuristics that create a steep trade-off between quality and latency. We propose planned diffusion, a system that trains the model to determine its own denoising order. Planned diffusion uses a single model that transitions between autoregressive and diffusion-based generation: first, the model autoregressively generates a plan that partitions the response into semantically independent chunks; second, the model denoises all chunks in parallel. The autoregressive plan enables the model to define the denoising order itself. On AlpacaEval, planned diffusion achieves 1.27x to 1.81x speedup over autoregressive generation with only 0.87% to 5.4% drop in win rate, establishing a new Pareto frontier for parallel generation with discrete diffusion. Additionally, planned diffusion's instruction following quality continues to improve with more finetuning compute, while the autoregressive baseline plateaus. Our implementation provides simple runtime knobs that offer tunable control over the quality-latency trade-off.

  • Long Context RAG Performance of Large Language Models

    arXiv (Cornell University) · 2024-11-05 · 6 citations

    preprintOpen accessSenior author

    Retrieval Augmented Generation (RAG) has emerged as a crucial technique for enhancing the accuracy of Large Language Models (LLMs) by incorporating external information. With the advent of LLMs that support increasingly longer context lengths, there is a growing interest in understanding how these models perform in RAG scenarios. Can these new long context models improve RAG performance? This paper presents a comprehensive study of the impact of increased context length on RAG performance across 20 popular open source and commercial LLMs. We ran RAG workflows while varying the total context length from 2,000 to 128,000 tokens (and 2 million tokens when possible) on three domain-specific datasets, and report key insights on the benefits and limitations of long context in RAG applications. Our findings reveal that while retrieving more documents can improve performance, only a handful of the most recent state of the art LLMs can maintain consistent accuracy at long context above 64k tokens. We also identify distinct failure modes in long context scenarios, suggesting areas for future research.

  • Quantum Control Machine: The Limits of Control Flow in Quantum Programming

    Proceedings of the ACM on Programming Languages · 2024-04-29 · 9 citations

    articleOpen accessSenior authorCorresponding

    Quantum algorithms for tasks such as factorization, search, and simulation rely on control flow such as branching and iteration that depends on the value of data in superposition. High-level programming abstractions for control flow, such as switches, loops, higher-order functions, and continuations, are ubiquitous in classical languages. By contrast, many quantum languages do not provide high-level abstractions for control flow in superposition, and instead require the use of hardware-level logic gates to implement such control flow. The reason for this gap is that whereas a classical computer supports control flow abstractions using a program counter that can depend on data, the typical architecture of a quantum computer does not analogously provide a program counter that can depend on data in superposition. As a result, the complete set of control flow abstractions that can be correctly realized on a quantum computer has not yet been established. In this work, we provide a complete characterization of the properties of control flow abstractions that are correctly realizable on a quantum computer. First, we prove that even on a quantum computer whose program counter exists in superposition, one cannot correctly realize control flow in quantum algorithms by lifting the classical conditional jump instruction to work in superposition. This theorem denies the ability to directly lift general abstractions for control flow such as the λ-calculus from classical to quantum programming. In response, we present the necessary and sufficient conditions for control flow to be correctly realizable on a quantum computer. We introduce the quantum control machine, an instruction set architecture featuring a conditional jump that is restricted to satisfy these conditions. We show how this design enables a developer to correctly express control flow in quantum algorithms using a program counter in place of logic gates.

  • The T-Complexity Costs of Error Correction for Control Flow in Quantum Computation

    Proceedings of the ACM on Programming Languages · 2024-06-20 · 10 citations

    articleOpen accessSenior author

    Numerous quantum algorithms require the use of quantum error correction to overcome the intrinsic unreliability of physical qubits. However, quantum error correction imposes a unique performance bottleneck, known as <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll"> <mml:mi>T</mml:mi> </mml:math> -complexity, that can make an implementation of an algorithm as a quantum program run more slowly than on idealized hardware. In this work, we identify that programming abstractions for control flow, such as the quantum if-statement, can introduce polynomial increases in the <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll"> <mml:mi>T</mml:mi> </mml:math> -complexity of a program. If not mitigated, this slowdown can diminish the computational advantage of a quantum algorithm. To enable reasoning about the costs of control flow, we present a cost model that a developer can use to accurately analyze the <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll"> <mml:mi>T</mml:mi> </mml:math> -complexity of a program under quantum error correction and pinpoint the sources of slowdown. To enable the mitigation of these costs, we present a set of program-level optimizations that a developer can use to rewrite a program to reduce its <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll"> <mml:mi>T</mml:mi> </mml:math> -complexity, predict the <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll"> <mml:mi>T</mml:mi> </mml:math> -complexity of the optimized program using the cost model, and then compile it to an efficient circuit via a straightforward strategy. We implement the program-level optimizations in Spire, an extension of the Tower quantum compiler. Using a set of 11 benchmark programs that use control flow, we empirically show that the cost model is accurate, and that Spire's optimizations recover programs that are asymptotically efficient, meaning their runtime <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll"> <mml:mi>T</mml:mi> </mml:math> -complexity under error correction is equal to their time complexity on idealized hardware. Our results show that optimizing a program before it is compiled to a circuit can yield better results than compiling the program to an inefficient circuit and then invoking a quantum circuit optimizer found in prior work. For our benchmarks, only 2 of 8 tested quantum circuit optimizers recover circuits with asymptotically efficient <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll"> <mml:mi>T</mml:mi> </mml:math> -complexity. Compared to these 2 optimizers, Spire uses <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll"> <mml:mn>54</mml:mn> <mml:mo>×</mml:mo> <mml:mo>−</mml:mo> <mml:mn>2400</mml:mn> <mml:mo>×</mml:mo> </mml:math> less compile time.

  • Learning to Compile Programs to Neural Networks

    arXiv (Cornell University) · 2024-07-21

    preprintOpen accessSenior author

    A $\textit{neural surrogate of a program}$ is a neural network that mimics the behavior of a program. Researchers have used these neural surrogates to automatically tune program inputs, adapt programs to new settings, and accelerate computations. Researchers traditionally develop neural surrogates by training on input-output examples from a single program. Alternatively, language models trained on a large dataset including many programs can consume program text, to act as a neural surrogate. Using a language model to both generate a surrogate and act as a surrogate, however, leading to a trade-off between resource consumption and accuracy. We present $\textit{neural surrogate compilation}$, a technique for producing neural surrogates directly from program text without coupling neural surrogate generation and execution. We implement neural surrogate compilers using hypernetworks trained on a dataset of C programs and find that they produce neural surrogates that are $1.9$-$9.5\times$ as data-efficient, produce visual results that are $1.0$-$1.3\times$ more similar to ground truth, and train in $4.3$-$7.3\times$ fewer epochs than neural surrogates trained from scratch.

  • Drowning in Documents: Consequences of Scaling Reranker Inference

    arXiv (Cornell University) · 2024-11-18

    preprintOpen access

    Rerankers, typically cross-encoders, are computationally intensive but are frequently used because they are widely assumed to outperform cheaper initial IR systems. We challenge this assumption by measuring reranker performance for full retrieval, not just re-scoring first-stage retrieval. To provide a more robust evaluation, we prioritize strong first-stage retrieval using modern dense embeddings and test rerankers on a variety of carefully chosen, challenging tasks, including internally curated datasets to avoid contamination, and out-of-domain ones. Our empirical results reveal a surprising trend: the best existing rerankers provide initial improvements when scoring progressively more documents, but their effectiveness gradually declines and can even degrade quality beyond a certain limit. We hope that our findings will spur future research to improve reranking.

  • BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

    arXiv (Cornell University) · 2024-03-27 · 34 citations

    preprintOpen access

    Models such as GPT-4 and Med-PaLM 2 have demonstrated impressive performance on a wide variety of biomedical NLP tasks. However, these models have hundreds of billions of parameters, are computationally expensive to run, require users to send their input data over the internet, and are trained on unknown data sources. Can smaller, more targeted models compete? To address this question, we build and release BioMedLM, a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles. When fine-tuned, BioMedLM can produce strong multiple-choice biomedical question-answering results competitive with much larger models, such as achieving a score of 57.3% on MedMCQA (dev) and 69.0% on the MMLU Medical Genetics exam. BioMedLM can also be fine-tuned to produce useful answers to patient questions on medical topics. This demonstrates that smaller models can potentially serve as transparent, privacy-preserving, economical and environmentally friendly foundations for particular NLP applications, such as in biomedicine. The model is available on the Hugging Face Hub: https://huggingface.co/stanford-crfm/BioMedLM.

Recent grants

Frequent coauthors

Labs

  • MIT EECS Communication LabPI

Awards & honors

  • Eleven MIT faculty receive Presidential Early Career Awards
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Michael Carbin

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup