
David Sontag
VerifiedMassachusetts Institute of Technology · Electrical Engineering & Computer Science
Active 1981–2026
About
David Sontag is an Associate Professor at MIT in the EECS department, specializing in Artificial Intelligence and Decision-making. His research areas include AI for Healthcare and Life Sciences, Natural Language and Speech Processing, and developing techniques for systems that interact with the external world through perception, communication, and action. His work combines intellectual traditions from computer science and electrical engineering to analyze and synthesize systems that learn, make decisions, and adapt to changing environments. As a faculty member, he is involved in advancing the understanding and application of AI technologies, contributing to the department's focus on innovative research in these fields.
Research topics
- Artificial Intelligence
- Data Mining
- Computer Science
- Machine Learning
- Biology
- Bioinformatics
Selected publications
LLMs can construct powerful representations and streamline sample-efficient supervised learning
ArXiv.org · 2026-03-12
articleOpen accessSenior authorAs real-world datasets become more complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data, such as time-series, free text, and structured records, often requires non-trivial domain expertise. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned interpretive summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric approaches significantly outperform count-feature models, naive LLM baselines, and a clinical foundation model pretrained on orders of magnitude more data. Beyond performance, rubrics offer operational advantages such as being easy to audit, cost-effectiveness at scale, and facilitating tabular representations.
LLMs can construct powerful representations and streamline sample-efficient supervised learning
arXiv (Cornell University) · 2026-03-12
preprintOpen accessSenior authorAs real-world datasets become more complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data, such as time-series, free text, and structured records, often requires non-trivial domain expertise. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned interpretive summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric approaches significantly outperform count-feature models, naive LLM baselines, and a clinical foundation model pretrained on orders of magnitude more data. Beyond performance, rubrics offer operational advantages such as being easy to audit, cost-effectiveness at scale, and facilitating tabular representations.
Cancer Research · 2025-04-21
articleAbstract Introduction: Abstraction of clinical data from unstructured medical records is a labor-intensive process. Large language model (LLM)-based systems can accelerate the availability of data for research use. Participants and Methods: The American Cancer Society’s Cancer Prevention Study-3 is a large, nationwide prospective cohort study of ∼300, 000 cancer-free participants enrolled between 2006-2013. Participants who self-reported a diagnosis of breast, colorectal, ovarian, or any blood cancer on the 2021 follow-up survey were consented for the retrieval of medical records (available as a single PDF) from diagnostic and treatment facilities. Guidelines, e.g. ontologies and decision trees, were generated to precisely define abstraction tasks, and relevant research data were annotated via human abstraction. This project then used 300 breast cancer-related medical records to develop (n=200) and test (n=100) the performance of Layer Health’s LLM-based platform Distill to abstract seven data elements from the medical record: cancer behavior, laterality, neoadjuvant therapy status, and presence of key biopsy or surgery with associated procedure dates. Both an answer and corresponding evidence from the record were provided by their algorithms; this evidence was used to conduct quality control on a small fraction of data points that were automatically flagged by inter-variable consistency checks. An ACS oncology data specialist adjudicated disagreements between human and LLM-based abstraction to update ground truth labels. Accuracy and F1 statistics were then calculated between ground truth and LLM-derived results. Results: Results on the test set (n=100) were available on the same day; adjudication of 25 discrepancies between human and LLM-based abstraction found 12 LLM errors and 13 human errors. Cancer behavior and laterality were abstracted with 100% accuracy by the LLM system. Neoadjuvant therapy (yes/no) was abstracted with 99% accuracy (F1=0.99). Surgical and biopsy procedures (yes/no) were abstracted with 98% (F1=0.99) and 96% (F1 = 0.98) accuracy, respectively. 94% of identified key biopsy dates were accurate within 1 day, and 96% were accurate within 30 days. 96% of identified key surgery dates were accurate within 1 day, and 97% were accurate within 30 days. Conclusion: LLM-based systems offer a time-efficient and accurate solution to accelerate abstraction of key data elements for use in oncologic research, including complex multi-step variables. Citation Format: Jillian Nelson, Monica Agrawal, Den E Bloodworth, Divya Gopinath, James Mullenbach, Peter J. Briggs, Dominique Connolly, David Sontag, Alpa V. Patel. Use of large language models for clinical data abstraction from oncologic medical records [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2025; Part 1 (Regular Abstracts); 2025 Apr 25-30; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2025;85(8_Suppl_1):Abstract nr 7427.
A Diffusion-Based Autoencoder for Learning Patient-Level Representations from Single-Cell Data
bioRxiv (Cold Spring Harbor Laboratory) · 2025-08-25
preprintOpen accessSenior authorCorrespondingAbstract Single-cell RNA sequencing (scRNA-seq) offers insights into cellular heterogeneity and tissue composition, yet leveraging this data for patient-level clinical predictions remains challenging due to the set-structured nature of single-cell data, as well as the scarcity of labeled samples. To address these challenges, we introduce scSet, a diffusion-based autoencoder that learns patient-level representations from sets of single-cell transcriptomes. Our method uses a transformer-based encoder to process variably sized and unordered cell inputs, coupled with a conditional diffusion decoder for self-supervised learning on unlabeled data. By pre-training on large-scale unlabeled datasets, scSet generates robust patient representations that can be fine-tuned for downstream clinical prediction tasks. We demonstrate the effectiveness of scSet patient embeddings for clinical prediction across multiple real-world datasets, where they outperform existing patient representations, even with limited labeled data. This work represents an important step toward bridging the gap between single-cell resolution and patient-level insights. Code is available at https://github.com/clinicalml/scset .
CodingGenie: A Proactive LLM-Powered Programming Assistant
2025-06-23 · 6 citations
articleOpen accessWhile developers increasingly adopt tools powered by large language models (LLMs) in day-to-day workflows, these tools still require explicit user invocation. To seamlessly integrate LLM capabilities to a developer's workflow, we introduce CodingGenie, a proactive assistant integrated into the code editor. CodingGenie autonomously provides suggestions, ranging from bug fixing to unit testing, based on the current code context and allows users to customize suggestions by providing a task description and selecting what suggestions are shown. We demonstrate multiple use cases to show how proactive suggestions from CodingGenie can improve developer experience, and also analyze the cost of adding proactivity. We believe this open-source tool will enable further research into proactive assistants. CodingGenie is open-sourced at https://github.com/sebzhao/CodingGenie/ and video demos are available at https://sebzhao.github.io/CodingGenie/.
Use of Machine Learning to Assess the Management of Uncomplicated Urinary Tract Infection
JAMA Network Open · 2025-01-31 · 3 citations
articleOpen accessImportance: Uncomplicated urinary tract infection (UTI) is a common indication for outpatient antimicrobial therapy. National guidelines for the management of uncomplicated UTI were published in 2011, but the extent to which they align with current practices, patient diversity, and pathogen biology, all of which have evolved greatly in the time since their publication, is not fully known. Objective: To reevaluate the effectiveness and adverse event profile for first-line antibiotics, fluoroquinolones, and oral β-lactams for treating uncomplicated UTI in contemporary clinical practice. Design, Setting, and Participants: This retrospective, population-based cohort study used a claims dataset from Independence Blue Cross, which contains inpatient, outpatient, laboratory, and pharmacy claims that occurred between 2012 and 2021, formatted into the Observational Medical Outcomes Partnership (OMOP) common data model. Participants were nonpregnant female individuals aged 18 years or older with a diagnosis of uncomplicated, nonrecurrent UTI at an outpatient setting. Patients must also have been treated with first-line (nitrofurantoin or trimethoprim-sulfamethoxazole), fluoroquinolone (ciprofloxacin, levofloxacin, or ofloxacin), or oral β-lactam (amoxicillin-clavulanate, cefadroxil, or cefpodoxime) antibiotics. Data analysis was performed from November 2021 to August 2024. Exposures: Patients exposed to first-line antibiotics were assigned to the treatment group, and those exposed to fluoroquinolone or β-lactam treatments were assigned to control groups. Main Outcomes and Measures: The primary outcome was a composite end point for treatment failure, defined as outpatient or inpatient revisit within 30 days for UTI, pyelonephritis, or sepsis. Secondary outcomes were the risk of 4 common antibiotic-associated adverse events: gastrointestinal symptoms, rash, kidney injury, and Clostridium difficile infection. Results: There were 57 585 episodes of UTI among 49 037 female patients (mean [SD] age, 51.7 [20.1]) years), with prescriptions for first-line antibiotics in 35 018 episodes (61%), fluoroquinolones in 21 140 episodes (37%), and β-lactams in 1427 episodes (2%). After adjustment, receipt of first-line therapies was associated with an absolute risk difference of -1.78% (95% CI, -2.37% to -1.06%) for having a revisit for UTI within 30 days of diagnosis vs fluoroquinolones. First-line therapies were associated with an absolute risk difference of -6.40% (95% CI, -10.14% to -3.24%) for 30-day revisit compared with β-lactam antibiotics. Differences in adverse events were similar between all comparators. Results were identical for models built with an automated OMOP feature extraction package. Conclusions and Relevance: In this cohort study of patients with uncomplicated UTI derived from a large regional claims dataset, national treatment guidelines published almost 14 years ago continue to recommend optimal treatments. These results also provide proof-of-principle that automated feature extraction methods for OMOP formatted data can emulate manually curated models, thereby promoting reproducibility and generalizability.
Uncovering Bias Mechanisms in Observational Studies
ArXiv.org · 2025-06-01
preprintOpen accessSenior authorObservational studies are a key resource for causal inference but are often affected by systematic biases. Prior work has focused mainly on detecting these biases, via sensitivity analyses and comparisons with randomized controlled trials, or mitigating them through debiasing techniques. However, there remains a lack of methodology for uncovering the underlying mechanisms driving these biases, e.g., whether due to hidden confounding or selection of participants. In this work, we show that the relationship between bias magnitude and the predictive performance of nuisance function estimators (in the observational study) can help distinguish among common sources of causal bias. We validate our methodology through extensive synthetic experiments and a real-world case study, demonstrating its effectiveness in revealing the mechanisms behind observed biases. Our framework offers a new lens for understanding and characterizing bias in observational studies, with practical implications for improving causal inference.
Large Language Models are Powerful Electronic Health Record Encoders
ArXiv.org · 2025-02-24 · 3 citations
preprintOpen accessElectronic Health Records (EHRs) offer considerable potential for clinical prediction, but their complexity and heterogeneity challenge traditional machine learning. Domain-specific EHR foundation models trained on unlabeled EHR data have shown improved predictive accuracy and generalization. However, their development is constrained by limited data access and site-specific vocabularies. We convert EHR data into plain text by replacing medical codes with natural-language descriptions, enabling general-purpose Large Language Models (LLMs) to produce high-dimensional embeddings for downstream prediction tasks without access to private medical training data. LLM-based embeddings perform on par with a specialized EHR foundation model, CLMBR-T-Base, across 15 clinical tasks from the EHRSHOT benchmark. In an external validation using the UK Biobank, an LLM-based model shows statistically significant improvements for some tasks, which we attribute to higher vocabulary coverage and slightly better generalization. Overall, we reveal a trade-off between the computational efficiency of specialized EHR models and the portability and data independence of LLM-based embeddings.
Need Help? Designing Proactive AI Assistants for Programming
2025-04-24 · 14 citations
articleOpen accessUNC Libraries · 2025-03-18
articleOpen accessOBJECTIVES: A novel longitudinal clustering technique was applied to comprehensive autoantibody data from a large, well-characterised, multinational inception systemic lupus erythematosus (SLE) cohort to determine profiles predictive of clinical outcomes. METHODS: Demographic, clinical and serological data from 805 patients with SLE obtained within 15 months of diagnosis and at 3-year and 5-year follow-up were included. For each visit, sera were assessed for 29 antinuclear antibodies (ANA) immunofluorescence patterns and 20 autoantibodies. K-means clustering on principal component analysis-transformed longitudinal autoantibody profiles identified discrete phenotypic clusters. One-way analysis of variance compared cluster enrolment demographics and clinical outcomes at 10-year follow-up. Cox proportional hazards model estimated the HR for survival adjusting for age of disease onset. RESULTS: Cluster 1 (n=137, high frequency of anti-Smith, anti-U1RNP, AC-5 (large nuclear speckled pattern) and high ANA titres) had the highest cumulative disease activity and immunosuppressants/biologics use at year 10. Cluster 2 (n=376, low anti-double stranded DNA (dsDNA) and ANA titres) had the lowest disease activity, frequency of lupus nephritis and immunosuppressants/biologics use. Cluster 3 (n=80, highest frequency of all five antiphospholipid antibodies) had the highest frequency of seizures and hypocomplementaemia. Cluster 4 (n=212) also had high disease activity and was characterised by multiple autoantibody reactivity including to antihistone, anti-dsDNA, antiribosomal P, anti-Sjögren syndrome antigen A or Ro60, anti-Sjögren syndrome antigen B or La, anti-Ro52/Tripartite Motif Protein 21, antiproliferating cell nuclear antigen and anticentromere B). Clusters 1 (adjusted HR 2.60 (95% CI 1.12 to 6.05), p=0.03) and 3 (adjusted HR 2.87 (95% CI 1.22 to 6.74), p=0.02) had lower survival compared with cluster 2. CONCLUSION: Four discrete SLE patient longitudinal autoantibody clusters were predictive of long-term disease activity, organ involvement, treatment requirements and mortality risk.
Recent grants
CAREER: Exact Algorithms for Learning Latent Structure
NSF · $500k · 2014–2017
NSF · $600k · 2022–2026
CAREER: Exact Algorithms for Learning Latent Structure
NSF · $351k · 2017–2020
AitF: Collaborative Research: Algorithms for Probabilistic Inference in the Real World
NSF · $400k · 2017–2022
Frequent coauthors
- 30 shared
Fredrik Johansson
- 28 shared
Michael Oberst
- 26 shared
Monica Agrawal
Massachusetts Institute of Technology
- 25 shared
Steven Horng
Beth Israel Deaconess Medical Center
- 22 shared
Yoni Halpern
Google (United States)
- 21 shared
Hunter Lang
- 19 shared
Larry Nathanson
Beth Israel Deaconess Medical Center
- 18 shared
Irene Y. Chen
University of Rochester Medical Center
Labs
MIT EECS - David Sontag LabPI
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with David Sontag
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup