Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
David Sontag

David Sontag

Verified

Massachusetts Institute of Technology · Electrical Engineering & Computer Science

Active 1981–2026

h-index53
Citations13.5k
Papers292127 last 5y
Funding$1.9M1 active
See your match with David Sontag — sign in to PhdFit.Sign in

About

David Sontag is an Associate Professor at MIT in the EECS department, specializing in Artificial Intelligence and Decision-making. His research areas include AI for Healthcare and Life Sciences, Natural Language and Speech Processing, and developing techniques for systems that interact with the external world through perception, communication, and action. His work combines intellectual traditions from computer science and electrical engineering to analyze and synthesize systems that learn, make decisions, and adapt to changing environments. As a faculty member, he is involved in advancing the understanding and application of AI technologies, contributing to the department's focus on innovative research in these fields.

Research topics

  • Artificial Intelligence
  • Data Mining
  • Computer Science
  • Machine Learning
  • Biology
  • Bioinformatics

Selected publications

  • LLMs can construct powerful representations and streamline sample-efficient supervised learning

    ArXiv.org · 2026-03-12

    articleOpen accessSenior author

    As real-world datasets become more complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data, such as time-series, free text, and structured records, often requires non-trivial domain expertise. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned interpretive summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric approaches significantly outperform count-feature models, naive LLM baselines, and a clinical foundation model pretrained on orders of magnitude more data. Beyond performance, rubrics offer operational advantages such as being easy to audit, cost-effectiveness at scale, and facilitating tabular representations.

  • LLMs can construct powerful representations and streamline sample-efficient supervised learning

    arXiv (Cornell University) · 2026-03-12

    preprintOpen accessSenior author

    As real-world datasets become more complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data, such as time-series, free text, and structured records, often requires non-trivial domain expertise. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned interpretive summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric approaches significantly outperform count-feature models, naive LLM baselines, and a clinical foundation model pretrained on orders of magnitude more data. Beyond performance, rubrics offer operational advantages such as being easy to audit, cost-effectiveness at scale, and facilitating tabular representations.

  • Abstract 7427: Use of large language models for clinical data abstraction from oncologic medical records

    Cancer Research · 2025-04-21

    article

    Abstract Introduction: Abstraction of clinical data from unstructured medical records is a labor-intensive process. Large language model (LLM)-based systems can accelerate the availability of data for research use. Participants and Methods: The American Cancer Society’s Cancer Prevention Study-3 is a large, nationwide prospective cohort study of ∼300, 000 cancer-free participants enrolled between 2006-2013. Participants who self-reported a diagnosis of breast, colorectal, ovarian, or any blood cancer on the 2021 follow-up survey were consented for the retrieval of medical records (available as a single PDF) from diagnostic and treatment facilities. Guidelines, e.g. ontologies and decision trees, were generated to precisely define abstraction tasks, and relevant research data were annotated via human abstraction. This project then used 300 breast cancer-related medical records to develop (n=200) and test (n=100) the performance of Layer Health’s LLM-based platform Distill to abstract seven data elements from the medical record: cancer behavior, laterality, neoadjuvant therapy status, and presence of key biopsy or surgery with associated procedure dates. Both an answer and corresponding evidence from the record were provided by their algorithms; this evidence was used to conduct quality control on a small fraction of data points that were automatically flagged by inter-variable consistency checks. An ACS oncology data specialist adjudicated disagreements between human and LLM-based abstraction to update ground truth labels. Accuracy and F1 statistics were then calculated between ground truth and LLM-derived results. Results: Results on the test set (n=100) were available on the same day; adjudication of 2​5​​ discrepancies between human and LLM-based abstraction found 12 LLM errors and 13 human errors. Cancer behavior and laterality were abstracted with 100% accuracy​ by the LLM system​. Neoadjuvant therapy (yes/no) was abstracted with 99% accuracy (F1=0.99). Surgical and biopsy procedures (yes/no) were abstracted with 98% (F1=0.99) and 96% (F1 = 0.98) accuracy, respectively. 94% of identified key biopsy dates were accurate within 1 day, and 96% were accurate within 30 days. 96% of identified key surgery dates were accurate within 1 day, and 97% were accurate within 30 days. Conclusion: LLM-based systems offer a time-efficient and accurate solution to accelerate abstraction of key data elements for use in oncologic research, including ​complex multi-step ​variables. Citation Format: Jillian Nelson, Monica Agrawal, Den E Bloodworth, Divya Gopinath, James Mullenbach, Peter J. Briggs, Dominique Connolly, David Sontag, Alpa V. Patel. Use of large language models for clinical data abstraction from oncologic medical records [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2025; Part 1 (Regular Abstracts); 2025 Apr 25-30; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2025;85(8_Suppl_1):Abstract nr 7427.

  • A Diffusion-Based Autoencoder for Learning Patient-Level Representations from Single-Cell Data

    bioRxiv (Cold Spring Harbor Laboratory) · 2025-08-25

    preprintOpen accessSenior authorCorresponding

    Abstract Single-cell RNA sequencing (scRNA-seq) offers insights into cellular heterogeneity and tissue composition, yet leveraging this data for patient-level clinical predictions remains challenging due to the set-structured nature of single-cell data, as well as the scarcity of labeled samples. To address these challenges, we introduce scSet, a diffusion-based autoencoder that learns patient-level representations from sets of single-cell transcriptomes. Our method uses a transformer-based encoder to process variably sized and unordered cell inputs, coupled with a conditional diffusion decoder for self-supervised learning on unlabeled data. By pre-training on large-scale unlabeled datasets, scSet generates robust patient representations that can be fine-tuned for downstream clinical prediction tasks. We demonstrate the effectiveness of scSet patient embeddings for clinical prediction across multiple real-world datasets, where they outperform existing patient representations, even with limited labeled data. This work represents an important step toward bridging the gap between single-cell resolution and patient-level insights. Code is available at https://github.com/clinicalml/scset .

  • CodingGenie: A Proactive LLM-Powered Programming Assistant

    2025-06-23 · 6 citations

    articleOpen access

    While developers increasingly adopt tools powered by large language models (LLMs) in day-to-day workflows, these tools still require explicit user invocation. To seamlessly integrate LLM capabilities to a developer's workflow, we introduce CodingGenie, a proactive assistant integrated into the code editor. CodingGenie autonomously provides suggestions, ranging from bug fixing to unit testing, based on the current code context and allows users to customize suggestions by providing a task description and selecting what suggestions are shown. We demonstrate multiple use cases to show how proactive suggestions from CodingGenie can improve developer experience, and also analyze the cost of adding proactivity. We believe this open-source tool will enable further research into proactive assistants. CodingGenie is open-sourced at https://github.com/sebzhao/CodingGenie/ and video demos are available at https://sebzhao.github.io/CodingGenie/.

  • Use of Machine Learning to Assess the Management of Uncomplicated Urinary Tract Infection

    JAMA Network Open · 2025-01-31 · 3 citations

    articleOpen access

    Importance: Uncomplicated urinary tract infection (UTI) is a common indication for outpatient antimicrobial therapy. National guidelines for the management of uncomplicated UTI were published in 2011, but the extent to which they align with current practices, patient diversity, and pathogen biology, all of which have evolved greatly in the time since their publication, is not fully known. Objective: To reevaluate the effectiveness and adverse event profile for first-line antibiotics, fluoroquinolones, and oral β-lactams for treating uncomplicated UTI in contemporary clinical practice. Design, Setting, and Participants: This retrospective, population-based cohort study used a claims dataset from Independence Blue Cross, which contains inpatient, outpatient, laboratory, and pharmacy claims that occurred between 2012 and 2021, formatted into the Observational Medical Outcomes Partnership (OMOP) common data model. Participants were nonpregnant female individuals aged 18 years or older with a diagnosis of uncomplicated, nonrecurrent UTI at an outpatient setting. Patients must also have been treated with first-line (nitrofurantoin or trimethoprim-sulfamethoxazole), fluoroquinolone (ciprofloxacin, levofloxacin, or ofloxacin), or oral β-lactam (amoxicillin-clavulanate, cefadroxil, or cefpodoxime) antibiotics. Data analysis was performed from November 2021 to August 2024. Exposures: Patients exposed to first-line antibiotics were assigned to the treatment group, and those exposed to fluoroquinolone or β-lactam treatments were assigned to control groups. Main Outcomes and Measures: The primary outcome was a composite end point for treatment failure, defined as outpatient or inpatient revisit within 30 days for UTI, pyelonephritis, or sepsis. Secondary outcomes were the risk of 4 common antibiotic-associated adverse events: gastrointestinal symptoms, rash, kidney injury, and Clostridium difficile infection. Results: There were 57 585 episodes of UTI among 49 037 female patients (mean [SD] age, 51.7 [20.1]) years), with prescriptions for first-line antibiotics in 35 018 episodes (61%), fluoroquinolones in 21 140 episodes (37%), and β-lactams in 1427 episodes (2%). After adjustment, receipt of first-line therapies was associated with an absolute risk difference of -1.78% (95% CI, -2.37% to -1.06%) for having a revisit for UTI within 30 days of diagnosis vs fluoroquinolones. First-line therapies were associated with an absolute risk difference of -6.40% (95% CI, -10.14% to -3.24%) for 30-day revisit compared with β-lactam antibiotics. Differences in adverse events were similar between all comparators. Results were identical for models built with an automated OMOP feature extraction package. Conclusions and Relevance: In this cohort study of patients with uncomplicated UTI derived from a large regional claims dataset, national treatment guidelines published almost 14 years ago continue to recommend optimal treatments. These results also provide proof-of-principle that automated feature extraction methods for OMOP formatted data can emulate manually curated models, thereby promoting reproducibility and generalizability.

  • Uncovering Bias Mechanisms in Observational Studies

    ArXiv.org · 2025-06-01

    preprintOpen accessSenior author

    Observational studies are a key resource for causal inference but are often affected by systematic biases. Prior work has focused mainly on detecting these biases, via sensitivity analyses and comparisons with randomized controlled trials, or mitigating them through debiasing techniques. However, there remains a lack of methodology for uncovering the underlying mechanisms driving these biases, e.g., whether due to hidden confounding or selection of participants. In this work, we show that the relationship between bias magnitude and the predictive performance of nuisance function estimators (in the observational study) can help distinguish among common sources of causal bias. We validate our methodology through extensive synthetic experiments and a real-world case study, demonstrating its effectiveness in revealing the mechanisms behind observed biases. Our framework offers a new lens for understanding and characterizing bias in observational studies, with practical implications for improving causal inference.

  • Large Language Models are Powerful Electronic Health Record Encoders

    ArXiv.org · 2025-02-24 · 3 citations

    preprintOpen access

    Electronic Health Records (EHRs) offer considerable potential for clinical prediction, but their complexity and heterogeneity challenge traditional machine learning. Domain-specific EHR foundation models trained on unlabeled EHR data have shown improved predictive accuracy and generalization. However, their development is constrained by limited data access and site-specific vocabularies. We convert EHR data into plain text by replacing medical codes with natural-language descriptions, enabling general-purpose Large Language Models (LLMs) to produce high-dimensional embeddings for downstream prediction tasks without access to private medical training data. LLM-based embeddings perform on par with a specialized EHR foundation model, CLMBR-T-Base, across 15 clinical tasks from the EHRSHOT benchmark. In an external validation using the UK Biobank, an LLM-based model shows statistically significant improvements for some tasks, which we attribute to higher vocabulary coverage and slightly better generalization. Overall, we reveal a trade-off between the computational efficiency of specialized EHR models and the portability and data independence of LLM-based embeddings.

  • Need Help? Designing Proactive AI Assistants for Programming

    2025-04-24 · 14 citations

    articleOpen access
  • Machine learning identifies clusters of longitudinal autoantibody profiles predictive of systemic lupus erythematosus disease outcomes

    UNC Libraries · 2025-03-18

    articleOpen access

    OBJECTIVES: A novel longitudinal clustering technique was applied to comprehensive autoantibody data from a large, well-characterised, multinational inception systemic lupus erythematosus (SLE) cohort to determine profiles predictive of clinical outcomes. METHODS: Demographic, clinical and serological data from 805 patients with SLE obtained within 15 months of diagnosis and at 3-year and 5-year follow-up were included. For each visit, sera were assessed for 29 antinuclear antibodies (ANA) immunofluorescence patterns and 20 autoantibodies. K-means clustering on principal component analysis-transformed longitudinal autoantibody profiles identified discrete phenotypic clusters. One-way analysis of variance compared cluster enrolment demographics and clinical outcomes at 10-year follow-up. Cox proportional hazards model estimated the HR for survival adjusting for age of disease onset. RESULTS: Cluster 1 (n=137, high frequency of anti-Smith, anti-U1RNP, AC-5 (large nuclear speckled pattern) and high ANA titres) had the highest cumulative disease activity and immunosuppressants/biologics use at year 10. Cluster 2 (n=376, low anti-double stranded DNA (dsDNA) and ANA titres) had the lowest disease activity, frequency of lupus nephritis and immunosuppressants/biologics use. Cluster 3 (n=80, highest frequency of all five antiphospholipid antibodies) had the highest frequency of seizures and hypocomplementaemia. Cluster 4 (n=212) also had high disease activity and was characterised by multiple autoantibody reactivity including to antihistone, anti-dsDNA, antiribosomal P, anti-Sjögren syndrome antigen A or Ro60, anti-Sjögren syndrome antigen B or La, anti-Ro52/Tripartite Motif Protein 21, antiproliferating cell nuclear antigen and anticentromere B). Clusters 1 (adjusted HR 2.60 (95% CI 1.12 to 6.05), p=0.03) and 3 (adjusted HR 2.87 (95% CI 1.22 to 6.74), p=0.02) had lower survival compared with cluster 2. CONCLUSION: Four discrete SLE patient longitudinal autoantibody clusters were predictive of long-term disease activity, organ involvement, treatment requirements and mortality risk.

Recent grants

Frequent coauthors

  • Fredrik Johansson

    30 shared
  • Michael Oberst

    28 shared
  • Monica Agrawal

    Massachusetts Institute of Technology

    26 shared
  • Steven Horng

    Beth Israel Deaconess Medical Center

    25 shared
  • Yoni Halpern

    Google (United States)

    22 shared
  • Hunter Lang

    21 shared
  • Larry Nathanson

    Beth Israel Deaconess Medical Center

    19 shared
  • Irene Y. Chen

    University of Rochester Medical Center

    18 shared

Labs

  • MIT EECS - David Sontag LabPI

  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with David Sontag

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup