Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Lei Cao

Lei Cao

· Assistant ProfessorVerified

University of Arizona · Computer Science

Active 1932–2026

h-index24
Citations3.5k
Papers222102 last 5y
Funding
See your match with Lei Cao — sign in to PhdFit.Sign in

About

Lei Cao is an Assistant Professor in the Computer Science department at the University of Arizona. He holds a research affiliation at MIT CSAIL, where he spent several years as a Postdoctoral Associate and then as a Research Scientist. During his time at MIT, he actively collaborated with prominent researchers including Prof. Samuel Madden, Prof. Michael Stonebraker, Prof. Tim Kraska, and Dr. Michael Cafarella. Prior to his academic career, Lei Cao worked as a Research Staff Member at IBM T.J. Watson Research Center. His research spans broad areas of data systems and data science, covering topics from low-level core database performance optimization to the design of high-level, application-specific machine learning techniques. His recent work focuses on the emerging area of "Systems for AI and AI for Systems," aiming to build data management and analytics tools that satisfy the SAUL properties: Scalable, Automatic, and Human-in-the-loop. Lei Cao's group actively welcomes research interns to contribute to projects developing next-generation AI-powered data systems.

Research topics

  • Medicine
  • Internal medicine
  • Environmental health
  • Demography
  • Geography
  • Gerontology
  • Emergency medicine
  • Surgery
  • Medical emergency
  • Physical therapy

Selected publications

  • A Bitter Lesson for Retail Demand Forecasting: Evidence from Fine-Tuning Foundation Models

    SSRN Electronic Journal · 2026-01-01

    preprintOpen accessSenior author
  • BRIEF: Bi-Level Coreset Selection for Efficient Instruction Tuning in LLMs

    Proceedings of the VLDB Endowment · 2026-02-01

    articleSenior author

    Instruction tuning is a key step in adapting large language models (LLMs) to effectively understand and follow human instructions. It enables LLMs to transform general knowledge into task-specific responses that align with user intent. Although many high-quality instruction tuning datasets have been released, efficiently utilizing these data sources during supervised fine-tuning (SFT) is important, as training on the full high-quality corpus can be computationally expensive. To address this inefficiency, we explore whether a compact, high-quality subset of instruction data can achieve comparable performance to full-dataset SFT, thereby reducing training cost without sacrificing effectiveness. To this end, this work proposes to select such a subset (a.k.a., coreset) of instruction examples that maintains comparable downstream performance while improving training efficiency. The key idea is inspired by our discovered decomposition that in instruction tuning, the training loss can be decomposed into two components that effectively quantify the contribution of an instruction to the two fundamental capabilities of LLMs, namely knowledge-related capability and instruction following capability. We then revisit the objective of the classical coreset approaches to balance the two capabilities when selecting instruction examples. Based on a bi-level formulation and a composite gradient distance that makes the objective submodular, we design an effective algorithm to achieve a bounded approximation error. Experiments on 4 datasets across 9 downstream tasks demonstrate that BRIEF reduces computational costs by 3× while improving accuracy by 5% on Llama-3.1-8B, Qwen3-4B and Mistral-Nemo-12B.

  • High-Precision Digital Reconstruction and Conservation of Architectural Heritage Based on Virtual Reality

    Buildings · 2026-05-11

    articleOpen accessSenior authorCorresponding

    The conservation and restoration of architectural heritage face dual challenges from natural erosion and human interference, necessitating the adoption of efficient and non-contact digital technologies to achieve sustainable preservation. Virtual reality (VR) technology, with its advantages of immersion, interactivity, and visualization, provides a novel technological pathway for digital documentation, conservation decision-making, and public presentation of architectural heritage. Taking the Fuliang Red Pagoda in Jingdezhen, Jiangxi Province, as the research object, this study constructs a high-precision digital reconstruction and VR interactive application workflow based on the integration of terrestrial laser scanning and close-range photogrammetry. Through point cloud denoising, Iterative Closest Point (ICP) registration, and Poisson surface reconstruction algorithms, a refined three-dimensional model of the pagoda is achieved, and an immersive VR system is developed with functions including component information query, virtual restoration scheme switching, and interactive exploration. The results demonstrate that this technical workflow not only enables non-contact digital archiving of the Fuliang Red Pagoda but also provides a visual decision-support tool for conservation interventions. Under full-scene operation, the system achieves an average rendering frame rate of 92 FPS and maintains motion-to-photon latency below 20 ms, ensuring good real-time performance and interaction stability. The findings indicate that VR-based digital technologies can enhance the scientific rigor of conservation planning and promote public engagement while adhering to the principles of authenticity and minimum intervention. This study provides a replicable technical pathway and practical reference for high-precision digital reconstruction and sustainable conservation of historic buildings.

  • KEN: An Execution Engine for Unstructured Database Systems

    Proceedings of the VLDB Endowment · 2026-01-01

    article

    Unstructured database management systems (UDBMSes) leverage machine learning to apply the relational model to modalities beyond tables, such as documents, images and videos. Queries in a UDBMS consist of logical operators for which the UDBMS chooses physical implementations (e.g., different models) with the goal to optimize both query latency and accuracy. However, many operators only expose a coarse-grained set of implementations, forcing the UDBMS to excessively sacrifice either accuracy or latency without middle-ground options. For example, an entity matching operator can either be implemented through small, specialized models or large, general-purpose models (e.g., Large Language Models) — while the former struggles on challenging inputs, the latter is more accurate but incurs orders of magnitude more computation. In this work, we aim to address this issue with model cascades , which seek to process "easy" inputs with small models and only resort to large models when necessary. However, cascades incur higher memory usage and additional data transfer between GPU memory and arithmetic units, which often slows queries compared to single models. To address this issue, we introduce Ken, a dedicated UDBMS execution engine that dynamically adapts its use of cascades to the query load, and optimizes the GPU placement and invocation scheduling of the cascade models. Compared to baselines, Ken achieves 1.7× –3.3× latency reductions when combining similar models for a single operator, and 122× latency reductions when combining models with orders of magnitude size differences in a multi-operator query.

  • Not All Documents Are What You Need for Extracting Instruction Tuning Data

    ArXiv.org · 2025-05-18

    preprintOpen accessSenior author

    Instruction tuning improves the performance of large language models (LLMs), but it heavily relies on high-quality training data. Recently, LLMs have been used to synthesize instruction data using seed question-answer (QA) pairs. However, these synthesized instructions often lack diversity and tend to be similar to the input seeds, limiting their applicability in real-world scenarios. To address this, we propose extracting instruction tuning data from web corpora that contain rich and diverse knowledge. A naive solution is to retrieve domain-specific documents and extract all QA pairs from them, but this faces two key challenges: (1) extracting all QA pairs using LLMs is prohibitively expensive, and (2) many extracted QA pairs may be irrelevant to the downstream tasks, potentially degrading model performance. To tackle these issues, we introduce EQUAL, an effective and scalable data extraction framework that iteratively alternates between document selection and high-quality QA pair extraction to enhance instruction tuning. EQUAL first clusters the document corpus based on embeddings derived from contrastive learning, then uses a multi-armed bandit strategy to efficiently identify clusters that are likely to contain valuable QA pairs. This iterative approach significantly reduces computational cost while boosting model performance. Experiments on AutoMathText and StackOverflow across four downstream tasks show that EQUAL reduces computational costs by 5-10x and improves accuracy by 2.5 percent on LLaMA-3.1-8B and Mistral-7B

  • UniCell: Towards a Unified Solution for Cell Annotation, Nomenclature Harmonization, Atlas Construction in Single-Cell Transcriptomics

    bioRxiv (Cold Spring Harbor Laboratory) · 2025-05-11

    preprintOpen access

    Abstract Standardizing cell type annotations across single-cell RNA-seq datasets remains a major challenge due to inconsistencies in nomenclature, variation in annotation granularity, and the presence of rare or previously unseen populations. We present UniCell, a hierarchical annotation framework that combines Cell Ontology structure with transcriptomic data for scalable, interpretable, and ontology-aware cell identity inference. UniCell leverages a multi-task architecture that jointly optimizes local and global classifiers, yielding coherent predictions across multiple levels of the ontology-defined hierarchy. When benchmarked across 20 human and mouse datasets, UniCell consistently outperformed state-of-the-art tools, including CellTypist, scANVI, OnClass, and SingleR, in annotation performance, and sensitivity to low-abundance populations. In disease settings, UniCell effectively identified previously unseen cell types through confidence-guided novelty detection. Applied to 45 human and 23 mouse tissue atlases, UniCell enabled cross-dataset and cross-species harmonization by embedding cells into a unified latent space aligned with Cell Ontology structure. Moreover, when used to supervise single-cell foundation models, UniCell substantially improved downstream annotation accuracy, rare cell detection, and hierarchical consistency. Together, these results establish UniCell as a generalizable framework that supports high-resolution annotation, nomenclature standardization, and atlas-level integration, providing a scalable and biologically grounded solution for single-cell transcriptomic analysis across diverse biological systems.

  • The power of peers: how common ownership networks shape corporate digitalization

    Chinese Management Studies · 2025-09-13 · 1 citations

    article

    Purpose This study aims to investigate peer influence mechanisms in corporate digital transformation within common ownership networks, and extends social network theory in strategic management by examining how these interconnected ownership structures shape firms’ transformation strategies. Design/methodology/approach Using panel data from Chinese A-share listed companies (2014–2023), this study uses social network analysis to construct common ownership networks and applies econometric models to test for peer effects. The research further examines network centrality as a moderator and the influence of industry leaders’ demonstration effects on follower firms. Findings The results confirm robust peer effects on digital transformation decisions within common ownership networks. Network centrality enhances these effects, rendering centrally located firms more susceptible to peer influence. Industry leaders accelerate transformation among follower firms through demonstration effects. Information diffusion via network ties, competitive pressure and organizational learning are identified as key underlying mechanisms. The study also documents significant heterogeneity in these effects across ownership structures, geographical concentrations and industrial characteristics, and finds that innovation capability mediates the relationship between digital transformation and corporate productivity. Originality/value This research contributes to network governance literature by empirically demonstrating the influence of common ownership networks on corporate digital transformation. It offers a framework identifying key peer effect mechanisms (organizational learning, information diffusion and competitive pressure) and clarifies the moderating role of network centrality. These findings deepen theoretical understanding and provide practical insights for the strategic management of digital transformation.

  • DGraFormer: Dynamic Graph Learning Guided Multi-Scale Transformer for Multivariate Time Series Forecasting

    2025-09-01 · 1 citations

    article

    Multivariate time series forecasting is a critical focus across many fields. Existing transformer-based models have overlooked the explicit modeling of inter-variable correlations. Similarly, the graph-based methods have also failed to address the dynamic nature of multivariate correlations and the noise in correlation modeling. To overcome these challenges, we propose a novel Dynamic Graph Learning Guided Multi-Scale Transformer (DGraFormer) for multivariate time series forecasting. Specifically, our method consists of two main components: Dynamic correlation-aware graph Learning (DCGL) and multi-scale temporal transformer (MTT). The former aims to capture dynamic correlations across different time windows, filters out noise, and selects key weights to guide the aggregation of relevant feature representations. The latter can effectively extract temporal patterns from patch data at varying scales. Finally, the proposed method can capture rich local correlation graph structures and multi-scale global temporal features. Experimental results demonstrate that DGraformer significantly outperforms existing state-of-the-art models on ten real-world datasets, achieving the best performance across multiple evaluation metrics. The source code of our model is available at \url{https://anonymous.4open.science/r/DGraFormer}.

  • Two Birds with One Stone: Efficient Deep Learning over Mislabeled Data through Subset Selection

    Proceedings of the ACM on Management of Data · 2025-06-17

    article

    Using a large training dataset to train a big and powerful model -- a typical practice in modern deep learning, often suffers from two major problems: the expensive and slow training process and the error-prone labels. The existing approaches, targeting either speeding up the training by selecting a subset of representative training instances (subset selection) or eliminating the negative effect of mislabels during training (mislabel detection), do not perform well in this scenario due to overlooking one of these two problems. To fill this gap, we propose Deem, a novel data-efficient framework that selects a subset of representative training instances under label uncertainty. The key idea is to leverage the metadata produced during deep learning training, e.g., training losses and gradients, to estimate the label uncertainty and select the representative instances. In particular, we model the problem of subset selection under uncertainty as a problem of finding a subset that closely approximates the gradient of the whole training data set derived on soft labels. We show that it is an NP-hard problem with submodular property and propose a low complexity algorithm to solve this problem with an approximate ratio. Training on this small subset thus improves the training efficiency while guaranteeing the model's accuracy. Moreover, we propose an efficient strategy to dynamically refine this subset during the iterative training process. Extensive experiments on 6 datasets and 10 baselines demonstrate that Deem accelerates the training process up to 10X without sacrificing the model accuracy.

  • Multiple cosmic strings in Chern–Simons–Higgs theory with gravity

    Nonlinear Analysis · 2025-07-08

    article1st authorCorresponding

Frequent coauthors

  • Samuel Madden

    34 shared
  • Elke A. Rundensteiner

    31 shared
  • Wen‐Jun Tu

    Capital Medical University

    20 shared
  • Nan Tang

    Hong Kong University of Science and Technology

    15 shared
  • Yizhou Yan

    Tsinghua University

    14 shared
  • Longde Wang

    National Health and Family Planning Commission

    12 shared
  • Wei Zhang

    KK Women's and Children's Hospital

    10 shared
  • Chongjing Lv

    Shandong Marine Resource and Environment Research Institute

    9 shared

Labs

  • Lei Cao's LabPI

    Research in data systems and data science, focusing on systems for AI and AI for systems.

  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Lei Cao

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup