Luke Miratrix

· Professor of Education (GSE)Verified

Harvard University · Biostatistics

Active 2008–2026

h-index23

Citations2.4k

Papers14255 last 5y

Funding$330k

Faculty page Lab page Website

See your match with Luke Miratrix — sign in to PhdFit.Sign in

About

Luke W. Miratrix is a Professor of Education at the Harvard Graduate School of Education. His research interests include causal inference, randomized experiments, text analysis, and simulation design. He is affiliated with the Department of Statistics and is involved in various academic activities, including faculty research and teaching. His contact information includes a phone number, email, and personal website, and he is based at Harvard in Cambridge, Massachusetts.

Research topics

Sociology
Computer Science
Statistics
Demography
Geography
Medicine
Mathematics
Econometrics
Psychology

Selected publications

Stratified Sampling for Model-Assisted Estimation with Surrogate Outcomes
Open MIND · 2026-02-13
preprintSenior author
In many randomized trials, outcomes such as essays or open-ended responses must be manually scored as a preliminary step to impact analysis, a process that is costly and limiting. Model-assisted estimation offers a way to combine surrogate outcomes generated by machine learning or large language models with a human-coded subset, yet typical implementations use simple random sampling and therefore overlook systematic variation in surrogate prediction error. We extend this framework by incorporating stratified sampling to more efficiently allocate human coding effort. We derive the exact variance of the stratified model-assisted estimator, characterize conditions under which stratification improves precision, and identify a Neyman-type optimal allocation rule that oversamples strata with larger residual variance. We evaluate our methods through a comprehensive simulation study to assess finite-sample performance. Overall, we find stratification consistently improves efficiency when surrogate prediction errors exhibit structured bias or heteroskedasticity. We also present two empirical applications, one using data from an education RCT and one using a large observational corpus, to illustrate how these methods can be implemented in practice using ChatGPT-generated surrogate outcomes. Overall, this framework provides a practical design-based approach for leveraging surrogate outcomes and strategically allocating human coding effort to obtain unbiased estimates with greater efficiency. While motivated by text-as-data applications, the methodology applies broadly to any setting where outcome measurement is costly or prohibitive, and can be applied to comparisons across groups or estimating the mean of a single group.
DOI
Stratified Sampling for Model-Assisted Estimation with Surrogate Outcomes
arXiv (Cornell University) · 2026-02-13
articleOpen accessSenior author
In many randomized trials, outcomes such as essays or open-ended responses must be manually scored as a preliminary step to impact analysis, a process that is costly and limiting. Model-assisted estimation offers a way to combine surrogate outcomes generated by machine learning or large language models with a human-coded subset, yet typical implementations use simple random sampling and therefore overlook systematic variation in surrogate prediction error. We extend this framework by incorporating stratified sampling to more efficiently allocate human coding effort. We derive the exact variance of the stratified model-assisted estimator, characterize conditions under which stratification improves precision, and identify a Neyman-type optimal allocation rule that oversamples strata with larger residual variance. We evaluate our methods through a comprehensive simulation study to assess finite-sample performance. Overall, we find stratification consistently improves efficiency when surrogate prediction errors exhibit structured bias or heteroskedasticity. We also present two empirical applications, one using data from an education RCT and one using a large observational corpus, to illustrate how these methods can be implemented in practice using ChatGPT-generated surrogate outcomes. Overall, this framework provides a practical design-based approach for leveraging surrogate outcomes and strategically allocating human coding effort to obtain unbiased estimates with greater efficiency. While motivated by text-as-data applications, the methodology applies broadly to any setting where outcome measurement is costly or prohibitive, and can be applied to comparisons across groups or estimating the mean of a single group.
Publisher OA PDF
Improving Estimation for Two-Dimensional Regression Discontinuity Designs in Education With Gaussian Process Regression
Journal of Educational and Behavioral Statistics · 2026-02-12
articleSenior author
Sometimes a treatment, such as receiving a high school diploma, is assigned to students if their scores on two inputs (e.g., math and English test scores) are above established cutoffs. This forms a multidimensional regression discontinuity design (RDD) where there are two running variables instead of one. Present methods for estimating such designs either collapse the two running variables into a single running variable, estimate two separate one-dimensional RDDs, or jointly model the entire response surface. The first two approaches may lose valuable information, while the third approach can be very sensitive to model misspecification. We examine an alternative approach, developed in the context of geographic RDDs, which uses Gaussian processes to flexibly model the response surfaces and estimate the impact of treatment along the full range of students who were on the margin of receiving treatment. We also discuss parametric and nonparametric surface response methods in general, which have been under explored in multidimensional RDDs for education. We demonstrate theoretically, in simulation, and in an applied example, that the Gaussian process approach has several advantages over current approaches, including other surface response methods. In particular, using Gaussian process regression in two-dimensional RDDs shows strong coverage and standard error estimation and allows for easy examination of treatment effect variation for students with different patterns of running variables and outcomes. As nonparametric approaches are new in education-specific RDDs, we also provide an R package for users to estimate treatment effects using these methods.
Publisher DOI
Toward accounting for the effects of gender socialization in quantitative research in human–computer interaction
Interacting with Computers · 2025-06-12 · 1 citations
articleOpen access
Abstract In quantitative HCI research, gender is typically represented as a single categorical variable and data from non-binary participants are frequently excluded from analyses. Meanwhile, many scholars argue that gender is a complex, multidimensional construct, and that overly simplistic operationalization of gender risks that our theories will generalize poorly have limited explanatory power, and will exclude experiences of individuals whose gender identities are not included in our analyses. In this work, we modeled gender as inclusive of multiple dimensions of gender socialization and we operationalized gender socialization through a subset of the items from the Conformity to Masculine Norms Inventory (CMNI). We replicated three studies of basic cognitive abilities (theory of mind, mental rotation, spatial working memory) that previously showed gender differences. For two of the studies, adding CMNI variables significantly and substantially improved the explanatory power of regression models. Also, in those studies, more than half of the effect of binary gender was mediated through the CMNI variables. These results suggest that gender socialization rather than categorical gender explain a substantial part of the individual differences on some cognitive tasks. Consequently, differences in task performance associated with gender categories may not be universal, i.e., they may not generalize to people from other cultures or eras where people are socialized into their gender roles differently. Instead, including multidimensional representations of gender may produce more accurate and more generalizable models. Given that our results also showed that CMNI might not model non-binary participants the same way as men and women, it remains an open question what specific instruments should be used to represent gender in quantitative analyses.
Publisher DOI
More power to you: Using machine learning to augment human coding for more efficient inference in text-based randomized trials
The Annals of Applied Statistics · 2025-03-01 · 1 citations
articleSenior author
For randomized trials that use text as an outcome, traditional approaches for assessing treatment impact require that each document first be manually coded for constructs of interest by trained human raters. This process, the current standard, is both time-consuming and limiting: even the largest human coding efforts are typically constrained to measure only a small set of dimensions across a subsample of available texts. In this work we present an inferential framework that can be used to increase the power of an impact assessment, given a fixed human-coding budget, by taking advantage of any “untapped” observations—those documents not manually scored due to time or resource constraints—as a supplementary resource. Our approach, a methodological combination of causal inference, survey sampling methods, and machine learning, has four steps: (1) select and code a sample of documents, (2) build a machine learning model to predict the human-coded outcomes from a set of automatically extracted text features, (3) generate machine-predicted scores for all documents and use these scores to estimate treatment impacts, and (4) adjust the final impact estimates using the residual differences between human-coded and machine-predicted outcomes. This final step ensures any biases in the modeling procedure do not propagate to biases in final estimated effects. Through an extensive simulation study and an application to a recent field trial in education, we show that our proposed approach can be used to reduce the scope of a human-coding effort while maintaining nominal power to detect a significant treatment impact.
Publisher DOI
Multilevel Metamodels: Enhancing Inference, Interpretability, and Generalizability in Monte Carlo Simulation Studies
Multivariate Behavioral Research · 2025-11-19
articleSenior author
Metamodels, or the regression analysis of Monte Carlo simulation results, provide a powerful tool to summarize simulation findings. However, an underutilized approach is the multilevel metamodel (MLMM) that accounts for the dependent data structure that arises from fitting multiple models to the same simulated data set. In this study, we articulate the theoretical rationale for the MLMM and illustrate how it can improve the interpretability of simulation results, better account for complex simulation designs, and provide new insights into the generalizability of simulation findings.
Publisher DOI
Item-Level Heterogeneity in Value Added Models: Implications for Reliability, Cross-Study Comparability, and Effect Sizes
Journal of Educational and Behavioral Statistics · 2025-12-28 · 1 citations
article
Value added models (VAMs) attempt to estimate the causal effects of teachers and schools on student test scores. We apply Generalizability Theory to show how estimated VA effects depend upon the selection of test items. Standard VAMs estimate causal effects on the items that are included on the test. Generalizability demands consideration of how estimates would differ had the test included alternative items. We introduce a model that estimates the magnitude of item-by-teacher/school variance accurately, revealing that standard VAMs can overstate reliability and overestimate differences between units. Using 16 academic outcomes from 8 studies with item-level data, we show how standard VAMs overstate reliability by a median of 0.04 on the 0 to 1 reliability scale (mean = 0.09, SD = 0.10) and provide standard deviations of teacher/school effects that are a median of 3% too large (mean = 12%, SD = 23% points). We discuss how imprecision due to heterogeneous VA effects across items attenuates effect sizes, complicates comparisons across studies, and contributes to temporal instability, though these effects are reduced when the number of items is high. Our results suggest that accurate estimation and interpretation of VAMs may be improved using item-level data, including qualitative data about how items represent the content domain.
Publisher DOI
Variance estimation after matching or re-weighting
ArXiv.org · 2025-06-12
preprintOpen accessSenior author
This paper develops a variance estimation framework for matching estimators that enables valid population inference for treatment effects. We provide theoretical analysis of a variance estimator that addresses key limitations in the existing literature. While Abadie and Imbens (2006) proposed a foundational variance estimator requiring matching for both treatment and control groups, this approach is computationally prohibitive and rarely used in practice. Our method provides a computationally feasible alternative that only requires matching treated units to controls while maintaining theoretical validity for population inference. We make three main contributions. First, we establish consistency and asymptotic normality for our variance estimator, proving its validity for average treatment effect on the treated (ATT) estimation in settings with small treated samples. Second, we develop a generalized theoretical framework with novel regularity conditions that significantly expand the class of matching procedures for which valid inference is available, including radius matching, M-nearest neighbor matching, and propensity score matching. Third, we demonstrate that our approach extends naturally to other causal inference estimators such as stable balancing weighting methods. Through simulation studies across different data generating processes, we show that our estimator maintains proper coverage rates while the state-of-the-art bootstrap method can exhibit substantial undercoverage (dropping from 95% to as low as 61%), particularly in settings with extensive control unit reuse. Our framework provides researchers with both theoretical guarantees and practical tools for conducting valid population inference across a wide range of causal inference applications. An R package implementing our method is available at https://github.com/jche/scmatch2.
Publisher OA PDF DOI
Caliper Synthetic Matching: Generalized Radius Matching with Local Synthetic Controls
arXiv (Cornell University) · 2024-11-08
preprintOpen accessSenior author
Matching promises transparent causal inferences for observational data, making it an intuitive approach for many applications. In practice, however, standard matching methods often perform poorly compared to modern approaches such as response-surface modeling and optimizing balancing weights. We propose Caliper Synthetic Matching (CSM) to address these challenges while preserving simple and transparent matches and match diagnostics. CSM extends Coarsened Exact Matching by incorporating general distance metrics, adaptive calipers, and locally constructed synthetic controls. We show that CSM can be viewed as a monotonic imbalance bounding matching method, so that it inherits the usual bounds on imbalance and bias enjoyed by MIB methods. We further provide a bound on a measure of joint covariate imbalance. Using a simulation study, we illustrate how CSM can even outperform modern matching methods in certain settings, and finally illustrate its use in an empirical example. Overall, we find CSM allows for many of the benefits of matching while avoiding some of the costs.
Publisher OA PDF DOI
Disentangling Person-Dependent and Item-Dependent Causal Effects: Applications of Item Response Theory to the Estimation of Treatment Effect Heterogeneity
Journal of Educational and Behavioral Statistics · 2024-04-05 · 11 citations
article
Analyzing heterogeneous treatment effects (HTEs) plays a crucial role in understanding the impacts of educational interventions. A standard practice for HTE analysis is to examine interactions between treatment status and preintervention participant characteristics, such as pretest scores, to identify how different groups respond to treatment. This study demonstrates that the identical patterns of HTE on test score outcomes can emerge either from variation in treatment effects due to a preintervention participant characteristic or from correlations between treatment effects and item easiness parameters. We demonstrate analytically and through simulation that these two scenarios cannot be distinguished if analysis is based on summary scores alone. We then describe a novel approach that identifies the relevant data-generating process by leveraging item-level data. We apply our approach to a randomized trial of a reading intervention in second grade and show that any apparent HTE by pretest ability is driven by the correlation between treatment effect size and item easiness. Our results highlight the potential of employing measurement principles in causal analysis, beyond their common use in test construction.
Publisher DOI

Recent grants

Statistical Methods for Causal Inference in Geographic Regression Discontinuity Designs
NSF · $330k · 2015–2020

Frequent coauthors

Avi Feller
46 shared
Todd Grindal
SRI International
24 shared
Joseph Lorenzo Hall
University of California, Berkeley
12 shared
Emily C. Hanno
Manpower Demonstration Research Corporation
12 shared
Lindsay C. Page
11 shared
Peng Ding
University of California, Berkeley
11 shared
Jorge Cuartas
Universidad de Los Andes
9 shared
Stephanie M. Jones
9 shared

Labs

Sports Analytics Laboratory at Harvard UniversityPI
The Sports Analytics Laboratory at Harvard University focuses on the application of statistical and data science methods to sports.

Education

Ph.D., Statistics
Harvard University
2015
B.A., Mathematics
Harvard University
2010

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Luke Miratrix

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you