
Vasilis Syrgkanis
· Assistant Professor of Management Science and Engineering and, by courtesy, of Computer Science and Electrical EngineeringVerifiedStanford University · Rheumatology
Active 2010–2025
About
Vasilis Syrgkanis is an Assistant Professor of Management Science and Engineering, with a courtesy appointment in Computer Science at Stanford University. He is affiliated with the Center for Artificial Intelligence in Medicine & Imaging (AIMI). His research focuses on artificial intelligence, management science, and their applications in medicine and imaging. Syrgkanis contributes to advancing AI methodologies and their integration into healthcare, leveraging his expertise to develop innovative solutions in these fields.
Research topics
- Computer Science
- Computer Security
- Machine Learning
- Statistics
- Mathematics
- Artificial Intelligence
- Mathematical optimization
- Immunology
- Operations research
- Applied mathematics
- Computational biology
- Biology
- Econometrics
- Cell biology
- Cancer research
- Economics
- Genetics
- Microeconomics
- Combinatorics
Selected publications
ArXiv.org · 2025-10-17
preprintOpen accessSenior authorReinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with reinforcement learning. Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences. However, both approaches often assume uniform annotator preferences and rely on binary comparisons, overlooking two key limitations: the diversity of human evaluators and the limitations of pairwise feedback. In this work, we address both these issues. First, we connect preference learning in RLHF with the econometrics literature and show that binary comparisons are insufficient for identifying latent user preferences from finite user data and infinite users, while (even incomplete) rankings over three or more responses ensure identifiability. Second, we introduce methods to incorporate heterogeneous preferences into alignment algorithms. We develop an Expectation-Maximization adaptation of DPO that discovers latent annotator types and trains a mixture of LLMs accordingly. Then we propose an aggregation algorithm using a min-max regret fairness criterion to produce a single generative policy with equitable performance guarantees. Together, these contributions establish a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.
Adversarial Estimation of Riesz Representers
Journal of the American Statistical Association · 2025-12-02 · 3 citations
preprintOpen accessSenior authorMany causal parameters are linear functionals of an underlying regression. The Riesz representer is a key component in the asymptotic variance of a semiparametrically estimated linear functional. We propose an adversarial framework to estimate the Riesz representer using general function spaces. We prove a nonasymptotic mean square rate in terms of an abstract quantity called the critical radius, then specialize it for neural networks, random forests, and reproducing kernel Hilbert spaces as leading cases. Our estimators are highly compatible with targeted and debiased machine learning with sample splitting; our guarantees directly verify general conditions for inference that allow mis-specification. We also use our guarantees to prove inference without sample splitting, based on stability or complexity. Our estimators achieve nominal coverage in highly nonlinear simulations where some previous methods break down. They shed new light on the heterogeneous effects of matching grants.
Sequential Decision Making with Expert Demonstrations under Unobserved Heterogeneity
2024-01-01
articleSenior authorSequential Decision Making with Expert Demonstrations under Unobserved Heterogeneity
arXiv (Cornell University) · 2024-04-10
preprintOpen accessSenior authorWe study the problem of online sequential decision-making given auxiliary demonstrations from experts who made their decisions based on unobserved contextual information. These demonstrations can be viewed as solving related but slightly different problems than what the learner faces. This setting arises in many application domains, such as self-driving cars, healthcare, and finance, where expert demonstrations are made using contextual information, which is not recorded in the data available to the learning agent. We model the problem as zero-shot meta-reinforcement learning with an unknown distribution over the unobserved contextual variables and a Bayesian regret minimization objective, where the unobserved variables are encoded as parameters with an unknown prior. We propose the Experts-as-Priors algorithm (ExPerior), an empirical Bayes approach that utilizes expert data to establish an informative prior distribution over the learner's decision-making problem. This prior distribution enables the application of any Bayesian approach for online decision-making, such as posterior sampling. We demonstrate that our strategy surpasses existing behaviour cloning, online, and online-offline baselines for multi-armed bandits, Markov decision processes (MDPs), and partially observable MDPs, showcasing the broad reach and utility of ExPerior in using expert demonstrations across different decision-making setups.
2024-01-01 · 2 citations
articleSenior authorarXiv (Cornell University) · 2024-05-23
preprintOpen accessSenior authorReinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with reinforcement learning. Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences. However, both approaches often assume uniform annotator preferences and rely on binary comparisons, overlooking two key limitations: the diversity of human evaluators and the limitations of pairwise feedback. In this work, we address both these issues. First, we connect preference learning in RLHF with the econometrics literature and show that binary comparisons are insufficient for identifying latent user preferences from finite user data and infinite users, while (even incomplete) rankings over three or more responses ensure identifiability. Second, we introduce methods to incorporate heterogeneous preferences into alignment algorithms. We develop an Expectation-Maximization adaptation of DPO that discovers latent annotator types and trains a mixture of LLMs accordingly. Then we propose an aggregation algorithm using a min-max regret fairness criterion to produce a single generative policy with equitable performance guarantees. Together, these contributions establish a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.
Structure-agnostic Optimality of Doubly Robust Learning for Treatment Effect Estimation
arXiv (Cornell University) · 2024-02-22 · 1 citations
preprintOpen accessSenior authorAverage treatment effect estimation is the most central problem in causal inference with application to numerous disciplines. While many estimation strategies have been proposed in the literature, the statistical optimality of these methods has still remained an open area of investigation, especially in regimes where these methods do not achieve parametric rates. In this paper, we adopt the recently introduced structure-agnostic framework of statistical lower bounds, which poses no structural properties on the nuisance functions other than access to black-box estimators that achieve some statistical estimation rate. This framework is particularly appealing when one is only willing to consider estimation strategies that use non-parametric regression and classification oracles as black-box sub-processes. Within this framework, we prove the statistical optimality of the celebrated and widely used doubly robust estimators for both the Average Treatment Effect (ATE) and the Average Treatment Effect on the Treated (ATT), as well as weighted variants of the former, which arise in policy evaluation.
Conditional Influence Functions
arXiv (Cornell University) · 2024-12-24
preprintOpen accessSenior authorThere are many nonparametric objects of interest that are a function of a conditional distribution. One important example is an average treatment effect conditional on a subset of covariates. Many of these objects have a conditional influence function that generalizes the classical influence function of a functional of a (unconditional) distribution. Conditional influence functions have important uses analogous to those of the classical influence function. They can be used to construct Neyman orthogonal estimating equations for conditional objects of interest that depend on high dimensional regressions. They can be used to formulate local policy effects and describe the effect of local misspecification on conditional objects of interest. We derive conditional influence functions for functionals of conditional means and other features of the conditional distribution of an outcome variable. We show how these can be used for locally linear estimation of conditional objects of interest. We give rate conditions for first step machine learners to have no effect on asymptotic distributions of locally linear estimators. We also give a general construction of Neyman orthogonal estimating equations for conditional objects of interest.
Predicting Long Term Sequential Policy Value Using Softer Surrogates
arXiv (Cornell University) · 2024-12-30
preprintOpen accessOff-policy policy evaluation (OPE) estimates the outcome of a new policy using historical data collected from a different policy. However, existing OPE methods cannot handle cases when the new policy introduces novel actions. This issue commonly occurs in real-world domains, like healthcare, as new drugs and treatments are continuously developed. Novel actions necessitate on-policy data collection, which can be burdensome and expensive if the outcome of interest takes a substantial amount of time to observe--for example, in multi-year clinical trials. This raises a key question of how to predict the long-term outcome of a policy after only observing its short-term effects? Though in general this problem is intractable, under some surrogacy conditions, the short-term on-policy data can be combined with the long-term historical data to make accurate predictions about the new policy's long-term value. In two simulated healthcare examples--HIV and sepsis management--we show that our estimators can provide accurate predictions about the policy value only after observing 10\% of the full horizon data. We also provide finite sample analysis of our doubly robust estimators.
Taking a Moment for Distributional Robustness
arXiv (Cornell University) · 2024-05-08
preprintOpen accessSenior authorA rich line of recent work has studied distributionally robust learning approaches that seek to learn a hypothesis that performs well, in the worst-case, on many different distributions over a population. We argue that although the most common approaches seek to minimize the worst-case loss over distributions, a more reasonable goal is to minimize the worst-case distance to the true conditional expectation of labels given each covariate. Focusing on the minmax loss objective can dramatically fail to output a solution minimizing the distance to the true conditional expectation when certain distributions contain high levels of label noise. We introduce a new min-max objective based on what is known as the adversarial moment violation and show that minimizing this objective is equivalent to minimizing the worst-case $\ell_2$-distance to the true conditional expectation if we take the adversary's strategy space to be sufficiently rich. Previous work has suggested minimizing the maximum regret over the worst-case distribution as a way to circumvent issues arising from differential noise levels. We show that in the case of square loss, minimizing the worst-case regret is also equivalent to minimizing the worst-case $\ell_2$-distance to the true conditional expectation. Although their objective and our objective both minimize the worst-case distance to the true conditional expectation, we show that our approach provides large empirical savings in computational cost in terms of the number of groups, while providing the same noise-oblivious worst-distribution guarantee as the minimax regret approach, thus making positive progress on an open question posed by Agarwal and Zhang (2022).
Frequent coauthors
- 38 shared
Éva Tardos
- 38 shared
Brendan Lucier
Microsoft Research (United Kingdom)
- 28 shared
Robert E. Schapire
- 23 shared
Nicole Immorlica
- 17 shared
Jennifer Wortman Vaughan
- 16 shared
Aleksandrs Slivkins
- 16 shared
Denis Nekipelov
- 14 shared
Haipeng Luo
Zhejiang University
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Vasilis Syrgkanis
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup