
Flavio P. Calmon
· Thomas D. Cabot Associate Professor of Electrical EngineeringVerifiedHarvard University · Electrical Engineering
Active 2008–2026
About
Flavio P. Calmon is an Thomas D. Cabot Associate Professor of Electrical Engineering at Harvard University, affiliated with the Harvard John A. Paulson School of Engineering and Applied Sciences. His research areas include applied mathematics, data science, artificial intelligence, machine learning, computer science, computational and data science, computation and society, and signal processing. He has been recognized for his contributions to mentoring and advising, receiving the Capers and Marion McDonald Award for Excellence in Mentoring and Advising. His work involves aligning AI with human values, and he has contributed to discussions on the future of AI and its safety, emphasizing transparency and accountability in AI behavior. He is actively involved in both undergraduate and graduate programs within engineering and applied sciences.
Research topics
- Computer Science
- Artificial Intelligence
- Machine Learning
- Mathematics
Selected publications
Optimal conversion from Rényi Differential Privacy to $f$-Differential Privacy
arXiv (Cornell University) · 2026-02-04
articleOpen accessWe prove the conjecture stated in Appendix F.3 of [Zhu et al. (2022)]: among all conversion rules that map a Rényi Differential Privacy (RDP) profile $τ\mapsto ρ(τ)$ to a valid hypothesis-testing trade-off $f$, the rule based on the intersection of single-order RDP privacy regions is optimal. This optimality holds simultaneously for all valid RDP profiles and for all Type I error levels $α$. Concretely, we show that in the space of trade-off functions, the tightest possible bound is $f_{ρ(\cdot)}(α) = \sup_{τ\geq 0.5} f_{τ,ρ(τ)}(α)$: the pointwise maximum of the single-order bounds for each RDP privacy region. Our proof unifies and sharpens the insights of [Balle et al. (2019)], [Asoodeh et al. (2021)], and [Zhu et al. (2022)]. Our analysis relies on a precise geometric characterization of the RDP privacy region, leveraging its convexity and the fact that its boundary is determined exclusively by Bernoulli mechanisms. Our results establish that the "intersection-of-RDP-privacy-regions" rule is not only valid, but optimal: no other black-box conversion can uniformly dominate it in the Blackwell sense, marking the fundamental limit of what can be inferred about a mechanism's privacy solely from its RDP guarantees.
Robust AI Evaluation through Maximal Lotteries
ArXiv.org · 2026-02-24
articleOpen accessThe standard way to evaluate language models on subjective tasks is through pairwise comparisons: an annotator chooses the "better" of two responses to a prompt. Leaderboards aggregate these comparisons into a single Bradley-Terry (BT) ranking, forcing heterogeneous preferences into a total order and violating basic social-choice desiderata. In contrast, social choice theory provides an alternative approach called maximal lotteries, which aggregates pairwise preferences without imposing any assumptions on their structure. However, we show that maximal lotteries are highly sensitive to preference heterogeneity and can favor models that severely underperform on specific tasks or user subpopulations. We introduce robust lotteries that optimize worst-case performance under plausible shifts in the preference data. On large-scale preference datasets, robust lotteries provide more reliable win rate guarantees across the annotator distribution and recover a stable set of top-performing models. By moving from rankings to pluralistic sets of winners, robust lotteries offer a principled step toward an ecosystem of complementary AI systems that serve the full spectrum of human preferences.
Inference-Time Machine Unlearning via Gated Activation Redirection
arXiv (Cornell University) · 2026-05-12
preprintOpen accessLarge Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model performance, ideally approximating a model retrained from scratch without the forget set. Existing approaches aim to achieve this by updating model parameters via gradient-based methods. However, these updates are computationally expensive, lead to irreversible weight changes, and degrade when the model is quantized for deployment. A recent alternative to changing model weights is activation engineering, where activations are changed during inference to steer model behavior. Despite circumventing weight editing, naive activation steering introduces its own failure modes, as a single global steering vector applies the same intervention to every input, leading to unintended changes in model behavior. We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. The resulting intervention is applied as a norm-preserving rotation in the residual stream, leaving model weights untouched. Experiments on TOFU and MUSE show that GUARD-IT matches or exceeds 12 gradient-based baselines across three model scales, while being the only method to simultaneously preserve utility, suppress memorization, and avoid catastrophic collapse across all settings. GUARD-IT further supports continual unlearning without retraining, and remains effective under quantization, a scenario in which parameter-editing methods degrade.
Robust AI Evaluation through Maximal Lotteries
Open MIND · 2026-02-24
preprintThe standard way to evaluate language models on subjective tasks is through pairwise comparisons: an annotator chooses the "better" of two responses to a prompt. Leaderboards aggregate these comparisons into a single Bradley-Terry (BT) ranking, forcing heterogeneous preferences into a total order and violating basic social-choice desiderata. In contrast, social choice theory provides an alternative approach called maximal lotteries, which aggregates pairwise preferences without imposing any assumptions on their structure. However, we show that maximal lotteries are highly sensitive to preference heterogeneity and can favor models that severely underperform on specific tasks or user subpopulations. We introduce robust lotteries that optimize worst-case performance under plausible shifts in the preference data. On large-scale preference datasets, robust lotteries provide more reliable win rate guarantees across the annotator distribution and recover a stable set of top-performing models. By moving from rankings to pluralistic sets of winners, robust lotteries offer a principled step toward an ecosystem of complementary AI systems that serve the full spectrum of human preferences.
Optimal conversion from Rényi Differential Privacy to $f$-Differential Privacy
Open MIND · 2026-02-04
preprintWe prove the conjecture stated in Appendix F.3 of [Zhu et al. (2022)]: among all conversion rules that map a Rényi Differential Privacy (RDP) profile $τ\mapsto ρ(τ)$ to a valid hypothesis-testing trade-off $f$, the rule based on the intersection of single-order RDP privacy regions is optimal. This optimality holds simultaneously for all valid RDP profiles and for all Type I error levels $α$. Concretely, we show that in the space of trade-off functions, the tightest possible bound is $f_{ρ(\cdot)}(α) = \sup_{τ\geq 0.5} f_{τ,ρ(τ)}(α)$: the pointwise maximum of the single-order bounds for each RDP privacy region. Our proof unifies and sharpens the insights of [Balle et al. (2019)], [Asoodeh et al. (2021)], and [Zhu et al. (2022)]. Our analysis relies on a precise geometric characterization of the RDP privacy region, leveraging its convexity and the fact that its boundary is determined exclusively by Bernoulli mechanisms. Our results establish that the "intersection-of-RDP-privacy-regions" rule is not only valid, but optimal: no other black-box conversion can uniformly dominate it in the Blackwell sense, marking the fundamental limit of what can be inferred about a mechanism's privacy solely from its RDP guarantees.
Inference-Time Machine Unlearning via Gated Activation Redirection
ArXiv.org · 2026-05-12
articleOpen accessLarge Language Models memorize vast amounts of training data, raising concerns regarding privacy, copyright infringement, and safety. Machine unlearning seeks to remove the influence of a targeted forget set while preserving model performance, ideally approximating a model retrained from scratch without the forget set. Existing approaches aim to achieve this by updating model parameters via gradient-based methods. However, these updates are computationally expensive, lead to irreversible weight changes, and degrade when the model is quantized for deployment. A recent alternative to changing model weights is activation engineering, where activations are changed during inference to steer model behavior. Despite circumventing weight editing, naive activation steering introduces its own failure modes, as a single global steering vector applies the same intervention to every input, leading to unintended changes in model behavior. We introduce Inference-Time Unlearning via Gated Activation Redirection (GUARD-IT), a training- and gradient-free method that unlearns via input-dependent activation steering at inference time. The resulting intervention is applied as a norm-preserving rotation in the residual stream, leaving model weights untouched. Experiments on TOFU and MUSE show that GUARD-IT matches or exceeds 12 gradient-based baselines across three model scales, while being the only method to simultaneously preserve utility, suppress memorization, and avoid catastrophic collapse across all settings. GUARD-IT further supports continual unlearning without retraining, and remains effective under quantization, a scenario in which parameter-editing methods degrade.
Soft Best-of-$n$ Sampling for Model Alignment
2025-06-22
articleSenior authorBest-of-<tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$n$</tex> (BoN) sampling is a practical approach for aligning language model outputs with human preferences without expensive fine-tuning. BoN sampling is performed by generating <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$n$</tex> responses to a prompt and then selecting the sample that maximizes a reward function. BoN yields high reward values in practice at a distortion cost, as measured by the KL-divergence between the sampled and original distribution. This distortion is coarsely controlled by varying the number of samples: larger <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$n$</tex> yields a higher reward at a higher distortion cost. We introduce Soft Best-of-<tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$n$</tex> sampling, a generalization of BoN that allows for smooth interpolation between the original distribution and reward-maximizing distribution through a temperature parameter A. We establish theoretical guarantees showing that Soft Best-of-n sampling converges sharply to the optimal tilted distribution at a rate of <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$O(1/n)$</tex> in KL and the expected (relative) reward. For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.
Gaussian DP for Reporting Differential Privacy Guarantees in Machine Learning
ArXiv.org · 2025-03-13
preprintOpen accessCurrent practices for reporting the level of differential privacy (DP) protection for machine learning (ML) algorithms such as DP-SGD provide an incomplete and potentially misleading picture of the privacy guarantees. For instance, if only a single $(\varepsilon,δ)$ is known about a mechanism, standard analyses show that there exist highly accurate inference attacks against training data records, when, in fact, such accurate attacks might not exist. In this position paper, we argue that using non-asymptotic Gaussian Differential Privacy (GDP) as the primary means of communicating DP guarantees in ML avoids these potential downsides. Using two recent developments in the DP literature: (i) open-source numerical accountants capable of computing the privacy profile and $f$-DP curves of DP-SGD to arbitrary accuracy, and (ii) a decision-theoretic metric over DP representations, we show how to provide non-asymptotic bounds on GDP using numerical accountants, and show that GDP can capture the entire privacy profile of DP-SGD and related algorithms with virtually no error, as quantified by the metric. To support our claims, we investigate the privacy profiles of state-of-the-art DP large-scale image classification, and the TopDown algorithm for the U.S. Decennial Census, observing that GDP fits their profiles remarkably well in all cases. We conclude with a discussion on the strengths and weaknesses of this approach, and discuss which other privacy mechanisms could benefit from GDP.
Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
arXiv (Cornell University) · 2025-10-30
preprintOpen accessSenior authorTranslating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they often only recover token-specific, noisy, or highly local concepts. We argue that this limitation stems from neglecting the temporal structure of language, where semantic content typically evolves smoothly over sequences. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.
Inference-Time Reward Hacking in Large Language Models
ArXiv.org · 2025-06-24
preprintOpen accessSenior authorA common paradigm to improve the performance of large language models is optimizing for a reward model. Reward models assign a numerical score to an LLM's output that indicates, for example, how likely it is to align with user preferences or safety goals. However, reward models are never perfect. They inevitably function as proxies for complex desiderata such as correctness, helpfulness, and safety. By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance, a phenomenon commonly referred to as reward hacking. In this work, we characterize reward hacking in inference-time alignment and demonstrate when and how we can mitigate it by hedging on the proxy reward. We study this phenomenon under Best-of-$n$ (BoN) and Soft Best-of-$n$ (SBoN), and we introduce Best-of-Poisson (BoP) that provides an efficient, near-exact approximation of the optimal reward-KL divergence policy at inference time. We show that the characteristic pattern of hacking as observed in practice (where the true reward first increases before declining) is an inevitable property of a broad class of inference-time mechanisms, including BoN and BoP. To counter this effect, we introduce HedgeTune, an efficient algorithm to find the optimal inference-time parameter. We demonstrate that hedging mitigates reward hacking and achieves superior reward-distortion tradeoffs on math, reasoning, and human-preference setups.
Recent grants
FAI: Foundations of Fair AI in Medicine: Ensuring the Fair Use of Patient Attributes
NSF · $625k · 2021–2025
NSF · $252k · 2019–2022
Collaborative Research: CIF: Medium: Fundamental Limits of Privacy-Enhancing Technologies
NSF · $225k · 2023–2028
CAREER: Information-Theoretic Foundations of Fairness in Machine Learning
NSF · $548k · 2019–2024
NSF · $383k · 2019–2024
Frequent coauthors
- 33 shared
Lalitha Sankar
Arizona State University
- 33 shared
Shahab Asoodeh
McMaster University
- 29 shared
Muriel Médard
Massachusetts Institute of Technology
- 27 shared
Hsiang Hsu
JPMorgan Chase & Co (United States)
- 22 shared
Jiachun Liao
- 21 shared
Oliver Kosut
Arizona State University
- 20 shared
Ken R. Duffy
Northeastern University
- 17 shared
Mark M. Christiansen
National University of Ireland, Maynooth
Awards & honors
- Capers and Marion McDonald Award for Excellence in Mentoring…
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Flavio P. Calmon
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup