
Jason Altschuler
· Assistant Professor of Statistics and Data Science, Assistant Professor of Computer and Information Science (secondary appointment), Assistant Professor of Electrical and Systems Engineering (secondary appointment)VerifiedUniversity of Pennsylvania · Business Economics and Public Policy
Active 1968–2026
About
Jason Altschuler is an Assistant Professor of Statistics and Data Science at the University of Pennsylvania's Wharton School, with secondary appointments in Computer and Information Science, as well as Electrical and Systems Engineering. He completed his Ph.D. in Electrical Engineering and Computer Science at MIT in 2022, following a master's degree from MIT in 2018, and a B.S. in Computer Science from Princeton University in 2016. His research interests include optimization, probability, machine learning, the mathematics of data science, and optimal transport. He has held the position of Faculty Fellow at New York University from 2022 to 2023 and has contributed to numerous research publications in the fields of information theory, algebra, geometry, and data science. His work involves advancing theoretical foundations and applications in probability, statistical inference, and optimization, with a focus on developing new methods and understanding in these areas.
Research topics
- Computer Science
- Artificial Intelligence
- Machine Learning
- Algorithm
Selected publications
Negative Momentum for Convex-Concave Optimization
arXiv (Cornell University) · 2026-04-18
preprintOpen accessSenior authorThis paper revisits momentum in the context of min-max optimization. Momentum is a celebrated mechanism for accelerating gradient dynamics in settings like convex minimization, but its direct use in min-max optimization makes gradient dynamics diverge. Surprisingly, Gidel et al. 2019 showed that negative momentum can help fix convergence. However, despite these promising initial results and progress since, the power of momentum remains unclear for min-max optimization in two key ways. (1) Generality: is global convergence possible for the foundational setting of convex-concave optimization? This is the direct analog of convex minimization and is a standard testing ground for min-max algorithms. (2) Fast convergence: is accelerated convergence possible for strongly-convex-strong-concave optimization (the only non-linear setting where global convergence is known)? Recent work has even argued that this is impossible. We answer both these questions in the affirmative. Together, these results put negative momentum on more equal footing with competitor algorithms, and show that negative momentum enables convergence significantly faster and more generally than was known possible.
Algorithmic warm starts for Hamiltonian Monte Carlo
arXiv (Cornell University) · 2026-03-24
preprintOpen accessGenerating samples from a continuous probability density is a central algorithmic problem across statistics, engineering, and the sciences. For high-dimensional settings, Hamiltonian Monte Carlo (HMC) is the default algorithm across mainstream software packages. However, despite the extensive line of work on HMC and its widespread empirical success, it remains unclear how many iterations of HMC are required as a function of the dimension $d$. On one hand, a variety of results show that Metropolized HMC converges in $O(d^{1/4})$ iterations from a warm start close to stationarity. On the other hand, Metropolized HMC is significantly slower without a warm start, e.g., requiring $Ω(d^{1/2})$ iterations even for simple target distributions such as isotropic Gaussians. Finding a warm start is therefore the computational bottleneck for HMC. We resolve this issue for the well-studied setting of sampling from a probability distribution satisfying strong log-concavity (or isoperimetry) and third-order derivative bounds. We prove that \emph{non-Metropolized} HMC generates a warm start in $\tilde{O}(d^{1/4})$ iterations, after which we can exploit the warm start using Metropolized HMC. Our final complexity of $\tilde{O}(d^{1/4})$ is the fastest algorithm for high-accuracy sampling under these assumptions, improving over the prior best of $\tilde{O}(d^{1/2})$. This closes the long line of work on the dimensional complexity of MHMC for such settings, and also provides a simple warm-start prescription for practical implementations.
Negative Momentum for Convex-Concave Optimization
ArXiv.org · 2026-04-18
articleOpen accessSenior authorThis paper revisits momentum in the context of min-max optimization. Momentum is a celebrated mechanism for accelerating gradient dynamics in settings like convex minimization, but its direct use in min-max optimization makes gradient dynamics diverge. Surprisingly, Gidel et al. 2019 showed that negative momentum can help fix convergence. However, despite these promising initial results and progress since, the power of momentum remains unclear for min-max optimization in two key ways. (1) Generality: is global convergence possible for the foundational setting of convex-concave optimization? This is the direct analog of convex minimization and is a standard testing ground for min-max algorithms. (2) Fast convergence: is accelerated convergence possible for strongly-convex-strong-concave optimization (the only non-linear setting where global convergence is known)? Recent work has even argued that this is impossible. We answer both these questions in the affirmative. Together, these results put negative momentum on more equal footing with competitor algorithms, and show that negative momentum enables convergence significantly faster and more generally than was known possible.
Algorithmic warm starts for Hamiltonian Monte Carlo
arXiv (Cornell University) · 2026-03-24
articleOpen accessGenerating samples from a continuous probability density is a central algorithmic problem across statistics, engineering, and the sciences. For high-dimensional settings, Hamiltonian Monte Carlo (HMC) is the default algorithm across mainstream software packages. However, despite the extensive line of work on HMC and its widespread empirical success, it remains unclear how many iterations of HMC are required as a function of the dimension $d$. On one hand, a variety of results show that Metropolized HMC converges in $O(d^{1/4})$ iterations from a warm start close to stationarity. On the other hand, Metropolized HMC is significantly slower without a warm start, e.g., requiring $Ω(d^{1/2})$ iterations even for simple target distributions such as isotropic Gaussians. Finding a warm start is therefore the computational bottleneck for HMC. We resolve this issue for the well-studied setting of sampling from a probability distribution satisfying strong log-concavity (or isoperimetry) and third-order derivative bounds. We prove that \emph{non-Metropolized} HMC generates a warm start in $\tilde{O}(d^{1/4})$ iterations, after which we can exploit the warm start using Metropolized HMC. Our final complexity of $\tilde{O}(d^{1/4})$ is the fastest algorithm for high-accuracy sampling under these assumptions, improving over the prior best of $\tilde{O}(d^{1/2})$. This closes the long line of work on the dimensional complexity of MHMC for such settings, and also provides a simple warm-start prescription for practical implementations.
SIAM Journal on Mathematics of Data Science · 2025-07-09 · 1 citations
articleOpen access1st authorCorrespondingOptimized methods for composite optimization: a reduction perspective
ArXiv.org · 2025-06-30
preprintOpen accessSenior authorRecent advances in convex optimization have leveraged computer-assisted proofs to develop optimized first-order methods that improve over classical algorithms. However, each optimized method is specially tailored for a particular problem setting, and it is a well-documented challenge to extend optimized methods to other settings due to their highly bespoke design and analysis. We provide a general framework that derives optimized methods for composite optimization directly from those for unconstrained smooth optimization. The derived methods naturally extend the original methods, generalizing how proximal gradient descent extends gradient descent. The key to our result is certain algebraic identities that provide a unified and straightforward way of extending convergence analyses from unconstrained to composite settings. As concrete examples, we apply our framework to establish (1) the phenomenon of stepsize acceleration for proximal gradient descent; (2) a convergence rate for the proximal optimized gradient method which is faster than FISTA; (3) a new method that improves the state-of-the-art rate for minimizing gradient norm in the composite setting.
Near-Linear Runtime for a Classical Matrix Preconditioning Algorithm
ArXiv.org · 2025-03-20
preprintOpen accessIn 1960, Osborne proposed a simple iterative algorithm for matrix balancing with outstanding numerical performance. Today, it is the default preconditioning procedure before eigenvalue computation and other linear algebra subroutines in mainstream software packages such as Python, Julia, MATLAB, EISPACK, LAPACK, and more. Despite its widespread usage, Osborne's algorithm has long resisted theoretical guarantees for its runtime: the first polynomial-time guarantees were obtained only in the past decade, and recent near-linear runtimes remain confined to variants of Osborne's algorithm with important differences that make them simpler to analyze but empirically slower. In this paper, we address this longstanding gap between theory and practice by proving that Osborne's original algorithm -- the de facto preconditioner in practice -- in fact has a near-linear runtime. This runtime guarantee (1) is optimal in the input size up to at most a single logarithm, (2) is the first runtime for Osborne's algorithm that does not dominate the runtime of downstream tasks like eigenvalue computation, and (3) improves upon the theoretical runtimes for all other variants of Osborne's algorithm.
Shifted Composition II: Shift Harnack Inequalities and Curvature Upper Bounds
IEEE Transactions on Information Theory · 2025-11-11
article1st authorCorrespondingWe apply the shifted composition rule—an information-theoretic principle introduced in our earlier work [Altschuler and Chewi 2024, IEEE Transactions on Information Theory]—to establish shift Harnack inequalities for the Langevin diffusion. We obtain <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">sharp</i> constants for these inequalities for the first time, allowing us to investigate their relationship with other properties of the diffusion. Namely, we show that they are equivalent to a sharp “local gradient-entropy” bound, and that they imply curvature <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">upper</i> bounds in a suggestive reflection of the Bakry–Émery theory of curvature <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">lower</i> bounds. As a corollary, we show that the local gradient-entropy inequality implies optimal concentration of the score, a.k.a. the logarithmic gradient of the density. More broadly, our techniques apply to discrete-time Markov chains over R<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><sup>d</sup></i> and also yield sharp shift Harnack inequalities for such processes.
Negative Stepsizes Make Gradient-Descent-Ascent Converge
ArXiv.org · 2025-05-02
preprintOpen accessSenior authorEfficient computation of min-max problems is a central question in optimization, learning, games, and controls. Arguably the most natural algorithm is gradient-descent-ascent (GDA). However, since the 1970s, conventional wisdom has argued that GDA fails to converge even on simple problems. This failure spurred an extensive literature on modifying GDA with additional building blocks such as extragradients, optimism, momentum, anchoring, etc. In contrast, we show that GDA converges in its original form by simply using a judicious choice of stepsizes. The key innovation is the proposal of unconventional stepsize schedules (dubbed slingshot stepsize schedules) that are time-varying, asymmetric, and periodically negative. We show that all three properties are necessary for convergence, and that altogether this enables GDA to converge on the classical counterexamples (e.g., unconstrained convex-concave problems). All of our results apply to the last iterate of GDA, as is typically desired in practice. The core algorithmic intuition is that although negative stepsizes make backward progress, they de-synchronize the min and max variables (overcoming the cycling issue of GDA), and lead to a slingshot phenomenon in which the forward progress in the other iterations is overwhelmingly larger. This results in fast overall convergence. Geometrically, the slingshot dynamics leverage the non-reversibility of gradient flow: positive/negative steps cancel to first order, yielding a second-order net movement in a new direction that leads to convergence and is otherwise impossible for GDA to move in. We interpret this as a second-order finite-differencing algorithm and show that, intriguingly, it approximately implements consensus optimization, an empirically popular algorithm for min-max problems involving deep neural networks (e.g., training GANs).
Shifted Interpolation for Differential Privacy
arXiv (Cornell University) · 2024-03-01 · 1 citations
preprintOpen accessSenior authorNoisy gradient descent and its variants are the predominant algorithms for differentially private machine learning. It is a fundamental question to quantify their privacy leakage, yet tight characterizations remain open even in the foundational setting of convex losses. This paper improves over previous analyses by establishing (and refining) the "privacy amplification by iteration" phenomenon in the unifying framework of $f$-differential privacy--which tightly captures all aspects of the privacy loss and immediately implies tighter privacy accounting in other notions of differential privacy, e.g., $(\varepsilon,δ)$-DP and Rényi DP. Our key technical insight is the construction of shifted interpolated processes that unravel the popular shifted-divergences argument, enabling generalizations beyond divergence-based relaxations of DP. Notably, this leads to the first exact privacy analysis in the foundational setting of strongly convex optimization. Our techniques extend to many settings: convex/strongly convex, constrained/unconstrained, full/cyclic/stochastic batches, and all combinations thereof. As an immediate corollary, we recover the $f$-DP characterization of the exponential mechanism for strongly convex optimization in Gopi et al. (2022), and moreover extend this result to more general settings.
Frequent coauthors
- 21 shared
Satwik Rajaram
The University of Texas Southwestern Medical Center
- 21 shared
Robert J. Steininger
The University of Texas at Dallas
- 21 shared
Steven J. Altschuler
University of California, San Francisco
- 21 shared
Benjamin Pavie
VIB-KU Leuven Center for Brain & Disease Research
- 21 shared
Pablo A. Parrilo
- 21 shared
Lani F. Wu
University of California, San Francisco
- 16 shared
Enric Boix-Adserà
- 14 shared
Austin Ouyang
Advanced Imaging Research (United States)
Awards & honors
- SIAM Early Career Prize for Optimization, 2026
- AFOSR Young Investigator Program (YIP) Award, 2026
- ICS Paper Prize, 2025
- Sloan Research Fellowship in Mathematics, 2025
- A.W. Tucker Finalist Prize, 2024
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Jason Altschuler
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup