
Mark Bun
· Assistant ProfessorVerifiedBoston University · Computer Science
Active 2013–2026
About
Mark Bun is an assistant professor in the Department of Computer Science at Boston University. He completed his Ph.D. in computer science at Harvard University in 2016 under the supervision of Salil Vadhan. Prior to joining Boston University, he was a Google Research Fellow at the Simons Institute for the Theory of Computing at UC Berkeley, where he worked on lower bounds and data privacy. He also held a postdoctoral research position in the Theory of Computation Group at Princeton University, hosted by Mark Zhandry. As an undergraduate, he studied mathematics and computer science at the University of Washington. Professor Bun's research broadly spans theoretical computer science, with particular interests in data privacy, computational complexity, cryptography, and the foundations of machine learning. His current research focuses on applying complexity theory methodologies to address practically motivated questions in algorithmic data privacy. Additionally, he investigates the power of real polynomial approximations to Boolean functions and their applications in quantum computation, communication complexity, and learning theory. Throughout his career, Mark Bun has contributed significantly to the understanding of differential privacy and private learning, exploring the computational and statistical aspects of privacy-preserving algorithms. His work bridges theoretical insights with practical challenges in privacy and learning, advancing the foundations of these fields.
Research topics
- Machine Learning
- Computer Science
- Artificial Intelligence
- Theoretical computer science
- Data Mining
- Algorithm
- Mathematics
- Discrete mathematics
Selected publications
Separating Oblivious and Adaptive Differential Privacy under Continual Observation
arXiv (Cornell University) · 2026-03-11
articleOpen access1st authorCorrespondingWe resolve an open question of Jain, Raskhodnikova, Sivakumar, and Smith (ICML 2023) by exhibiting a problem separating differential privacy under continual observation in the oblivious and adaptive settings. The continual observation (a.k.a. continual release) model formalizes privacy for streaming algorithms, where data is received over time and output is released at each time step. In the oblivious setting, privacy need only hold for data streams fixed in advance; in the adaptive setting, privacy is required even for streams that can be chosen adaptively based on the streaming algorithm's output. We describe the first explicit separation between the oblivious and adaptive settings. The problem showing this separation is based on the correlated vector queries problem of Bun, Steinke, and Ullman (SODA 2017). Specifically, we present an $(\varepsilon,0)$-DP algorithm for the oblivious setting that remains accurate for exponentially many time steps in the dimension of the input. On the other hand, we show that every $(\varepsilon,δ)$-DP adaptive algorithm fails to be accurate after releasing output for only a constant number of time steps.
Separating Oblivious and Adaptive Differential Privacy under Continual Observation
arXiv (Cornell University) · 2026-03-11
preprintOpen access1st authorCorrespondingWe resolve an open question of Jain, Raskhodnikova, Sivakumar, and Smith (ICML 2023) by exhibiting a problem separating differential privacy under continual observation in the oblivious and adaptive settings. The continual observation (a.k.a. continual release) model formalizes privacy for streaming algorithms, where data is received over time and output is released at each time step. In the oblivious setting, privacy need only hold for data streams fixed in advance; in the adaptive setting, privacy is required even for streams that can be chosen adaptively based on the streaming algorithm's output. We describe the first explicit separation between the oblivious and adaptive settings. The problem showing this separation is based on the correlated vector queries problem of Bun, Steinke, and Ullman (SODA 2017). Specifically, we present an $(\varepsilon,0)$-DP algorithm for the oblivious setting that remains accurate for exponentially many time steps in the dimension of the input. On the other hand, we show that every $(\varepsilon,δ)$-DP adaptive algorithm fails to be accurate after releasing output for only a constant number of time steps.
Privately Learning Decision Lists and a Differentially Private Winnow
arXiv (Cornell University) · 2026-02-07
articleOpen access1st authorCorrespondingWe give new differentially private algorithms for the classic problems of learning decision lists and large-margin halfspaces in the PAC and online models. In the PAC model, we give a computationally efficient algorithm for learning decision lists with minimal sample overhead over the best non-private algorithms. In the online model, we give a private analog of the influential Winnow algorithm for learning halfspaces with mistake bound polylogarithmic in the dimension and inverse polynomial in the margin. As an application, we describe how to privately learn decision lists in the online model, qualitatively matching state-of-the art non-private guarantees.
Privately Learning Decision Lists and a Differentially Private Winnow
Open MIND · 2026-02-07
preprint1st authorCorrespondingWe give new differentially private algorithms for the classic problems of learning decision lists and large-margin halfspaces in the PAC and online models. In the PAC model, we give a computationally efficient algorithm for learning decision lists with minimal sample overhead over the best non-private algorithms. In the online model, we give a private analog of the influential Winnow algorithm for learning halfspaces with mistake bound polylogarithmic in the dimension and inverse polynomial in the margin. As an application, we describe how to privately learn decision lists in the online model, qualitatively matching state-of-the art non-private guarantees.
Interactive Proofs For Distribution Testing With Conditional Oracles
ArXiv.org · 2025-11-27
articleOpen accessWe revisit the framework of interactive proofs for distribution testing, first introduced by Chiesa and Gur (ITCS 2018), which has recently experienced a surge in interest, accompanied by notable progress (e.g., Herman and Rothblum, STOC 2022, FOCS 2023; Herman, RANDOM~2024). In this model, a data-poor verifier determines whether a probability distribution has a property of interest by interacting with an all-powerful, data-rich but untrusted prover bent on convincing them that it has the property. While prior work gave sample-, time-, and communication-efficient protocols for testing and estimating a range of distribution properties, they all suffer from an inherent issue: for most interesting properties of distributions over a domain of size $N$, the verifier must draw at least $Ω(\sqrt{N})$ samples of its own. While sublinear in $N$, this is still prohibitive for large domains encountered in practice. In this work, we circumvent this limitation by augmenting the verifier with the ability to perform an exponentially smaller number of more powerful (but reasonable) \emph{pairwise conditional} queries, effectively enabling them to perform ``local comparison checks'' of the prover's claims. We systematically investigate the landscape of interactive proofs in this new setting, giving polylogarithmic query and sample protocols for (tolerantly) testing all \emph{label-invariant} properties, thus demonstrating exponential savings without compromising on communication, for this large and fundamental class of testing tasks.
Enforcing Demographic Coherence: A Harms Aware Framework for Reasoning about Private Data Release
ArXiv.org · 2025-02-04
preprintOpen access1st authorCorrespondingThe technical literature about data privacy largely consists of two complementary approaches: formal definitions of conditions sufficient for privacy preservation and attacks that demonstrate privacy breaches. Differential privacy is an accepted standard in the former sphere. However, differential privacy's powerful adversarial model and worst-case guarantees may make it too stringent in some situations, especially when achieving it comes at a significant cost to data utility. Meanwhile, privacy attacks aim to expose real and worrying privacy risks associated with existing data release processes but often face criticism for being unrealistic. Moreover, the literature on attacks generally does not identify what properties are necessary to defend against them. We address the gap between these approaches by introducing demographic coherence, a condition inspired by privacy attacks that we argue is necessary for data privacy. This condition captures privacy violations arising from inferences about individuals that are incoherent with respect to the demographic patterns in the data. Our framework focuses on confidence rated predictors, which can in turn be distilled from almost any data-informed process. Thus, we capture privacy threats that exist even when no attack is explicitly being carried out. Our framework not only provides a condition with respect to which data release algorithms can be analysed but suggests natural experimental evaluation methodologies that could be used to build practical intuition and make tangible assessment of risks. Finally, we argue that demographic coherence is weaker than differential privacy: we prove that every differentially private data release is also demographically coherent, and that there are demographically coherent algorithms which are not differentially private.
Differentially Private Wasserstein Barycenters
ArXiv.org · 2025-10-03
preprintOpen accessThe Wasserstein barycenter is defined as the mean of a set of probability measures under the optimal transport metric, and has numerous applications spanning machine learning, statistics, and computer graphics. In practice these input measures are empirical distributions built from sensitive datasets, motivating a differentially private (DP) treatment. We present, to our knowledge, the first algorithms for computing Wasserstein barycenters under differential privacy. Empirically, on synthetic data, MNIST, and large-scale U.S. population datasets, our methods produce high-quality private barycenters with strong accuracy-privacy tradeoffs.
Oracle-Efficient Differentially Private Learning with Public Data
2024-01-01
articlePrivate PAC Learning May be Harder than Online Learning
arXiv (Cornell University) · 2024-02-16
preprintOpen access1st authorCorrespondingWe continue the study of the computational complexity of differentially private PAC learning and how it is situated within the foundations of machine learning. A recent line of work uncovered a qualitative equivalence between the private PAC model and Littlestone's mistake-bounded model of online learning, in particular, showing that any concept class of Littlestone dimension $d$ can be privately PAC learned using $\mathrm{poly}(d)$ samples. This raises the natural question of whether there might be a generic conversion from online learners to private PAC learners that also preserves computational efficiency. We give a negative answer to this question under reasonable cryptographic assumptions (roughly, those from which it is possible to build indistinguishability obfuscation for all circuits). We exhibit a concept class that admits an online learner running in polynomial time with a polynomial mistake bound, but for which there is no computationally-efficient differentially private PAC learner. Our construction and analysis strengthens and generalizes that of Bun and Zhandry (TCC 2016-A), who established such a separation between private and non-private PAC learner.
Differentially private confidence intervals for proportions under stratified random sampling
Electronic Journal of Statistics · 2024-01-01 · 6 citations
articleOpen accessConfidence intervals are a fundamental tool for quantifying the uncertainty of parameters of interest. With the increase of data privacy awareness, developing a private version of confidence intervals has gained growing attention from both statisticians and computer scientists. Differential privacy is a state-of-the-art framework for analyzing privacy loss when releasing statistics computed from sensitive data. Recent work has been done around differentially private confidence intervals, yet to the best of our knowledge, rigorous methodologies on differentially private confidence intervals in the context of survey sampling have not been studied. In this paper, we propose three differentially private algorithms for constructing confidence intervals for proportions under stratified random sampling. We articulate two variants of differential privacy that make sense for data from stratified sampling designs, analyzing each of our algorithms within one of these two variants. We establish analytical privacy guarantees and asymptotic properties of the estimators. In addition, we conduct simulation studies to evaluate the proposed private confidence intervals, and two applications to the 1940 Census data are provided.
Recent grants
CRII: AF: The Polynomial Method in Learning, Communication, and Quantum Computation
NSF · $175k · 2020–2023
Frequent coauthors
- 36 shared
Justin Thaler
- 26 shared
Thomas Steinke
- 15 shared
Marco Gaboardi
Boston University
- 10 shared
Salil Vadhan
Harvard University Press
- 9 shared
Kobbi Nissim
Georgetown University
- 7 shared
Jonathan Ullman
Northeastern University
- 7 shared
Uri Stemmer
- 6 shared
Robin Kothari
Labs
Research in theoretical computer science, including data privacy, computational complexity, cryptography, and the foundations of machine learning.
Education
Ph.D.
Harvard
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Mark Bun
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup