Daniel Kifer
· Associate Professor of Computer Science & Engineering, Graduate Faculty, Social Data Analytics, C-SoDA Faculty AffiliateVerifiedPennsylvania State University · Social Data Analytics
Active 2002–2026
About
Daniel Kifer is an Associate Professor of Computer Science & Engineering and a Graduate Faculty member at Pennsylvania State University. He is also a faculty affiliate of the Center for Social Data Analytics (C-SoDA). His research focuses on social data analytics, integrating computer science principles to analyze and interpret social data. Kifer is involved in advancing the understanding of social data through his academic and research activities, contributing to the development of methodologies and tools in this interdisciplinary field.
Research topics
- Computer Science
- Machine Learning
- Artificial Intelligence
- Data Mining
- Mathematics
- Data science
Selected publications
Accurate and Scalable Matrix Mechanisms via Divide and Conquer
ArXiv.org · 2026-04-01
articleOpen accessSenior authorMatrix mechanisms are often used to provide unbiased differentially private query answers when publishing statistics or creating synthetic data. Recent work has developed matrix mechanisms, such as ResidualPlanner and Weighted Fourier Factorizations, that scale to high dimensional datasets while providing optimality guarantees for workloads such as marginals and circular product queries. They operate by adding noise to a linearly independent set of queries that can compactly represent the desired workloads. In this paper, we present QuerySmasher, an alternative scalable approach based on a divide-and-conquer strategy. Given a workload that can be answered from various data marginals, QuerySmasher splits each query into sub-queries and re-assembles the pieces into mutually orthogonal sub-workloads. These sub-workloads represent small, low-dimensional problems that can be independently and optimally answered by existing low-dimensional matrix mechanisms. QuerySmasher then stitches these solutions together to answer queries in the original workload. We show that QuerySmasher subsumes prior work, like ResidualPlanner (RP), ResidualPlanner+ (RP+), and Weighted Fourier Factorizations (WFF). We prove that it can dominate those approaches, under sum squared error, for all workloads. We also experimentally demonstrate the scalability and accuracy of QuerySmasher.
Fast Private Adaptive Query Answering for Large Data Domains
ArXiv.org · 2026-02-05
articleOpen accessPrivately releasing marginals of a tabular dataset is a foundational problem in differential privacy. However, state-of-the-art mechanisms suffer from a computational bottleneck when marginal estimates are reconstructed from noisy measurements. Recently, residual queries were introduced and shown to lead to highly efficient reconstruction in the batch query answering setting. We introduce new techniques to integrate residual queries into state-of-the-art adaptive mechanisms such as AIM. Our contributions include a novel conceptual framework for residual queries using multi-dimensional arrays, lazy updating strategies, and adaptive optimization of the per-round privacy budget allocation. Together these contributions reduce error, improve speed, and simplify residual query operations. We integrate these innovations into a new mechanism (AIM+GReM), which improves AIM by using fast residual-based reconstruction instead of a graphical model approach. Our mechanism is orders of magnitude faster than the original framework and demonstrates competitive error and greatly improved scalability.
Fast Private Adaptive Query Answering for Large Data Domains
Open MIND · 2026-02-05
preprintPrivately releasing marginals of a tabular dataset is a foundational problem in differential privacy. However, state-of-the-art mechanisms suffer from a computational bottleneck when marginal estimates are reconstructed from noisy measurements. Recently, residual queries were introduced and shown to lead to highly efficient reconstruction in the batch query answering setting. We introduce new techniques to integrate residual queries into state-of-the-art adaptive mechanisms such as AIM. Our contributions include a novel conceptual framework for residual queries using multi-dimensional arrays, lazy updating strategies, and adaptive optimization of the per-round privacy budget allocation. Together these contributions reduce error, improve speed, and simplify residual query operations. We integrate these innovations into a new mechanism (AIM+GReM), which improves AIM by using fast residual-based reconstruction instead of a graphical model approach. Our mechanism is orders of magnitude faster than the original framework and demonstrates competitive error and greatly improved scalability.
Accurate and Scalable Matrix Mechanisms via Divide and Conquer
arXiv (Cornell University) · 2026-04-01
preprintOpen accessSenior authorMatrix mechanisms are often used to provide unbiased differentially private query answers when publishing statistics or creating synthetic data. Recent work has developed matrix mechanisms, such as ResidualPlanner and Weighted Fourier Factorizations, that scale to high dimensional datasets while providing optimality guarantees for workloads such as marginals and circular product queries. They operate by adding noise to a linearly independent set of queries that can compactly represent the desired workloads. In this paper, we present QuerySmasher, an alternative scalable approach based on a divide-and-conquer strategy. Given a workload that can be answered from various data marginals, QuerySmasher splits each query into sub-queries and re-assembles the pieces into mutually orthogonal sub-workloads. These sub-workloads represent small, low-dimensional problems that can be independently and optimally answered by existing low-dimensional matrix mechanisms. QuerySmasher then stitches these solutions together to answer queries in the original workload. We show that QuerySmasher subsumes prior work, like ResidualPlanner (RP), ResidualPlanner+ (RP+), and Weighted Fourier Factorizations (WFF). We prove that it can dominate those approaches, under sum squared error, for all workloads. We also experimentally demonstrate the scalability and accuracy of QuerySmasher.
Composition for Pufferfish Privacy
Open MIND · 2026-02-02
preprintWhen creating public data products out of confidential datasets, inferential/posterior-based privacy definitions, such as Pufferfish, provide compelling privacy semantics for data with correlations. However, such privacy definitions are rarely used in practice because they do not always compose. For example, it is possible to design algorithms for these privacy definitions that have no leakage when run once but reveal the entire dataset when run more than once. We prove necessary and sufficient conditions that must be added to ensure linear composition for Pufferfish mechanisms, hence avoiding such privacy collapse. These extra conditions turn out to be differential privacy-style inequalities, indicating that achieving both the interpretable semantics of Pufferfish for correlated data and composition benefits requires adopting differentially private mechanisms to Pufferfish. We show that such translation is possible through a concept called the $(a,b)$-influence curve, and many existing differentially private algorithms can be translated with our framework into a composable Pufferfish algorithm. We illustrate the benefit of our new framework by designing composable Pufferfish algorithms for Markov chains that significantly outperform prior work.
Composition for Pufferfish Privacy
arXiv (Cornell University) · 2026-02-02
articleOpen accessWhen creating public data products out of confidential datasets, inferential/posterior-based privacy definitions, such as Pufferfish, provide compelling privacy semantics for data with correlations. However, such privacy definitions are rarely used in practice because they do not always compose. For example, it is possible to design algorithms for these privacy definitions that have no leakage when run once but reveal the entire dataset when run more than once. We prove necessary and sufficient conditions that must be added to ensure linear composition for Pufferfish mechanisms, hence avoiding such privacy collapse. These extra conditions turn out to be differential privacy-style inequalities, indicating that achieving both the interpretable semantics of Pufferfish for correlated data and composition benefits requires adopting differentially private mechanisms to Pufferfish. We show that such translation is possible through a concept called the $(a,b)$-influence curve, and many existing differentially private algorithms can be translated with our framework into a composable Pufferfish algorithm. We illustrate the benefit of our new framework by designing composable Pufferfish algorithms for Markov chains that significantly outperform prior work.
Journal of Geophysical Research Machine Learning and Computation · 2025-09-01 · 2 citations
articleOpen accessAbstract Landslide risk is traditionally predicted by process‐based models with detailed assessments or point‐scale attribute‐based machine learning (ML) models with first‐ or second‐order features (e.g., slope and curvature) as inputs. One could hypothesize that terrain patterns might contain useful higher‐order information that could be extracted, via computer vision ML models, to elevate prediction performance beyond that achievable with attribute‐based models. We put this hypothesis to the test in the state of Oregon, where a large landslide data set is available. A Convolutional Neural Network (CNN) using 2D geospatial and terrain data (CNN2D) reached state‐of‐the‐art single‐model scores for Precision (0.90) and Recall (0.86), along with other metrics. CNN2D's Precision‐Recall Pareto front, formed by applying different hyperparameters, dominated attribute‐based models like Random Forest (RF1D) by a substantial margin, attesting to the value of fine‐scale terrain patterns. However, CNN2D's superiority required high‐resolution rainfall (∼800 m) and terrain (∼10 m) data sets: as the resolution coarsened, all models declined in performance but CNN2D's scores decreased more than RF1D's. Ensembling CNN2D and RF1D produced even better Recall (0.90), and this cross‐model‐type ensemble was also better than other ensembles. These models further showed robust results in cross‐regional validation. Rainfall, land cover, and elevation were the most important predictors, while prescribed Plan and Profile Curvature fields were also highly useful inputs (perhaps due to the size of the training data set). Based on the results of our analyses, we generated landslide susceptibility maps which provide insights into spatial patterns of landslide risk.
Correlating Cross-Iteration Noise for DP-SGD using Model Curvature
ArXiv.org · 2025-10-06
preprintOpen accessDifferentially private stochastic gradient descent (DP-SGD) offers the promise of training deep learning models while mitigating many privacy risks. However, there is currently a large accuracy gap between DP-SGD and normal SGD training. This has resulted in different lines of research investigating orthogonal ways of improving privacy-preserving training. One such line of work, known as DP-MF, correlates the privacy noise across different iterations of stochastic gradient descent -- allowing later iterations to cancel out some of the noise added to earlier iterations. In this paper, we study how to improve this noise correlation. We propose a technique called NoiseCurve that uses model curvature, estimated from public unlabeled data, to improve the quality of this cross-iteration noise correlation. Our experiments on various datasets, models, and privacy parameters show that the noise correlations computed by NoiseCurve offer consistent and significant improvements in accuracy over the correlation scheme used by DP-MF.
A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census
Harvard Data Science Review · 2025-07-21 · 1 citations
articleOpen accessFor the last half-century, it has been a common and accepted practice for statistical agencies, including the United States Census Bureau, to adopt different strategies to protect the confidentiality of aggregate tabular data products from those used to protect the individual records contained in publicly released microdata products. This strategy was premised on the assumption that the aggregation used to generate tabular data products made the resulting statistics inherently less disclosive than the microdata from which they were tabulated. Consistent with this common assumption, the 2010 Census of Population and Housing in the U.S. used different disclosure limitation rules for its tabular and microdata publications. This paper demonstrates that, in the context of disclosure limitation for the 2010 Census, the assumption that tabular data are inherently less disclosive than their underlying microdata is fundamentally flawed. The 2010 Census published more than 150 billion aggregate statistics in 180 table sets. Most of these tables were published at the most detailed geographic level—individual census blocks, which can have populations as small as one person. Using only 34 of the published table sets, we reconstructed microdata records including five variables (census block, sex, age, race, and ethnicity) from the confidential 2010 Census person records. Using only published data, an attacker using our methods can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. We further confirm, through reidentification studies, that an attacker can, within census blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with race and ethnicity different from the modal person on the census block) with 95% accuracy. Having shown the vulnerabilities inherent to the disclosure limitation methods used for the 2010 Census, we proceed to demonstrate that the more robust disclosure limitation framework used for the 2020 Census publications defends against attacks that are based on reconstruction. Finally, we show that available alternatives to the 2020 Census Disclosure Avoidance System would either fail to protect confidentiality, or would overly degrade the statistics’ utility for the primary statutory use case: redrawing the boundaries of all of the nation’s legislative and voting districts in compliance with the 1965 Voting Rights Act.
NDT & E International · 2025-08-23 · 2 citations
article
Recent grants
CAREER: An Axiomatic Basis for Statistical Privacy
NSF · $438k · 2011–2017
TWC SBES: Medium: Utility for Private Data Sharing in Social Science
NSF · $1.1M · 2012–2018
SaTC: CORE: Medium: Developing for Differential Privacy with Formal Methods and Counterexamples
NSF · $1.2M · 2017–2023
SaTC: CORE: Small: New Techniques for Optimizing Accuracy in Differential Privacy Applications
NSF · $500k · 2019–2024
Frequent coauthors
- 52 shared
C. Lee Giles
- 34 shared
Alexander G. Ororbia
Rochester Institute of Technology
- 33 shared
Ankur Mali
- 27 shared
Ashwin Machanavajjhala
- 25 shared
Danfeng Zhang
- 25 shared
John M. Abowd
- 21 shared
Chaopeng Shen
Pennsylvania State University
- 19 shared
Johannes Gehrke
Microsoft (United States)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Daniel Kifer
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup