Daniel Kifer

· Associate Professor of Computer Science & Engineering, Graduate Faculty, Social Data Analytics, C-SoDA Faculty AffiliateVerified

Pennsylvania State University · Social Data Analytics

Active 2002–2026

h-index46

Citations14.1k

Papers22093 last 5y

Funding$3.2M

Faculty page Lab page Website

See your match with Daniel Kifer — sign in to PhdFit.Sign in

About

Daniel Kifer is an Associate Professor of Computer Science & Engineering and a Graduate Faculty member at Pennsylvania State University. He is also a faculty affiliate of the Center for Social Data Analytics (C-SoDA). His research focuses on social data analytics, integrating computer science principles to analyze and interpret social data. Kifer is involved in advancing the understanding of social data through his academic and research activities, contributing to the development of methodologies and tools in this interdisciplinary field.

Research topics

Computer Science
Machine Learning
Artificial Intelligence
Data Mining
Mathematics
Data science

Selected publications

Accurate and Scalable Matrix Mechanisms via Divide and Conquer
ArXiv.org · 2026-04-01
articleOpen accessSenior author
Matrix mechanisms are often used to provide unbiased differentially private query answers when publishing statistics or creating synthetic data. Recent work has developed matrix mechanisms, such as ResidualPlanner and Weighted Fourier Factorizations, that scale to high dimensional datasets while providing optimality guarantees for workloads such as marginals and circular product queries. They operate by adding noise to a linearly independent set of queries that can compactly represent the desired workloads. In this paper, we present QuerySmasher, an alternative scalable approach based on a divide-and-conquer strategy. Given a workload that can be answered from various data marginals, QuerySmasher splits each query into sub-queries and re-assembles the pieces into mutually orthogonal sub-workloads. These sub-workloads represent small, low-dimensional problems that can be independently and optimally answered by existing low-dimensional matrix mechanisms. QuerySmasher then stitches these solutions together to answer queries in the original workload. We show that QuerySmasher subsumes prior work, like ResidualPlanner (RP), ResidualPlanner+ (RP+), and Weighted Fourier Factorizations (WFF). We prove that it can dominate those approaches, under sum squared error, for all workloads. We also experimentally demonstrate the scalability and accuracy of QuerySmasher.
Publisher OA PDF
Fast Private Adaptive Query Answering for Large Data Domains
ArXiv.org · 2026-02-05
articleOpen access
Privately releasing marginals of a tabular dataset is a foundational problem in differential privacy. However, state-of-the-art mechanisms suffer from a computational bottleneck when marginal estimates are reconstructed from noisy measurements. Recently, residual queries were introduced and shown to lead to highly efficient reconstruction in the batch query answering setting. We introduce new techniques to integrate residual queries into state-of-the-art adaptive mechanisms such as AIM. Our contributions include a novel conceptual framework for residual queries using multi-dimensional arrays, lazy updating strategies, and adaptive optimization of the per-round privacy budget allocation. Together these contributions reduce error, improve speed, and simplify residual query operations. We integrate these innovations into a new mechanism (AIM+GReM), which improves AIM by using fast residual-based reconstruction instead of a graphical model approach. Our mechanism is orders of magnitude faster than the original framework and demonstrates competitive error and greatly improved scalability.
Publisher OA PDF
Fast Private Adaptive Query Answering for Large Data Domains
Open MIND · 2026-02-05
preprint
Privately releasing marginals of a tabular dataset is a foundational problem in differential privacy. However, state-of-the-art mechanisms suffer from a computational bottleneck when marginal estimates are reconstructed from noisy measurements. Recently, residual queries were introduced and shown to lead to highly efficient reconstruction in the batch query answering setting. We introduce new techniques to integrate residual queries into state-of-the-art adaptive mechanisms such as AIM. Our contributions include a novel conceptual framework for residual queries using multi-dimensional arrays, lazy updating strategies, and adaptive optimization of the per-round privacy budget allocation. Together these contributions reduce error, improve speed, and simplify residual query operations. We integrate these innovations into a new mechanism (AIM+GReM), which improves AIM by using fast residual-based reconstruction instead of a graphical model approach. Our mechanism is orders of magnitude faster than the original framework and demonstrates competitive error and greatly improved scalability.
DOI
Accurate and Scalable Matrix Mechanisms via Divide and Conquer
arXiv (Cornell University) · 2026-04-01
preprintOpen accessSenior author
Matrix mechanisms are often used to provide unbiased differentially private query answers when publishing statistics or creating synthetic data. Recent work has developed matrix mechanisms, such as ResidualPlanner and Weighted Fourier Factorizations, that scale to high dimensional datasets while providing optimality guarantees for workloads such as marginals and circular product queries. They operate by adding noise to a linearly independent set of queries that can compactly represent the desired workloads. In this paper, we present QuerySmasher, an alternative scalable approach based on a divide-and-conquer strategy. Given a workload that can be answered from various data marginals, QuerySmasher splits each query into sub-queries and re-assembles the pieces into mutually orthogonal sub-workloads. These sub-workloads represent small, low-dimensional problems that can be independently and optimally answered by existing low-dimensional matrix mechanisms. QuerySmasher then stitches these solutions together to answer queries in the original workload. We show that QuerySmasher subsumes prior work, like ResidualPlanner (RP), ResidualPlanner+ (RP+), and Weighted Fourier Factorizations (WFF). We prove that it can dominate those approaches, under sum squared error, for all workloads. We also experimentally demonstrate the scalability and accuracy of QuerySmasher.
Publisher DOI
Composition for Pufferfish Privacy
Open MIND · 2026-02-02
preprint
When creating public data products out of confidential datasets, inferential/posterior-based privacy definitions, such as Pufferfish, provide compelling privacy semantics for data with correlations. However, such privacy definitions are rarely used in practice because they do not always compose. For example, it is possible to design algorithms for these privacy definitions that have no leakage when run once but reveal the entire dataset when run more than once. We prove necessary and sufficient conditions that must be added to ensure linear composition for Pufferfish mechanisms, hence avoiding such privacy collapse. These extra conditions turn out to be differential privacy-style inequalities, indicating that achieving both the interpretable semantics of Pufferfish for correlated data and composition benefits requires adopting differentially private mechanisms to Pufferfish. We show that such translation is possible through a concept called the $(a,b)$-influence curve, and many existing differentially private algorithms can be translated with our framework into a composable Pufferfish algorithm. We illustrate the benefit of our new framework by designing composable Pufferfish algorithms for Markov chains that significantly outperform prior work.
DOI
Composition for Pufferfish Privacy
arXiv (Cornell University) · 2026-02-02
articleOpen access
When creating public data products out of confidential datasets, inferential/posterior-based privacy definitions, such as Pufferfish, provide compelling privacy semantics for data with correlations. However, such privacy definitions are rarely used in practice because they do not always compose. For example, it is possible to design algorithms for these privacy definitions that have no leakage when run once but reveal the entire dataset when run more than once. We prove necessary and sufficient conditions that must be added to ensure linear composition for Pufferfish mechanisms, hence avoiding such privacy collapse. These extra conditions turn out to be differential privacy-style inequalities, indicating that achieving both the interpretable semantics of Pufferfish for correlated data and composition benefits requires adopting differentially private mechanisms to Pufferfish. We show that such translation is possible through a concept called the $(a,b)$-influence curve, and many existing differentially private algorithms can be translated with our framework into a composable Pufferfish algorithm. We illustrate the benefit of our new framework by designing composable Pufferfish algorithms for Markov chains that significantly outperform prior work.
Publisher OA PDF
The Value of Terrain Pattern, High‐Resolution Data and Ensemble Modeling for Landslide Susceptibility Prediction
Journal of Geophysical Research Machine Learning and Computation · 2025-09-01 · 2 citations
articleOpen access
Abstract Landslide risk is traditionally predicted by process‐based models with detailed assessments or point‐scale attribute‐based machine learning (ML) models with first‐ or second‐order features (e.g., slope and curvature) as inputs. One could hypothesize that terrain patterns might contain useful higher‐order information that could be extracted, via computer vision ML models, to elevate prediction performance beyond that achievable with attribute‐based models. We put this hypothesis to the test in the state of Oregon, where a large landslide data set is available. A Convolutional Neural Network (CNN) using 2D geospatial and terrain data (CNN2D) reached state‐of‐the‐art single‐model scores for Precision (0.90) and Recall (0.86), along with other metrics. CNN2D's Precision‐Recall Pareto front, formed by applying different hyperparameters, dominated attribute‐based models like Random Forest (RF1D) by a substantial margin, attesting to the value of fine‐scale terrain patterns. However, CNN2D's superiority required high‐resolution rainfall (∼800 m) and terrain (∼10 m) data sets: as the resolution coarsened, all models declined in performance but CNN2D's scores decreased more than RF1D's. Ensembling CNN2D and RF1D produced even better Recall (0.90), and this cross‐model‐type ensemble was also better than other ensembles. These models further showed robust results in cross‐regional validation. Rainfall, land cover, and elevation were the most important predictors, while prescribed Plan and Profile Curvature fields were also highly useful inputs (perhaps due to the size of the training data set). Based on the results of our analyses, we generated landslide susceptibility maps which provide insights into spatial patterns of landslide risk.
Publisher DOI
Correlating Cross-Iteration Noise for DP-SGD using Model Curvature
ArXiv.org · 2025-10-06
preprintOpen access
Differentially private stochastic gradient descent (DP-SGD) offers the promise of training deep learning models while mitigating many privacy risks. However, there is currently a large accuracy gap between DP-SGD and normal SGD training. This has resulted in different lines of research investigating orthogonal ways of improving privacy-preserving training. One such line of work, known as DP-MF, correlates the privacy noise across different iterations of stochastic gradient descent -- allowing later iterations to cancel out some of the noise added to earlier iterations. In this paper, we study how to improve this noise correlation. We propose a technique called NoiseCurve that uses model curvature, estimated from public unlabeled data, to improve the quality of this cross-iteration noise correlation. Our experiments on various datasets, models, and privacy parameters show that the noise correlations computed by NoiseCurve offer consistent and significant improvements in accuracy over the correlation scheme used by DP-MF.
Publisher OA PDF DOI
A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census
Harvard Data Science Review · 2025-07-21 · 1 citations
articleOpen access
For the last half-century, it has been a common and accepted practice for statistical agencies, including the United States Census Bureau, to adopt different strategies to protect the confidentiality of aggregate tabular data products from those used to protect the individual records contained in publicly released microdata products. This strategy was premised on the assumption that the aggregation used to generate tabular data products made the resulting statistics inherently less disclosive than the microdata from which they were tabulated. Consistent with this common assumption, the 2010 Census of Population and Housing in the U.S. used different disclosure limitation rules for its tabular and microdata publications. This paper demonstrates that, in the context of disclosure limitation for the 2010 Census, the assumption that tabular data are inherently less disclosive than their underlying microdata is fundamentally flawed. The 2010 Census published more than 150 billion aggregate statistics in 180 table sets. Most of these tables were published at the most detailed geographic level—individual census blocks, which can have populations as small as one person. Using only 34 of the published table sets, we reconstructed microdata records including five variables (census block, sex, age, race, and ethnicity) from the confidential 2010 Census person records. Using only published data, an attacker using our methods can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. We further confirm, through reidentification studies, that an attacker can, within census blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with race and ethnicity different from the modal person on the census block) with 95% accuracy. Having shown the vulnerabilities inherent to the disclosure limitation methods used for the 2010 Census, we proceed to demonstrate that the more robust disclosure limitation framework used for the 2020 Census publications defends against attacks that are based on reconstruction. Finally, we show that available alternatives to the 2020 Census Disclosure Avoidance System would either fail to protect confidentiality, or would overly degrade the statistics’ utility for the primary statutory use case: redrawing the boundaries of all of the nation’s legislative and voting districts in compliance with the 1965 Voting Rights Act.
Publisher OA PDF DOI
A transfer learning approach to the prediction of porosity in additively manufactured metallic components
NDT & E International · 2025-08-23 · 2 citations
article
Publisher DOI

Recent grants

CAREER: An Axiomatic Basis for Statistical Privacy
NSF · $438k · 2011–2017
TWC SBES: Medium: Utility for Private Data Sharing in Social Science
NSF · $1.1M · 2012–2018
SaTC: CORE: Medium: Developing for Differential Privacy with Formal Methods and Counterexamples
NSF · $1.2M · 2017–2023
SaTC: CORE: Small: New Techniques for Optimizing Accuracy in Differential Privacy Applications
NSF · $500k · 2019–2024

Frequent coauthors

C. Lee Giles
52 shared
Alexander G. Ororbia
Rochester Institute of Technology
34 shared
Ankur Mali
33 shared
Ashwin Machanavajjhala
27 shared
Danfeng Zhang
25 shared
John M. Abowd
25 shared
Chaopeng Shen
Pennsylvania State University
21 shared
Johannes Gehrke
Microsoft (United States)
19 shared

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Daniel Kifer

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you