Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Daniel Kifer

Daniel Kifer

· Associate Professor of Computer Science & Engineering, Graduate Faculty, Social Data Analytics, C-SoDA Faculty AffiliateVerified

Pennsylvania State University · Social Data Analytics

Active 2002–2026

h-index46
Citations14.1k
Papers22093 last 5y
Funding$3.2M
See your match with Daniel Kifer — sign in to PhdFit.Sign in

About

Daniel Kifer is an Associate Professor of Computer Science & Engineering and a Graduate Faculty member at Pennsylvania State University. He is also a faculty affiliate of the Center for Social Data Analytics (C-SoDA). His research focuses on social data analytics, integrating computer science principles to analyze and interpret social data. Kifer is involved in advancing the understanding of social data through his academic and research activities, contributing to the development of methodologies and tools in this interdisciplinary field.

Research topics

  • Computer Science
  • Machine Learning
  • Artificial Intelligence
  • Data Mining
  • Mathematics
  • Data science

Selected publications

  • Accurate and Scalable Matrix Mechanisms via Divide and Conquer

    ArXiv.org · 2026-04-01

    articleOpen accessSenior author

    Matrix mechanisms are often used to provide unbiased differentially private query answers when publishing statistics or creating synthetic data. Recent work has developed matrix mechanisms, such as ResidualPlanner and Weighted Fourier Factorizations, that scale to high dimensional datasets while providing optimality guarantees for workloads such as marginals and circular product queries. They operate by adding noise to a linearly independent set of queries that can compactly represent the desired workloads. In this paper, we present QuerySmasher, an alternative scalable approach based on a divide-and-conquer strategy. Given a workload that can be answered from various data marginals, QuerySmasher splits each query into sub-queries and re-assembles the pieces into mutually orthogonal sub-workloads. These sub-workloads represent small, low-dimensional problems that can be independently and optimally answered by existing low-dimensional matrix mechanisms. QuerySmasher then stitches these solutions together to answer queries in the original workload. We show that QuerySmasher subsumes prior work, like ResidualPlanner (RP), ResidualPlanner+ (RP+), and Weighted Fourier Factorizations (WFF). We prove that it can dominate those approaches, under sum squared error, for all workloads. We also experimentally demonstrate the scalability and accuracy of QuerySmasher.

  • Fast Private Adaptive Query Answering for Large Data Domains

    ArXiv.org · 2026-02-05

    articleOpen access

    Privately releasing marginals of a tabular dataset is a foundational problem in differential privacy. However, state-of-the-art mechanisms suffer from a computational bottleneck when marginal estimates are reconstructed from noisy measurements. Recently, residual queries were introduced and shown to lead to highly efficient reconstruction in the batch query answering setting. We introduce new techniques to integrate residual queries into state-of-the-art adaptive mechanisms such as AIM. Our contributions include a novel conceptual framework for residual queries using multi-dimensional arrays, lazy updating strategies, and adaptive optimization of the per-round privacy budget allocation. Together these contributions reduce error, improve speed, and simplify residual query operations. We integrate these innovations into a new mechanism (AIM+GReM), which improves AIM by using fast residual-based reconstruction instead of a graphical model approach. Our mechanism is orders of magnitude faster than the original framework and demonstrates competitive error and greatly improved scalability.

  • Fast Private Adaptive Query Answering for Large Data Domains

    Open MIND · 2026-02-05

    preprint

    Privately releasing marginals of a tabular dataset is a foundational problem in differential privacy. However, state-of-the-art mechanisms suffer from a computational bottleneck when marginal estimates are reconstructed from noisy measurements. Recently, residual queries were introduced and shown to lead to highly efficient reconstruction in the batch query answering setting. We introduce new techniques to integrate residual queries into state-of-the-art adaptive mechanisms such as AIM. Our contributions include a novel conceptual framework for residual queries using multi-dimensional arrays, lazy updating strategies, and adaptive optimization of the per-round privacy budget allocation. Together these contributions reduce error, improve speed, and simplify residual query operations. We integrate these innovations into a new mechanism (AIM+GReM), which improves AIM by using fast residual-based reconstruction instead of a graphical model approach. Our mechanism is orders of magnitude faster than the original framework and demonstrates competitive error and greatly improved scalability.

  • Accurate and Scalable Matrix Mechanisms via Divide and Conquer

    arXiv (Cornell University) · 2026-04-01

    preprintOpen accessSenior author

    Matrix mechanisms are often used to provide unbiased differentially private query answers when publishing statistics or creating synthetic data. Recent work has developed matrix mechanisms, such as ResidualPlanner and Weighted Fourier Factorizations, that scale to high dimensional datasets while providing optimality guarantees for workloads such as marginals and circular product queries. They operate by adding noise to a linearly independent set of queries that can compactly represent the desired workloads. In this paper, we present QuerySmasher, an alternative scalable approach based on a divide-and-conquer strategy. Given a workload that can be answered from various data marginals, QuerySmasher splits each query into sub-queries and re-assembles the pieces into mutually orthogonal sub-workloads. These sub-workloads represent small, low-dimensional problems that can be independently and optimally answered by existing low-dimensional matrix mechanisms. QuerySmasher then stitches these solutions together to answer queries in the original workload. We show that QuerySmasher subsumes prior work, like ResidualPlanner (RP), ResidualPlanner+ (RP+), and Weighted Fourier Factorizations (WFF). We prove that it can dominate those approaches, under sum squared error, for all workloads. We also experimentally demonstrate the scalability and accuracy of QuerySmasher.

  • Composition for Pufferfish Privacy

    Open MIND · 2026-02-02

    preprint

    When creating public data products out of confidential datasets, inferential/posterior-based privacy definitions, such as Pufferfish, provide compelling privacy semantics for data with correlations. However, such privacy definitions are rarely used in practice because they do not always compose. For example, it is possible to design algorithms for these privacy definitions that have no leakage when run once but reveal the entire dataset when run more than once. We prove necessary and sufficient conditions that must be added to ensure linear composition for Pufferfish mechanisms, hence avoiding such privacy collapse. These extra conditions turn out to be differential privacy-style inequalities, indicating that achieving both the interpretable semantics of Pufferfish for correlated data and composition benefits requires adopting differentially private mechanisms to Pufferfish. We show that such translation is possible through a concept called the $(a,b)$-influence curve, and many existing differentially private algorithms can be translated with our framework into a composable Pufferfish algorithm. We illustrate the benefit of our new framework by designing composable Pufferfish algorithms for Markov chains that significantly outperform prior work.

  • Composition for Pufferfish Privacy

    arXiv (Cornell University) · 2026-02-02

    articleOpen access

    When creating public data products out of confidential datasets, inferential/posterior-based privacy definitions, such as Pufferfish, provide compelling privacy semantics for data with correlations. However, such privacy definitions are rarely used in practice because they do not always compose. For example, it is possible to design algorithms for these privacy definitions that have no leakage when run once but reveal the entire dataset when run more than once. We prove necessary and sufficient conditions that must be added to ensure linear composition for Pufferfish mechanisms, hence avoiding such privacy collapse. These extra conditions turn out to be differential privacy-style inequalities, indicating that achieving both the interpretable semantics of Pufferfish for correlated data and composition benefits requires adopting differentially private mechanisms to Pufferfish. We show that such translation is possible through a concept called the $(a,b)$-influence curve, and many existing differentially private algorithms can be translated with our framework into a composable Pufferfish algorithm. We illustrate the benefit of our new framework by designing composable Pufferfish algorithms for Markov chains that significantly outperform prior work.

  • The Value of Terrain Pattern, High‐Resolution Data and Ensemble Modeling for Landslide Susceptibility Prediction

    Journal of Geophysical Research Machine Learning and Computation · 2025-09-01 · 2 citations

    articleOpen access

    Abstract Landslide risk is traditionally predicted by process‐based models with detailed assessments or point‐scale attribute‐based machine learning (ML) models with first‐ or second‐order features (e.g., slope and curvature) as inputs. One could hypothesize that terrain patterns might contain useful higher‐order information that could be extracted, via computer vision ML models, to elevate prediction performance beyond that achievable with attribute‐based models. We put this hypothesis to the test in the state of Oregon, where a large landslide data set is available. A Convolutional Neural Network (CNN) using 2D geospatial and terrain data (CNN2D) reached state‐of‐the‐art single‐model scores for Precision (0.90) and Recall (0.86), along with other metrics. CNN2D's Precision‐Recall Pareto front, formed by applying different hyperparameters, dominated attribute‐based models like Random Forest (RF1D) by a substantial margin, attesting to the value of fine‐scale terrain patterns. However, CNN2D's superiority required high‐resolution rainfall (∼800 m) and terrain (∼10 m) data sets: as the resolution coarsened, all models declined in performance but CNN2D's scores decreased more than RF1D's. Ensembling CNN2D and RF1D produced even better Recall (0.90), and this cross‐model‐type ensemble was also better than other ensembles. These models further showed robust results in cross‐regional validation. Rainfall, land cover, and elevation were the most important predictors, while prescribed Plan and Profile Curvature fields were also highly useful inputs (perhaps due to the size of the training data set). Based on the results of our analyses, we generated landslide susceptibility maps which provide insights into spatial patterns of landslide risk.

  • Correlating Cross-Iteration Noise for DP-SGD using Model Curvature

    ArXiv.org · 2025-10-06

    preprintOpen access

    Differentially private stochastic gradient descent (DP-SGD) offers the promise of training deep learning models while mitigating many privacy risks. However, there is currently a large accuracy gap between DP-SGD and normal SGD training. This has resulted in different lines of research investigating orthogonal ways of improving privacy-preserving training. One such line of work, known as DP-MF, correlates the privacy noise across different iterations of stochastic gradient descent -- allowing later iterations to cancel out some of the noise added to earlier iterations. In this paper, we study how to improve this noise correlation. We propose a technique called NoiseCurve that uses model curvature, estimated from public unlabeled data, to improve the quality of this cross-iteration noise correlation. Our experiments on various datasets, models, and privacy parameters show that the noise correlations computed by NoiseCurve offer consistent and significant improvements in accuracy over the correlation scheme used by DP-MF.

  • A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. Census

    Harvard Data Science Review · 2025-07-21 · 1 citations

    articleOpen access

    For the last half-century, it has been a common and accepted practice for statistical agencies, including the United States Census Bureau, to adopt different strategies to protect the confidentiality of aggregate tabular data products from those used to protect the individual records contained in publicly released microdata products. This strategy was premised on the assumption that the aggregation used to generate tabular data products made the resulting statistics inherently less disclosive than the microdata from which they were tabulated. Consistent with this common assumption, the 2010 Census of Population and Housing in the U.S. used different disclosure limitation rules for its tabular and microdata publications. This paper demonstrates that, in the context of disclosure limitation for the 2010 Census, the assumption that tabular data are inherently less disclosive than their underlying microdata is fundamentally flawed. The 2010 Census published more than 150 billion aggregate statistics in 180 table sets. Most of these tables were published at the most detailed geographic level—individual census blocks, which can have populations as small as one person. Using only 34 of the published table sets, we reconstructed microdata records including five variables (census block, sex, age, race, and ethnicity) from the confidential 2010 Census person records. Using only published data, an attacker using our methods can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. We further confirm, through reidentification studies, that an attacker can, within census blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with race and ethnicity different from the modal person on the census block) with 95% accuracy. Having shown the vulnerabilities inherent to the disclosure limitation methods used for the 2010 Census, we proceed to demonstrate that the more robust disclosure limitation framework used for the 2020 Census publications defends against attacks that are based on reconstruction. Finally, we show that available alternatives to the 2020 Census Disclosure Avoidance System would either fail to protect confidentiality, or would overly degrade the statistics’ utility for the primary statutory use case: redrawing the boundaries of all of the nation’s legislative and voting districts in compliance with the 1965 Voting Rights Act.

  • A transfer learning approach to the prediction of porosity in additively manufactured metallic components

    NDT & E International · 2025-08-23 · 2 citations

    article

Recent grants

Frequent coauthors

  • C. Lee Giles

    52 shared
  • Alexander G. Ororbia

    Rochester Institute of Technology

    34 shared
  • Ankur Mali

    33 shared
  • Ashwin Machanavajjhala

    27 shared
  • Danfeng Zhang

    25 shared
  • John M. Abowd

    25 shared
  • Chaopeng Shen

    Pennsylvania State University

    21 shared
  • Johannes Gehrke

    Microsoft (United States)

    19 shared
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Daniel Kifer

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup