About
David Hunter is a Professor of Statistics and a Graduate Faculty member at the Pennsylvania State University. He is also affiliated with the Social Data Analytics (C-SoDA) program and serves as a C-SoDA Faculty Affiliate. His office is located at 302 Pond Laboratory, University Park, PA 16802. Further information about his academic profile and work can be found on his website at http://sites.stat.psu.edu/~dhunter/.
Research topics
- Data Mining
- Computer Science
- Machine Learning
- Artificial Intelligence
- Programming language
- Mathematics
- Theoretical computer science
- Statistics
- Data science
Selected publications
2026-03-07
book-chapter1st authorCorrespondingAbstract This chapter examines the promises and pitfalls of using statistics in discussions about inclusivity and bias. Through three illustrative vignettes—on college admissions modeling, blind orchestra auditions, and academic science hiring—the chapter explores the limits of statistical inference in socially complex settings. It highlights the importance of model interpretability in high-stakes decision-making, the dangers of conflating correlation with causation, and the challenges of concluding incomplete or unrepresentative data. The author argues that, while statistical tools can enhance clarity and foster an informed debate, they are often misunderstood or misapplied in ways that obscure rather than illuminate truth. The chapter advocates for a common-sense, ethically grounded approach to data interpretation, especially when the societal stakes are high. It concludes that statistical reasoning must remain transparent and humble, mindful of its limitations in measuring nuanced social realities.
Wellcome Open Research · 2025-04-30 · 1 citations
preprintOpen access<ns7:p> The Southern Corroboree frog ( <ns7:italic>Pseudophryne corroboree</ns7:italic> ; Anura; Myobatrachidae) is a Critically Endangered amphibian, according to the IUCN, and is endemic to the Snowy Mountains region of Kosciuszko National Park in New South Wales, Australia. This species has been driven to functional extinction by the introduction of the fungal disease, chytridiomycosis. Here we provide the first reference genome for <ns7:italic>P. corroboree</ns7:italic> . Using PacBio HiFi sequencing, Arima Hi-C, and Bionano optical mapping, we produced a chromosome-level genome assembly. Additionally, we generated a reference transcriptome based on multiple tissues from both male and female individuals to support genome annotation. The resulting genome spans 8.87 Gb across 12 chromosomes, with a contig N50 of 6.8 Mb. This research provides a phased, annotated genome assembly along with transcriptomic resources to facilitate future conservation genomic studies of <ns7:italic>P. corroboree</ns7:italic> . Furthermore, the genome offers an invaluable resource for taxonomic and evolutionary research, particularly given the nearest available chromosome-level reference genome is from <ns7:italic>Mixophyes fleayi</ns7:italic> , a species that last shared a common ancestor with <ns7:italic>P. corroboree</ns7:italic> 80 million years ago. </ns7:p>
A Regression Framework for Studying Relationships among Attributes under Network Interference
Journal of the American Statistical Association · 2025-10-01
articleSenior authorA regression framework for studying relationships among attributes under network interference
arXiv (Cornell University) · 2024-10-10
preprintOpen accessSenior authorTo understand how the interconnected and interdependent world of the twenty-first century operates and make model-based predictions, joint probability models for networks and interdependent outcomes are needed. We propose a comprehensive regression framework for networks and interdependent outcomes with multiple advantages, including interpretability, scalability, and provable theoretical guarantees. The regression framework can be used for studying relationships among attributes of connected units and captures complex dependencies among connections and attributes, while retaining the virtues of linear regression, logistic regression, and other regression models by being interpretable and widely applicable. On the computational side, we show that the regression framework is amenable to scalable statistical computing based on convex optimization of pseudo-likelihoods using minorization-maximization methods. On the theoretical side, we establish convergence rates for pseudo-likelihood estimators based on a single observation of dependent connections and attributes. We demonstrate the regression framework using simulations and an application to hate speech on the social media platform X.
Modeling Homophily in Exponential-Family Random Graph Models for Bipartite Networks
arXiv (Cornell University) · 2023-12-09
preprintOpen accessHomophily, the tendency of individuals who are alike to form ties with one another, is an important concept in the study of social networks. Yet accounting for homophily effects is complicated in the context of bipartite networks where ties connect individuals not with one another but rather with a separate set of nodes, which might also be individuals but which are often an entirely different type of objects. As a result, much work on the effect of homophily in a bipartite network proceeds by first eliminating the bipartite structure, collapsing a two-mode network to a one-mode network and thereby ignoring potentially meaningful structure in the data. We introduce a set of methods to model homophily on bipartite networks without losing information in this way, then we demonstrate that these methods allow for substantively interesting findings in management science not possible using standard techniques. These methods are implemented in the widely-used ergm package for R.
The Annals of Applied Statistics · 2023-10-31 · 3 citations
articleOpen accessMotivated by a study of United Nations voting behaviors, we introduce a regression model for a series of networks that are correlated over time. Our model is a dynamic extension of the additive and multiplicative effects network model (AMEN) of Hoff (2021). In addition to incorporating a temporal structure, the model accommodates two types of missing data thus allows the size of the network to vary over time. We demonstrate via simulations the necessity of various components of the model. We apply the model to the United Nations General Assembly voting data from 1983 to 2014 (Voeten, 2013) to answer interesting research questions regarding international voting behaviors. In addition to finding important factors that could explain the voting behaviors, the model-estimated additive effects, multiplicative effects, and their movements reveal meaningful foreign policy positions and alliances of various countries.
Computing Pseudolikelihood Estimators for Exponential-Family Random Graph Models
Journal of Data Science · 2023-01-01 · 5 citations
articleOpen accessSenior authorCorrespondingThe reputation of the maximum pseudolikelihood estimator (MPLE) for Exponential Random Graph Models (ERGM) has undergone a drastic change over the past 30 years. While first receiving broad support, mainly due to its computational feasibility and the lack of alternatives, general opinions started to change with the introduction of approximate maximum likelihood estimator (MLE) methods that became practicable due to increasing computing power and the introduction of MCMC methods. Previous comparison studies appear to yield contradicting results regarding the preference of these two point estimators; however, there is consensus that the prevailing method to obtain an MPLE’s standard error by the inverse Hessian matrix generally underestimates standard errors. We propose replacing the inverse Hessian matrix by an approximation of the Godambe matrix that results in confidence intervals with appropriate coverage rates and that, in addition, enables examining for model degeneracy. Our results also provide empirical evidence for the asymptotic normality of the MPLE under certain conditions.
<b>ergm</b> 4: New Features for Analyzing Exponential-Family Random Graph Models
Journal of Statistical Software · 2023 · 61 citations
- Computer Science
- Computer Science
- Data Mining
The ergm package supports the statistical analysis and simulation of network data. It anchors the statnet suite of packages for network analysis in R introduced in a special issue in Journal of Statistical Software in 2008. This article provides an overview of the new functionality in the 2021 release of ergm version 4. These include more flexible handling of nodal covariates, term operators that extend and simplify model specification, new models for networks with valued edges, improved handling of constraints on the sample space of networks, and estimation with missing edge data. We also identify the new packages in the statnet suite that extend ergm's functionality to other network data types and structural features and the robust set of online resources that support the statnet development process and applications.
Likelihood-based inference for exponential-family random graph models via linear programming
Electronic Journal of Statistics · 2023-01-01 · 3 citations
articleOpen accessSenior authorThe problem of determining whether a given point, or set of points, lies within the convex hull of another set of points in d dimensions arises naturally in the context of certain exponential family models in statistics. This article discusses the general convex hull problem and its application to the particular problem of modelling network data using an exponential-family random graph model (ERGM). While the convex hull question may be solved via a simple linear program, this approach is not well known in the statistical literature. The article also details several substantial improvements to the convex hull-testing algorithm currently implemented in the widely used ergm package for network modeling. It provides direct numerical comparisons of two linear programming packages for R that can be called by ergm and offers several illustrative examples.
Improving ERGM starting values using simulated annealing
Social Networks · 2023-11-07 · 4 citations
articleOpen accessSenior authorMuch of the theory of estimation for exponential family models, which include exponential-family random graph models (ERGMs) as a special case, is well-established and maximum likelihood estimates (MLEs) in particular enjoy many desirable properties. However, in the case of many ERGMs, direct calculation of MLEs is impossible and therefore methods for approximating MLEs and/or alternative estimation methods must be employed. Many MLE approximation algorithms require an alternative estimate as a starting point. The maximum pseudo-likelihood estimator (MPLE) is frequently taken as this starting point. Here, we discuss a potentially large class of such alternatives based on the fact that, unlike the MLE, the MPLE fails to satisfy the so-called “likelihood principle”. This means that different networks may have different MPLEs even if they have the same sufficient statistics. We exploit this fact here to search for improved starting values for approximation-based MLE methods. The method we propose has shown its merit in producing an MLE for a network dataset and model that had defied estimation using all other known methods.
Recent grants
NIH · $1.4M · 2013
Frequent coauthors
- 20 shared
Mark S. Handcock
Development Fund
- 18 shared
Steven M. Goodreau
University of Washington
- 17 shared
Martina Morris
The Royal Wolverhampton NHS Trust
- 17 shared
Pavel N. Krivitsky
UNSW Sydney
- 13 shared
Carter T. Butts
University of California, Irvine
- 13 shared
Didier Chauveau
Institut Denis Poisson
- 11 shared
David Welch
- 10 shared
Duy Vu
University of Melbourne
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with David Hunter
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup