Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
David S. Matteson

David S. Matteson

Verified

Cornell University · Computer Science

Active 1986–2026

h-index28
Citations3.3k
Papers222105 last 5y
Funding$2.5M
See your match with David S. Matteson — sign in to PhdFit.Sign in

About

David S. Matteson is a faculty member in the Department of Computer Science at Cornell University. The provided page text does not include specific information about his research focus, background, or key contributions. Therefore, no detailed biography can be extracted from the available content.

Research topics

  • Mathematics
  • Data Mining
  • Artificial Intelligence
  • Computer Science
  • Machine Learning
  • Econometrics
  • Geometry
  • Mathematical economics
  • Economics
  • Physics
  • Economic growth
  • Medicine
  • Geography
  • Data science
  • Keynesian economics

Selected publications

  • BASTION: A Bayesian Framework for Trend and Seasonality Decomposition

    arXiv (Cornell University) · 2026-01-26

    preprintOpen accessSenior author

    We introduce BASTION (Bayesian Adaptive Seasonality and Trend DecompositION), a flexible Bayesian framework for decomposing time series into trend and multiple seasonality components. We cast the decomposition as a penalized nonparametric regression and establish formal conditions under which the trend and seasonal components are uniquely identifiable, an issue only treated informally in the existing literature. BASTION offers three key advantages over existing decomposition methods: (1) accurate estimation of trend and seasonality amidst abrupt changes, (2) enhanced robustness against outliers and time-varying volatility, and (3) robust uncertainty quantification. We evaluate BASTION against established methods, including TBATS, STR, and MSTL, using both simulated and real-world datasets. By effectively capturing complex dynamics while accounting for irregular components such as outliers and heteroskedasticity, BASTION delivers a more nuanced and interpretable decomposition. To support further research and practical applications, BASTION is available as an R package at https://github.com/Jasoncho0914/BASTION

  • 65 Integrating pathology reports into multimodal machine learning models to predict thyroid cancer recurrence

    Journal of Clinical and Translational Science · 2026-04-01

    articleOpen access

    Objectives/Goals: The objective of this study was to evaluate the performance of multimodal machine learning (ML) models trained to predict differentiated thyroid cancer (DTC) recurrence using clinical data combined with novel natural language processing (NLP) derived features extracted from patient cytopathology and surgical pathology reports. Methods/Study Population: This was a retrospective study of adult thyroid cancer patients treated at an academic medical center. Patients were classified as having cancer recurrence or no recurrence. NLP features were extracted from cytopathology and surgical pathology reports using Term Frequency–Inverse Document Frequency (TF-IDF), latent Dirichlet allocation (LDA), and a zero-shot large language model (LLM) classification. 5 multimodal ML models were trained to predict cancer recurrence utilizing a combination of NLP and LLM features and clinical variables. Model performance was evaluated using area under the receiver operating characteristic curve (ROC-AUC) and precision recall area under the curve (PR-AUC). The top performing model was optimized with a 5-fold cross-validation. Feature importance was calculated. Results/Anticipated Results: 480 patients with differentiated thyroid cancer diagnosed on surgical pathology were included in this study. The baseline model (clinical variables only) had a F1-score of 0.52 and an AUC of 0.53. The optimized gradient boosting model utilizing all features (EMR, LDA, TF-IDF, and LLM) had a F1-score of 0.87 and an AUC of 0.86. Topic words and themes from the patient cytopathology and surgical pathology reports were generated using LDA. Topic themes in cytopathology reports include malignancy, lymph node evaluation, and molecular testing. Topic themes in surgical pathology reports include histologic subtype, orientation of nodule, and intraoperative biopsy. The LDA themes of malignancy and histologic subtype ranked the highest in terms of feature importance. Discussion/Significance of Impact: Multimodal models utilizing novel NLP features derived from unstructured pathology reports may enable improved prediction of recurrence in patients with DTC. Our optimized model demonstrated that 4 of the top 6 highest features were LDA topics. Topic modeling may be a valuable tool to extract relevant information from unstructured clinical notes.

  • Simulating Freely Diffusing Single-Molecule FRET Data with Consideration of Protein Conformational Dynamics

    The Journal of Physical Chemistry B · 2026-01-15

    articleOpen access

    Single-molecule Förster resonance energy transfer (smFRET) experiments have greatly contributed to the understanding of the conformational dynamics of proteins and other biomolecules. Generating high-fidelity simulated data for smFRET experiments is an important step toward developing and examining accurate and efficient smFRET data analysis techniques. Here, we use distributions of interdye distances generated using Langevin dynamics to simulate freely diffusing smFRET timestamp data for proteins and biomolecules that have conformational flexibility. We then compare analysis techniques for smFRET data to validate the new module. The Langevin dynamics is used here as an illustrative example to demonstrate how modeling conformational dynamics can be integrated with molecular diffusion and photon emission statistics, all of which are essential for realistic simulation of freely diffusing smFRET data. We also discuss different ways to generalize our approach to make the simulated data more realistic including the employment of molecular dynamics (MD) simulations that is illustrated with an example. The Langevin dynamics module provides a framework for generating timestamp data for systems with a known underlying conformational heterogeneity as a step toward the development of new analysis techniques for smFRET data dealing with flexible proteins or other biomolecular systems.

  • Spatial heterogeneity in machine learning-based poverty mapping: Where do models underperform?

    Geography and sustainability · 2026-01-15

    articleOpen access

    • Machine learning-based poverty mapping underperforms due to spatial heterogeneity. • In interpolation, geographically weighted models reveal variation across regions. • In extrapolation, models overestimate welfare in poor, rural, single-sector regions. • Spatial models yield limited gains in underperforming areas. • Unbiased poverty maps require improved training data and remote sensing proxies. Accurately locating poor populations is increasingly urgent as global poverty reduction has stalled under the combined pressures of conflicts, climate shocks, rising food prices, pandemics, and growing inequality. Recent studies harnessing geospatial big data and machine learning (ML) have significantly advanced poverty mapping, enabling granular and timely welfare estimates in traditionally data-scarce regions. While much of the existing research has focused on overall out-of-sample predictive performance, there is a lack of understanding regarding where such models underperform and whether key spatial relationships might vary across places. This study investigates spatial heterogeneity in ML-based poverty mapping in East Africa, testing whether spatial regression and ML techniques produce more unbiased predictions. We find that extrapolation into unsurveyed areas suffers from biases that spatial methods do not resolve; welfare is overestimated in impoverished regions, rural areas, and single sector-dominated economies, whereas it tends to be underestimated in wealthier, urbanized, and diversified economies. Even as spatial models improve overall predictive accuracy, enhancements in traditionally underperforming areas remain marginal. This underscores the need for more representative training datasets and better remotely sensed proxies, especially for poor and rural regions, in future research related to ML-based poverty mapping. For development agencies, the findings caution against treating ML-based outputs as neutral or universally reliable, highlighting instead the need to pair technical advances with investments in inclusive data collection, integration of spatial theory, and institutional strategies that address structural data inequalities.

  • Modeling Dynamic Correlation Matrices with Shrinkage Priors

    ArXiv.org · 2026-05-07

    articleOpen access

    Estimating time-varying correlation matrices is challenging because existing methods may adapt slowly to structural changes, impose insufficient regularization, or produce diffuse posterior uncertainty. In moderate dimensions, an additional difficulty is summarizing the estimated evolving dependence structure for downstream decision-making tasks. We propose a Bayesian approach based on a low-rank factor representation, with latent states evolving under a dynamic shrinkage prior and observation errors following a multivariate factor stochastic volatility model. This specification allows locally adaptive regularization of the estimated correlation structure over time and informative uncertainty quantification. We establish, to our knowledge, a first-of-its-kind posterior contraction result for dynamically regularized Bayesian models, showing contraction around the true model parameters at an explicit rate under averaged Hellinger distance. To summarize the estimated correlation matrices, we build on the information-theoretic concept of total correlation to obtain a scalar measure of cross-sectional dependence. Simulation studies show improved accuracy and responsiveness relative to competing methods in a range of challenging scenarios. We then apply our method to monitoring the correlation evolution of equity portfolios during periods of financial market stress, providing an ex post framework for assessing the changing benefits of diversification in backtesting analyses.

  • BASTION: A Bayesian Framework for Trend and Seasonality Decomposition

    ArXiv.org · 2026-01-26

    articleOpen accessSenior author

    We introduce BASTION (Bayesian Adaptive Seasonality and Trend DecompositION), a flexible Bayesian framework for decomposing time series into trend and multiple seasonality components. We cast the decomposition as a penalized nonparametric regression and establish formal conditions under which the trend and seasonal components are uniquely identifiable, an issue only treated informally in the existing literature. BASTION offers three key advantages over existing decomposition methods: (1) accurate estimation of trend and seasonality amidst abrupt changes, (2) enhanced robustness against outliers and time-varying volatility, and (3) robust uncertainty quantification. We evaluate BASTION against established methods, including TBATS, STR, and MSTL, using both simulated and real-world datasets. By effectively capturing complex dynamics while accounting for irregular components such as outliers and heteroskedasticity, BASTION delivers a more nuanced and interpretable decomposition. To support further research and practical applications, BASTION is available as an R package at https://github.com/Jasoncho0914/BASTION

  • Modeling Dynamic Correlation Matrices with Shrinkage Priors

    arXiv (Cornell University) · 2026-05-07

    preprintOpen access

    Estimating time-varying correlation matrices is challenging because existing methods may adapt slowly to structural changes, impose insufficient regularization, or produce diffuse posterior uncertainty. In moderate dimensions, an additional difficulty is summarizing the estimated evolving dependence structure for downstream decision-making tasks. We propose a Bayesian approach based on a low-rank factor representation, with latent states evolving under a dynamic shrinkage prior and observation errors following a multivariate factor stochastic volatility model. This specification allows locally adaptive regularization of the estimated correlation structure over time and informative uncertainty quantification. We establish, to our knowledge, a first-of-its-kind posterior contraction result for dynamically regularized Bayesian models, showing contraction around the true model parameters at an explicit rate under averaged Hellinger distance. To summarize the estimated correlation matrices, we build on the information-theoretic concept of total correlation to obtain a scalar measure of cross-sectional dependence. Simulation studies show improved accuracy and responsiveness relative to competing methods in a range of challenging scenarios. We then apply our method to monitoring the correlation evolution of equity portfolios during periods of financial market stress, providing an ex post framework for assessing the changing benefits of diversification in backtesting analyses.

  • Smoothing Variances Across Time: Adaptive Stochastic Volatility

    Figshare · 2025-12-29

    datasetOpen accessSenior author

    We introduce a novel Bayesian framework for estimating time-varying volatility by extending the Random Walk Stochastic Volatility (RWSV) model with Dynamic Shrinkage Processes (DSP) in log-variances. Unlike the classical Stochastic Volatility (SV) or GARCH-type models with restrictive parametric stationarity assumptions, our proposed Adaptive Stochastic Volatility (ASV) model provides smooth yet dynamically adaptive estimates of evolving volatility and its uncertainty. We further enhance the model by incorporating a nugget effect, allowing it to flexibly capture small-scale variability while preserving smoothness elsewhere. We derive the theoretical properties of the global-local shrinkage prior DSP. Simulation studies demonstrate that ASV is highly robust to misspecification, consistently recovering the latent volatility structure across a wide range of data-generating processes. Furthermore, ASV’s capacity to yield locally smooth and interpretable estimates facilitates a clearer understanding of the underlying patterns and trends in volatility. As an extension, we develop the Bayesian Trend Filter with ASV (BTF-ASV) which allows joint modeling of the mean and volatility with abrupt changes. Finally, our proposed models are applied to time series data from finance, econometrics, and environmental science, highlighting their flexibility and broad applicability.

  • dsp: Dynamic Shrinkage Process and Change Point Detection

    2025-08-19

    datasetOpen accessSenior author

    Provides efficient Markov chain Monte Carlo (MCMC) algorithms for dynamic shrinkage processes, which extend global-local shrinkage priors to the time series setting by allowing shrinkage to depend on its own past. These priors yield locally adaptive estimates, useful for time series and regression functions with irregular features. The package includes full MCMC implementations for trend filtering using dynamic shrinkage on signal differences, producing locally constant or linear fits with adaptive credible bands. Also included are models with static shrinkage and normal-inverse-Gamma priors for comparison. Additional tools cover dynamic regression with time-varying coefficients and B-spline models with shrinkage on basis differences, allowing for flexible curve-fitting with unequally spaced data. Some support for heteroscedastic errors, outlier detection, and change point estimation. Methods in this package are described in Kowal et al. (2019) &lt;<a href="https://doi.org/10.1111%2Frssb.12325" target="_top">doi:10.1111/rssb.12325</a>&gt;, Wu et al. (2024) &lt;<a href="https://doi.org/10.1080%2F07350015.2024.2362269" target="_top">doi:10.1080/07350015.2024.2362269</a>&gt;, Schafer and Matteson (2024) &lt;<a href="https://doi.org/10.1080%2F00401706.2024.2407316" target="_top">doi:10.1080/00401706.2024.2407316</a>&gt;, and Cho and Matteson (2024) &lt;<a href="https://doi.org/10.48550%2FarXiv.2408.11315" target="_top">doi:10.48550/arXiv.2408.11315</a>&gt;.

  • Spatial Heterogeneity in Machine Learning-Based Poverty Mapping: Where Do Models Underperform?

    SSRN Electronic Journal · 2025-01-01

    preprintOpen access

Recent grants

Frequent coauthors

  • David Ruppert

    Cornell University

    46 shared
  • Ines Wilms

    19 shared
  • Jacob Bien

    University of Southern California

    19 shared
  • Yuchen Xu

    Shandong First Medical University

    16 shared
  • Peter A. Crozier

    16 shared
  • Nicholas A. James

    Mount Sinai Beth Israel

    15 shared
  • Toryn L. J. Schafer

    Cornell University

    14 shared
  • Benjamin B. Risk

    Emory University

    14 shared

Education

  • PhD, Statistics

    University of Chicago

    2008

Awards & honors

  • CAREER Award from the National Science Foundation
  • Chancellor’s Award for Scholarship and Creative Activities f…
  • inaugural Ann S. Bowers Research Excellence Award
  • Faculty Research Awards from the Xerox/PARC Foundation and L…
  • Fellow of the American Statistical Association (ASA)
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with David S. Matteson

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup