About
Patrick Flaherty is an Associate Professor in the Department of Mathematics and Statistics at the University of Massachusetts Amherst. His research focuses on developing statistical models and scalable algorithms to interpret massive biomedical data sets, particularly in the context of large-scale genomic data. His work aims to address the need for statistically rigorous and computationally efficient methods to analyze complex data generated by advances in DNA sequencing technology, with the goal of improving patient care. His research spans diverse fields including machine learning, bioinformatics, statistics, and genetics. Flaherty's long-term goal is to enable the interpretation of genomic changes that drive disease development through innovative statistical and computational approaches.
Research topics
- Computer Science
- Biology
- Chemistry
- Genetics
- Data Mining
- Internal medicine
- Bioinformatics
- Medicine
- Computational biology
- Biochemistry
- Surgery
- Programming language
Selected publications
Stress testing reveals selective vulnerabilities in protein homeostasis
bioRxiv (Cold Spring Harbor Laboratory) · 2025-06-16
preprintOpen accessProtein quality control (PQC) systems are essential for cellular resilience to proteotoxic stress. Despite intensive study for decades, functional redundancies in the system obscure the contributions of the collectively important individual genes. Here, we leverage transposon sequencing across bacteria strains lacking key chaperones and proteases to reveal hidden determinants of stress response in protein homeostasis. By profiling fitness under multiple proteotoxic stresses, we uncover stress-specific vulnerabilities and reveal how major players of PQC mask correlations between transcriptomic responses and gene fitness. As an illustration of unexpected connections, we identify a heat-specific synthetic lethality between the disaggregase ClpB and DNA Polymerase I (PolA) mediated by persistent aggregation of the RecA recombinase and toxic persistence of the heat shock regulon. Our findings reveal that stress-induced aggregation is not broadly toxic. Rather, it becomes lethal in specific genetic or environmental contexts due to the depletion of components only needed in those specific circumstances. This work presents a framework to reveal normally hidden fragility in stress responses using gene fitness scores adaptable to a variety of systems.
Annals of Intensive Care · 2024-01-01 · 3 citations
articleOpen accessBACKGROUND: Multiple organ failure/dysfunction syndrome (MOF/MODS) is a major cause of mortality and morbidity among severe trauma patients. Current clinical practices entail monitoring physiological measurements and applying clinical score systems to diagnose its onset. Instead, we aimed to develop an early prediction model for MOF outcome evaluated soon after traumatic injury by performing machine learning analysis of genome-wide transcriptome data from blood samples drawn within 24 h of traumatic injury. We then compared its performance to baseline injury severity scores and detection of infections. METHODS: Buffy coat transcriptome and linked clinical datasets from blunt trauma patients from the Inflammation and the Host Response to Injury Study ("Glue Grant") multi-center cohort were used. According to the inclusion/exclusion criteria, 141 adult (age ≥ 16 years old) blunt trauma patients (excluding penetrating) with early buffy coat (≤ 24 h since trauma injury) samples were analyzed, with 58 MOF-cases and 83 non-cases. We applied the Least Absolute Shrinkage and Selection Operator (LASSO) and eXtreme Gradient Boosting (XGBoost) algorithms to select features and develop models for MOF early outcome prediction. RESULTS: The LASSO model included 18 transcripts (AUROC [95% CI]: 0.938 [0.890-0.987] (training) and 0.833 [0.699-0.967] (test)), and the XGBoost model included 41 transcripts (0.999 [0.997-1.000] (training) and 0.907 [0.816-0.998] (test)). There were 16 overlapping transcripts comparing the two panels (0.935 [0.884-0.985] (training) and 0.836 [0.703-0.968] (test)). The biomarker models notably outperformed models based on injury severity scores and sex, which we found to be significantly associated with MOF (APACHEII + sex-0.649 [0.537-0.762] (training) and 0.493 [0.301-0.685] (test); ISS + sex-0.630 [0.516-0.744] (training) and 0.482 [0.293-0.670] (test); NISS + sex-0.651 [0.540-0.763] (training) and 0.525 [0.335-0.714] (test)). CONCLUSIONS: The accurate assessment of MOF from blood samples immediately after trauma is expected to aid in improving clinical decision-making and may contribute to reduced morbidity, mortality and healthcare costs. Moreover, understanding the molecular mechanisms involving the transcripts identified as important for MOF prediction may eventually aid in developing novel interventions.
arXiv (Cornell University) · 2024-10-24
preprintOpen accessWe consider the problem of developing interpretable and computationally efficient matrix decomposition methods for matrices whose entries have bounded support. Such matrices are found in large-scale DNA methylation studies and many other settings. Our approach decomposes the data matrix into a Tucker representation wherein the number of columns in the constituent factor matrices is not constrained. We derive a computationally efficient sampling algorithm to solve for the Tucker decomposition. We evaluate the performance of our method using three criteria: predictability, computability, and stability. Empirical results show that our method has similar performance as other state-of-the-art approaches in terms of held-out prediction and computational complexity, but has significantly better performance in terms of stability to changes in hyper-parameters. The improved stability results in higher confidence in the results in applications where the constituent factors are used to generate and test scientific hypotheses such as DNA methylation analysis of cancer samples.
Discovering Genetic Modulators of the Protein Homeostasis System through Multilevel Analysis
bioRxiv (Cold Spring Harbor Laboratory) · 2024-02-29 · 1 citations
preprintOpen accessSenior authorCorrespondingEvery protein progresses through a natural lifecycle from birth to maturation to death; this process is coordinated by the protein homeostasis system. Environmental or physiological conditions trigger pathways that maintain the homeostasis of the proteome. An open question is how these pathways are modulated to respond to the many stresses that an organism encounters during its lifetime. To address this question, we tested how the fitness landscape changes in response to environmental and genetic perturbations using directed and massively parallel transposon mutagenesis in Caulobacter crescentus . We developed a general computational pipeline for the analysis of gene-by-environment interactions in transposon mutagenesis experiments. This pipeline uses a combination of general linear models (GLMs), statistical knockoffs, and a nonparametric Bayesian statistical model to identify essential genetic network components that are shared across environmental perturbations. This analysis allows us to quantify the similarity of proteotoxic environmental perturbations from the perspective of the fitness landscape. We find that essential genes vary more by genetic background than by environmental conditions, with limited overlap among mutant strains targeting different facets of the protein homeostasis system. We also identified 146 unique fitness determinants across different strains, with 19 genes common to at least two strains, showing varying resilience to proteotoxic stresses. Experiments exposing cells to a combination of genetic perturbations and dual environmental stressors show that perturbations that are quantitatively dissimilar from the perspective of the fitness landscape are likely to have a synergistic effect on the growth defect. Significance Statement This study provides critical insights into how cells adapt to environmental and genetic challenges affecting protein homeostasis. Using multilevel statistical analysis and transposon mutagenesis, we find that a model organism, Caulobacter crescentus , lacks a universal redundancy mechanism for coping with stress, as evidenced by the limited overlap in essential genes across different environmental and genetic perturbations. Our methods also pinpoint key fitness determinants and enable the prediction of perturbation combinations that synergistically affect cell growth.
Discovering genetic modulators of the protein homeostasis system through multilevel analysis
PNAS Nexus · 2024-12-23
articleOpen accessSenior authorAbstract Every protein progresses through a natural lifecycle from birth to maturation to death; this process is coordinated by the protein homeostasis system. Environmental or physiological conditions trigger pathways that maintain the homeostasis of the proteome. An open question is how these pathways are modulated to respond to the many stresses that an organism encounters during its lifetime. To address this question, we tested how the fitness landscape changes in response to environmental and genetic perturbations using directed and massively parallel transposon mutagenesis in Caulobacter crescentus. We developed a general computational pipeline for the analysis of gene-by-environment interactions in transposon mutagenesis experiments. This pipeline uses a combination of general linear models, statistical knockoffs, and a nonparametric Bayesian statistical model to identify essential genetic network components that are shared across environmental perturbations. This analysis allows us to quantify the similarity of proteotoxic environmental perturbations from the perspective of the fitness landscape. We find that essential genes vary more by genetic background than by environmental conditions, with limited overlap among mutant strains targeting different facets of the protein homeostasis system. We also identified 146 unique fitness determinants across different strains, with 19 genes common to at least two strains, showing varying resilience to proteotoxic stresses. Experiments exposing cells to a combination of genetic perturbations and dual environmental stressors show that perturbations that are quantitatively dissimilar from the perspective of the fitness landscape are likely to have a synergistic effect on the growth defect.
A PREVENTIVE TOOL FOR PREDICTING BLOODSTREAM INFECTIONS IN CHILDREN WITH BURNS
Shock · 2023-01-04 · 14 citations
articleOpen accessABSTRACT: Introduction: Despite significant advances in pediatric burn care, bloodstream infections (BSIs) remain a compelling challenge during recovery. A personalized medicine approach for accurate prediction of BSIs before they occur would contribute to prevention efforts and improve patient outcomes. Methods: We analyzed the blood transcriptome of severely burned (total burn surface area [TBSA] ≥20%) patients in the multicenter Inflammation and Host Response to Injury ("Glue Grant") cohort. Our study included 82 pediatric (aged <16 years) patients, with blood samples at least 3 days before the observed BSI episode. We applied the least absolute shrinkage and selection operator (LASSO) machine-learning algorithm to select a panel of biomarkers predictive of BSI outcome. Results: We developed a panel of 10 probe sets corresponding to six annotated genes ( ARG2 [ arginase 2 ], CPT1A [ carnitine palmitoyltransferase 1A ], FYB [ FYN binding protein ], ITCH [ itchy E3 ubiquitin protein ligase ], MACF1 [ microtubule actin crosslinking factor 1 ], and SSH2 [ slingshot protein phosphatase 2 ]), two uncharacterized ( LOC101928635 , LOC101929599 ), and two unannotated regions. Our multibiomarker panel model yielded highly accurate prediction (area under the receiver operating characteristic curve, 0.938; 95% confidence interval [CI], 0.881-0.981) compared with models with TBSA (0.708; 95% CI, 0.588-0.824) or TBSA and inhalation injury status (0.792; 95% CI, 0.676-0.892). A model combining the multibiomarker panel with TBSA and inhalation injury status further improved prediction (0.978; 95% CI, 0.941-1.000). Conclusions: The multibiomarker panel model yielded a highly accurate prediction of BSIs before their onset. Knowing patients' risk profile early will guide clinicians to take rapid preventive measures for limiting infections, promote antibiotic stewardship that may aid in alleviating the current antibiotic resistance crisis, shorten hospital length of stay and burden on health care resources, reduce health care costs, and significantly improve patients' outcomes. In addition, the biomarkers' identity and molecular functions may contribute to developing novel preventive interventions.
Briefings in Bioinformatics · 2023-02-18 · 6 citations
articleOpen accessSenior authorCorrespondingLarge-scale multiple perturbation experiments have the potential to reveal a more detailed understanding of the molecular pathways that respond to genetic and environmental changes. A key question in these studies is which gene expression changes are important for the response to the perturbation. This problem is challenging because (i) the functional form of the nonlinear relationship between gene expression and the perturbation is unknown and (ii) identification of the most important genes is a high-dimensional variable selection problem. To deal with these challenges, we present here a method based on the model-X knockoffs framework and Deep Neural Networks to identify significant gene expression changes in multiple perturbation experiments. This approach makes no assumptions on the functional form of the dependence between the responses and the perturbations and it enjoys finite sample false discovery rate control for the selected set of important gene expression responses. We apply this approach to the Library of Integrated Network-Based Cellular Signature data sets which is a National Institutes of Health Common Fund program that catalogs how human cells globally respond to chemical, genetic and disease perturbations. We identified important genes whose expression is directly modulated in response to perturbation with anthracycline, vorinostat, trichostatin-a, geldanamycin and sirolimus. We compare the set of important genes that respond to these small molecules to identify co-responsive pathways. Identification of which genes respond to specific perturbation stressors can provide better understanding of the underlying mechanisms of disease and advance the identification of new drug targets.
PLoS Computational Biology · 2022-03-07 · 5 citations
articleOpen accessSenior authorCorrespondingThe understanding of bacterial gene function has been greatly enhanced by recent advancements in the deep sequencing of microbial genomes. Transposon insertion sequencing methods combines next-generation sequencing techniques with transposon mutagenesis for the exploration of the essentiality of genes under different environmental conditions. We propose a model-based method that uses regularized negative binomial regression to estimate the change in transposon insertions attributable to gene-environment changes in this genetic interaction study without transformations or uniform normalization. An empirical Bayes model for estimating the local false discovery rate combines unique and total count information to test for genes that show a statistically significant change in transposon counts. When applied to RB-TnSeq (randomized barcode transposon sequencing) and Tn-seq (transposon sequencing) libraries made in strains of Caulobacter crescentus using both total and unique count data the model was able to identify a set of conditionally beneficial or conditionally detrimental genes for each target condition that shed light on their functions and roles during various stress conditions.
The Annals of Applied Statistics · 2021-06-01 · 1 citations
articleOpen accessSenior authorThere are distinguishing features or "hallmarks" of cancer that are found across tumors, individuals, and types of cancer, and these hallmarks can be driven by specific genetic mutations. Yet, within a single tumor there is often extensive genetic heterogeneity as evidenced by single-cell and bulk DNA sequencing data. The goal of this work is to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data. Understanding the genetic composition of the tumor at the time of treatment is important in the personalized design of targeted therapeutic combinations and monitoring for possible recurrence after treatment. We propose a hierarchical Dirichlet process mixture model that incorporates the correlation structure induced by a structured sampling arrangement and we show that this model improves the quality of inference. We develop a representation of the hierarchical Dirichlet process prior as a Gamma-Poisson hierarchy and we use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method. Experiments with simulation data show that our model outperforms standard numerical and statistical methods for decomposing admixed count data. Analyses of real acute lymphoblastic leukemia cancer sequencing dataset shows that our model improves upon state-of-the-art bioinformatic methods. An interpretation of the results of our model on this real dataset reveals co-mutated loci across samples.
Cluster Trellis: Data Structures & Algorithms for Exact Inference in Hierarchical Clustering
International Conference on Artificial Intelligence and Statistics · 2021-03-18
article
Recent grants
Learning Conditionally Essential Genetic Networks in the Protein Homeostasis System
NIH · $693k · 2019–2024
Frequent coauthors
- 58 shared
Hanlee P. Ji
Palo Alto University
- 51 shared
James M. Ford
University of California, San Francisco
- 49 shared
Mei‐Yin C. Polley
NRG Oncology
- 49 shared
Mamie Yu
- 49 shared
David H. Gutmann
Washington University in St. Louis
- 49 shared
Joshua D. Schiffman
Huntsman Cancer Institute
- 49 shared
Paul G. Fisher
Stanford University
- 49 shared
Mitchel S. Berger
Neurological Surgery
Labs
Education
- 2012
Postdoc, Biochemistry
Stanford University
- 2006
PhD, Electrical Engineering and Computer Science
University of California Berkeley
- 2000
BS, Electrical Engineering
Rochester Institute of Technology
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Patrick Flaherty
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup