Andrew Scott Allen

· Professor of Biostatistics & BioinformaticsVerified

Duke University · Biostatistics and Bioinformatics

Active 1977–2026

h-index53

Citations12.3k

Papers29870 last 5y

Funding$14.1M

Faculty page Lab page

See your match with Andrew Scott Allen — sign in to PhdFit.Sign in

About

Andrew Scott Allen is a Professor of Biostatistics and Bioinformatics at Duke University. He serves as the Director of the Center for Statistical Genetics and Genomics and is the Chief of the Division of Integrative Genomics within the Duke Department of Biostatistics and Bioinformatics. His professional roles involve leading research initiatives in statistical genetics, genomics, and bioinformatics, contributing to the advancement of these fields through his leadership and scholarly activities.

Research topics

Biology
Computational biology
Genetics
Neuroscience
Biochemistry
Cell biology

Selected publications

Modeling gene regulatory perturbations via deep learning from high-throughput reporter assays
bioRxiv (Cold Spring Harbor Laboratory) · 2026-03-31
articleOpen access
Assessing likely variant effects on phenotypes is of critical importance in diagnostic settings, and while much progress has been made in interpreting genic mutations based on our understanding of coding sequence, noncoding variants can be much more challenging to reliably interpret based on DNA sequence alone. High-throughput reporter assays such as STARR-seq and MPRA have shown utility in experimentally measuring regulatory effects of noncoding variants present in samples but provide no readout for variants not present in the assay inputs. However, whole-genome reporter assays provide copious data that can be used to train predictive models for prioritizing variants not directly observed in the experiment. We describe a retrainable predictive modeling framework, BlueSTARR, for this task, and present results of training several models with this framework on whole-genome STARR-seq data from two cell lines and one drug treatment. Using these models, we uncover a global signature across the human genome consistent with purifying selection against both loss-of-function and gain-of-function regulatory variants, with the latter showing a significant bias consistent with selection against gains of cis regulatory function in closed chromatin proximal to genes. By testing the model on synthetic enhancers with binding motifs for transcription factors GR and AP-1, we find that when trained on drug perturbation data, the model is able to learn distance-dependent and treatment-dependent binding patterns and their resulting reporter gene activation. These results demonstrate that lightweight, easily retrainable models such as ours have utility in probing latent signals present in novel experimental data. Finally, we find only modest differences in performance between different deep-learning architectures when trained on this single data modality, and while somewhat greater predictive accuracy can be achieved with much larger models trained at great expense on many terabytes of data, there is still copious room for improvement even for industrial strength, state-of-the-art models.
Publisher OA PDF DOI
Structured Pooling Improves Detection of Rare Regulatory Mutations in Population-Scale Reporter Assays
bioRxiv (Cold Spring Harbor Laboratory) · 2026-03-31
articleOpen access
Identifying genetic variants in noncoding DNA that impact gene expression and thereby contribute to disease risk remains a difficult but important challenge in genomic medicine. Modern reporter assays such as STARR-seq and MPRA provide an efficient and effective means of testing, in very high throughput, millions of variants captured directly from patient genomes. While these assays have previously been scaled to whole genomes and, separately, to populations, we report findings from the first whole-genome population-scale STARR-seq experiment performed on 100 individuals. In order to achieve that scale we devised a novel experimental design that partitions samples into pools so as to increase allele frequencies within pools and thereby reduce expected dropout and increase signal-to-noise ratio in experimental readouts. We show that this design produces more accurate estimates of variant effect sizes, and we provide a Bayesian model for robust estimation of those effect sizes that also reports full posterior distributions for assessment of confidence in estimates. Together, these methodological innovations facilitate the detection of functional regulatory variants, particularly rare variants, with much higher accuracy and at greater scale than previously possible. We demonstrate the utility of this approach on the task of functional annotation of quantitative trait loci such as eQTLs and caQTLs, and show concordance with patterns of constraint in transcription factor binding profiles.
Publisher OA PDF DOI
Functional Annotation of the Major Histocompatibility Complex Locus
bioRxiv (Cold Spring Harbor Laboratory) · 2026-02-03
articleOpen access
Abstract The human major histocompatibility complex (MHC) locus has the greatest density of disease-associations in the human genome, including links to over 100 polygenic disorders. Its complex haplotype structure, rich gene density, and high degree of linkage disequilibrium combine to make deciphering the gene regulatory logic of the MHC locus extremely challenging. Employing complementary high-throughput CRISPR interference (CRISPRi) and activation (CRISPRa) epigenetic screens coupled with single-cell transcriptome profiling across three distinct human cell types, we identified hundreds of new connections between cis -regulatory elements (CREs) and their target genes in this locus. These CRE-gene links are largely cell type-specific and act as enhancers. Additionally, some CREs have complex features, including harboring both active and repressive histone marks, lacking chromatin accessibility, targeting multiple genes, or acting as silencers. Computational methods fail to predict a majority of these CRE-gene connections. These findings emphasize the potential for functional perturbation experiments to dissect complex loci and reveal shared and cell type-specific regulatory mechanisms relevant to genomics of complex diseases. Collectively, this study provides a unique resource for understanding the complex regulatory landscape within the MHC locus and supports the need for creating new models that encompass CRE-gene interactions, cell type-specific gene expression, and disease genetics in the noncoding genome.
Publisher OA PDF DOI
Identifying Inheritance Patterns of Allelic Imbalance, using Integrative Modeling and Bayesian Inference
bioRxiv (Cold Spring Harbor Laboratory) · 2026-03-31
articleOpen access
Abstract Interpreting the effects of novel mutations on phenotypic traits remains challenging, particularly for cis -regulatory variants. For rare variants, individuals typically possess at most one affected copy of the causal allele, leading to allelic imbalance, and thus the ability to infer inheritance of allelic imbalance can inform genetic studies of phenotypic traits. While many methods for detection of allele-specific expression (ASE) exist, they largely focus on ASE in one individual. We show that performing joint inference across multiple individuals in a trio allows for simultaneously improving estimates of ASE and identifying its likely mode of inheritance. Our Bayesian approach has the benefit of being able to (1) aggregate information across individuals so as to improve statistical power, (2) estimate uncertainty in estimates, and (3) rank modes of inheritance by posterior probability. We demonstrate that this model is also applicable to other forms of imbalance such as allele-specific chromatin accessibility. Applying the model to ATAC-seq and RNA-seq from several trios, we uncover examples in which ASE can be linked to imbalance in chromatin state of cis -regulatory elements and to potential causal variants. As the cost of sequencing continues to decrease, we expect that powerful methodologies such as the one presented here will promote more routine collection of samples from related individuals and improve our understanding of genetic effects on gene regulation and their contribution to phenotypic traits.
Publisher OA PDF DOI
Cell Modeling and Rescue of a Novel Non-coding Genetic Cause of Glycogen Storage Disease IX
bioRxiv (Cold Spring Harbor Laboratory) · 2025-05-17
preprintOpen access
Abstract Delayed diagnosis of Mendelian disease substantially prevents early therapeutic intervention that could improve symptoms and prognosis. One major contributing challenge is the functional interpretation of non-coding variants that cause disease by altering splicing and/or gene expression. We identified two siblings with glycogen storage disease (GSD) type IX γ2, both of whom had a classic clinical presentation, enzyme deficiency, and a known pathogenic splice acceptor variant on one allele of PHKG2 . Despite the autosomal recessive nature of the disease, no variant on the second allele was identified by gene panel sequencing. To identify a potential missing second pathogenic variant, we completed whole genome sequencing (WGS) and detected putative deep intronic splicing variant in PHKG2 in both siblings. We confirmed the functional splicing effects of this variant using short-read and long-read RNA-seq on patient blood and a HEK293T cell model in which we installed the variant using CRISPR editing. Using the cell model, we demonstrated multiple biochemical and cellular impacts that are consistent with GSD IX γ2, and a reversal of aberrant splicing using antisense splice-switching oligonucleotides. In doing so, we demonstrate a novel and robust pathway for detecting, validating, and reversing the impacts of novel non-coding causes of rare disease.
Publisher OA PDF DOI
Functional Category-Specific Intolerance Reflects Genic Function and Clinical Relevance
bioRxiv (Cold Spring Harbor Laboratory) · 2025-09-27
preprintOpen accessSenior authorCorresponding
Abstract A key problem in genetics is associating variants with disease phenotypes. In aid of this, much progress has been made in quantifying the functional impact of individual variants on the gene product it codes for. However, the intolerance of the sequence in which those variants are found to functional variation is also a key determinant of whether a deleterious variant is pathogenic or not. Previous approaches to estimating genic intolerance have combined functional variant types, i.e., missense, loss-of-function, etc., or restricted analyses to only one type, i.e., pLI, missense-Z etc. Here we take a different approach and jointly model patterns of intolerance across multiple functional variant types. We refer to this approach as CATMINT. We show that CATMINT is competitive with previous gene level intolerance metrics in predicting disease relevant genes, with CATMINT ranking among the top performing scores across differing types of genes. However, perhaps more exciting is that CATMINT enables variant category specific intolerance estimation, revealing distinct functional profiles across genes/gene families. Analysis of ClinVar data shows that CATMINT intolerance patterns in disease genes recapitulate patterns of pathogenic variants within those genes, supporting the utility of category-specific intolerance in clinical variant interpretation. Further, we use the statistical framework utilized by CATMINT to conduct power analyses, allowing us to classify genes according to the power those genes have to detect intolerance. This allows us, for example, to identify genes that are underpowered and undetected, but may nevertheless be highly intolerant. Together, these results define a framework for understanding how selective pressures shape gene-specific sensitivity to different classes of mutation, improving the resolution of variant interpretation and gene prioritization in clinical and functional genomics.
Publisher OA PDF DOI
Cell modeling and rescue of a novel noncoding genetic cause of glycogen storage disease IX
Genetics in Medicine Open · 2025-11-20
articleOpen access
Purpose: Delayed diagnosis of Mendelian disease prevents early therapeutic intervention that could improve symptoms and prognosis. One major contributing challenge is functional interpretation of noncoding variants that alter splicing. Here, we aimed to better understand both how splice altering variants contribute to Mendelian disease and how to identify such mechanisms via an instrumental case study of 2 siblings with glycogen storage disease (GSD) IX γ2. Methods: (HGNC:8931). Despite the autosomal recessive nature of the disease, no coding variant on the second allele was identified by targeted sequencing. We evaluated potential noncoding pathogenic variants using genome sequencing and RNA sequencing and created an isogenic model of the candidate variant using CRISPR/Cas9 genome editing. Results: . In a HEK293T cell model in which we installed that variant, we confirmed its effects on splicing in addition to multiple biochemical and cellular phenotypes consistent with GSD IX. We then reversed aberrant splicing using antisense oligonucleotide technology. Conclusion: c.556+1069T>G causes GSD IX γ2 and can be targeted using antisense oligonucleotides. This demonstrates a novel and robust pathway for detecting, validating, and reversing the impacts of noncoding causes of rare disease.
Publisher OA PDF DOI
Bayesian estimation of allele-specific expression in the presence of phasing uncertainty
Bioinformatics · 2025-05-03 · 1 citations
articleOpen access
MOTIVATION: Allele-specific expression (ASE) analyses aim to detect imbalanced expression of maternal versus paternal copies of an autosomal gene. Such allelic imbalance can result from a variety of cis-acting causes, including disruptive mutations within one copy of a gene that impact the stability of transcripts, as well as regulatory variants outside the gene that impact transcription initiation. Current methods for ASE estimation suffer from a number of shortcomings, such as relying on only one variant within a gene, assuming perfect phasing information across multiple variants within a gene, or failing to account for alignment biases and possible genotyping errors. RESULTS: We developed BEASTIE, a Bayesian hierarchical model designed for precise ASE quantification at the gene level, based on given genotypes and RNA-Seq data. BEASTIE addresses the complexities of allelic mapping bias, genotyping error, and phasing errors by incorporating empirical phasing error rates derived from Genome-in-a-Bottle individual NA12878. BEASTIE surpasses existing methods in accuracy, especially in scenarios with high phasing errors. This improvement is critical for identifying rare genetic variants often obscured by such errors. Through rigorous validation on simulated data and application to real data from the 1000 Genomes Project, we establish the robustness of BEASTIE. These findings underscore the value of BEASTIE in revealing patterns of ASE across gene sets and pathways. AVAILABILITY AND IMPLEMENTATION: The software is freely available from Github (https://github.com/x811zou/BEASTIE); and Zendo (DOI: 10.5281/zenodo.15062124).
Publisher OA PDF DOI
Promoter Deletion Leading to Allele Specific Expression in a Genetically Unsolved Case of Primary Ciliary Dyskinesia
UNC Libraries · 2025-10-05
articleOpen access
Variation in the non-coding genome represents an understudied mechanism of disease and it remains challenging to predict if single nucleotide variants, small insertions and deletions, or structural variants in non-coding genomic regions will be detrimental. Our approach using complementary RNA-seq and targeted long-read DNA sequencing can prioritize identification of non-coding variants that lead to disease via alteration of gene splicing or expression. We have identified a patient with primary ciliary dyskinesia with a pathogenic coding variant on one allele of the SPAG1 gene, while the second allele appears normal by whole exome sequencing despite an autosomal recessive inheritance pattern. RNA sequencing revealed reduced SPAG1 transcript levels and exclusive allele specific expression of the known pathogenic allele, suggesting the presence of a non-coding variant on the second allele that impacts transcription. Targeted long-read DNA sequencing identified a heterozygous 3 kilobase deletion of the 5' untranslated region of SPAG1, overlapping the promoter and first non-coding exon. This non-coding deletion was missed by whole exome sequencing and gene-specific deletion/duplication analysis, highlighting the importance of investigating the non-coding genome in patients with "missing" disease-causing variation. This paradigm demonstrates the utility of both RNA and long-read DNA sequencing in identifying pathogenic non-coding variants in patients with unexplained genetic disease.
Publisher DOI
Abstract 4344136: The Genetic Basis of Early Mortality in Neonates with Single Ventricle Disease: An NC-DEFINE Prospective Observational Cohort Study
Circulation · 2025-11-03
article
Background: Single ventricle disease (SVD) is the most severe form of congenital heart disease. Despite surgical advances improving survival, heart failure (HF) remains a key contributor to early morbidity and mortality, especially in the first months of life. Defining genetic drivers of HF in SVD could enable early risk stratification and guide precision therapies to improve outcomes. Hypothesis: Ultra-rare variants in genes associated with dilated cardiomyopathy (DCM) increase HF risk in neonates with SVD. Approach: Neonates ≤21 days old with SVD were prospectively enrolled and followed at Duke (n=40). An additional 69 individuals (0.2–65y) from Duke and UNC formed an ambispective cohort. Chromosomal abnormalities were excluded. Ultra-rare (MAF<e⁻⁴) variants in DCM genes were manually classified as likely pathogenic/pathogenic (LP/P) or of uncertain significance (VUS) per ACMG criteria. Multimodal deep clinical phenotyping was performed. HF was defined as severe (VAD, transplant, or death) or medically managed (MM; EF ≤40% and/or HF therapy escalation). Findings were validated in a blinded analysis of an independent SVD cohort from Nationwide Children’s Hospital (NCH; n=36) and compared to the AllofUs population. Results: In the prospective cohort, 8 (20%) neonates developed severe HF, 11 (28%) developed MM HF, and 21 (52%) remained HF free, with mean follow-up of 2 years. Hosting a DCM-associated LP/P variant was linked to an 11-fold increased risk of severe HF ( P =0.0009), while VUSs increased MM HF risk 6-fold ( P =0.01). Most HF occurred within the first month. Findings were independently validated in the NCH cohort, where LP/P variants reduced freedom from severe HF ( P =0.0007). Associations were attenuated in the ambispective cohort, suggesting survivor bias and underscoring the importance of early detection and risk stratification. Compared with the AllofUs cohort, the prospective cohort had a higher prevalence of LP/P variants ( P <0.0001) but similar VUS burden, supporting a model in which LP/P variants drive primary disease risk, while low-penetrant variants may modify susceptibility in the context of SVD. Conclusion: This study provides the first prospective evidence linking DCM-associated variants to significant risk for early-onset HF in SVD. These findings, independently validated in external cohorts, underscore the potential for genetic screening to inform early risk stratification and family counseling in this high-risk population.
Publisher DOI

Recent grants

NIH Grant R01MH084680
NIH · $912k · 2013
NIH Grant K25HL077663
NIH · $711k · 2010
The Duke FUNCTION Center: Pioneering the comprehensive identification of combinatorial noncoding causes of disease
NIH · $12.4M · 2020–2025

Frequent coauthors

David B. Goldstein
Columbia University
179 shared
Gundula Povysil
Columbia University
135 shared
Kate E. Stanley
KU Leuven
125 shared
Vimla S. Aggarwal
Columbia University Irving Medical Center
125 shared
Jessica L. Giordano
Columbia University Irving Medical Center
124 shared
Uma M. Reddy
Cornell University
124 shared
Joseph Hostyk
Columbia University
124 shared
Ronald J. Wapner
New York Genome Center
124 shared

Education

PhD
Emory University
2001

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Andrew Scott Allen

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you