Tiffany Amariuta
· Ph.D.VerifiedUniversity of California, San Diego · Medical Genetics
Active 2015–2026
About
Tiffany Amariuta, Ph.D., is the Principal Investigator of the Amariuta Lab at UCSD, which she started in July 2022. Her research focus involves exploring the genetic foundations of various biological processes and diseases, utilizing computational and systems biology approaches. The lab's work includes developing methods to identify genome-wide contributions to gene expression regulation, such as evaluating trans-acting expression quantitative trait loci (trans-eQTLs), and leveraging data science techniques to understand health disparities among populations with different ancestries. Dr. Amariuta's background and research emphasize integrating genomic, transcriptomic, and computational data to advance understanding of complex traits and diseases.
Research topics
- Genetics
- Biology
- Computational biology
Selected publications
EvoLen: Evolution-Guided Tokenization for DNA Language Model
arXiv (Cornell University) · 2026-04-09
preprintOpen accessTokens serve as the basic units of representation in DNA language models (DNALMs), yet their design remains underexplored. Unlike natural language, DNA lacks inherent token boundaries or predefined compositional rules, making tokenization a fundamental modeling decision rather than a naturally specified one. While existing approaches like byte-pair encoding (BPE) excel at capturing token structures that reflect human-generated linguistic regularities, DNA is organized by biological function and evolutionary constraint rather than linguistic convention. We argue that DNA tokenization should prioritize functional sequence patterns like regulatory motifs-short, recurring segments under evolutionary constraint and typically preserved across species. We incorporate evolutionary information directly into the tokenization process through EvoLen, a tokenizer that combines evolutionary stratification with length-aware decoding to better preserve motif-scale functional sequence units. EvoLen uses cross-species evolutionary signals to group DNA sequences, trains separate BPE tokenizers on each group, merges the resulting vocabularies via a rule prioritizing preserved patterns, and applies length-aware decoding with dynamic programming. Through controlled experiments, EvoLen improves the preservation of functional sequence patterns, differentiation across genomic contexts, and alignment with evolutionary constraint, while matching or outperforming standard BPE across diverse DNALM benchmarks. These results demonstrate that tokenization introduces a critical inductive bias and that incorporating evolutionary information yields more biologically meaningful and interpretable sequence representations.
Nature Communications · 2026-01-19
articleOpen accessEvoLen: Evolution-Guided Tokenization for DNA Language Model
arXiv (Cornell University) · 2026-04-09
articleOpen accessTokens serve as the basic units of representation in DNA language models (DNALMs), yet their design remains underexplored. Unlike natural language, DNA lacks inherent token boundaries or predefined compositional rules, making tokenization a fundamental modeling decision rather than a naturally specified one. While existing approaches like byte-pair encoding (BPE) excel at capturing token structures that reflect human-generated linguistic regularities, DNA is organized by biological function and evolutionary constraint rather than linguistic convention. We argue that DNA tokenization should prioritize functional sequence patterns like regulatory motifs-short, recurring segments under evolutionary constraint and typically preserved across species. We incorporate evolutionary information directly into the tokenization process through EvoLen, a tokenizer that combines evolutionary stratification with length-aware decoding to better preserve motif-scale functional sequence units. EvoLen uses cross-species evolutionary signals to group DNA sequences, trains separate BPE tokenizers on each group, merges the resulting vocabularies via a rule prioritizing preserved patterns, and applies length-aware decoding with dynamic programming. Through controlled experiments, EvoLen improves the preservation of functional sequence patterns, differentiation across genomic contexts, and alignment with evolutionary constraint, while matching or outperforming standard BPE across diverse DNALM benchmarks. These results demonstrate that tokenization introduces a critical inductive bias and that incorporating evolutionary information yields more biologically meaningful and interpretable sequence representations.
Fine-mapping causal tissues and genes at disease-associated loci
Nature Genetics · 2025-01-01 · 13 citations
articleOpen accessNature Communications · 2024-01-31 · 16 citations
reviewOpen accessHere we report the largest Asian genome-wide association study (GWAS) for systemic sclerosis performed to date, based on data from Japanese subjects and comprising of 1428 cases and 112,599 controls. The lead SNP is in the FCGR/FCRL region, which shows a penetrating association in the Asian population, while a complete linkage disequilibrium SNP, rs10917688, is found in a cis-regulatory element for IRF8. IRF8 is also a significant locus in European GWAS for systemic sclerosis, but rs10917688 only shows an association in the presence of the risk allele of IRF8 in the Japanese population. Further analysis shows that rs10917688 is marked with H3K4me1 in primary B cells. A meta-analysis with a European GWAS detects 30 additional significant loci. Polygenic risk scores constructed with the effect sizes of the meta-analysis suggest the potential portability of genetic associations beyond populations. Prioritizing the top 5% of SNPs of IRF8 binding sites in B cells improves the fitting of the polygenic risk scores, underscoring the roles of B cells and IRF8 in the development of systemic sclerosis. The results also suggest that systemic sclerosis shares a common genetic architecture across populations.
Research Square · 2024-12-04
preprintOpen accessHigh-dimensional phenotyping to define the genetic basis of cellular morphology
Nature Communications · 2024-01-06 · 52 citations
articleOpen accessAbstract The morphology of cells is dynamic and mediated by genetic and environmental factors. Characterizing how genetic variation impacts cell morphology can provide an important link between disease association and cellular function. Here, we combine genomic sequencing and high-content imaging approaches on iPSCs from 297 unique donors to investigate the relationship between genetic variants and cellular morphology to map what we term cell morphological quantitative trait loci (cmQTLs). We identify novel associations between rare protein altering variants in WASF2 , TSPAN15 , and PRLR with several morphological traits related to cell shape, nucleic granularity, and mitochondrial distribution. Knockdown of these genes by CRISPRi confirms their role in cell morphology. Analysis of common variants yields one significant association and nominate over 300 variants with suggestive evidence (P < 10 −6 ) of association with one or more morphology traits. We then use these data to make predictions about sample size requirements for increasing discovery in cellular genetic studies. We conclude that, similar to molecular phenotypes, morphological profiling can yield insight about the function of genes and variants.
medRxiv · 2024-09-26 · 1 citations
preprintOpen accessSenior authorCorrespondingAbstract While disease-associated variants identified by genome-wide association studies (GWAS) most likely regulate gene expression levels, linking variants to target genes is critical to determining the functional mechanisms of these variants. Genetic effects on gene expression have been extensively characterized by expression quantitative trait loci (eQTL) studies, yet data from non-European populations is limited. This restricts our understanding of disease to genes whose regulatory variants are common in European populations. While previous work has leveraged data from multiple populations to improve GWAS power and polygenic risk score (PRS) accuracy, multi-ancestry data has not yet been used to better estimate cis -genetic effects on gene expression. Here, we present a new method, Multi-Ancestry Gene Expression Prediction Regularized Optimization (MAGEPRO), which constructs robust genetic models of gene expression in understudied populations or cell types by fitting a regularized linear combination of eQTL summary data across diverse cohorts. In simulations, our tool generates more accurate models of gene expression than widely-used LASSO and the state-of-the-art multi-ancestry PRS method, PRS-CSx, adapted to gene expression prediction. We attribute this improvement to MAGEPRO’s ability to more accurately estimate causal eQTL effect sizes ( p < 3.98 × 10 -4 , two-sided paired t-test). With real data, we applied MAGEPRO to 8 eQTL cohorts representing 3 ancestries (average n = 355) and consistently outperformed each of 6 competing methods in gene expression prediction tasks. Integration with GWAS summary statistics across 66 complex traits (representing 22 phenotypes and 3 ancestries) resulted in 2,331 new gene-trait associations, many of which replicate across multiple ancestries, including PHTF1 linked to white blood cell count, a gene which is overexpressed in leukemia patients. MAGEPRO also identified biologically plausible novel findings, such as PIGB , an essential component of GPI biosynthesis, associated with heart failure, which has been previously evidenced by clinical outcome data. Overall, MAGEPRO is a powerful tool to enhance inference of gene regulatory effects in underpowered datasets and has improved our understanding of population-specific and shared genetic effects on complex traits.
bioRxiv (Cold Spring Harbor Laboratory) · 2024-11-11 · 5 citations
preprintOpen accessSingle-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity by providing gene expression data at single-cell resolution, uncovering insights into rare cell populations, cell-cell interactions, and gene regulation. Foundation models pretrained on large-scale scRNA-seq datasets have shown great promise in analyzing such data, but existing approaches are often limited to modeling a small subset of highly expressed genes and lack the integration of external genespecific knowledge. To address these limitations, we present sc-Long, a billion-parameter foundation model pretrained on 48 million cells. sc-Long performs self-attention across the entire set of 28,000 genes in the human genome. This enables the model to capture long-range dependencies between all genes, including lowly expressed ones, which often play critical roles in cellular processes but are typically excluded by existing foundation models. Additionally, sc-Long integrates gene knowledge from the Gene Ontology using a graph convolutional network, enriching its contextual understanding of gene functions and relationships. In extensive evaluations, sc-Long surpasses both stateof-the-art scRNA-seq foundation models and task-specific models across diverse tasks, including predicting transcriptional responses to genetic and chemical perturbations, forecasting cancer drug responses, and inferring gene regulatory networks.
The power paradox of detecting disease-associated and gene-expression-associated variants
Nature Genetics · 2023-10-19 · 1 citations
article1st authorCorresponding
Frequent coauthors
- 327 shared
Soumya Raychaudhuri
Brigham and Women's Hospital
- 111 shared
Kazuyoshi Ishigaki
RIKEN Center for Integrative Medical Sciences
- 110 shared
Yang Luo
Brigham and Women's Hospital
- 77 shared
Harm-Jan Westra
University Medical Center Groningen
- 68 shared
Alkes L. Price
Broad Institute
- 61 shared
Emma E. Davenport
Wellcome Sanger Institute
- 49 shared
María Gutiérrez‐Arcelus
- 44 shared
Steven Gazal
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Tiffany Amariuta
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup