Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…

Tiffany Amariuta

· Ph.D.Verified

University of California, San Diego · Medical Genetics

Active 2015–2026

h-index25
Citations3.6k
Papers9142 last 5y
Funding
See your match with Tiffany Amariuta — sign in to PhdFit.Sign in

About

Tiffany Amariuta, Ph.D., is the Principal Investigator of the Amariuta Lab at UCSD, which she started in July 2022. Her research focus involves exploring the genetic foundations of various biological processes and diseases, utilizing computational and systems biology approaches. The lab's work includes developing methods to identify genome-wide contributions to gene expression regulation, such as evaluating trans-acting expression quantitative trait loci (trans-eQTLs), and leveraging data science techniques to understand health disparities among populations with different ancestries. Dr. Amariuta's background and research emphasize integrating genomic, transcriptomic, and computational data to advance understanding of complex traits and diseases.

Research topics

  • Genetics
  • Biology
  • Computational biology

Selected publications

  • EvoLen: Evolution-Guided Tokenization for DNA Language Model

    arXiv (Cornell University) · 2026-04-09

    preprintOpen access

    Tokens serve as the basic units of representation in DNA language models (DNALMs), yet their design remains underexplored. Unlike natural language, DNA lacks inherent token boundaries or predefined compositional rules, making tokenization a fundamental modeling decision rather than a naturally specified one. While existing approaches like byte-pair encoding (BPE) excel at capturing token structures that reflect human-generated linguistic regularities, DNA is organized by biological function and evolutionary constraint rather than linguistic convention. We argue that DNA tokenization should prioritize functional sequence patterns like regulatory motifs-short, recurring segments under evolutionary constraint and typically preserved across species. We incorporate evolutionary information directly into the tokenization process through EvoLen, a tokenizer that combines evolutionary stratification with length-aware decoding to better preserve motif-scale functional sequence units. EvoLen uses cross-species evolutionary signals to group DNA sequences, trains separate BPE tokenizers on each group, merges the resulting vocabularies via a rule prioritizing preserved patterns, and applies length-aware decoding with dynamic programming. Through controlled experiments, EvoLen improves the preservation of functional sequence patterns, differentiation across genomic contexts, and alignment with evolutionary constraint, while matching or outperforming standard BPE across diverse DNALM benchmarks. These results demonstrate that tokenization introduces a critical inductive bias and that incorporating evolutionary information yields more biologically meaningful and interpretable sequence representations.

  • Author Correction: GWAS for systemic sclerosis identifies six novel susceptibility loci including one in the Fcγ receptor region

    Nature Communications · 2026-01-19

    articleOpen access
  • EvoLen: Evolution-Guided Tokenization for DNA Language Model

    arXiv (Cornell University) · 2026-04-09

    articleOpen access

    Tokens serve as the basic units of representation in DNA language models (DNALMs), yet their design remains underexplored. Unlike natural language, DNA lacks inherent token boundaries or predefined compositional rules, making tokenization a fundamental modeling decision rather than a naturally specified one. While existing approaches like byte-pair encoding (BPE) excel at capturing token structures that reflect human-generated linguistic regularities, DNA is organized by biological function and evolutionary constraint rather than linguistic convention. We argue that DNA tokenization should prioritize functional sequence patterns like regulatory motifs-short, recurring segments under evolutionary constraint and typically preserved across species. We incorporate evolutionary information directly into the tokenization process through EvoLen, a tokenizer that combines evolutionary stratification with length-aware decoding to better preserve motif-scale functional sequence units. EvoLen uses cross-species evolutionary signals to group DNA sequences, trains separate BPE tokenizers on each group, merges the resulting vocabularies via a rule prioritizing preserved patterns, and applies length-aware decoding with dynamic programming. Through controlled experiments, EvoLen improves the preservation of functional sequence patterns, differentiation across genomic contexts, and alignment with evolutionary constraint, while matching or outperforming standard BPE across diverse DNALM benchmarks. These results demonstrate that tokenization introduces a critical inductive bias and that incorporating evolutionary information yields more biologically meaningful and interpretable sequence representations.

  • Fine-mapping causal tissues and genes at disease-associated loci

    Nature Genetics · 2025-01-01 · 13 citations

    articleOpen access
  • GWAS for systemic sclerosis identifies six novel susceptibility loci including one in the Fcγ receptor region

    Nature Communications · 2024-01-31 · 16 citations

    reviewOpen access

    Here we report the largest Asian genome-wide association study (GWAS) for systemic sclerosis performed to date, based on data from Japanese subjects and comprising of 1428 cases and 112,599 controls. The lead SNP is in the FCGR/FCRL region, which shows a penetrating association in the Asian population, while a complete linkage disequilibrium SNP, rs10917688, is found in a cis-regulatory element for IRF8. IRF8 is also a significant locus in European GWAS for systemic sclerosis, but rs10917688 only shows an association in the presence of the risk allele of IRF8 in the Japanese population. Further analysis shows that rs10917688 is marked with H3K4me1 in primary B cells. A meta-analysis with a European GWAS detects 30 additional significant loci. Polygenic risk scores constructed with the effect sizes of the meta-analysis suggest the potential portability of genetic associations beyond populations. Prioritizing the top 5% of SNPs of IRF8 binding sites in B cells improves the fitting of the polygenic risk scores, underscoring the roles of B cells and IRF8 in the development of systemic sclerosis. The results also suggest that systemic sclerosis shares a common genetic architecture across populations.

  • scLong: A Billion-Parameter Foundation Model for Capturing Long-Range Gene Context in Single-Cell Transcriptomics

    Research Square · 2024-12-04

    preprintOpen access
  • High-dimensional phenotyping to define the genetic basis of cellular morphology

    Nature Communications · 2024-01-06 · 52 citations

    articleOpen access

    Abstract The morphology of cells is dynamic and mediated by genetic and environmental factors. Characterizing how genetic variation impacts cell morphology can provide an important link between disease association and cellular function. Here, we combine genomic sequencing and high-content imaging approaches on iPSCs from 297 unique donors to investigate the relationship between genetic variants and cellular morphology to map what we term cell morphological quantitative trait loci (cmQTLs). We identify novel associations between rare protein altering variants in WASF2 , TSPAN15 , and PRLR with several morphological traits related to cell shape, nucleic granularity, and mitochondrial distribution. Knockdown of these genes by CRISPRi confirms their role in cell morphology. Analysis of common variants yields one significant association and nominate over 300 variants with suggestive evidence (P < 10 −6 ) of association with one or more morphology traits. We then use these data to make predictions about sample size requirements for increasing discovery in cellular genetic studies. We conclude that, similar to molecular phenotypes, morphological profiling can yield insight about the function of genes and variants.

  • Powerful mapping of <i>cis</i> -genetic effects on gene expression across diverse populations reveals novel disease-critical genes

    medRxiv · 2024-09-26 · 1 citations

    preprintOpen accessSenior authorCorresponding

    Abstract While disease-associated variants identified by genome-wide association studies (GWAS) most likely regulate gene expression levels, linking variants to target genes is critical to determining the functional mechanisms of these variants. Genetic effects on gene expression have been extensively characterized by expression quantitative trait loci (eQTL) studies, yet data from non-European populations is limited. This restricts our understanding of disease to genes whose regulatory variants are common in European populations. While previous work has leveraged data from multiple populations to improve GWAS power and polygenic risk score (PRS) accuracy, multi-ancestry data has not yet been used to better estimate cis -genetic effects on gene expression. Here, we present a new method, Multi-Ancestry Gene Expression Prediction Regularized Optimization (MAGEPRO), which constructs robust genetic models of gene expression in understudied populations or cell types by fitting a regularized linear combination of eQTL summary data across diverse cohorts. In simulations, our tool generates more accurate models of gene expression than widely-used LASSO and the state-of-the-art multi-ancestry PRS method, PRS-CSx, adapted to gene expression prediction. We attribute this improvement to MAGEPRO’s ability to more accurately estimate causal eQTL effect sizes ( p &lt; 3.98 × 10 -4 , two-sided paired t-test). With real data, we applied MAGEPRO to 8 eQTL cohorts representing 3 ancestries (average n = 355) and consistently outperformed each of 6 competing methods in gene expression prediction tasks. Integration with GWAS summary statistics across 66 complex traits (representing 22 phenotypes and 3 ancestries) resulted in 2,331 new gene-trait associations, many of which replicate across multiple ancestries, including PHTF1 linked to white blood cell count, a gene which is overexpressed in leukemia patients. MAGEPRO also identified biologically plausible novel findings, such as PIGB , an essential component of GPI biosynthesis, associated with heart failure, which has been previously evidenced by clinical outcome data. Overall, MAGEPRO is a powerful tool to enhance inference of gene regulatory effects in underpowered datasets and has improved our understanding of population-specific and shared genetic effects on complex traits.

  • scLong: A Billion-Parameter Foundation Model for Capturing Long-Range Gene Context in Single-Cell Transcriptomics

    bioRxiv (Cold Spring Harbor Laboratory) · 2024-11-11 · 5 citations

    preprintOpen access

    Single-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity by providing gene expression data at single-cell resolution, uncovering insights into rare cell populations, cell-cell interactions, and gene regulation. Foundation models pretrained on large-scale scRNA-seq datasets have shown great promise in analyzing such data, but existing approaches are often limited to modeling a small subset of highly expressed genes and lack the integration of external genespecific knowledge. To address these limitations, we present sc-Long, a billion-parameter foundation model pretrained on 48 million cells. sc-Long performs self-attention across the entire set of 28,000 genes in the human genome. This enables the model to capture long-range dependencies between all genes, including lowly expressed ones, which often play critical roles in cellular processes but are typically excluded by existing foundation models. Additionally, sc-Long integrates gene knowledge from the Gene Ontology using a graph convolutional network, enriching its contextual understanding of gene functions and relationships. In extensive evaluations, sc-Long surpasses both stateof-the-art scRNA-seq foundation models and task-specific models across diverse tasks, including predicting transcriptional responses to genetic and chemical perturbations, forecasting cancer drug responses, and inferring gene regulatory networks.

  • The power paradox of detecting disease-associated and gene-expression-associated variants

    Nature Genetics · 2023-10-19 · 1 citations

    article1st authorCorresponding

Frequent coauthors

  • Soumya Raychaudhuri

    Brigham and Women's Hospital

    327 shared
  • Kazuyoshi Ishigaki

    RIKEN Center for Integrative Medical Sciences

    111 shared
  • Yang Luo

    Brigham and Women's Hospital

    110 shared
  • Harm-Jan Westra

    University Medical Center Groningen

    77 shared
  • Alkes L. Price

    Broad Institute

    68 shared
  • Emma E. Davenport

    Wellcome Sanger Institute

    61 shared
  • María Gutiérrez‐Arcelus

    49 shared
  • Steven Gazal

    44 shared
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Tiffany Amariuta

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup