Nathan Sheffield

· Assistant Professor of Genome SciencesVerified

University of Virginia · Genome Sciences

Active 2008–2026

h-index38

Citations14.2k

Papers15789 last 5y

Funding$5.1M2 active

Faculty page

See your match with Nathan Sheffield — sign in to PhdFit.Sign in

About

Nathan Sheffield is an Associate Professor in the Department of Genome Sciences at the University of Virginia School of Medicine. He holds a B.S. in Bioinformatics from Brigham Young University and a Ph.D. in Computational Biology from Duke University. His research is at the interface of computation and biology, drawing on techniques in computer science, data science, bioinformatics, and statistics, and applying them to biological questions in cancer, epigenetics, development, and genomics. His particular projects include computational cancer epigenomics, where he investigates how cancers rewire normal regulatory machinery, using Ewing sarcoma as a model system to examine genome-wide epigenetic profiles. He is also engaged in genome-scale analysis of gene regulation and chromatin structure, focusing on how different cell types fold their DNA to enable complex regulatory patterns, and how regulatory DNA governs gene expression during development. His work involves the use of machine learning, supercomputing, and software engineering to analyze high-throughput genomic data, aiming to understand cellular changes and disease mechanisms.

Research topics

Biology
Medicine
Genetics
Data Mining
Computer Security
Bioinformatics
Computational biology
Computer Science
Political Science
Pathology
Data science
Business
Internal medicine
World Wide Web
Knowledge management

Selected publications

Automated biomedical hypothesis generation with time-aware hypergraph contrastive learning
Knowledge and Information Systems · 2026-05-17
articleOpen access
Abstract Research in scientific domains now generates more than a million articles annually, overwhelming researchers and hindering discovery. This surge has sparked interest in biomedical hypothesis generation (HG), which aims to uncover implicit patterns among biomedical concepts. Most existing methods focus on pairwise link prediction, overlooking the complex, multi-concept relationships underlying many breakthroughs. We introduce HyHG , a temporal Hy pergraph contrastive learning framework for biomedical H ypothesis G eneration, which redefines hypotheses as hyperedges—sets of co-mentioned concepts in an article. By representing articles as hyperedges and organizing them into a temporal hypergraph, HyHG captures the evolution of scientific ideas over time. A transformer-based architecture learns from historical hyperedge sequences to predict future hyperedges—sets of concepts likely to co-occur in the future literature. To distinguish genuine hypotheses from misleading ones, HyHG employs a time-anchored contrastive loss and hard negative sampling based on minimal edits to real hyperedges. We demonstrate that HyHG achieves state-of-the-art performance on three biomedical datasets. Our code and data are available at: https://github.com/amir-hassan25/Temporal-Hypergraph-Contrastive-Learning.
Publisher OA PDF DOI
Card Dealing Math
ArXiv.org · 2025-09-14
preprintOpen access
Various card tricks involve under-down dealing, where alternatively one card is placed under the deck and the next card is dealt. We study how the cards need to be prepared in the deck to be dealt in order. The order in which the $N$ cards are prepared defines a permutation. In this work, we analyze general dealing patterns, considering properties of the resulting permutations. We give recursive formulas for these permutations, their inverses, the final dealt card, and the dealing order of the first card. We discuss some particular examples of dealing patterns and conclude with an analysis of several existing and novel magic card tricks making use of dealing patterns. Our discussions involve 30 existing sequences in the OEIS, and we introduce 44 new sequences to that database.
Publisher OA PDF DOI
Standards in the Preparation of Biomedical Research Metadata: A Bridge2AI Perspective
ArXiv.org · 2025-09-12
preprintOpen access
AI-readiness describes the degree to which data may be optimally and ethically used for subsequent AI and Machine Learning (AI/ML) methods, where those methods may involve some combination of model training, data classification, and ethical, explainable prediction. The Bridge2AI consortium has defined the particular criteria a biomedical dataset may possess to render it AI-ready: in brief, a dataset's readiness is related to its FAIRness, provenance, degree of characterization, explainability, sustainability, and computability, in addition to its accompaniment with documentation about ethical data practices. To ensure AI-readiness and to clarify data structure and relationships within Bridge2AI's Grand Challenges (GCs), particular types of metadata are necessary. The GCs within the Bridge2AI initiative include four data-generating projects focusing on generating AI/ML-ready datasets to tackle complex biomedical and behavioral research problems. These projects develop standardized, multimodal data, tools, and training resources to support AI integration, while addressing ethical data practices. Examples include using voice as a biomarker, building interpretable genomic tools, modeling disease trajectories with diverse multimodal data, and mapping cellular and molecular health indicators across the human body. This report assesses the state of metadata creation and standardization in the Bridge2AI GCs, provides guidelines where required, and identifies gaps and areas for improvement across the program. New projects, including those outside the Bridge2AI consortium, would benefit from what we have learned about creating metadata as part of efforts to promote AI readiness.
Publisher OA PDF DOI
Fast, memory-efficient genomic interval tokenizers for modern machine learning
ArXiv.org · 2025-11-03
preprintOpen accessSenior author
Introduction: Epigenomic datasets from high-throughput sequencing experiments are commonly summarized as genomic intervals. As the volume of this data grows, so does interest in analyzing it through deep learning. However, the heterogeneity of genomic interval data, where each dataset defines its own regions, creates barriers for machine learning methods that require consistent, discrete vocabularies. Methods: We introduce gtars-tokenizers, a high-performance library that maps genomic intervals to a predefined universe or vocabulary of regions, analogous to text tokenization in natural language processing. Built in Rust with bindings for Python, R, CLI, and WebAssembly, gtars-tokenizers implements two overlap methods (BITS and AIList) and integrates seamlessly with modern ML frameworks through Hugging Face-compatible APIs. Results: The gtars-tokenizers package achieves top efficiency for large-scale datasets, while enabling genomic intervals to be processed using standard ML workflows in PyTorch and TensorFlow without ad hoc preprocessing. This token-based approach bridges genomics and machine learning, supporting scalable and standardized analysis of interval data across diverse computational environments. Availability: PyPI and GitHub: https://github.com/databio/gtars.
Publisher OA PDF DOI
ConceptDrift: leveraging spatial, temporal and semantic evolution of biomedical concepts for hypothesis generation
Bioinformatics · 2025-10-28
articleOpen access
MOTIVATION: Hypothesis generation is a fundamental problem in biomedical text mining that aims to generate ideas that are new, interesting, and plausible by discovering unexplored links between biomedical concepts. Despite significant advances made by existing approaches, they do not fully leverage the evolutionary properties of biomedical concepts. This is limiting because scientific knowledge continually evolves over time, with new facts being added and old ones becoming obsolete. Thus, it is crucial to capture the evolutionary properties of biomedical concepts from multiple perspectives (e.g. spatial, temporal, and semantic) to generate hypotheses that reflect the up-to-date information landscape of the biomedical domain. RESULTS: We introduce a novel framework, ConceptDrift, that models the hypothesis generation task as a sequence of temporal graphlets and simultaneously encodes spatial, temporal, and semantic change. Unlike existing approaches that treat these dimensions independently, ConceptDrift is the first to provide a holistic understanding of concept evolution by integrating them into a unified framework. Grounded in the theories of the Distributional Hypothesis and Conceptual Change, our method adapts these principles to the unique challenges of large-scale biomedical literature. We conduct extensive experiments across multiple datasets and demonstrate that ConceptDrift consistently outperforms state-of-the-art baselines in generating accurate and meaningful hypotheses. Our framework shows immediate practical benefits for web-based literature mining tools in life sciences and biomedicine, offering more robust and predictive feature representations. AVAILABILITY AND IMPLEMENTATION: https://github.com/amir-hassan25/ConceptDrift (DOI: 10.6084/m9.figshare.29975476).
Publisher OA PDF DOI
HyHG: A Temporal Hypergraph Contrastive Learning Framework for Biomedical Hypothesis Generation
2025-11-12 · 1 citations
article
Biomedical research now generates more than a million articles annually, overwhelming researchers and hindering discovery. This surge has sparked interest in biomedical hypothesis generation (HG), which aims to uncover implicit patterns among biomedical concepts. Most existing methods focus on pairwise link prediction, overlooking the complex, multi-concept relationships underlying many breakthroughs. We introduce HyHG, a temporal Hypergraph contrastive learning framework for biomedical Hypothesis Generation, which redefines hypotheses as hyperedges–sets of co-mentioned concepts in an article. By representing articles as hyperedges and organizing them into a temporal hypergraph, HyHG captures the evolution of scientific ideas over time. A transformer-based architecture learns from historical hyperedge sequences to predict future hyperedges–sets of concepts likely to co-occur in future literature. To distinguish genuine hypotheses from misleading ones, HyHG employs a timeanchored contrastive loss and hard negative sampling based on minimal edits to real hyperedges. We demonstrate state-of-the-art performance on three biomedical datasets. Our code and data are available at: https://github.com/amirhassan25/Temporal-Hypergraph-Contrastive-Learning.
Publisher DOI
Atacformer: A transformer-based foundation model for analysis and interpretation of ATAC-seq data
bioRxiv (Cold Spring Harbor Laboratory) · 2025-11-04 · 2 citations
preprintOpen accessSenior authorCorresponding
Introduction: Chromatin accessibility profiling is an important tool for understanding gene regulation and cellular function. While public repositories house nearly 10,000 scATAC-seq experiments, unifying this data for meaningful analysis remains challenging. Existing tools struggle with the scale and complexity of scATAC-seq datasets, limiting tasks like clustering, cell-type annotation, and reference mapping. A promising solution is using foundation models adapted to specific tasks via transfer learning. While transfer learning has been applied to scRNA-seq, its potential for scATAC-seq remains underexplored. Methods: We introduce Atacformer, a transformer-based foundation model for scATAC-seq data analysis. Unlike other models that only produce cell-level representations, Atacformer generates embeddings for individual cis-regulatory elements. Pre-trained on a large atlas of scATAC-seq experiments, Atacformer learns robust representations of genomic regulatory regions for downstream use. After pretraining, the model is fine-tuned for cell-type prediction and batch correction. We also integrated Atacformer with RNA-seq data to build a Contrastive RNA-ATAC Fine Tuning (CRAFT) model capable of cross-modal alignment and RNA imputation from ATAC data. Results: Atacformer matches or exceeds leading scATAC-seq clustering tools in adjusted rand index and runtime, with fine-tuned models achieving top performance across datasets. It processes raw fragment files end-to-end 80% faster than existing tools while preserving biological structure. Fine-tuned on bulk BED files, it recovers cell type and assay labels with >80% accuracy. We show how the Atacformer architecture produces contextualized embeddings of individual genomic regions, which we use to identify unannotated, cell-type-specific promoter elements directly from chromatin accessibility data.
Publisher OA PDF DOI
Taming the reference genome jungle: the refget sequence collection standard
bioRxiv (Cold Spring Harbor Laboratory) · 2025-10-06 · 1 citations
preprintOpen accessSenior authorCorresponding
Reference genomes are foundational to genomics but suffer from widespread ambiguity and incompatibility due to inconsistent naming, undocumented differences, and lack of formal mechanisms for comparison. To address this, we introduce the GA4GH refget Sequence Collections (seqcol) standard. Refget seqcol is a framework for unambiguous representation, retrieval, and comparison of sequence collections such as reference genomes and transcriptomes. The seqcol standard comprises four components: a structured data schema, a canonical encoding algorithm that produces content-based, globally unique identifiers, a retrieval API, and a comparison protocol. This standard enables precise identification of sequence collections, even across decentralized or private systems, and allows compatibility assessments beyond exact identity, such as order-relaxed matches or shared coordinate systems. We applied the refget seqcol standard to 60 human and 36 mouse reference genomes sourced from major providers. Using digest-based comparisons, we quantified levels of similarity across attributes including sequence names, lengths, coordinate systems, and actual sequence content. Our analysis revealed some consistent subsets of sequences or coordinate systems, as well as substantial incompatibility among references and duplicate references under different names. To support adoption of refget seqcol, we provide a Python package implementing the full standard, a web API, and a comparison interface allowing users to assess local references against a curated database. This work offers a scalable, reproducible solution to the reference genome compatibility crisis, enabling improved transparency, reuse, and integration in genomic analyses. Refget seqcol enhances interoperability across tools and datasets, making genomic research more robust and reproducible.
Publisher OA PDF DOI
AI-readiness Criteria for Biomedical Data
bioRxiv (Cold Spring Harbor Laboratory) · 2024-10-25 · 23 citations
preprintOpen access
Abstract Biomedical research is rapidly adopting artificial intelligence (AI). Yet the inherent complexity of biomedical data preparation requires implementing actionable, robust criteria for ethical and explainable AI (XAI) at the “pre-model” stage, encompassing data acquisition, detailed transformations, and ethical governance. Simple conformance to FAIR (Findable, Accessible, Interoperable, Reusable) Principles is insufficient. Here, we define criteria and practices for reliable AI-readiness of biomedical data, developed by the NIH Bridge to Artificial Intelligence (Bridge2AI) Standards Working Group across seven core dimensions of dataset AI-readiness: FAIRness, Provenance, Characterization, Ethics, Pre-model Explainability, Sustainability, and Computability . Conformance to these criteria provides a basis for pre-model scientific rigor and ethical integrity, mitigating downstream risks of bias and error prior to AI modeling. We apply and evaluate these standards across all four Bridge2AI flagship datasets, spanning functional genomics to clinical medicine, and encode them in machine-actionable metadata bound to the datasets. This framework sets a benchmark for preparing ethical, reusable datasets in biomedical AI and provides standardized methods for reliable pre-model data evaluation.
Publisher OA PDF DOI
Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets
Bioengineering · 2024-03-08 · 7 citations
articleOpen accessSenior authorCorresponding
As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.
Publisher OA PDF DOI

Recent grants

Novel methods for large-scale genomic interval comparison
NIH · $1.8M · 2022–2026
A modular data analysis ecosystem using portable encapsulated projects
NIH · $1.6M · 2018–2023
A modular data analysis ecosystem using portable encapsulated projects
NIH · $360k · 2018–2023
Continually Adaptive Machine Learning Platform for Personalized Biomedical Literature Curation and Exploration
NIH · $1.3M · 2023–2027

Frequent coauthors

Christoph Bock
70 shared
Jason P. Smith
United States Department of Veterans Affairs
43 shared
Terrence S. Furey
University of North Carolina at Chapel Hill
37 shared
Gregory E. Crawford
Durham Technical Community College
35 shared
Lingyun Song
33 shared
Yoichiro Shibata
30 shared
Alok K. Tewari
Dana-Farber Cancer Institute
29 shared
Galip Gürkan Yardımcı
Oregon Health & Science University
28 shared

Education

Postdoc
Stanford University
2016
Postdoc
CeMM Research Center for Molecular Medicine
2015
PhD Computational Biology and Bioinformatics, Program in Computational Biology and Bioinformatics
Duke University
2013
B.S. Bioinformatics, Biology
Brigham Young University
2008

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Nathan Sheffield

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you