Michael Schatz

· Bloomberg Distinguished Professor of Computer Science and BiologyVerified

Johns Hopkins University · Genetics and Molecular Biology

Active 1997–2026

h-index111

Citations64.3k

Papers430169 last 5y

Funding$61.2M2 active

Faculty page

See your match with Michael Schatz — sign in to PhdFit.Sign in

About

Michael Schatz is the Bloomberg Distinguished Professor of Computational Biology and Oncology at Johns Hopkins University. His research focuses on solving computational problems in genomics research, developing innovative biotechnologies and computational tools to study the sequence and function of genomes. Schatz's work advances understanding of genome structure, evolution, and function, particularly in the context of medicine—such as autism spectrum disorders, cancer, and other human diseases—as well as agriculture. He has created many widely used methods and software for genome assembly and analysis, including NGMLR and Sniffles for long-read sequencing analysis, Scalpel for genetic variant discovery, and GECCO for studying complex genomic variations. His lab has identified numerous structural alterations in cancer genomes and developed tools like Ginkgo for single-cell copy number profiling. Schatz's contributions extend to computational methods for genome assembly and analysis across species, utilizing single molecule-sequencing technologies. He serves as a faculty member in the Department of Computer Science and the Department of Biology at Johns Hopkins, and is involved with the Cancer Prevention and Control Program at the Sidney Kimmel Comprehensive Cancer Center. His work has earned him several awards, including the 2015 Sloan Foundation Fellowship and an NSF CAREER Award, and he has been recognized for his leadership in the field of computational biology.

Research topics

Genetics
Biology
Computer Science
Computational biology
Evolutionary biology
Data Mining
Data science
Machine Learning
Artificial Intelligence
Biotechnology
Botany
Medicine
World Wide Web
Zoology

Selected publications

The Common Fund Data Ecosystem (CFDE)
bioRxiv (Cold Spring Harbor Laboratory) · 2026-04-12
articleOpen access
The NIH Common Fund Data Ecosystem (CFDE) integrates data resources from 18 NIH Common Fund programs for discovery and integrative analysis. These programs generate valuable but heterogeneous datasets that can be difficult to discover, access, and reuse. CFDE aims to provide a collaborative, community-built infrastructure that links and enriches Common Fund programs. We describe the evolution, structure, and core technologies of CFDE, including practical approaches that support submission, integration, visualization, and public release of multimodal data. Training programs and workforce initiatives lower barriers to adoption. CFDE has devised solutions to critical issues facing cross-program initiatives, including data scale and heterogeneity, dataset integration, and long-term sustainability. We demonstrate the utility of linking Common Fund resources through integrative tools and cross-dataset queries to yield insights that would otherwise be infeasible. Collectively, CFDE shows that a standards-driven, federated approach enhances and unifies cross-disciplinary resources, fostering collaboration and data-driven discovery.
Publisher DOI
Integration of transcriptomics and long-read genomics prioritizes structural variants in rare disease
Genome Research · 2025-03-20 · 9 citations
articleOpen access
Rare structural variants (SVs)—insertions, deletions, and complex rearrangements—can cause Mendelian disease, yet they remain difficult to accurately detect and interpret. We sequenced and analyzed Oxford Nanopore Technologies long-read genomes of 68 individuals from the undiagnosed disease network (UDN) with no previously identified diagnostic mutations from short-read sequencing. Using our optimized SV detection pipelines and 571 control long-read genomes, we detected 716 long-read rare (MAF < 0.01) SV alleles per genome on average, achieving a 2.4× increase from short reads. To characterize the functional effects of rare SVs, we assessed their relationship with gene expression from blood or fibroblasts from the same individuals and found that rare SVs overlapping enhancers were enriched (LOR = 0.46) near expression outliers. We also evaluated tandem repeat expansions (TREs) and found 14 rare TREs per genome; notably, these TREs were also enriched near overexpression outliers. To prioritize candidate functional SVs, we developed Watershed-SV, a probabilistic model that integrates expression data with SV-specific genomic annotations, which significantly outperforms baseline models that do not incorporate expression data. Watershed-SV identified a median of eight high-confidence functional SVs per UDN genome. Notably, this included compound heterozygous deletions in FAM177A1 shared by two siblings, which were likely causal for a rare neurodevelopmental disorder. Our observations demonstrate the promise of integrating long-read sequencing with gene expression toward improving the prioritization of functional SVs and TREs in rare disease patients.
Publisher OA PDF DOI
Complete sequencing of ape genomes
Nature · 2025-04-09 · 119 citations
articleOpen access
. Consequently, our understanding of the evolution of our species is incomplete. Here we present haplotype-resolved reference genomes and comparative analyses of six ape species: chimpanzee, bonobo, gorilla, Bornean orangutan, Sumatran orangutan and siamang. We achieve chromosome-level contiguity with substantial sequence accuracy (<1 error in 2.7 megabases) and completely sequence 215 gapless chromosomes telomere-to-telomere. We resolve challenging regions, such as the major histocompatibility complex and immunoglobulin loci, to provide in-depth evolutionary insights. Comparative analyses enabled investigations of the evolution and diversity of regions previously uncharacterized or incompletely studied without bias from mapping to the human reference genome. Such regions include newly minted gene families in lineage-specific segmental duplications, centromeric DNA, acrocentric chromosomes and subterminal heterochromatin. This resource serves as a comprehensive baseline for future evolutionary studies of humans and our closest living ape relatives.
Publisher OA PDF DOI
Engineering compact <scp> <i>Physalis peruviana</i> </scp> (goldenberry) to promote its potential as a global crop
Plants People Planet · 2025-12-04 · 2 citations
articleOpen accessCorresponding
Societal Impact Statement Goldenberry ( Physalis peruviana ) produces sweet, nutritionally rich berries, yet like many minor crops, is cultivated in limited geographical regions and has not been a focus of breeding programs for trait enhancement. Leveraging knowledge of plant architecture‐related traits from related species, we used CRISPR/Cas9‐mediated gene editing to generate a compact ideotype to advance future breeding efforts and agricultural production. Goldenberry growers will benefit from these compact versions because it optimizes per plot yield, facilitating larger scale production to meet rising consumer popularity and demand.
Publisher OA PDF DOI
Mem-based pangenome indexing for k-mer queries
Algorithms for Molecular Biology · 2025-03-01 · 1 citations
articleOpen access
Abstract Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k -mers and de Bruijn graphs are limited to answering questions at a specific substring length k . We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k -mer presence/absence (membership queries) and that count the number of genomes containing k -mers in a window (conservation queries). MEMO’s index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8 $$\times$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mo>×</mml:mo> </mml:math> smaller than a comparable KMC3 index and 11.4 $$\times$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mo>×</mml:mo> </mml:math> smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 s, 2.5 $$\times$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mo>×</mml:mo> </mml:math> faster than other approaches. MEMO’s small index size, lack of k -mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.
Publisher OA PDF DOI
A complete diploid human genome benchmark for personalized genomics
bioRxiv (Cold Spring Harbor Laboratory) · 2025-09-21 · 20 citations
preprintOpen access
Human genome resequencing typically involves mapping reads to a reference genome to call variants; however, this approach suffers from both technical and reference biases, leaving many duplicated and structurally polymorphic regions of the genome unmapped. Consequently, existing variant benchmarks, generated by the same methods, fail to assess these complex regions. To address this limitation, we present a telomere-to-telomere genome benchmark that achieves near-perfect accuracy (i.e. no detectable errors) across 99.4% of the complete, diploid HG002 genome. This benchmark adds 701.4 Mb of autosomal sequence and both sex chromosomes (216.8 Mb), totaling 15.3% of the genome that was absent from prior benchmarks. We also provide a diploid annotation of genes, transposable elements, segmental duplications, and satellite repeats, including 39,144 protein-coding genes across both haplotypes. To facilitate application of the benchmark, we developed tools for measuring the accuracy of sequencing reads, phased variant call sets, and genome assemblies against a diploid reference. Genome-wide analyses show that state-of-the-art de novo assembly methods resolve 2-7% more sequence and outperform variant calling accuracy by an order of magnitude, yielding just one error per 100 kb across 99.9% of the benchmark regions. Adoption of genome-based benchmarking is expected to accelerate the development of cost-effective methods for complete genome sequencing, expanding the reach of genomic medicine to the entire genome and enabling a new era of personalized genomics.
Publisher OA PDF DOI
Uncalled4 improves nanopore DNA and RNA modification detection via fast and accurate signal alignment
Nature Methods · 2025-03-28 · 42 citations
articleOpen accessSenior author
Nanopore signal analysis enables detection of nucleotide modifications from native DNA and RNA sequencing, providing both accurate genetic or transcriptomic and epigenetic information without additional library preparation. At present, only a limited set of modifications can be directly basecalled (for example, 5-methylcytosine), while most others require exploratory methods that often begin with alignment of nanopore signal to a nucleotide reference. We present Uncalled4, a toolkit for nanopore signal alignment, analysis and visualization. Uncalled4 features an efficient banded signal alignment algorithm, BAM signal alignment file format, statistics for comparing signal alignment methods and a reproducible de novo training method for k-mer-based pore models, revealing potential errors in Oxford Nanopore Technologies' state-of-the-art DNA model. We apply Uncalled4 to RNA 6-methyladenine (m6A) detection in seven human cell lines, identifying 26% more modifications than Nanopolish using m6Anet, including in several genes where m6A has known implications in cancer. Uncalled4 is available open source at github.com/skovaka/uncalled4 .
Publisher OA PDF DOI
Learning Explainable Imaging-Genetics Associations Related to a Neurological Disorder
Lecture notes in computer science · 2025-09-19
book-chapterOpen access
Publisher OA PDF DOI
Unraveling the hidden complexity of cancer through long-read sequencing
Genome Research · 2025-03-20 · 8 citations
reviewOpen accessSenior author
Cancer is fundamentally a disease of the genome, characterized by extensive genomic, transcriptomic, and epigenomic alterations. Most current studies predominantly use short-read sequencing, gene panels, or microarrays to explore these alterations; however, these technologies can systematically miss or misrepresent certain types of alterations, especially structural variants, complex rearrangements, and alterations within repetitive regions. Long-read sequencing is rapidly emerging as a transformative technology for cancer research by providing a comprehensive view across the genome, transcriptome, and epigenome, including the ability to detect alterations that previous technologies have overlooked. In this Perspective, we explore the current applications of long-read sequencing for both germline and somatic cancer analysis. We provide an overview of the computational methodologies tailored to long-read data and highlight key discoveries and resources within cancer genomics that were previously inaccessible with prior technologies. We also address future opportunities and persistent challenges, including the experimental and computational requirements needed to scale to larger sample sizes, the hurdles in sequencing and analyzing complex cancer genomes, and opportunities for leveraging machine learning and artificial intelligence technologies for cancer informatics. We further discuss how the telomere-to-telomere genome and the emerging human pangenome could enhance the resolution of cancer genome analysis, potentially revolutionizing early detection and disease monitoring in patients. Finally, we outline strategies for transitioning long-read sequencing from research applications to routine clinical practice.
Publisher OA PDF DOI
Genomic Next-Token Predictors are In-Context Learners
ArXiv.org · 2025-11-16
preprintOpen access
In-context learning (ICL) -- the capacity of a model to infer and apply abstract patterns from examples provided within its input -- has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in human language. This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training? To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data. These findings extend emergent meta-learning beyond language, pointing toward a unified, modality-agnostic view of in-context learning.
Publisher OA PDF DOI

Recent grants

A Federated Galaxy for user-friendly large-scale cancer genomics research
NIH · $3.9M · 2018–2024
Expanding the AnVIL (Analysis, Visualization, and Informatics Lab-space)
NIH · $20.8M · 2018–2028
Integrative genomic and epigenomic analysis of cancer using long read sequencing
NIH · $1.1M · 2021–2025
Tuning big data analysis infrastructure for HIV research
NIH · $4.2M · 2017–2024
NIH Grant R01AI020426
NIH · $361k · 1990

Frequent coauthors

Fritz J. Sedlazeck
Rice University
125 shared
Steven L. Salzberg
Johns Hopkins University
119 shared
Brian J. Haas
Broad Institute
81 shared
Mihaela Pertea
Johns Hopkins University
76 shared
Owen White
University of Maryland, Baltimore
76 shared
Claire M. Fraser
University of Maryland, Baltimore
76 shared
Adam M. Phillippy
National Human Genome Research Institute
76 shared
Jennifer R. Wortman
76 shared

Education

Ph.D., Computer Science
University of Maryland at College Park
2010
BS, Computer Science
Carnegie Mellon University
2000

Awards & honors

2015 Alfred P. Sloan Foundation Fellowship for Computational…
NSF CAREER Award (2014)
Genome Technology ’s Young Investigator of the Year (2010)
Winship Herr Award for Excellence in Teaching from the Watso…

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Michael Schatz

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you