Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…

Michael Schatz

· Bloomberg Distinguished Professor of Computer Science and BiologyVerified

Johns Hopkins University · Genetics and Molecular Biology

Active 1997–2026

h-index111
Citations64.3k
Papers430169 last 5y
Funding$61.2M2 active
See your match with Michael Schatz — sign in to PhdFit.Sign in

About

Michael Schatz is the Bloomberg Distinguished Professor of Computational Biology and Oncology at Johns Hopkins University. His research focuses on solving computational problems in genomics research, developing innovative biotechnologies and computational tools to study the sequence and function of genomes. Schatz's work advances understanding of genome structure, evolution, and function, particularly in the context of medicine—such as autism spectrum disorders, cancer, and other human diseases—as well as agriculture. He has created many widely used methods and software for genome assembly and analysis, including NGMLR and Sniffles for long-read sequencing analysis, Scalpel for genetic variant discovery, and GECCO for studying complex genomic variations. His lab has identified numerous structural alterations in cancer genomes and developed tools like Ginkgo for single-cell copy number profiling. Schatz's contributions extend to computational methods for genome assembly and analysis across species, utilizing single molecule-sequencing technologies. He serves as a faculty member in the Department of Computer Science and the Department of Biology at Johns Hopkins, and is involved with the Cancer Prevention and Control Program at the Sidney Kimmel Comprehensive Cancer Center. His work has earned him several awards, including the 2015 Sloan Foundation Fellowship and an NSF CAREER Award, and he has been recognized for his leadership in the field of computational biology.

Research topics

  • Genetics
  • Biology
  • Computer Science
  • Computational biology
  • Evolutionary biology
  • Data Mining
  • Data science
  • Machine Learning
  • Artificial Intelligence
  • Biotechnology
  • Botany
  • Medicine
  • World Wide Web
  • Zoology

Selected publications

  • The Common Fund Data Ecosystem (CFDE)

    bioRxiv (Cold Spring Harbor Laboratory) · 2026-04-12

    articleOpen access

    The NIH Common Fund Data Ecosystem (CFDE) integrates data resources from 18 NIH Common Fund programs for discovery and integrative analysis. These programs generate valuable but heterogeneous datasets that can be difficult to discover, access, and reuse. CFDE aims to provide a collaborative, community-built infrastructure that links and enriches Common Fund programs. We describe the evolution, structure, and core technologies of CFDE, including practical approaches that support submission, integration, visualization, and public release of multimodal data. Training programs and workforce initiatives lower barriers to adoption. CFDE has devised solutions to critical issues facing cross-program initiatives, including data scale and heterogeneity, dataset integration, and long-term sustainability. We demonstrate the utility of linking Common Fund resources through integrative tools and cross-dataset queries to yield insights that would otherwise be infeasible. Collectively, CFDE shows that a standards-driven, federated approach enhances and unifies cross-disciplinary resources, fostering collaboration and data-driven discovery.

  • Integration of transcriptomics and long-read genomics prioritizes structural variants in rare disease

    Genome Research · 2025-03-20 · 9 citations

    articleOpen access

    Rare structural variants (SVs)—insertions, deletions, and complex rearrangements—can cause Mendelian disease, yet they remain difficult to accurately detect and interpret. We sequenced and analyzed Oxford Nanopore Technologies long-read genomes of 68 individuals from the undiagnosed disease network (UDN) with no previously identified diagnostic mutations from short-read sequencing. Using our optimized SV detection pipelines and 571 control long-read genomes, we detected 716 long-read rare (MAF < 0.01) SV alleles per genome on average, achieving a 2.4× increase from short reads. To characterize the functional effects of rare SVs, we assessed their relationship with gene expression from blood or fibroblasts from the same individuals and found that rare SVs overlapping enhancers were enriched (LOR = 0.46) near expression outliers. We also evaluated tandem repeat expansions (TREs) and found 14 rare TREs per genome; notably, these TREs were also enriched near overexpression outliers. To prioritize candidate functional SVs, we developed Watershed-SV, a probabilistic model that integrates expression data with SV-specific genomic annotations, which significantly outperforms baseline models that do not incorporate expression data. Watershed-SV identified a median of eight high-confidence functional SVs per UDN genome. Notably, this included compound heterozygous deletions in FAM177A1 shared by two siblings, which were likely causal for a rare neurodevelopmental disorder. Our observations demonstrate the promise of integrating long-read sequencing with gene expression toward improving the prioritization of functional SVs and TREs in rare disease patients.

  • Complete sequencing of ape genomes

    Nature · 2025-04-09 · 119 citations

    articleOpen access

    . Consequently, our understanding of the evolution of our species is incomplete. Here we present haplotype-resolved reference genomes and comparative analyses of six ape species: chimpanzee, bonobo, gorilla, Bornean orangutan, Sumatran orangutan and siamang. We achieve chromosome-level contiguity with substantial sequence accuracy (<1 error in 2.7 megabases) and completely sequence 215 gapless chromosomes telomere-to-telomere. We resolve challenging regions, such as the major histocompatibility complex and immunoglobulin loci, to provide in-depth evolutionary insights. Comparative analyses enabled investigations of the evolution and diversity of regions previously uncharacterized or incompletely studied without bias from mapping to the human reference genome. Such regions include newly minted gene families in lineage-specific segmental duplications, centromeric DNA, acrocentric chromosomes and subterminal heterochromatin. This resource serves as a comprehensive baseline for future evolutionary studies of humans and our closest living ape relatives.

  • Engineering compact <scp> <i>Physalis peruviana</i> </scp> (goldenberry) to promote its potential as a global crop

    Plants People Planet · 2025-12-04 · 2 citations

    articleOpen accessCorresponding

    Societal Impact Statement Goldenberry ( Physalis peruviana ) produces sweet, nutritionally rich berries, yet like many minor crops, is cultivated in limited geographical regions and has not been a focus of breeding programs for trait enhancement. Leveraging knowledge of plant architecture‐related traits from related species, we used CRISPR/Cas9‐mediated gene editing to generate a compact ideotype to advance future breeding efforts and agricultural production. Goldenberry growers will benefit from these compact versions because it optimizes per plot yield, facilitating larger scale production to meet rising consumer popularity and demand.

  • Mem-based pangenome indexing for k-mer queries

    Algorithms for Molecular Biology · 2025-03-01 · 1 citations

    articleOpen access

    Abstract Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k -mers and de Bruijn graphs are limited to answering questions at a specific substring length k . We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k -mer presence/absence (membership queries) and that count the number of genomes containing k -mers in a window (conservation queries). MEMO’s index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8 $$\times$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mo>×</mml:mo> </mml:math> smaller than a comparable KMC3 index and 11.4 $$\times$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mo>×</mml:mo> </mml:math> smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 s, 2.5 $$\times$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mo>×</mml:mo> </mml:math> faster than other approaches. MEMO’s small index size, lack of k -mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.

  • A complete diploid human genome benchmark for personalized genomics

    bioRxiv (Cold Spring Harbor Laboratory) · 2025-09-21 · 20 citations

    preprintOpen access

    Human genome resequencing typically involves mapping reads to a reference genome to call variants; however, this approach suffers from both technical and reference biases, leaving many duplicated and structurally polymorphic regions of the genome unmapped. Consequently, existing variant benchmarks, generated by the same methods, fail to assess these complex regions. To address this limitation, we present a telomere-to-telomere genome benchmark that achieves near-perfect accuracy (i.e. no detectable errors) across 99.4% of the complete, diploid HG002 genome. This benchmark adds 701.4 Mb of autosomal sequence and both sex chromosomes (216.8 Mb), totaling 15.3% of the genome that was absent from prior benchmarks. We also provide a diploid annotation of genes, transposable elements, segmental duplications, and satellite repeats, including 39,144 protein-coding genes across both haplotypes. To facilitate application of the benchmark, we developed tools for measuring the accuracy of sequencing reads, phased variant call sets, and genome assemblies against a diploid reference. Genome-wide analyses show that state-of-the-art de novo assembly methods resolve 2-7% more sequence and outperform variant calling accuracy by an order of magnitude, yielding just one error per 100 kb across 99.9% of the benchmark regions. Adoption of genome-based benchmarking is expected to accelerate the development of cost-effective methods for complete genome sequencing, expanding the reach of genomic medicine to the entire genome and enabling a new era of personalized genomics.

  • Uncalled4 improves nanopore DNA and RNA modification detection via fast and accurate signal alignment

    Nature Methods · 2025-03-28 · 42 citations

    articleOpen accessSenior author

    Nanopore signal analysis enables detection of nucleotide modifications from native DNA and RNA sequencing, providing both accurate genetic or transcriptomic and epigenetic information without additional library preparation. At present, only a limited set of modifications can be directly basecalled (for example, 5-methylcytosine), while most others require exploratory methods that often begin with alignment of nanopore signal to a nucleotide reference. We present Uncalled4, a toolkit for nanopore signal alignment, analysis and visualization. Uncalled4 features an efficient banded signal alignment algorithm, BAM signal alignment file format, statistics for comparing signal alignment methods and a reproducible de novo training method for k-mer-based pore models, revealing potential errors in Oxford Nanopore Technologies' state-of-the-art DNA model. We apply Uncalled4 to RNA 6-methyladenine (m6A) detection in seven human cell lines, identifying 26% more modifications than Nanopolish using m6Anet, including in several genes where m6A has known implications in cancer. Uncalled4 is available open source at github.com/skovaka/uncalled4 .

  • Learning Explainable Imaging-Genetics Associations Related to a Neurological Disorder

    Lecture notes in computer science · 2025-09-19

    book-chapterOpen access
  • Unraveling the hidden complexity of cancer through long-read sequencing

    Genome Research · 2025-03-20 · 8 citations

    reviewOpen accessSenior author

    Cancer is fundamentally a disease of the genome, characterized by extensive genomic, transcriptomic, and epigenomic alterations. Most current studies predominantly use short-read sequencing, gene panels, or microarrays to explore these alterations; however, these technologies can systematically miss or misrepresent certain types of alterations, especially structural variants, complex rearrangements, and alterations within repetitive regions. Long-read sequencing is rapidly emerging as a transformative technology for cancer research by providing a comprehensive view across the genome, transcriptome, and epigenome, including the ability to detect alterations that previous technologies have overlooked. In this Perspective, we explore the current applications of long-read sequencing for both germline and somatic cancer analysis. We provide an overview of the computational methodologies tailored to long-read data and highlight key discoveries and resources within cancer genomics that were previously inaccessible with prior technologies. We also address future opportunities and persistent challenges, including the experimental and computational requirements needed to scale to larger sample sizes, the hurdles in sequencing and analyzing complex cancer genomes, and opportunities for leveraging machine learning and artificial intelligence technologies for cancer informatics. We further discuss how the telomere-to-telomere genome and the emerging human pangenome could enhance the resolution of cancer genome analysis, potentially revolutionizing early detection and disease monitoring in patients. Finally, we outline strategies for transitioning long-read sequencing from research applications to routine clinical practice.

  • Genomic Next-Token Predictors are In-Context Learners

    ArXiv.org · 2025-11-16

    preprintOpen access

    In-context learning (ICL) -- the capacity of a model to infer and apply abstract patterns from examples provided within its input -- has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in human language. This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training? To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data. These findings extend emergent meta-learning beyond language, pointing toward a unified, modality-agnostic view of in-context learning.

Recent grants

Frequent coauthors

  • Fritz J. Sedlazeck

    Rice University

    125 shared
  • Steven L. Salzberg

    Johns Hopkins University

    119 shared
  • Brian J. Haas

    Broad Institute

    81 shared
  • Mihaela Pertea

    Johns Hopkins University

    76 shared
  • Owen White

    University of Maryland, Baltimore

    76 shared
  • Claire M. Fraser

    University of Maryland, Baltimore

    76 shared
  • Adam M. Phillippy

    National Human Genome Research Institute

    76 shared
  • Jennifer R. Wortman

    76 shared

Education

  • Ph.D., Computer Science

    University of Maryland at College Park

    2010
  • BS, Computer Science

    Carnegie Mellon University

    2000

Awards & honors

  • 2015 Alfred P. Sloan Foundation Fellowship for Computational…
  • NSF CAREER Award (2014)
  • Genome Technology ’s Young Investigator of the Year (2010)
  • Winship Herr Award for Excellence in Teaching from the Watso…
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Michael Schatz

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup