Steven L. Salzberg

Verified

Johns Hopkins University · Biochemistry and Molecular Biology

Active 1971–2026

h-index191

Citations397.3k

Papers582122 last 5y

Funding$31.1M2 active

Faculty page

See your match with Steven L. Salzberg — sign in to PhdFit.Sign in

About

Steven L. Salzberg, PhD, is a Bloomberg Distinguished Professor at Johns Hopkins University, holding appointments in Biomedical Engineering, Computer Science, and Biostatistics. He is the Director of the Center for Computational Biology. His research focuses on the development of new computational methods for analysis of DNA from the latest sequencing technologies. Over the years, he has developed and applied software to many problems in gene finding, genome assembly, comparative genomics, evolutionary genomics, and sequencing technology itself. His current work emphasizes analysis of DNA and RNA sequenced with next-generation technology. Salzberg's contributions to genomics software have been recognized through his inclusion on Clarivate’s Highly Cited Researchers list. He holds a PhD in Computer Science from Harvard University, an MPhil and MS in Computer Science from Yale University, and a BA in English from Yale University.

Research topics

Genetics
Biology
Computational biology
Computer Science
Botany
Evolutionary biology
Endocrinology
Medicine
Bioinformatics
Demography
Internal medicine

Selected publications

A reference genome sequence for the exceptionally long-lived Great Basin bristlecone pine, <i>Pinus longaeva</i>
G3 Genes Genomes Genetics · 2026-03-17
articleOpen accessSenior author
Great Basin bristlecone pine (Pinus longaeva), one of two species of bristlecone pine, the other being Rocky Mountain bristlecone pine (P. aristata), is endemic to the high Great Basin mountains in eastern California, Nevada, and Utah. It is the upper treeline forest tree in this region, found mostly between 2900 m and 3600 m. The primary goal of this project was to generate a reference genome sequence for P. longaeva that, among its many possible applications, will serve as an important genetic resource to better understand the genetic mechanisms underlying is extreme longevity and its adaptation to the extreme environmental conditions where it is found. A combination of short-read and long-read sequences were generated from haploid megagametophyte and diploid needle tissues, respectively. A customized genome assembly approach was used to construct a highly contiguous 23.8-gigabase genome with a scaffold N50 size of 1.2 gigabases. The chloroplast and mitochondrial genomes were assembled separately into circular chromosomes with lengths of 120 kilobases and 8.68 megabases, respectively. While the number of disease resistance genes known as nucleotide-binding leucine-rich repeat receptors (NLRs) and larger-than-average telomere lengths relative to other conifers have been suggested as genetic mechanisms for controlling longevity, we did not find strong evidence for their involvement. Clearly further study is needed.
Publisher DOI
Setting higher standards for reports of microbial species in human cancers
Nature Cancer · 2026-02-19 · 1 citations
articleOpen access1st authorCorresponding
Publisher OA PDF DOI
Evidence of off-target probe binding affecting 10x Genomics Xenium gene panels compromise accuracy of spatial transcriptomic profiling
eLife · 2026-04-02
articleOpen access
The accuracy of spatial gene expression profiles generated by probe-based in situ spatially-resolved transcriptomic technologies depends on the specificity with which probes bind to their intended target gene. Off-target binding, defined as a probe binding to something other than the target gene, can distort a gene’s true expression profile, making probe specificity essential for reliable transcriptomics. Here, we investigated off-target binding affecting the 10x Genomics Xenium technology. We developed a software tool, Off-target Probe Tracker (OPT), to identify putative off-target binding via alignment of probe sequences and assessing whether mapped loci corresponded to the intended target gene across multiple reference annotations. Applying OPT to a Xenium human breast gene panel, we identified at least 14 out of the 313 genes in the panel potentially impacted by off-target binding to protein-coding genes. To substantiate our predictions, we leveraged a Xenium breast cancer dataset generated using this gene panel and compared results to orthogonal spatial and single-cell transcriptomic profiles from Visium CytAssist and 3ʹ single-cell RNA-seq derived from the same tumor block. Our findings indicate that for some genes, the expression patterns detected by Xenium demonstrably reflect the aggregate expression of the target and predicted off-target genes based on Visium and single-cell RNA-seq rather than the target gene alone. We further applied OPT to identify potential off-target binding in custom gene panels and integrate tissue-specific RNA-seq data to assess effects. Overall, this work enhances the biological interpretability of spatial transcriptomics data and improves reproducibility in spatial transcriptomics research.
Publisher DOI
Comparison of unbiased metagenomic next generation sequencing to targeted multiplex diagnostic assays for the detection of respiratory viruses
PLoS ONE · 2026-05-07
articleOpen access
OBJECTIVES: Accurate diagnosis of existing and emerging respiratory pathogens is important. We evaluated the capability of unbiased metagenomic next generation sequencing (mNGS) to identify pathogenic RNA viruses from two cohorts of nasopharyngeal (NP) swabs previously tested by commercial multiplex respiratory diagnostics. METHODS: NP swabs (N = 100) in viral transport media (VTM) were assessed using mNGS for this study. Cohort 1 (N = 52) consisted of symptomatic individuals who tested negative for SARS-CoV-2, influenza A/B, and RSV by the Xpert Xpress CoV-2/Flu/RSV Plus multiplex respiratory virus panel and were tested by mNGS for undetected pathogens. Cohort 2 (N = 48) included symptomatic individuals who were positive (N = 26) or negative (N = 22) by the ePlex RP2 multiplex respiratory pathogen panel. Samples were positive for influenza A (N = 8), rhinovirus/enterovirus (N = 5), RSV (N = 4), adenovirus (N = 3), parainfluenza (N = 2), seasonal coronaviruses (N = 2), and human metapneumovirus (N = 1), as well as a rhinovirus/enterovirus/human metapneumovirus co-infected sample (N = 1). mNGS results were compared with ePlex RP2 findings, and symptomatic negative samples were evaluated for additional pathogen detection. RESULTS: Cohort 1 contained 8% (4/52) viral and 19% (10/52) bacterial reads. In cohort 2, positive concordance between ePlex RP2 and mNGS was 31% (8/26). mNGS did not identify any viral reads in ePlex RP2-negative samples. However, it detected other microbial reads, such as Acanthamoeba castellanii, in 21% (10/48) of samples. CONCLUSION: In this study, targeted multiplex amplification methods demonstrated better overall sensitivity in NPs of symptomatic respiratory individuals than mNGS. Other mNGS approaches may produce different results. This study suggests that mNGS may offer adjunctive information, including the detection of rare pathogens, which may be helpful in some clinical contexts.
Publisher DOI
Efficient evidence-based genome annotation with EviAnn
bioRxiv (Cold Spring Harbor Laboratory) · 2025-05-12 · 7 citations
preprintOpen accessSenior author
Abstract For many years, machine learning-based ab initio gene finding approaches have been central components of eukaryotic genome annotation pipelines, and they remain so today. The reliance on these approaches was originally sustained by the high cost and low availability of gene expression data, a primary source of evidence for gene annotation along with protein homology. However, innovations in modern sequencing technologies have revolutionized the acquisition of gene expression data, allowing scientists to rely more heavily on this class of evidence. In addition, proteins found in a multitude of well-annotated genomes represent another invaluable resource for gene annotation. Existing annotation packages often underutilize these data sources, which prompted us to develop EviAnn ( Evi dence-based Ann otator), a novel evidence-based eukaryotic gene annotation system. EviAnn takes a strongly data-driven approach, building the exon-intron structure of genes from transcript alignments or protein-sequence homology rather than from purely ab initio gene finding techniques. We show that when provided with the same input data, EviAnn consistently outperforms current state-of-the-art packages including BRAKER3, MAKER2, and FINDER, while utilizing considerably less computer time. Annotation of a mammalian genome can be completed in less than an hour on a single multi-core server. EviAnn is freely available under an open-source license from https://github.com/alekseyzimin/EviAnn_release and from Bioconda as “eviann”.
Publisher OA PDF DOI
OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species
eLife · 2025-09-09
preprintOpen access
Abstract The SpliceAI deep learning system is currently one of the most accurate methods for identifying splicing signals directly from DNA sequences. However, its utility is limited by its reliance on older software frameworks and human-centric training data. Here we introduce OpenSpliceAI, a trainable, open-source version of SpliceAI implemented in PyTorch to address these challenges. OpenSpliceAI supports both training from scratch and transfer learning, enabling seamless retraining on species-specific datasets and mitigating human-centric biases. Our experiments show that it achieves faster processing speeds and lower memory usage than the original SpliceAI code, allowing large-scale analyses of extensive genomic regions on a single GPU. Additionally, OpenSpliceAI’s flexible architecture makes for easier integration with established machine learning ecosystems, simplifying the development of custom splicing models for different species and applications. We demonstrate that OpenSpliceAI’s output is highly concordant with SpliceAI. In silico mutagenesis (ISM) analyses confirm that both models rely on similar sequence features, and calibration experiments demonstrate similar score probability estimates.
Publisher DOI
Training interdisciplinary health teams to optimize the management and education of gestational diabetes
Diabetes Research and Clinical Practice · 2025-12-01
article1st authorCorresponding
Publisher DOI
Author response: OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species
2025-09-09
peer-reviewOpen access
The SpliceAI deep learning system is currently one of the most accurate methods for identifying splicing signals directly from DNA sequences. However, its utility is limited by its reliance on older software frameworks and human-centric training data. Here we introduce OpenSpliceAI, a trainable, open-source version of SpliceAI implemented in PyTorch to address these challenges. OpenSpliceAI supports both training from scratch and transfer learning, enabling seamless retraining on species-specific datasets and mitigating human-centric biases. Our experiments show that it achieves faster processing speeds and lower memory usage than the original SpliceAI code, allowing large-scale analyses of extensive genomic regions on a single GPU. Additionally, OpenSpliceAI’s flexible architecture makes for easier integration with established machine learning ecosystems, simplifying the development of custom splicing models for different species and applications. We demonstrate that OpenSpliceAI’s output is highly concordant with SpliceAI. In silico mutagenesis (ISM) analyses confirm that both models rely on similar sequence features, and calibration experiments demonstrate similar score probability estimates.
Publisher DOI
Author response: OpenSpliceAI provides an efficient modular implementation of SpliceAI enabling easy retraining across nonhuman species
2025-10-30
peer-reviewOpen access
Publisher DOI
Phenotype to genotype: A new and rapid approach using whole-genome sequencing
PLoS Genetics · 2025-07-14
articleOpen accessCorresponding
Forward genetic screening is a powerful approach to assign functions to genes and can be used to elucidate the many genes whose functions remain unknown. A key step in forward genetic screening is mapping: identification of the gene causing the phenotype. Existing mapping methods use a bioinformatic mapping-by-sequencing approach based on allelic frequency calculations that often identify large genomic regions which contain an intractable number of candidate genes for testing. Here, we describe WheresWalker, a modern mapping-by-sequencing algorithm that identifies a mutation-containing interval and then supports positional cloning to shrink the interval, which drastically reduces the number of potential candidates, allowing for extremely rapid mutation identification. We validated this method using mutants from a forward genetic mutagenesis screen in zebrafish for modifiers of ApoB-lipoprotein metabolism. WheresWalker correctly mapped and identified novel zebrafish mutations in mttp, apobb.1, and mia2 genes, as well as a previously published mutation in maize. Further, we used WheresWalker to identify a previously unappreciated ApoB-lipoprotein metabolism-modifying locus, slc3a2a.
Publisher DOI

Recent grants

NIH Grant K01HG000022
NIH · $335k · 2000
Bioinformatics Software for Analyzing Microbial Genomes
NIH · $2.1M · 2008–2019
A Software Framework for Exploring 1,000 Genomes of African Descent
NIH · $1.4M · 2015–2019
Computational Methods for Microbial and Microbiome Sequence Analysis
NIH · $2.9M · 2019–2030
The Terabase Search Engine
NIH · $1.1M · 2014–2018

Frequent coauthors

Claire M. Fraser
University of Maryland, Baltimore
215 shared
Mihaela Pertea
Johns Hopkins University
207 shared
Owen White
University of Maryland, Baltimore
200 shared
Jennifer R. Wortman
175 shared
Brian J. Haas
Broad Institute
175 shared
Arthur L. Delcher
Johns Hopkins University
145 shared
Daniela Puiu
Johns Hopkins University
136 shared
Tamara V. Feldblyum
Center for Devices and Radiological Health
120 shared

Education

Ph.D., Computer Science
Harvard University
1989
M. Phil., Computer Science
Yale University
1984
M.S., Computer Science
Yale University
1982
B.A., English
Yale University
1980

Awards & honors

Clarivate’s Highly Cited Researchers list (2025)

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Steven L. Salzberg

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you