Steven L. Salzberg
VerifiedJohns Hopkins University · Biochemistry and Molecular Biology
Active 1971–2026
About
Steven L. Salzberg, PhD, is a Bloomberg Distinguished Professor at Johns Hopkins University, holding appointments in Biomedical Engineering, Computer Science, and Biostatistics. He is the Director of the Center for Computational Biology. His research focuses on the development of new computational methods for analysis of DNA from the latest sequencing technologies. Over the years, he has developed and applied software to many problems in gene finding, genome assembly, comparative genomics, evolutionary genomics, and sequencing technology itself. His current work emphasizes analysis of DNA and RNA sequenced with next-generation technology. Salzberg's contributions to genomics software have been recognized through his inclusion on Clarivate’s Highly Cited Researchers list. He holds a PhD in Computer Science from Harvard University, an MPhil and MS in Computer Science from Yale University, and a BA in English from Yale University.
Research topics
- Genetics
- Biology
- Computational biology
- Computer Science
- Botany
- Evolutionary biology
- Endocrinology
- Medicine
- Bioinformatics
- Demography
- Internal medicine
Selected publications
G3 Genes Genomes Genetics · 2026-03-17
articleOpen accessSenior authorGreat Basin bristlecone pine (Pinus longaeva), one of two species of bristlecone pine, the other being Rocky Mountain bristlecone pine (P. aristata), is endemic to the high Great Basin mountains in eastern California, Nevada, and Utah. It is the upper treeline forest tree in this region, found mostly between 2900 m and 3600 m. The primary goal of this project was to generate a reference genome sequence for P. longaeva that, among its many possible applications, will serve as an important genetic resource to better understand the genetic mechanisms underlying is extreme longevity and its adaptation to the extreme environmental conditions where it is found. A combination of short-read and long-read sequences were generated from haploid megagametophyte and diploid needle tissues, respectively. A customized genome assembly approach was used to construct a highly contiguous 23.8-gigabase genome with a scaffold N50 size of 1.2 gigabases. The chloroplast and mitochondrial genomes were assembled separately into circular chromosomes with lengths of 120 kilobases and 8.68 megabases, respectively. While the number of disease resistance genes known as nucleotide-binding leucine-rich repeat receptors (NLRs) and larger-than-average telomere lengths relative to other conifers have been suggested as genetic mechanisms for controlling longevity, we did not find strong evidence for their involvement. Clearly further study is needed.
Setting higher standards for reports of microbial species in human cancers
Nature Cancer · 2026-02-19 · 1 citations
articleOpen access1st authorCorrespondingeLife · 2026-04-02
articleOpen accessThe accuracy of spatial gene expression profiles generated by probe-based in situ spatially-resolved transcriptomic technologies depends on the specificity with which probes bind to their intended target gene. Off-target binding, defined as a probe binding to something other than the target gene, can distort a gene’s true expression profile, making probe specificity essential for reliable transcriptomics. Here, we investigated off-target binding affecting the 10x Genomics Xenium technology. We developed a software tool, Off-target Probe Tracker (OPT), to identify putative off-target binding via alignment of probe sequences and assessing whether mapped loci corresponded to the intended target gene across multiple reference annotations. Applying OPT to a Xenium human breast gene panel, we identified at least 14 out of the 313 genes in the panel potentially impacted by off-target binding to protein-coding genes. To substantiate our predictions, we leveraged a Xenium breast cancer dataset generated using this gene panel and compared results to orthogonal spatial and single-cell transcriptomic profiles from Visium CytAssist and 3ʹ single-cell RNA-seq derived from the same tumor block. Our findings indicate that for some genes, the expression patterns detected by Xenium demonstrably reflect the aggregate expression of the target and predicted off-target genes based on Visium and single-cell RNA-seq rather than the target gene alone. We further applied OPT to identify potential off-target binding in custom gene panels and integrate tissue-specific RNA-seq data to assess effects. Overall, this work enhances the biological interpretability of spatial transcriptomics data and improves reproducibility in spatial transcriptomics research.
PLoS ONE · 2026-05-07
articleOpen accessOBJECTIVES: Accurate diagnosis of existing and emerging respiratory pathogens is important. We evaluated the capability of unbiased metagenomic next generation sequencing (mNGS) to identify pathogenic RNA viruses from two cohorts of nasopharyngeal (NP) swabs previously tested by commercial multiplex respiratory diagnostics. METHODS: NP swabs (N = 100) in viral transport media (VTM) were assessed using mNGS for this study. Cohort 1 (N = 52) consisted of symptomatic individuals who tested negative for SARS-CoV-2, influenza A/B, and RSV by the Xpert Xpress CoV-2/Flu/RSV Plus multiplex respiratory virus panel and were tested by mNGS for undetected pathogens. Cohort 2 (N = 48) included symptomatic individuals who were positive (N = 26) or negative (N = 22) by the ePlex RP2 multiplex respiratory pathogen panel. Samples were positive for influenza A (N = 8), rhinovirus/enterovirus (N = 5), RSV (N = 4), adenovirus (N = 3), parainfluenza (N = 2), seasonal coronaviruses (N = 2), and human metapneumovirus (N = 1), as well as a rhinovirus/enterovirus/human metapneumovirus co-infected sample (N = 1). mNGS results were compared with ePlex RP2 findings, and symptomatic negative samples were evaluated for additional pathogen detection. RESULTS: Cohort 1 contained 8% (4/52) viral and 19% (10/52) bacterial reads. In cohort 2, positive concordance between ePlex RP2 and mNGS was 31% (8/26). mNGS did not identify any viral reads in ePlex RP2-negative samples. However, it detected other microbial reads, such as Acanthamoeba castellanii, in 21% (10/48) of samples. CONCLUSION: In this study, targeted multiplex amplification methods demonstrated better overall sensitivity in NPs of symptomatic respiratory individuals than mNGS. Other mNGS approaches may produce different results. This study suggests that mNGS may offer adjunctive information, including the detection of rare pathogens, which may be helpful in some clinical contexts.
Efficient evidence-based genome annotation with EviAnn
bioRxiv (Cold Spring Harbor Laboratory) · 2025-05-12 · 7 citations
preprintOpen accessSenior authorAbstract For many years, machine learning-based ab initio gene finding approaches have been central components of eukaryotic genome annotation pipelines, and they remain so today. The reliance on these approaches was originally sustained by the high cost and low availability of gene expression data, a primary source of evidence for gene annotation along with protein homology. However, innovations in modern sequencing technologies have revolutionized the acquisition of gene expression data, allowing scientists to rely more heavily on this class of evidence. In addition, proteins found in a multitude of well-annotated genomes represent another invaluable resource for gene annotation. Existing annotation packages often underutilize these data sources, which prompted us to develop EviAnn ( Evi dence-based Ann otator), a novel evidence-based eukaryotic gene annotation system. EviAnn takes a strongly data-driven approach, building the exon-intron structure of genes from transcript alignments or protein-sequence homology rather than from purely ab initio gene finding techniques. We show that when provided with the same input data, EviAnn consistently outperforms current state-of-the-art packages including BRAKER3, MAKER2, and FINDER, while utilizing considerably less computer time. Annotation of a mammalian genome can be completed in less than an hour on a single multi-core server. EviAnn is freely available under an open-source license from https://github.com/alekseyzimin/EviAnn_release and from Bioconda as “eviann”.
eLife · 2025-09-09
preprintOpen accessAbstract The SpliceAI deep learning system is currently one of the most accurate methods for identifying splicing signals directly from DNA sequences. However, its utility is limited by its reliance on older software frameworks and human-centric training data. Here we introduce OpenSpliceAI, a trainable, open-source version of SpliceAI implemented in PyTorch to address these challenges. OpenSpliceAI supports both training from scratch and transfer learning, enabling seamless retraining on species-specific datasets and mitigating human-centric biases. Our experiments show that it achieves faster processing speeds and lower memory usage than the original SpliceAI code, allowing large-scale analyses of extensive genomic regions on a single GPU. Additionally, OpenSpliceAI’s flexible architecture makes for easier integration with established machine learning ecosystems, simplifying the development of custom splicing models for different species and applications. We demonstrate that OpenSpliceAI’s output is highly concordant with SpliceAI. In silico mutagenesis (ISM) analyses confirm that both models rely on similar sequence features, and calibration experiments demonstrate similar score probability estimates.
Diabetes Research and Clinical Practice · 2025-12-01
article1st authorCorresponding2025-09-09
peer-reviewOpen accessThe SpliceAI deep learning system is currently one of the most accurate methods for identifying splicing signals directly from DNA sequences. However, its utility is limited by its reliance on older software frameworks and human-centric training data. Here we introduce OpenSpliceAI, a trainable, open-source version of SpliceAI implemented in PyTorch to address these challenges. OpenSpliceAI supports both training from scratch and transfer learning, enabling seamless retraining on species-specific datasets and mitigating human-centric biases. Our experiments show that it achieves faster processing speeds and lower memory usage than the original SpliceAI code, allowing large-scale analyses of extensive genomic regions on a single GPU. Additionally, OpenSpliceAI’s flexible architecture makes for easier integration with established machine learning ecosystems, simplifying the development of custom splicing models for different species and applications. We demonstrate that OpenSpliceAI’s output is highly concordant with SpliceAI. In silico mutagenesis (ISM) analyses confirm that both models rely on similar sequence features, and calibration experiments demonstrate similar score probability estimates.
2025-10-30
peer-reviewOpen accessPhenotype to genotype: A new and rapid approach using whole-genome sequencing
PLoS Genetics · 2025-07-14
articleOpen accessCorrespondingForward genetic screening is a powerful approach to assign functions to genes and can be used to elucidate the many genes whose functions remain unknown. A key step in forward genetic screening is mapping: identification of the gene causing the phenotype. Existing mapping methods use a bioinformatic mapping-by-sequencing approach based on allelic frequency calculations that often identify large genomic regions which contain an intractable number of candidate genes for testing. Here, we describe WheresWalker, a modern mapping-by-sequencing algorithm that identifies a mutation-containing interval and then supports positional cloning to shrink the interval, which drastically reduces the number of potential candidates, allowing for extremely rapid mutation identification. We validated this method using mutants from a forward genetic mutagenesis screen in zebrafish for modifiers of ApoB-lipoprotein metabolism. WheresWalker correctly mapped and identified novel zebrafish mutations in mttp, apobb.1, and mia2 genes, as well as a previously published mutation in maize. Further, we used WheresWalker to identify a previously unappreciated ApoB-lipoprotein metabolism-modifying locus, slc3a2a.
Recent grants
NIH · $335k · 2000
Bioinformatics Software for Analyzing Microbial Genomes
NIH · $2.1M · 2008–2019
A Software Framework for Exploring 1,000 Genomes of African Descent
NIH · $1.4M · 2015–2019
Computational Methods for Microbial and Microbiome Sequence Analysis
NIH · $2.9M · 2019–2030
NIH · $1.1M · 2014–2018
Frequent coauthors
- 215 shared
Claire M. Fraser
University of Maryland, Baltimore
- 207 shared
Mihaela Pertea
Johns Hopkins University
- 200 shared
Owen White
University of Maryland, Baltimore
- 175 shared
Jennifer R. Wortman
- 175 shared
Brian J. Haas
Broad Institute
- 145 shared
Arthur L. Delcher
Johns Hopkins University
- 136 shared
Daniela Puiu
Johns Hopkins University
- 120 shared
Tamara V. Feldblyum
Center for Devices and Radiological Health
Education
- 1989
Ph.D., Computer Science
Harvard University
- 1984
M. Phil., Computer Science
Yale University
- 1982
M.S., Computer Science
Yale University
- 1980
B.A., English
Yale University
Awards & honors
- Clarivate’s Highly Cited Researchers list (2025)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Steven L. Salzberg
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup