
Mark Borodovsky
· Regents' Professor, Joint with Wallace H.…VerifiedGeorgia Institute of Technology · Computer Science
Active 1989–2026
Research topics
- Computer Science
- Genetics
- Biology
- Artificial Intelligence
- Data Mining
- Machine Learning
- Computational biology
- Database
- Horticulture
- Botany
Selected publications
HPVarcall: Calling lineages and sublineages for partial DNA sequences of human papillomavirus
bioRxiv (Cold Spring Harbor Laboratory) · 2026-01-02
articleOpen accessSenior authorAbstract We describe a computational method, HPVarcall, that assigns DNA sequences of a human papillomavirus (HPV) variant of known type to lineages and sublineages. The algorithm relies on statistical models - positional frequency profiles - trained on multiple alignments of HPV genomic sequences that are known to belong to specific sublineages of a given HPV type. The workflow begins with multiple alignment of all available sequences for the HPV type, followed by construction of a phylogenetic tree and identification of branches containing sublineage-specific reference sequences. In the prediction phase, sublineage-specific statistical models are used to compute the posterior probabilities for each sublineage given a query sequence. The query classifies to belong to the sublineage with the highest posterior probability. Accuracy assessments performed for the nine HPV types included in the Gardasil 9 vaccine demonstrated a low error rate in assigning HPV genomic fragments of at least 1000 nucleotides to their correct sublineages and even higher accuracy for longer sequence fragments.
Translon: a single term for translated regions
Nature Methods · 2025-09-01 · 11 citations
letterOpen accessGenome Research · 2024-05-01 · 531 citations
articleOpen accessGene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes, and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement integrating all three data types was made by the recently released GeneMark-ETP. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS, and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under an assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperforms BRAKER1 and BRAKER2. The average transcript-level F1-score is increased by about 20 percentage points on average, whereas the difference is most pronounced for species with large and complex genomes. BRAKER3 also outperforms other existing tools, MAKER2, Funannotate, and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.
GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes
Genome Research · 2024-05-01 · 130 citations
articleOpen accessSenior authorLarge-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. Here we present an automatic gene finder, GeneMark-ETP, integrating genomic-, transcriptomic-, and protein-derived evidence that has been developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data are sufficient for making gene predictions with "high confidence." The genes situated in the genomic space between the high-confidence genes are predicted in the next stage. The set of high-confidence genes serves as an initial training set for the statistical model. Further on, the model parameters are iteratively updated in the rounds of gene prediction and parameter re-estimation. Upon reaching convergence, GeneMark-ETP makes the final predictions and delivers the whole complement of predicted genes. GeneMark-ETP outperforms gene finders using a single type of extrinsic evidence. Comparisons with gene finders MAKER2 and TSEBRA, those that use both transcript- and protein-derived extrinsic evidence, show that GeneMark-ETP delivers state-of-the-art gene-prediction accuracy, with the margin of outperforming existing approaches increasing in its application to larger and more complex eukaryotic genomes.
bioRxiv (Cold Spring Harbor Laboratory) · 2023-01-15 · 60 citations
preprintOpen accessSenior authorCorrespondingLarge-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. Here we present an automatic gene finder, GeneMark-ETP, integrating genomic-, transcriptomic- and protein-derived evidence that has been developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data is sufficient for making gene predictions with 'high confidence'. The genes situated in the genomic space between the high confidence genes are predicted in the next stage. The set of high confidence genes serves as an initial training set for the statistical model. Further on, the model parameters are iteratively updated in the rounds of gene prediction and parameter re-estimation. Upon reaching convergence, GeneMark-ETP makes the final predictions and delivers the whole complement of predicted genes. GeneMark-ETP outperformed gene finders using a single type of extrinsic evidence. Comparisons with gene finders utilizing both transcript- and protein-derived extrinsic evidence, MAKER2, and TSEBRA, demonstrated that GeneMark-ETP delivered state-of-the-art gene prediction accuracy with the margin of outperforming existing approaches increasing in its applications to larger and more complex eukaryotic genomes.
bioRxiv (Cold Spring Harbor Laboratory) · 2023-06-12 · 234 citations
preprintOpen accessGene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes and even along a single gene. User-friendly and accurate annotation pipelines that can cope with such data heterogeneity are needed. The previously developed annotation pipelines BRAKER1 and BRAKER2 use RNA-seq or protein data, respectively, but not both. A further significant performance improvement was made by the recently released GeneMark-ETP integrating all three data types. We here present the BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS and further improves accuracy using the TSEBRA combiner. BRAKER3 annotates protein-coding genes in eukaryotic genomes using both short-read RNA-seq and a large protein database, along with statistical models learned iteratively and specifically for the target genome. We benchmarked the new pipeline on genomes of 11 species under assumed level of relatedness of the target species proteome to available proteomes. BRAKER3 outperformed BRAKER1 and BRAKER2. The average transcript-level F1-score was increased by ~20 percentage points on average, while the difference was most pronounced for species with large and complex genomes. BRAKER3 also outperformed other existing tools, MAKER2, Funannotate and FINDER. The code of BRAKER3 is available on GitHub and as a ready-to-run Docker container for execution with Docker or Singularity. Overall, BRAKER3 is an accurate, easy-to-use tool for eukaryotic genome annotation.
MgCod: Gene Prediction in Phage Genomes with Multiple Genetic Codes
Journal of Molecular Biology · 2023-05-25 · 11 citations
articleOpen accessSenior authorCorrespondingMassive sequencing of microbiomes has led to the discovery of a large number of phage genomes with intermittent stop codon recoding. We have developed a computational tool, MgCod, that identifies genomic regions (blocks) with distinct stop codon recoding simultaneously with the prediction of protein-coding regions. When MgCod was used to scan a large volume of human metagenomic contigs hundreds of viral contigs with intermittent stop codon recoding were revealed. Many of these contigs originated from genomes of known crAssphages. Further analyses had shown that intermittent recoding was associated with subtle patterns in the organization of protein-coding genes, such as 'single-coding' and 'dual-coding'. The dual-coding genes, clustered into blocks, could be translated by two alternative codes producing nearly identical proteins. It was observed that the dual-coded blocks were enriched with the early-stage phage genes, while the late-stage genes were residing in the single-coded blocks. MgCod can identify types of stop codon recoding in novel genomic sequences in parallel with gene prediction. It is available for download from https://github.com/gatech-genemark/MgCod.
Faculty of 1000 Research Ltd · 2023-01-01
articleOpen access1st authorCorrespondingbioRxiv (Cold Spring Harbor Laboratory) · 2022-04-30 · 3 citations
preprintOpen accessAbstract Background Blackberries ( Rubus spp.) are the fourth most economically important berry crop worldwide. Genome assemblies and annotations have been developed for Rubus species in subgenus Idaeobatus , including black raspberry ( R. occidentalis ), red raspberry ( R. idaeus ), and R. chingii , but very few genomic resources exist for blackberries and their relatives in subgenus Rubus . Findings Here we present a chromosome-length assembly and annotation of the diploid blackberry germplasm accession ‘Hillquist’ ( R. argutus ). ‘Hillquist’ is the only known source of primocane-fruiting (annual-fruiting) in tetraploid fresh-market blackberry breeding programs and is represented in the pedigree of many important cultivars worldwide. The ‘Hillquist’ assembly, generated using PacBio long reads scaffolded with Hi-C sequencing, consisted of 298 Mb, of which 270 Mb (90%) was placed on seven chromosome-length scaffolds with an average length of 38.6 Mb. Approximately 52.8% of the genome was composed of repetitive elements. The genome sequence was highly collinear with a novel maternal haplotype-resolved linkage map of the tetraploid blackberry selection A-2551TN and genome assemblies of R. chingii and red raspberry. A total of 38,503 protein-coding genes were predicted using the assembly and Iso-Seq and RNA-seq data, of which 72% were functionally annotated. Conclusions The utility of the ‘Hillquist’ genome has been demonstrated here by the development of the first genotyping-by-sequencing based linkage map of tetraploid blackberry and the identification of several possible candidate genes for primocane-fruiting within the previously mapped locus. This chromosome-length assembly will facilitate future studies in Rubus biology, genetics, and genomics and strengthen applied breeding programs.
MetaGeneMark-2: Improved Gene Prediction in Metagenomes
bioRxiv (Cold Spring Harbor Laboratory) · 2022-07-27 · 46 citations
preprintOpen accessSenior authorCorrespondingAbstract Accurate prediction of protein-coding genes in metagenomic contigs presents a well-known challenge. Particularly difficult is to identify short and incomplete genes as well as positions of translation initiation sites. It is frequently assumed that initiation of translation in prokaryotes is controlled by a ribosome binding site (RBS), a sequence with the Shine-Dalgarno (SD) consensus situated in the 5’ UTR. However, ∼30% of the 5,007 genomes, representing the RefSeq collection of prokaryotic genomes, have either non-SD RBS sequences or no RBS site due to physical absence of the 5’ UTR (the case of leaderless transcription). Predictions of the gene 3’ ends are much more accurate; still, errors could occur due to the use of incorrect genetic code. Hence, an effective gene finding algorithm would identify true genetic code in a process of the sequence analysis. In this work prediction of gene starts was improved by inferring the GC content dependent generating functions for RBS sequences as well as for promoter sequences involved in leaderless transcription. An additional feature of the algorithm was the ability to identify alternative genetic code defined by a reassignment of the TGA stop codon (the only stop codon reassignment type known in prokaryotes). It was demonstrated that MetaGeneMark-2 made more accurate gene predictions in metagenomic sequences than several existing state-of-the-art tools.
Recent grants
NIH · $5.5M · 2015
NIH · $81k · 2006
NIH · $100k · 1995
NIGMS Administrative Supplements to Support Undergraduate Summer Research
NIH · $1.1M · 2018–2023
Frequent coauthors
- 102 shared
Alexandre Lomsadze
Georgia Institute of Technology
- 28 shared
Tomáš Brůna
Joint Genome Institute
- 24 shared
Ivan Antonov
- 21 shared
Shiyuyun Tang
- 17 shared
Mario Stanke
- 17 shared
Paul Burns
- 16 shared
Svetlana Ekisheva
- 16 shared
Wenhan Zhu
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Mark Borodovsky
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup