
Research topics
- Computational biology
- Computer Science
- Political Science
- Biology
- Genetics
- Physics
- Environmental health
- Statistics
- Immunology
- Mathematics
- Nursing
- Quantum mechanics
- Theoretical computer science
- Biological system
- Medicine
- Statistical physics
- Intensive care medicine
- Evolutionary biology
Selected publications
Whither the protein landscape?
Structural Dynamics · 2025-01-01
editorialOpen access1st authorCorrespondingbioRxiv (Cold Spring Harbor Laboratory) · 2025-04-09 · 1 citations
preprintOpen accessSenior authorCorrespondingThe Last Universal Common Ancestor (LUCA) lived about 4.2 billion years ago and had a nearly modern metabolism. Genes and proteins must therefore have achieved modern lengths during a time comparable to the error in that date. How did that happen? We show here that E coli transformed with a double mutant full-length Leucyl-tRNA synthetase (LeuRS) produces discrete sets of shorter genes. These lack the anticodon-binding domain and large parts of the catalytic domain. They connect remote active site parts in two different ways. Large pre-steady-state bursts confirm that they are the active enzymes. These in vivo results validate earlier designs for ancient aminoacyl-tRNA synthetase enzymes, greatly expanding the sequence space of active synthetases. One construct joins the 56-residue protozyme directly to a 25-residue segment containing the second catalytic signature. It catalyzes both activation and minihelix acylation with amino acids. AlphaFold33 predicts that the mRNA sequence encoding the latter fragment is a long hairpin. Thus, 3D structures in the gene itself may promote the deletions. The deletions appear to reverse the modular evolution of full-length synthetases from simpler catalysts. The reverse evolution we describe could open broad access to primordial gene discovery.
Evolution is coupled with branching across many granularities of life
Proceedings of the Royal Society B Biological Sciences · 2025-05-01 · 8 citations
articleOpen accessAcross many scales of life, the rate of evolutionary change is often accelerated at the time when one lineage splits into two. The emergence of novel protein function can be facilitated by gene duplication (neofunctionalization); rapid morphological change is often accompanied by speciation (punctuated equilibrium); and the establishment of cultural identity is frequently driven by sociopolitical division (schismogenesis). In each case, the changes resist re-homogenization; promoting assortment into distinct lineages that are susceptible to different selective pressures, leading to rapid divergence. The traditional gradualistic view of evolution struggles to detect this phenomenon. We propose a probabilistic framework that constructs phylogenies, tests for saltative branching and improves divergence time estimation by estimating the independent contributions of gradual and abrupt change on each lineage. We provide evidence of saltative branching for proteins (aminoacyl transfer RNA (tRNA) synthetases), animal morphologies (cephalopods) and human languages (Indo-European). These three cases provide unique insights: for aminoacyl-tRNA synthetases, the trees are substantially different from those obtained under gradualist models; we estimate that 99% of cephalopod morphological changes coincided with speciation events; and Indo-European dispersal is estimated to have started around 6000 BCE, corroborating the recently proposed hybrid explanation. Our open-source code is available under a General Public License.
Genome Biology and Evolution · 2025-06-06 · 3 citations
articleOpen access1st authorCorrespondingTranslation of symbols in one chemical language into another defined genetics. Yet, the co-linearity of codons and amino acids is so commonplace an idea that few even ask how it arose. Readout is done by two distinct sets of proteins, called aminoacyl-tRNA synthetases. Aminoacyl-tRNA synthetases must enforce the rules first used to assemble themselves. To understand the roots of translation, we must experimentally test the structural codes that the earliest aminoacyl-tRNA synthetases used to recognize both amino acid and RNA substrates. We present here new results on five different facets of that problem. (i) The surfaces of structures coded by opposite strands of the same gene have opposite polarities. Core residues in proteins from one strand are surface residues in proteins from the other strand. The complementarity of base pairing thus projects into the proteome. That leads in turn to contrasting amino acid and RNA substrate binding modes. (ii) Escherichia coli reproduces in vivo a nested hierarchy of active excerpts, or "urzymes," similar to those we had designed as models for ancestral aminoacyl-tRNA synthetases. (iii) A third novel deletion produced in vivo and a new Class II urzyme suggest how to design bidirectional urzyme genes. (iv) Codon middle base pairing provides a basis to constrain Class I and II aminoacyl-tRNA synthetase family trees. (v) Aminoacyl-tRNA synthetase urzymes acylate class-specific subsets of an RNA library, showing urzyme RNA substrate specificity for the first time. Four new tree-building tools augment these results to compose a viable platform for experimental study of the origins of genetic coding.
bioRxiv (Cold Spring Harbor Laboratory) · 2025-02-21 · 1 citations
preprintOpen accessAbstract All known living systems make proteins from the same twenty canonically-coded amino acids, but this was not always the case. Early genetic coding systems likely operated with a restricted pool of amino acid types and limited means to distinguish between them. Despite this, amino acid substitution models like LG and WAG all assume a constant coding alphabet over time. That makes them especially inappropriate for the aminoacyl-tRNA synthetases (aaRS) - the enzymes that govern translation. To address this limitation, we created a class of substitution models that accounts for evolutionary changes in the coding alphabet size by defining the transition from nineteen states in a past epoch to twenty now. We use a Bayesian phylogenetic framework to improve phylogeny estimation and testing of this two-alphabet hypothesis. The hypothesis was strongly rejected by datasets composed exclusively of “young” eukaryotic proteins. It was generally supported by “old” (aaRS and non-aaRS) proteins whose origins date from before the last universal common ancestor. Standard methods overestimate the divergence ages of proteins that originated under reduced coding alphabets in both simulated and aaRS alignments. The new model reduces this bias substantially. Our findings support the late incorporation of tryptophan into the genetic code (relative to tyrosine) and suggest that isoleucine and valine were once coded interchangeably, forming protein quasispecies. This work provides a robust, seamless framework for reconstructing phylogenies from ancient protein datasets and offers further insights into the dawn of molecular biology.
Aminoacyl-tRNA synthetase urzymes optimized by deep learning behave as a quasispecies
Structural Dynamics · 2025-03-01 · 2 citations
articleOpen accessSenior authorProtein design plays a key role in our efforts to work out how genetic coding began. That effort entails urzymes. Urzymes are small, conserved excerpts from full-length aminoacyl-tRNA synthetases that remain active. Urzymes require design to connect disjoint pieces and repair naked nonpolar patches created by removing large domains. Rosetta allowed us to create the first urzymes, but those urzymes were only sparingly soluble. We could measure activity, but it was hard to concentrate those samples to levels required for structural biology. Here, we used the deep learning algorithms ProteinMPNN and AlphaFold2 to redesign a set of optimized LeuAC urzymes derived from leucyl-tRNA synthetase. We select a balanced, representative subset of eight variants for testing using principal component analysis. Most tested variants are much more soluble than the original LeuAC. They also span a range of catalytic proficiency and amino acid specificity. The data enable detailed statistical analyses of the sources of both solubility and specificity. In that way, we show how to begin to unwrap the elements of protein chemistry that were hidden within the neural networks. Deep learning networks have thus helped us surmount several vexing obstacles to further investigations into the nature of ancestral proteins. Finally, we discuss how the eight variants might resemble a sample drawn from a population similar to one subject to natural selection.
Molecular Biology and Evolution · 2025-08-08 · 2 citations
articleOpen accessAll known living systems make proteins from the same 20 canonically coded amino acids, but this was not always the case. Early genetic coding systems likely operated with a restricted pool of amino acid types and limited means to distinguish between them. Despite this, amino acid substitution models like LG and WAG all assume a constant coding alphabet over time. That makes them especially inappropriate for the aminoacyl-tRNA synthetases (aaRS)-the enzymes that govern translation. To address this limitation, we created a class of substitution models that account for evolutionary changes in the coding alphabet size by defining the transition from 19 states in a past epoch to 20 now. We use a Bayesian phylogenetic framework to improve phylogeny estimation and testing of this two-alphabet hypothesis. The hypothesis was strongly rejected by datasets composed exclusively of "young" eukaryotic proteins. It was generally supported by "old" (aaRS and non-aaRS) proteins whose origins date from before the last universal common ancestor. Standard methods overestimate the divergence ages of proteins that originated under reduced coding alphabets in both simulated and aaRS alignments. The new model provides a timeline slightly more consistent with the Earth's history. Our findings suggest that aaRS functional bifurcation events can explain much of the genetic code's evolution, but there remain other unknown forces at play too. This work provides a robust, seamless framework for reconstructing phylogenies from ancient protein datasets and offers further insights into the dawn of molecular biology.
Artificial intelligence in structural biology: Preface
Structural Dynamics · 2025-11-01
articleOpen accessSenior authorHetMM: A Michaelis-Menten model for non-homogeneous enzyme mixtures
iScience · 2024-01-21 · 11 citations
articleOpen accessThe Michaelis-Menten model requires its reaction velocities to come from a preparation of homogeneous enzymes, with identical or near-identical catalytic activities. However, this condition is not always met. We introduce a kinetic model that relaxes this requirement, by assuming there are an unknown number of enzyme species drawn from a probability distribution whose standard deviation is estimated. Through simulation studies, we demonstrate the method accurately discriminates between homogeneous and heterogeneous data, even with moderate levels of experimental error. We applied this model to three homogeneous and three heterogeneous biological systems, showing that the standard and heterogeneous models outperform respectively. Lastly, we show that heterogeneity is not readily distinguished from negatively cooperative binding under the Hill model. These two distinct attributes-inequality in catalytic ability and interference between binding sites-yield similar Michaelis-Menten curves that are not readily resolved without further experimentation. Our user-friendly software package allows homogeneity testing and parameter estimation.
bioRxiv (Cold Spring Harbor Laboratory) · 2024-05-15 · 6 citations
preprintOpen accessAbstract The aminoacyl-tRNA synthetases (aaRS) are a large group of enzymes that implement the genetic code in all known biological systems. They attach amino acids to their cognate tRNAs, moonlight in various non-translational activities, and are linked to many genetic disorders. The aaRS have a subtle ontology characterized by structural and functional idiosyncrasies that vary from organism to organism, and protein to protein. Across the tree of life, the twenty-two coded amino acids are handled by sixteen evolutionary Families of Class I aaRS and twenty-one Families of Class II aaRS. We introduce AARS Online, an interactive Wikipedia-like tool curated by an international consortium of field experts. This platform systematizes existing knowledge about the aaRS by showcasing a taxonomically diverse selection of aaRS sequences and structures. Through its graphical user interface, AARS Online facilitates a seamless exploration between protein sequence and structure, providing a friendly introduction to the material for non-experts and a useful resource for experts. Curated multiple sequence alignments can be extracted for downstream analyses. Accessible at www.aars.online , AARS Online is a free resource to delve into the world of the aaRS.
Recent grants
NIH · $430k · 1991
Storage and Recovery of ATP binding energy in Metal-Catalyzed Phosphoryl-Transfer
NIH · $1.4M · 2010–2015
NIH · $2.7M · 2006
Sense/Antisense Genetic Coding and the Origins of Translation
NIH · $3.9M · 2006–2020
Frequent coauthors
- 62 shared
Susan F. Leitman
National Institutes of Health Clinical Center
- 43 shared
Daniel H. Fowler
National Cancer Institute
- 42 shared
Ronald E. Gress
National Cancer Institute
- 38 shared
Richard F. Little
National Institutes of Health
- 37 shared
Elizabeth M. Kang
National Institute of Allergy and Infectious Diseases
- 37 shared
Richard A. Morgan
- 37 shared
John F. Tisdale
National Heart Lung and Blood Institute
- 37 shared
Moniek de Witte
University Medical Center Utrecht
Education
- 1972
PhD, Biology
University of California San Diego
- 1968
MS, Chemistry
University of California San Diego
- 1967
BA, Molecular Biophysics and Biochemistry
Yale University
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Charles Carter
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup