Martin Jinye Zhang

· Assistant ProfessorVerified

Carnegie Mellon University · Ray and Stephanie Lane Computational Biology Department

Active 2016–2026

h-index21

Citations4.5k

Papers7958 last 5y

Funding—

Faculty page Lab page

See your match with Martin Jinye Zhang — sign in to PhdFit.Sign in

About

Martin Jinye Zhang is an Assistant Professor in the Ray and Stephanie Lane Computational Biology Department at Carnegie Mellon University. He is based at the School of Computer Science, located at 5000 Forbes Avenue, Pittsburgh, PA. His role involves research and teaching within the field of computational biology, contributing to the department's academic and scientific endeavors.

Research topics

Biology
Genetics
Computational biology
Evolutionary biology
Cell biology

Selected publications

SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources
arXiv (Cornell University) · 2026-04-05
preprintOpen access
Modern scientific ecosystems are rich in procedural knowledge across repositories, APIs, scripts, notebooks, documentation, databases, and papers, yet much of this knowledge remains fragmented across heterogeneous artifacts that agents cannot readily operationalize. This gap between abundant scientific know-how and usable agent capabilities is a key bottleneck for building effective scientific agents. We present SkillFoundry, a self-evolving framework that converts such resources into validated agent skills, reusable packages that encode task scope, inputs and outputs, execution steps, environment assumptions, provenance, and tests. SkillFoundry organizes a target domain as a domain knowledge tree, mines resources from high-value branches, extracts operational contracts, compiles them into executable skill packages, and then iteratively expands, repairs, merges, or prunes the resulting library through a closed-loop validation process. SkillFoundry produces a substantially novel and internally valid skill library, with 71.1\% of mined skills differing from existing skill libraries such as SkillHub and SkillSMP. We demonstrate that these mined skills improve coding agent performance on five of the six MoSciBench datasets. We further show that SkillFoundry can design new task-specific skills on demand for concrete scientific objectives, and that the resulting skills substantially improve performance on two challenging genomics tasks: cell type annotation and the scDRS workflow. Together, these results show that automatically mined skills improve agent performance on benchmarks and domain-specific tasks, expand coverage beyond hand-crafted skill libraries, and provide a practical foundation for more capable scientific agents.
Publisher DOI
MultiSuSiE improves multi-ancestry fine-mapping in All of Us whole-genome sequencing data
Nature Genetics · 2026-01-01 · 1 citations
article
Publisher DOI
SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources
arXiv (Cornell University) · 2026-04-05
articleOpen access
Modern scientific ecosystems are rich in procedural knowledge across repositories, APIs, scripts, notebooks, documentation, databases, and papers, yet much of this knowledge remains fragmented across heterogeneous artifacts that agents cannot readily operationalize. This gap between abundant scientific know-how and usable agent capabilities is a key bottleneck for building effective scientific agents. We present SkillFoundry, a self-evolving framework that converts such resources into validated agent skills, reusable packages that encode task scope, inputs and outputs, execution steps, environment assumptions, provenance, and tests. SkillFoundry organizes a target domain as a domain knowledge tree, mines resources from high-value branches, extracts operational contracts, compiles them into executable skill packages, and then iteratively expands, repairs, merges, or prunes the resulting library through a closed-loop validation process. SkillFoundry produces a substantially novel and internally valid skill library, with 71.1\% of mined skills differing from existing skill libraries such as SkillHub and SkillSMP. We demonstrate that these mined skills improve coding agent performance on five of the six MoSciBench datasets. We further show that SkillFoundry can design new task-specific skills on demand for concrete scientific objectives, and that the resulting skills substantially improve performance on two challenging genomics tasks: cell type annotation and the scDRS workflow. Together, these results show that automatically mined skills improve agent performance on benchmarks and domain-specific tasks, expand coverage beyond hand-crafted skill libraries, and provide a practical foundation for more capable scientific agents.
Publisher OA PDF
martinjzhang/LDSPEC: ldspec paper
Open MIND · 2026-04-26
otherOpen access1st authorCorresponding
code corresponding to the publication
Publisher DOI
martinjzhang/LDSPEC: ldspec paper
Zenodo (CERN European Organization for Nuclear Research) · 2026-04-26
otherOpen access1st authorCorresponding
code corresponding to the publication
Publisher DOI
TusoAI: Agentic Optimization for Scientific Methods
ArXiv.org · 2025-09-28 · 1 citations
preprintOpen accessSenior author
Scientific discovery is often slowed by the manual development of computational tools needed to analyze complex experimental data. Building such tools is costly and time-consuming because scientists must iteratively review literature, test modeling and scientific assumptions against empirical data, and implement these insights into efficient software. Large language models (LLMs) have demonstrated strong capabilities in synthesizing literature, reasoning with empirical data, and generating domain-specific code, offering new opportunities to accelerate computational method development. Existing LLM-based systems either focus on performing scientific analyses using existing computational methods or on developing computational methods or models for general machine learning without effectively integrating the often unstructured knowledge specific to scientific domains. Here, we introduce TusoAI , an agentic AI system that takes a scientific task description with an evaluation function and autonomously develops and optimizes computational methods for the application. TusoAI integrates domain knowledge into a knowledge tree representation and performs iterative, domain-specific optimization and model diagnosis, improving performance over a pool of candidate solutions. We conducted comprehensive benchmark evaluations demonstrating that TusoAI outperforms state-of-the-art expert methods, MLE agents, and scientific AI agents across diverse tasks, such as single-cell RNA-seq data denoising and satellite-based earth monitoring. Applying TusoAI to two key open problems in genetics improved existing computational methods and uncovered novel biology, including 9 new associations between autoimmune diseases and T cell subtypes and 7 previously unreported links between disease variants linked to their target genes. Our code is publicly available at https://github.com/Alistair-Turcan/TusoAI.
Publisher OA PDF DOI
Genetic and Cellular Architecture of Breast Cancer Risk Across Ancestries
medRxiv · 2025-08-24
preprintOpen access
Abstract Background Breast cancer genome-wide association studies (GWAS) have identified more than 200 susceptibility loci, but most studies are dominated by European and East Asian populations. Methods We analyzed breast cancer GWAS summary statistics from African (AFR), East Asian (EAS), European (EUR), and Hispanic/Latina (H/L) samples (159,297 cases and 212,102 controls). We estimated logit-scale SNP-based heritability, polygenicity, and cross-ancestry genetic correlation, partitioned heritability across functional annotations, and integrated GWAS results with the Tabula Sapiens single-cell atlas using scDRS+. Results The logit-scale heritability of breast cancer ranged from ℎ 2 =0.47 (SE = 0.07) in EAS to AFR ℎ 2 =0.61 (SE = 0.10), with no significant differences across ancestries (p=0.63). The estimated number of susceptibility markers in a sparse normal-mixture effects model also varied from 4,446 (SE = 3,100) in EAS to 8,308 (SE = 2,751) in AFR, but differences were not significant across ancestries (p=0.55). Cross-sample genetic correlations varied, with the strongest correlation between EUR and EAS (𝜌 = 0.79, SE = 0.08) and weakest between AFR and H/L (𝜌 = 0.26, SE = 0.24). Regulatory annotations were enriched for breast cancer heritability across samples. Integration with single-cell expression profiles implicated ancestry-shared associations with innate immune, secretory epithelial, and stromal cell types. Conclusion These results indicate substantial cross-ancestry sharing of breast cancer polygenic architecture, highlight a consistent contribution of regulatory variation, and identify convergent cellular contexts that motivate functional follow-up and inform expectations for the transferability and attainable performance of common-variant risk prediction across populations.
Publisher OA PDF DOI
Principal Components for Practice‐Oriented Measurement of Running Technique: A Proof‐Of‐Concept Study
European Journal of Sport Science · 2025-06-27
articleOpen access
This study aims to construct valid and practically applicable running technique measures using principal component analysis (PCA). We hypothesized that data-driven principal movements (PMs), derived from deliberately instructed opposite technique variations, would significantly distinguish these variations and could serve as quantitative measures of running technique as described by practitioners. 20 experienced runners were instructed to vary 14 distinct running technique elements into two opposing directions (e.g., forward and backward lean for a technique element representing horizontal movements). Elements and their variations were selected based on visual descriptions from practitioners found in running literature. Kinematic data were collected on a treadmill using optical motion capture and analyzed using a PCA-based approach to determine running-specific technique measures per technique element. By combining trials with opposing technique variations, variance in the data was purposefully produced, which in turn caused the resultant principal movements to align with the intended technique element. For all of the 14 technique elements, a valid measure-in the sense that the inputted opposite variations were significantly distinguishable within this measure-could be constructed. The measures could further be applied to the habitual running technique of the group of tested runners. The results of this study demonstrate the construct validity and applicability of the presented approach to measure running technique. This method can provide runners and coaches with valuable feedback and will enable future studies to investigate running technique, quantified through practice-informed measures, in the context of performance, injury risk, or adaptations to equipment.
Publisher DOI
Dissecting the genetic complexity of myalgic encephalomyelitis/chronic fatigue syndrome via deep learning-powered genome analysis
medRxiv · 2025-04-16 · 7 citations
preprintOpen access
Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) is a complex, heterogeneous, and systemic disease defined by a suite of symptoms, including unexplained persistent fatigue, post-exertional malaise (PEM), cognitive impairment, myalgia, orthostatic intolerance, and unrefreshing sleep. The disease mechanism of ME/CFS is unknown, with no effective curative treatments. In this study, we present a multi-site ME/CFS whole-genome analysis, which is powered by a novel deep learning framework, HEAL2. We show that HEAL2 not only has predictive value for ME/CFS based on personal rare variants, but also links genetic risk to various ME/CFS-associated symptoms. Model interpretation of HEAL2 identifies 115 ME/CFS-risk genes that exhibit significant intolerance to loss-of-function (LoF) mutations. Transcriptome and network analyses highlight the functional importance of these genes across a wide range of tissues and cell types, including the central nervous system (CNS) and immune cells. Patient-derived multi-omics data implicate reduced expression of ME/CFS risk genes within ME/CFS patients, including in the plasma proteome, and the transcriptomes of B and T cells, especially cytotoxic CD4 T cells, supporting their disease relevance. Pan-phenotype analysis of ME/CFS genes further reveals the genetic correlation between ME/CFS and other complex diseases and traits, including depression and long COVID-19. Overall, HEAL2 provides a candidate genetic-based diagnostic tool for ME/CFS, and our findings contribute to a comprehensive understanding of the genetic, molecular, and cellular basis of ME/CFS, yielding novel insights into therapeutic targets. Our deep learning model also offers a potent, broadly applicable framework for parallel rare variant analysis and genetic prediction for other complex diseases and traits.
Publisher OA PDF DOI
Fine-mapping causal tissues and genes at disease-associated loci
Nature Genetics · 2025-01-01 · 13 citations
articleOpen access
Publisher OA PDF DOI

Frequent coauthors

James Zou
Stanford University
58 shared
Alkes L. Price
Broad Institute
32 shared
Soumya Raychaudhuri
Brigham and Women's Hospital
27 shared
Angela Oliveira Pisco
Chan Zuckerberg Initiative (United States)
24 shared
Kangcheng Hou
University of California, Los Angeles
23 shared
Xilin Jiang
21 shared
Michael Inouye
University of Cambridge
18 shared
Saori Sakaue
Harvard University
17 shared

Education

Ph.D., Computational Biology
Carnegie Mellon University
M.S., Computational Biology
Carnegie Mellon University

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Martin Jinye Zhang

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you