
Marinka Zitnik
VerifiedHarvard University · Biomedical Informatics
Active 2012–2026
About
Marinka Zitnik is an Associate Professor in the Department of Biomedical Informatics at Harvard Medical School and an Associate Faculty at the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University. She is also an Associate Member at the Broad Institute of MIT and Harvard and an Affiliated Faculty at the Harvard Data Science Initiative. Her research investigates the foundations of artificial intelligence to enhance scientific discovery and to realize individualized diagnosis and treatment. Her lab aims to lay the foundations for AI that contribute to the scientific understanding of therapeutic design and genomic medicine or acquire such understanding autonomously. Her work focuses on using AI to describe the state of a person with increasing precision by incorporating modalities such as genetic code, cellular atlases, molecular datasets, and therapeutics. The challenge her research addresses is how to reason over these data to develop powerful disease diagnostics and empower new kinds of therapies. Her lab creates new avenues for fusing knowledge and patient data to give the right patient the right treatment at the right time, ensuring medicinal effects are consistent across individuals and laboratory results. Additionally, her research seeks to change the traditional method of scientific discovery by using AI to disentangle the complexity of interconnected biological systems, advancing drug design and developing new therapies. Dr. Zitnik has founded Therapeutics Data Commons and leads the International AI4Science initiative, organizing national symposia on drug repurposing and applying AI to medicine.
Research topics
- Data Mining
- Computer Science
- Artificial Intelligence
- Machine Learning
- Biology
- Cognitive science
- Bioinformatics
- Programming language
- Data science
Selected publications
PhylaFlow: Hybrid Flow Matching in Billera-Holmes-Vogtmann Tree Space for Phylogenetic Inference
arXiv (Cornell University) · 2026-05-21
preprintOpen accessPhylogenetic trees are hybrid objects: branch lengths vary continuously, while topologies change discretely through edge contractions and expansions. Billera-Holmes-Vogtmann (BHV) tree space provides a canonical geometry for this structure, representing each resolved topology as a Euclidean orthant and topological changes as motion across shared lower-dimensional boundaries. We introduce PhylaFlow, a hybrid flow-matching model that learns posterior-basin transport in BHV tree space. PhylaFlow is trained on BHV geodesic paths from random starting trees to short-run posterior samples, coupling continuous branch-length motion within orthants with learned boundary events and discrete topology transitions. We evaluate the learned geometry operationally: if the flow reaches posterior-relevant regions, finite-budget Bayesian refinement initialized from, or guided by, its terminal trees should recover posterior-supported topologies more efficiently. Across DS1-DS8 phylogenetic posterior benchmarks, PhylaFlow substantially reduces initial Tree-KL relative to classical initializers. After finite-budget MrBayes refinement, direct PhylaFlow improves early and intermediate topology-recovery trajectories on most datasets, while split-guided PhylaFlow-MCMC obtains the strongest hard-case results. The best PhylaFlow variant outperforms short-warmup on seven of eight datasets and PhyloGFN on five of eight under the same refinement budget. In a joint sequence-conditioned experiment, sequence embeddings steer posterior split recovery, although exact posterior topology recovery remains preliminary. These results show that hybrid flow matching can learn actionable transport in BHV tree space and provide a geometry-aware proposal mechanism for Bayesian phylogenetic inference.
Generative Artificial Intelligence for Biology: Toward Unifying Models, Algorithms, and Modalities
ChemRxiv · 2026-04-06
articleRapid advances in generative artificial intelligence have revolutionized biological modeling across domains such as protein, genetics, and single-cell. However, existing works often organize applications by molecule types or specific research tasks, overlooking the methodological convergence and cross-modal innovations. This paper aims to present a unified methodological perspective that highlights the fundamental technical commonalities across biological modalities. We systematically organize recent advances in generative modeling for biology through the lens of core machine learning paradigms, from language models (LMs) and diffusion models to their emerging hybrid architectures. Our work reveals how techniques initially developed for one molecular type (e.g., protein design) can be effectively transferred to others (e.g., RNA engineering), and identifies the convergence trend where discrete diffusion models and iterative language models represent different facets of a unified generative framework. We cover the evolution from domain-specific models to multi-modal biological foundation models and agent-based systems. By emphasizing methodological connections rather than applications, this paper aims to accelerate cross-domain innovation and make the field more accessible to the broader machine learning community. We conclude by identifying promising research directions where successful techniques in one biological domain remain unexplored in others, offering a roadmap for future advances in generative biology.
PhylaFlow: Hybrid Flow Matching in Billera-Holmes-Vogtmann Tree Space for Phylogenetic Inference
ArXiv.org · 2026-05-21
articleOpen accessPhylogenetic trees are hybrid objects: branch lengths vary continuously, while topologies change discretely through edge contractions and expansions. Billera-Holmes-Vogtmann (BHV) tree space provides a canonical geometry for this structure, representing each resolved topology as a Euclidean orthant and topological changes as motion across shared lower-dimensional boundaries. We introduce PhylaFlow, a hybrid flow-matching model that learns posterior-basin transport in BHV tree space. PhylaFlow is trained on BHV geodesic paths from random starting trees to short-run posterior samples, coupling continuous branch-length motion within orthants with learned boundary events and discrete topology transitions. We evaluate the learned geometry operationally: if the flow reaches posterior-relevant regions, finite-budget Bayesian refinement initialized from, or guided by, its terminal trees should recover posterior-supported topologies more efficiently. Across DS1-DS8 phylogenetic posterior benchmarks, PhylaFlow substantially reduces initial Tree-KL relative to classical initializers. After finite-budget MrBayes refinement, direct PhylaFlow improves early and intermediate topology-recovery trajectories on most datasets, while split-guided PhylaFlow-MCMC obtains the strongest hard-case results. The best PhylaFlow variant outperforms short-warmup on seven of eight datasets and PhyloGFN on five of eight under the same refinement budget. In a joint sequence-conditioned experiment, sequence embeddings steer posterior split recovery, although exact posterior topology recovery remains preliminary. These results show that hybrid flow matching can learn actionable transport in BHV tree space and provide a geometry-aware proposal mechanism for Bayesian phylogenetic inference.
Phenotypic prediction of missense variants via deep contrastive learning
Nature Biomedical Engineering · 2026-04-14
articleQuantum-machine-assisted drug discovery
npj Drug Discovery. · 2026-01-07 · 5 citations
preprintOpen accessDrug discovery is lengthy and expensive, with traditional computer-aided design facing limits. This paper examines integrating quantum computing across the drug development cycle to accelerate and enhance workflows and rigorous decision-making. It highlights quantum approaches for molecular simulation, drug-target interaction prediction, and optimizing clinical trials. Leveraging quantum capabilities could accelerate timelines and costs for bringing therapies to market, improving efficiency and ultimately benefiting public health.
Toward AI-Powered Cancer Etiology Research
Cancer Discovery · 2026-04-13
articleAdvances in multimodal longitudinal data and artificial intelligence (AI) create new opportunities for cancer etiology research. We envision an AI-powered discovery workflow integrating an interoperable epidemiologic data ecosystem and causal inference frameworks to accelerate the identification of both cancer causes and the converging biological states for prevention.
Multimodal AI predicts clinical outcomes of drug combinations from preclinical data.
PubMed · 2025-09-24 · 2 citations
preprintOpen accessSenior authorPredicting clinical outcomes from preclinical data is essential for identifying safe and effective drug combinations, reducing late-stage clinical failures, and accelerating the development of precision therapies. Current AI models rely on structural or target-based features but fail to incorporate the multimodal data necessary for accurate, clinically relevant predictions. Here, we introduce Madrigal, a multimodal AI model that learns from structural, pathway, cell viability, and transcriptomic data to predict drug-combination effects across 953 clinical outcomes and 21,842 compounds, including combinations of approved drugs and novel compounds in development. Madrigal uses an attention bottleneck module to unify preclinical drug data modalities while handling missing data during training and inference, a major challenge in multimodal learning. It outperforms single-modality methods and state-of-the-art models in predicting adverse drug interactions, and ablations show both modality alignment and multimodality are necessary. It captures transporter-mediated interactions and aligns with head-to-head clinical trial differences for neutropenia, anemia, alopecia, and hypoglycemia. In type 2 diabetes and MASH, Madrigal supports polypharmacy decisions and prioritizes resmetirom among safer candidates. Extending to personalization, Madrigal improves patient-level adverse-event prediction in a longitudinal EHR cohort and an independent oncology cohort, and predicts ex vivo efficacy in primary acute myeloid leukemia samples and patient-derived xenograft models. Madrigal links preclinical multimodal readouts to safety risks of drug combinations and offers a generalizable foundation for safer combination design.
CONCERT predicts niche-aware perturbation responses in spatial transcriptomics
bioRxiv (Cold Spring Harbor Laboratory) · 2025-11-10 · 1 citations
preprintOpen accessSenior authorCorrespondingSpatial perturbation transcriptomics measures how genetic or chemical edits alter gene expression while preserving tissue context. Perturbation outcomes depend on a cell's intrinsic state and also on how effects propagate across cellular microenvironments. We present CONCERT, a niche-aware generative model that embeds perturbation context and learns spatial kernels with a Gaussian process variational autoencoder to predict perturbation effects across tissue. We formalize three tasks: patch, border, and niche, predicting responses in nearby unperturbed regions, at tissue interfaces, and as a function of surrounding microenvironments. We evaluate CONCERT on Perturb-map lung datasets. CONCERT outperforms state-of-the-art models (dissociated counterfactuals, spatialized perturbation models, and kNN), reducing E-distance by up to 33.77% (patch), 26.05% (border), and 33.74% (niche) versus the next best, with mean absolute error down by up to 23.28% and Pearson correlation up by up to 9.10%. Two case studies go beyond benchmarking. In dextran sodium sulfate-induced colitis, CONCERT reconstructs spatial gene expression at unmeasured time points, produces longitudinal comparisons across unpaired mice, resolves inter-mouse heterogeneity, and recovers consistent temporal declines of inflammation-associated genes across regions. In ischemic stroke, CONCERT predicts responses under variable lesion sizes and in a 3D formulation across brain sections, capturing lesion-core and peri-lesion patterns. CONCERT performs niche-aware counterfactual prediction, reconstructs missing spatial data, and models perturbation responses across tissues.
Geranium: Multimodal Retrieval of Genomics Data Visualizations
IEEE Transactions on Visualization and Computer Graphics · 2025-09-21
articleOpen accessEffective visualization is essential for interpreting genomics data, yet researchers often face challenges in finding relevant, reusable examples. Existing tools offer limited support for searching the vast landscape of genomics visualizations, making the process of authoring new visualizations time-consuming and inefficient. To address this gap, we introduce Geranium, a data visualization retrieval system for searching and authoring genomics visualizations. Geranium supports multimodal retrieval, enabling users to query with images, text, or grammar based specifications. Retrieved examples serve as scaffolds for authoring, providing templates that researchers can adapt with their own data, thereby streamlining the mechanics of visualization construction. Geranium integrates three embedding methods to combine specialized and general knowledge: grammar-based embeddings tailored to genomics visualizations, multimodalem beddings from a biomedical vision-language foundation model, and text embeddings from a fine-tuned large language model. For each visualization, we construct a multimodal representation that includes a Gosling specification, a pixel-based rendering, and natural language descriptions. We evaluate embedding strategies to maximize top-k retrieval accuracy and conduct user studies with domain collaborators to gather feedback on usability. Our collection comprises 3,200 visualizations across 50 categories, ranging from single-view to coordinated multi-view designs and supporting applications from single-cell epigenomics to structural variation analysis.
Geranium: Multimodal Retrieval of Genomics Data Visualizations
2025-12-27
articleOpen accessEffective visualization is essential for interpreting genomics data, yet researchers often face challenges in finding relevant, reusable examples. Existing tools offer limited support for searching the vast landscape of genomics visualizations, making the process of authoring new visualizations time-consuming and inefficient. To address this gap, we introduce Geranium, a data visualization retrieval system for searching and authoring genomics visualizations. Geranium supports multimodal retrieval, enabling users to query with images, text, or grammar-based specifications. Retrieved examples serve as scaffolds for authoring, providing templates that researchers can adapt with their own data, thereby streamlining the mechanics of visualization construction. Geranium integrates three embedding methods to combine specialized and general knowledge: grammar-based embeddings tailored to genomics visualizations, multimodal embeddings from a biomedical vision-language foundation model, and text embeddings from a fine-tuned large language model. For each visualization, we construct a multimodal representation that includes a Gosling specification, a pixel-based rendering, and natural language descriptions. We evaluate embedding strategies to maximize top-k retrieval accuracy and conduct user studies with domain collaborators to gather feedback on usability. Our collection comprises 3,200 visualizations across 50 categories, ranging from single-view to coordinated multi-view designs and supporting applications from single-cell epigenomics to structural variation analysis.
Recent grants
RAPID:Collaborative Research: Computational Drug Repurposing for COVID-19
NSF · $100k · 2020–2021
Workshop on Drug Repurposing for Future Pandemics
NSF · $30k · 2020–2020
Frequent coauthors
- 64 shared
Jure Leskovec
Stanford University
- 59 shared
Blaž Zupan
Baylor College of Medicine
- 52 shared
Xiang Zhang
Air Force Medical University
- 41 shared
Kexin Huang
- 28 shared
Xiang Zhang
- 22 shared
William C. Hahn
Dana-Farber Cancer Institute
- 21 shared
Theodoros Tsiligkaridis
- 21 shared
Payal Chandak
Harvard–MIT Division of Health Sciences and Technology
Labs
Zitnik LabPI
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Marinka Zitnik
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup