Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Jure Leskovec

Jure Leskovec

· Associate Professor of Computer ScienceVerified

Stanford University · Biomedical Data Science

Active 1977–2026

h-index143
Citations121.8k
Papers684249 last 5y
Funding$3.5M1 active
See your match with Jure Leskovec — sign in to PhdFit.Sign in

About

Jure Leskovec is a Professor of Computer Science at Stanford University. His general research area is applied machine learning for large interconnected systems, with a focus on modeling complex, richly-labeled relational structures, graphs, and networks across systems at all scales. These scales range from interactions of proteins within a cell to interactions between humans in society. His research applications include commonsense reasoning, recommender systems, computational social science, and computational biology, with a particular emphasis on drug discovery.

Research topics

  • Computer Science
  • Artificial Intelligence
  • Data Mining
  • Machine Learning
  • Biology
  • Sociology
  • Engineering
  • Mathematics
  • Geography
  • Cell biology
  • Economics
  • Demography
  • Data science
  • Demographic economics
  • Political Science
  • Economic growth
  • Anatomy
  • Socioeconomics
  • Psychology
  • Econometrics
  • Internet privacy
  • Law
  • Operating system
  • Economic geography

Selected publications

  • TxConformal: Controlling False Discoveries in AI-Driven Therapeutic Discovery

    bioRxiv (Cold Spring Harbor Laboratory) · 2026-04-30

    article

    Artificial Intelligence (AI) is transforming therapeutic discovery by scoring a large set of promising candidates and prioritizing a shortlist for further investigation. Quantifying the reliability of AI scores and preventing false positives among selected candidates is key to the efficiency of the discovery process. Conformal prediction (CP) has emerged as a popular tool for guiding such prioritization, especially via the conformal selection framework to control false discovery rates (FDR) in selecting top-ranked candidates under distributional shift 1, 2 . However, deploying these advances in real-world therapeutic discovery remains challenging: distribution shifts are difficult to quantify and correct in high-dimensional biomedical data, and practical workflows often require flexible error metrics. Here, we present T x C onformal , a general framework for trustworthy decision making when building shortlists using AI scores. T x C onformal adjusts for distribution shift by balancing the hidden representations in AI models and then provides confidence measures for true discoveries of target biological properties. These confidence measures, interpretable as p-values, can be used in conjunction with statistical multiple testing procedures to derive selection decisions with limited false positives or to estimate the errors in given selection decisions. T x C onformal controls the false positive rate in six real-world tasks spanning various therapeutic discovery stages, modalities, and AI models with realistic data splits. When selecting promising combinatorial genetic perturbations, T x C onformal nearly halves false-positive selections compared to baseline methods, substantially reducing unnecessary experimental costs by tens of thousands of dollars. When selecting stable protein structures under mutant shifts, T x C onformal identifies about 10 times more proteins than baseline methods at stringent thresholds when running at a target FDR level of 10%, recovering over 90% of valuable candidates that baseline methods miss due to unaccounted distribution shifts. Furthermore, we demonstrate that T x C onformal robustly supports various alternative error metrics suitable for resource-constrained settings. Finally, in a prospective fixed-budget virtual screening campaign for novel antibiotic discovery, T x C onformal predicted false positives in close agreement with experimental outcomes, with substantial improvements over simple baselines.

  • Bio-BLIP: A Multimodal Architecture for Transferable Reasoning in Genomic Variant Interpretation

    bioRxiv (Cold Spring Harbor Laboratory) · 2026-05-15

    articleOpen accessSenior author

    Abstract Developing scientific hypotheses in biology requires integrating heterogeneous evidence across DNA sequence, gene context, protein function, and prior literature. Existing multimodal AI systems expose biological evidence to reasoning models through textification or by projecting biological embeddings into fine-tuned language models. However, these models are typically highly optimized the specific set of tasks for which they are fine-tuned. Here we present Bio-BLIP, a multimodal Q-former based architecture which leverages biological embeddings and a LLM to generalize to complex reasoning tasks without task-specific fine-tuning. The key to Bio-BLIP is a new neural network architecture that integrates four data modalities – DNA, genes, proteins, and text – through a master Qformer model, which integrates the modality-specific information into a fixed-length prefix for the LLM backbone. Bio-BLIP is pretrained on the task of human genetic variant annotation and achieves a 29.8% increase in generating accurate variant features over frontier LLMs. We evaluate Bio-BLIP zero-shot on downstream genomic tasks of variant prioritization and target gene prediction. Bio-BLIP outperforms two alignment-free genomic language models on regulatory variant prioritization for Mendelian disease. Across the target gene prediction task, Bio-BLIP improves accuracy over LLMs by leveraging learned genomic variant knowledge in difficult cases. Our model produces rich, transparent reasoning traces. In biological domains characterized by multiple scales of data and varied downstream tasks, Bio-BLIP offers a step toward natively multimodal, generalizable reasoning.

  • Are Current AI Virtual Cell Models Useful for Scientific Discovery?

    bioRxiv (Cold Spring Harbor Laboratory) · 2026-04-25

    articleSenior author

    Abstract AI models are increasingly developed to predict the effect of perturbations on gene expression, but current benchmarks fail to reliably measure model performance. Here, we argue that new benchmarks that directly measure the value of model predictions for specific scientific discovery outcomes are needed to address this gap. We present PerturbHD, an evaluation framework for AI-enabled hit discovery, to demonstrate the benefits our proposed approach.

  • Data for Universal Cell Embeddings: A Foundation Model for Cell Biology

    Zenodo (CERN European Organization for Nuclear Research) · 2026-04-01

    datasetOpen accessSenior author

    Data for replicating the Universal Cell Embeddings paper code. See preprint here: https://www.biorxiv.org/content/10.1101/2023.11.28.568918v1

  • Proteo-R1: Reasoning Foundation Models for De Novo Protein Design

    arXiv (Cornell University) · 2026-05-01

    preprintOpen access

    Deep learning in \emph{de novo} protein design has achieved atomic-level fidelity. However, existing models remain largely non-deliberative: they directly synthesize molecular geometries without explicitly reasoning about which residues or interactions are functionally essential. As a result, design decisions are entangled with continuous sampling dynamics, limiting interpretability, controllability, and systematic reuse of biochemical knowledge. We introduce \textbf{Proteo-R1}, a reasoning-guided protein design framework that explicitly decouples \emph{molecular understanding} from \emph{geometric generation}. Proteo-R1 adopts a dual-expert architecture in which a multimodal large language model (MLLM) serves as an \emph{understanding expert}, analyzing protein sequences, structures, and textual context to identify key functional residues that govern binding and specificity. These residue-level decisions are then passed as hard constraints to a separate diffusion-based \emph{generation expert}, which performs conditional co-design while respecting the fixed interaction anchors. This factorization mirrors how human experts approach molecular engineering: first, reasoning about critical interactions, then optimizing geometry subject to those constraints. By operationalizing reasoning as explicit residue-level commitments rather than latent textual guidance, Proteo-R1 achieves stable, interpretable, and modular integration of LLM reasoning with state-of-the-art geometric generative models. Code, data, and demos are available at https://smiles724.github.io/r1/.

  • Data for Universal Cell Embeddings: A Foundation Model for Cell Biology

    Zenodo (CERN European Organization for Nuclear Research) · 2026-04-01

    datasetOpen accessSenior author

    Data for replicating the Universal Cell Embeddings paper code. See preprint here: https://www.biorxiv.org/content/10.1101/2023.11.28.568918v1

  • Proteo-R1: Reasoning Foundation Models for De Novo Protein Design

    ArXiv.org · 2026-05-01

    articleOpen access

    Deep learning in \emph{de novo} protein design has achieved atomic-level fidelity. However, existing models remain largely non-deliberative: they directly synthesize molecular geometries without explicitly reasoning about which residues or interactions are functionally essential. As a result, design decisions are entangled with continuous sampling dynamics, limiting interpretability, controllability, and systematic reuse of biochemical knowledge. We introduce \textbf{Proteo-R1}, a reasoning-guided protein design framework that explicitly decouples \emph{molecular understanding} from \emph{geometric generation}. Proteo-R1 adopts a dual-expert architecture in which a multimodal large language model (MLLM) serves as an \emph{understanding expert}, analyzing protein sequences, structures, and textual context to identify key functional residues that govern binding and specificity. These residue-level decisions are then passed as hard constraints to a separate diffusion-based \emph{generation expert}, which performs conditional co-design while respecting the fixed interaction anchors. This factorization mirrors how human experts approach molecular engineering: first, reasoning about critical interactions, then optimizing geometry subject to those constraints. By operationalizing reasoning as explicit residue-level commitments rather than latent textual guidance, Proteo-R1 achieves stable, interpretable, and modular integration of LLM reasoning with state-of-the-art geometric generative models. Code, data, and demos are available at https://smiles724.github.io/r1/.

  • LLMs Generate Structurally Realistic Social Networks but Overestimate Political Homophily

    Proceedings of the International AAAI Conference on Web and Social Media · 2025-06-07 · 8 citations

    articleOpen accessSenior author

    Generating social networks is essential for many applications, such as epidemic modeling and social simulations. The emergence of generative AI, especially large language models (LLMs), offers new possibilities for social network generation: LLMs can generate networks without additional training or need to define network parameters, and users can flexibly define individuals in the network using natural language. However, this potential raises two critical questions: 1) are the social networks generated by LLMs realistic, and 2) what are risks of bias, given the importance of demographics in forming social ties? To answer these questions, we develop three prompting methods for network generation and compare the generated networks to a suite of real social networks. We find that more realistic networks are generated with “local” methods, where the LLM constructs relations for one persona at a time, compared to “global” methods that construct the entire network at once. We also find that the generated networks match real networks on many characteristics, including density, clustering, connectivity, and degree distribution. However, we find that LLMs emphasize political homophily over all other types of homophily and significantly overestimate political homophily compared to real social networks.

  • Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards

    ArXiv.org · 2025-07-03

    preprintOpen accessSenior author

    Compound AI systems integrating multiple components, such as Large Language Models, specialized tools, and traditional machine learning models, are increasingly deployed to solve complex real-world tasks. However, optimizing compound systems remains challenging due to their non-differentiable structures and diverse configuration types across components, including prompts, hyperparameters, and model parameters. To address this challenge, we propose Optimas, a unified framework for effective optimization of compound systems. The core idea of Optimas is to maintain one Local Reward Function (LRF) per component, each satisfying a local-global alignment property, i.e., each component's local reward correlates with the global system performance. In each iteration, Optimas efficiently adapts the LRFs to maintain this property while simultaneously maximizing each component's local reward. This approach enables independent updates of heterogeneous configurations using the designated optimization method, while ensuring that local improvements consistently lead to performance gains. We present extensive evaluations across five real-world compound systems to demonstrate that Optimas outperforms strong baselines by an average improvement of 11.92%, offering a general and effective approach for improving compound systems. Our website is at https://optimas.stanford.edu.

  • Surface-based Molecular Design with Multi-modal Flow Matching

    2025-08-03 · 1 citations

    articleOpen access

    Therapeutic peptides show promise in targeting previously undruggable binding sites, with recent advancements in deep generative models enabling full-atom peptide co-design for specific protein receptors.However, the critical role of molecular surfaces in proteinprotein interactions (PPIs) has been underexplored.To bridge this gap, we propose an omni-design peptides generation paradigm, called SurfFlow, a novel surface-based generative algorithm that enables comprehensive co-design of sequence, structure, and surface for peptides.SurfFlow employs a multi-modality conditional flow matching (CFM) architecture to learn distributions of surface geometries and biochemical properties, enhancing peptide binding accuracy.Evaluated on the comprehensive PepMerge benchmark, SurfFlow consistently outperforms full-atom baselines across all metrics.These results highlight the advantages of considering molecular surfaces in de novo peptide discovery and demonstrate the potential of integrating multiple protein modalities for more effective therapeutic peptide discovery.

Recent grants

Frequent coauthors

Education

  • Postdoc, Computer Science Department

    Cornell University

    2009
  • PhD, Machine Learning Department

    Carnegie Mellon University

    2008
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Jure Leskovec

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup