Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
John Mellor-Crummey

John Mellor-Crummey

· Professor of Computer Science and of Electrical and Computer Engineering Member, Ken Kennedy Institute

Rice University · Computer Science

Active 1987–2025

h-index48
Citations9.4k
Papers21819 last 5y
Funding
See your match with John Mellor-Crummey — sign in to PhdFit.Sign in

About

John Mellor-Crummey is a Professor of Computer Science and of Electrical and Computer Engineering at Rice University. His research focuses on software technology for high performance parallel computing, including tools for measurement and analysis of application performance, dynamic data race detection, and network performance analysis and optimization. He leads the development of the HPCToolkit Performance Tools, supported by the Exascale Computing Project. His past work includes the development of data parallel compilers, runtime systems for scalable parallel computing, scalable software synchronization algorithms for shared-memory multiprocessors, and techniques for execution replay of parallel programs. Recognized for his contributions to the field, he was awarded the Dijkstra Prize in Distributed Computing in 2006 and was named an ACM Fellow in 2013 for his contributions to parallel and high performance computing.

Selected publications

  • Analyzing the Performance of Applications at Exascale

    2025-06-08

    articleSenior author
  • Software Tools Ecosystem Project (STEP) Midyear Report CY2025

    2025-07-01

    reportOpen access

    This document provides a technical project report for the first six months of 2025 for the Software Tools Ecosystem Project (STEP). The mission of STEP is to enable critical software tools to proactively adapt to emerging platform technologies (such as new accelerators, storage devices, network technologies, and smart devices) and emerging application use cases (such as advanced machine learning and workflow frameworks) so that they continue to meet the needs of scientific computing and provide a strong foundation for future Advanced Scientific Computing Research activities. Our challenges include the wide breadth of our stakeholders and rapidly evolving platform technology dependencies.

  • Matrix-Free Finite Volume Kernels on a Dataflow Architecture

    arXiv (Cornell University) · 2024-08-06

    preprintOpen access

    Fast and accurate numerical simulations are crucial for designing large-scale geological carbon storage projects ensuring safe long-term CO2 containment as a climate change mitigation strategy. These simulations involve solving numerous large and complex linear systems arising from the implicit Finite Volume (FV) discretization of PDEs governing subsurface fluid flow. Compounded with highly detailed geomodels, solving linear systems is computationally and memory expensive, and accounts for the majority of the simulation time. Modern memory hierarchies are insufficient to meet the latency and bandwidth needs of large-scale numerical simulations. Therefore, exploring algorithms that can leverage alternative and balanced paradigms, such as dataflow and in-memory computing is crucial. This work introduces a matrix-free algorithm to solve FV-based linear systems using a dataflow architecture to significantly minimize memory latency and bandwidth bottlenecks. Our implementation achieves two orders of magnitude speedup compared to a GPGPU-based reference implementation, and up to 1.2 PFlops on a single dataflow device.

  • Priority Sampling of Large Language Models for Compilers

    2024-04-19 · 6 citations

    article

    Large Language Models show great potential in generating and optimizing code. Widely used sampling methods such as Nucleus Sampling increase the diversity of generation but often produce repeated samples for low temperatures and incoherent samples for high temperatures. We present Priority Sampling, a simple and deterministic sampling technique that produces unique samples ordered by the model's confidence. Additionally, Priority Sampling supports a controllable and structured exploration process using regular-expression-based generation. Priority Sampling outperforms Nucleus Sampling for any number of samples, boosting the performance of the original model from 2.87% to 5% improvement over -Oz. Moreover, it outperforms the autotuner used to generate training labels of the original model in just 30 samples.

  • Automated Code Generation of High-Order Stencils for a Dataflow Architecture

    2024-11-17 · 7 citations

    article

    Finite-difference methods based on high-order stencils are widely used in seismic simulations, weather forecasting, and computational fluid dynamics. Recently, multiple research groups have begun exploring the use of dataflow architectures, such as Cerebras’ wafer-scale engine, to accelerate stencil computations. However, implementations of stencil computations for dataflow architectures must address unique challenges, such as managing the routing of data communications and accommodating a significantly constrained memory footprint. These make hand-crafting code for a dataflow architecture difficult and time-consuming. This paper describes a framework for developing portable, high-performance implementations of stencil computations for modern node architectures. The paper focuses on code generation strategies for the Cerebras wafer-scale engine, including code generation of router configurations and sequencing of communication for high-order stencils. A 25-point starshaped stencil written using our tool is $7 \times$ shorter than handcrafted code written in Cerebras Software Language (CSL), and it delivers comparable performance to manually written code.

  • Refining HPCToolkit for application performance analysis at exascale

    The International Journal of High Performance Computing Applications · 2024-08-30 · 5 citations

    articleOpen accessSenior author

    As part of the US Department of Energy’s Exascale Computing Project (ECP), Rice University has been refining its HPCToolkit performance tools to better support measurement and analysis of applications executing on exascale supercomputers. To efficiently collect performance measurements of GPU-accelerated applications, HPCToolkit employs novel non-blocking data structures to communicate performance measurements between tool threads and application threads. To attribute performance information in detail to source lines, loop nests, and inlined call chains, HPCToolkit performs parallel analysis of large CPU and GPU binaries involved in the execution of an exascale application to rapidly recover mappings between machine instructions and source code. To analyze terabytes of performance measurements gathered during executions at exascale, HPCToolkit employs distributed-memory parallelism, multithreading, sparse data structures, and out-of-core streaming analysis algorithms. To support interactive exploration of profiles up to terabytes in size, HPCToolkit’s hpcviewer graphical user interface uses out-of-core methods to visualize performance data. The result of these efforts is that HPCToolkit now supports collection, analysis, and presentation of profiles and traces of GPU-accelerated applications at exascale. These improvements have enabled HPCToolkit to efficiently measure, analyze and explore terabytes of performance data for executions using as many as 64K MPI ranks and 64K GPU tiles on ORNL’s Frontier supercomputer. HPCToolkit’s support for measurement and analysis of GPU-accelerated applications has been employed to study a collection of open-science applications developed as part of ECP. This paper reports on these experiences, which provided insight into opportunities for tuning applications, strengths and weaknesses of HPCToolkit itself, as well as unexpected behaviors in executions at exascale.

  • Matrix-Free Finite Volume Kernels on a Dataflow Architecture

    2024-11-17 · 1 citations

    article

    Fast and accurate numerical simulations are crucial for designing large-scale geological carbon storage projects ensuring safe long-term CO<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</inf> containment as a climate change mitigation strategy. These simulations involve solving numerous large and complex linear systems arising from the implicit Finite Volume (FV) discretization of PDEs governing subsurface fluid flow. Compounded with highly detailed geomodels, solving linear systems is computationally and memory expensive, and accounts for the majority of the simulation time. Modern memory hierarchies are insufficient to meet the latency and bandwidth needs of large-scale numerical simulations. Therefore, exploring algorithms that can leverage alternative and balanced paradigms such as dataflow and in-memory computing is crucial. This work introduces a matrix-free algorithm to solve FV-based linear systems using a dataflow architecture to significantly minimize memory latency and bandwidth bottlenecks. Our implementation achieves two orders of magnitude speedup compared to a GPGPU-based reference implementation, and up to 1.2 PFlops on a single dataflow device.

  • LoopTune: Optimizing Tensor Computations with Reinforcement Learning

    arXiv (Cornell University) · 2023-09-04

    preprintOpen access

    Advanced compiler technology is crucial for enabling machine learning applications to run on novel hardware, but traditional compilers fail to deliver performance, popular auto-tuners have long search times and expert-optimized libraries introduce unsustainable costs. To address this, we developed LoopTune, a deep reinforcement learning compiler that optimizes tensor computations in deep learning models for the CPU. LoopTune optimizes tensor traversal order while using the ultra-fast lightweight code generator LoopNest to perform hardware-specific optimizations. With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM, consistently performing at the level of the hand-tuned library Numpy. Moreover, LoopTune tunes code in order of seconds.

  • A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures

    arXiv (Cornell University) · 2023-09-09

    preprintOpen access

    Finite-difference methods based on high-order stencils are widely used in seismic simulations, weather forecasting, computational fluid dynamics, and other scientific applications. Achieving HPC-level stencil computations on one architecture is challenging, porting to other architectures without sacrificing performance requires significant effort, especially in this golden age of many distinctive architectures. To help developers achieve performance, portability, and productivity with stencil computations, we developed StencilPy. With StencilPy, developers write stencil computations in a high-level domain-specific language, which promotes productivity, while its backends generate efficient code for existing and emerging architectures, including modern many-core CPUs (such as AMD Genoa-X, Fujitsu A64FX, and Intel Sapphire Rapids), latest generations of GPUs (including NVIDIA H100 and A100, AMD MI200, and Intel Ponte Vecchio), and accelerators (including Cerebras and STX). StencilPy demonstrates promising performance results on par with hand-written code, maintains cross-architectural performance portability, and enhances productivity. Its modular design enables easy configuration, customization, and extension. A 25-point star-shaped stencil written in StencilPy is one-quarter of the length of a hand-crafted CUDA code and achieves similar performance on an NVIDIA H100 GPU. In addition, the same kernel written using our tool is 7x shorter than hand-optimized code written in Cerebras Software Language (CSL), and it delivers comparable performance that code on a Cerebras CS-2.

  • ValueExpert: exploring value patterns in GPU-accelerated applications

    2022-02-22 · 17 citations

    article

    General-purpose GPUs have become common in modern computing systems to accelerate applications in many domains, including machine learning, high-performance computing, and autonomous driving. However, inefficiencies abound in GPU-accelerated applications, which prevent them from obtaining bare-metal performance. Performance tools play an important role in understanding performance inefficiencies in complex code bases. Many GPU performance tools pinpoint time-consuming code and provide high-level performance insights but overlook one important performance issue---value-related inefficiencies, which exist in many GPU code bases. In this paper, we present ValueExpert, a novel tool to pinpoint value-related inefficiencies in GPU applications.

Awards & honors

  • Dijkstra Prize in Distributed Computing (2006)
  • ACM Fellow (2013)
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with John Mellor-Crummey

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup