About
Luis Ceze is a Professor at the Paul G. Allen School of Computer Science and Engineering at the University of Washington. He leads three research groups focusing on hardware/software systems (Sampa), machine learning systems and architecture (SAMPL), and the use of DNA for information technology applications (MISL). In addition to his academic roles, he is the co-founder and CEO of OctoML, where he leads a team working on machine learning systems. His research lies at the intersection of computer architecture, programming languages, machine learning, and biology, with the goal of exploring new and improved methods for building computing systems. Luis Ceze received his PhD in Computer Science from the University of Illinois at Urbana-Champaign and holds a BEng and MEng in Electrical Engineering from the University of São Paulo, Brazil. Born in São Paulo, Brazil, he balances his professional work with personal interests such as cooking and spending time with his family.
Research topics
- Computer Science
- Biology
- Computational biology
- Artificial Intelligence
- Genetics
- Theoretical computer science
- Data Mining
- Nanotechnology
- Machine Learning
- Mathematical optimization
- Computer network
- Combinatorial chemistry
- Materials science
- Biophysics
- Algorithm
- Biochemistry
- Data science
- Software engineering
- Programming language
- Biological system
- Parallel computing
- Computer engineering
- Chemistry
Selected publications
AVO: Agentic Variation Operators for Autonomous Evolutionary Search
arXiv (Cornell University) · 2026-03-25
articleOpen accessAgentic Variation Operators (AVO) are a new family of evolutionary variation operators that replace the fixed mutation, crossover, and hand-designed heuristics of classical evolutionary search with autonomous coding agents. Rather than confining a language model to candidate generation within a prescribed pipeline, AVO instantiates variation as a self-directed agent loop that can consult the current lineage, a domain-specific knowledge base, and execution feedback to propose, repair, critique, and verify implementation edits. We evaluate AVO on attention, among the most aggressively optimized kernel targets in AI, on NVIDIA Blackwell (B200) GPUs. Over 7 days of continuous autonomous evolution on multi-head attention, AVO discovers kernels that outperform cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5% across the evaluated configurations. The discovered optimizations transfer readily to grouped-query attention, requiring only 30 minutes of additional autonomous adaptation and yielding gains of up to 7.0% over cuDNN and 9.3% over FlashAttention-4. Together, these results show that agentic variation operators move beyond prior LLM-in-the-loop evolutionary pipelines by elevating the agent from candidate generator to variation operator, and can discover performance-critical micro-architectural optimizations that produce kernels surpassing state-of-the-art expert-engineered attention implementations on today's most advanced GPU hardware.
VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents
arXiv (Cornell University) · 2026-01-21
preprintOpen accessVIBETENSOR is an open-source research system software stack for deep learning, generated by LLM-powered coding agents under high-level human guidance. In this paper, "fully generated" refers to code provenance: implementation changes were produced and applied as agent-proposed diffs; validation relied on agent-run builds, tests, and differential checks, without per-change manual diff review. It implements a PyTorch-style eager tensor library with a C++20 core (CPU+CUDA), a torch-like Python overlay via nanobind, and an experimental Node.js/TypeScript interface. Unlike thin bindings, VIBETENSOR includes its own tensor/storage system, schema-lite dispatcher, reverse-mode autograd, CUDA runtime (streams/events/graphs), a stream-ordered caching allocator with diagnostics, and a stable C ABI for dynamically loaded operator plugins. We view this release as a milestone for AI-assisted software engineering: it shows coding agents can generate a coherent deep learning runtime spanning language bindings down to CUDA memory management, validated primarily by builds and tests. We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact. We report repository scale and test-suite composition, and summarize reproducible microbenchmarks from an accompanying AI-generated kernel suite, including fused attention versus PyTorch SDPA/FlashAttention. We also report end-to-end training sanity checks on 3 small workloads (sequence reversal, ViT, miniGPT) on NVIDIA H100 (Hopper, SM90) and Blackwell-class GPUs; multi-GPU results are Blackwell-only and use an optional CUTLASS-based ring-allreduce plugin gated on CUDA 13+ and sm103a toolchain support. Finally, we discuss failure modes in generated system software, including a "Frankenstein" composition effect where locally correct subsystems interact to yield globally suboptimal performance.
Nature Communications · 2026-02-09
articleOpen accessAVO: Agentic Variation Operators for Autonomous Evolutionary Search
arXiv (Cornell University) · 2026-03-25
preprintOpen accessAgentic Variation Operators (AVO) are a new family of evolutionary variation operators that replace the fixed mutation, crossover, and hand-designed heuristics of classical evolutionary search with autonomous coding agents. Rather than confining a language model to candidate generation within a prescribed pipeline, AVO instantiates variation as a self-directed agent loop that can consult the current lineage, a domain-specific knowledge base, and execution feedback to propose, repair, critique, and verify implementation edits. We evaluate AVO on attention, among the most aggressively optimized kernel targets in AI, on NVIDIA Blackwell (B200) GPUs. Over 7 days of continuous autonomous evolution on multi-head attention, AVO discovers kernels that outperform cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5% across the evaluated configurations. The discovered optimizations transfer readily to grouped-query attention, requiring only 30 minutes of additional autonomous adaptation and yielding gains of up to 7.0% over cuDNN and 9.3% over FlashAttention-4. Together, these results show that agentic variation operators move beyond prior LLM-in-the-loop evolutionary pipelines by elevating the agent from candidate generator to variation operator, and can discover performance-critical micro-architectural optimizations that produce kernels surpassing state-of-the-art expert-engineered attention implementations on today's most advanced GPU hardware.
VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents
ArXiv.org · 2026-01-21
articleOpen accessVIBETENSOR is an open-source research system software stack for deep learning, generated by LLM-powered coding agents under high-level human guidance. In this paper, "fully generated" refers to code provenance: implementation changes were produced and applied as agent-proposed diffs; validation relied on agent-run builds, tests, and differential checks, without per-change manual diff review. It implements a PyTorch-style eager tensor library with a C++20 core (CPU+CUDA), a torch-like Python overlay via nanobind, and an experimental Node.js/TypeScript interface. Unlike thin bindings, VIBETENSOR includes its own tensor/storage system, schema-lite dispatcher, reverse-mode autograd, CUDA runtime (streams/events/graphs), a stream-ordered caching allocator with diagnostics, and a stable C ABI for dynamically loaded operator plugins. We view this release as a milestone for AI-assisted software engineering: it shows coding agents can generate a coherent deep learning runtime spanning language bindings down to CUDA memory management, validated primarily by builds and tests. We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact. We report repository scale and test-suite composition, and summarize reproducible microbenchmarks from an accompanying AI-generated kernel suite, including fused attention versus PyTorch SDPA/FlashAttention. We also report end-to-end training sanity checks on 3 small workloads (sequence reversal, ViT, miniGPT) on NVIDIA H100 (Hopper, SM90) and Blackwell-class GPUs; multi-GPU results are Blackwell-only and use an optional CUTLASS-based ring-allreduce plugin gated on CUDA 13+ and sm103a toolchain support. Finally, we discuss failure modes in generated system software, including a "Frankenstein" composition effect where locally correct subsystems interact to yield globally suboptimal performance.
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
arXiv (Cornell University) · 2025-01-02 · 3 citations
preprintOpen accessSenior authorTransformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.
Random access and semantic search in DNA data storage enabled by Cas9 and machine-guided design
Nature Communications · 2025-07-10 · 4 citations
articleOpen accessDNA is a promising medium for digital data storage due to its exceptional data density and longevity. Practical DNA-based storage systems require selective data retrieval to minimize decoding time and costs. In this work, we introduce CRISPR-Cas9 as a user-friendly tool for multiplexed, low-latency molecular data extraction. We first present a one-pot, multiplexed random access method in which specific data files are selectively cleaved using a CRISPR-Cas9 addressing system and then sequenced via nanopore technology. This approach was validated on a pool of 1.6 million DNA sequences, comprising 25 unique data files. We then developed a molecular similarity-search approach combining machine learning with Cas9-based retrieval. Using a deep neural network, we mapped a database of 1.74 million images into a reduced-dimensional embedding, encoding each embedding as a Cas9 target sequence. These target sequences act as molecular addresses, capturing clusters of semantically related images. By leveraging Cas9's off-target cleavage activity, query sequences cleave both exact and closely related targets, enabling high-fidelity retrieval of molecular addresses corresponding to in silico image clusters similar to the query. These approaches move towards addressing key challenges in molecular data retrieval by offering simplified, rapid isothermal protocols and new DNA data access capabilities.
Palu: Compressing KV-Cache with Low-Rank Projection
arXiv (Cornell University) · 2024-07-30
preprintOpen accessPost-training KV-Cache compression methods typically either sample a subset of effectual tokens or quantize the data into lower numerical bit width. However, these methods cannot exploit redundancy in the hidden dimension of the KV tensors. This paper presents a hidden dimension compression approach called Palu, a KV-Cache compression framework that utilizes low-rank projection to reduce inference-time LLM memory usage. Palu decomposes the linear layers into low-rank matrices, caches compressed intermediate states, and reconstructs the full keys and values on the fly. To improve accuracy, compression rate, and efficiency, Palu further encompasses (1) a medium-grained low-rank decomposition scheme, (2) an efficient rank search algorithm, (3) low-rank-aware quantization compatibility enhancements, and (4) optimized GPU kernels with operators fusion. Extensive experiments with popular LLMs show that Palu compresses KV-Cache by 50% while maintaining strong accuracy and delivering up to 1.89x on the RoPE-based attention module. When combined with quantization, Palu's inherent quantization-friendly design yields small to negligible extra accuracy degradation while saving additional memory than quantization-only methods and achieving up to 2.91x speedup for the RoPE-based attention. Moreover, it maintains comparable or even better accuracy (up to 1.19 lower perplexity) compared to quantization-only methods. These results demonstrate Palu's superior capability to effectively address the efficiency and memory challenges of LLM inference posed by KV-Cache. Our code is publicly available at: https://github.com/shadowpa0327/Palu
Optimizing Convolution Neural Nets with a Unified Transformation Approach
Communications of the ACM · 2024-09-25
article1st authorCorrespondingvMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs
arXiv (Cornell University) · 2024-05-01 · 1 citations
preprintOpen accessIoT devices based on microcontroller units (MCU) provide ultra-low power consumption and ubiquitous computation for near-sensor deep learning models (DNN). However, the memory of MCU is usually 2-3 orders of magnitude smaller than mobile devices, which makes it challenging to map DNNs onto MCUs. Previous work separates memory management and kernel implementation for MCU and relies on coarse-grained memory management techniques such as inplace update to reduce memory consumption. In this paper, we propose to coordinate memory management and kernel optimization for DNN inference on MCUs to enable fine-grained memory management. The key idea is to virtualize the limited memory of MCU as a large memory pool. Each kernel divides the memory pool into kernel-specific segments and handles segment load and store while computing DNN layers. Memory consumption can be reduced because using the fine-grained segment-level memory control, we can overlap the memory footprint of different tensors without the need to materialize them at the same time. Following this idea, we implement \ours{} for DNN inference on MCU. Evaluation for single layers on ARM Cortex-M4 and Cortex-M7 processors shows that \ours{} can reduce from $12.0\%$ to $49.5\%$ RAM usage and from $20.6\%$ to $53.0\%$ energy consumption compared to state-of-the-art work. For full DNN evaluation, \ours{} can reduce the memory bottleneck by $61.5\%$, enabling more models to be deployed on low-end MCUs.
Recent grants
NSF · $230k · 2018–2020
SHF:Small:Disciplined Approximate Programming for Energy-Efficient Computing
NSF · $300k · 2012–2015
SHF: Large: General-Purpose Approximate Computing Across the System Stack
NSF · $2.4M · 2015–2024
SHF: Small: Precise Concurrency Exceptions: Architecture Support, Semantics and System Implications
NSF · $516k · 2010–2014
NSF · $559k · 2009–2016
Frequent coauthors
- 977 shared
Karin Strauß
Microsoft (United States)
- 854 shared
Jeff Nivala
University of Washington
- 843 shared
Karen Zhang
The University of Texas at Austin
- 842 shared
Kathryn J Doroschak
Adaptive Biotechnologies (United States)
- 840 shared
Melissa Queen
University of Washington
- 839 shared
Aishwarya Mandyam
University of Washington
- 152 shared
David A. Bader
- 150 shared
Guojing Cong
Oak Ridge National Laboratory
Education
Ph.D., Computer Science
University of Illinois at Urbana-Champaign
Other, Electrical Engineering
University of São Paulo, Brazil
Other, Electrical Engineering
University of São Paulo, Brazil
Awards & honors
- NSF CAREER Award
- Sloan Research Fellowship
- Microsoft Research Faculty Fellowship
- 2013 IEEE TCCA Young Computer Architect Award
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Luis Ceze
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup