Luis Ceze

· ProfessorVerified

University of Washington · Computer Science & Engineering

Active 2002–2026

h-index59

Citations14.2k

Papers40465 last 5y

Funding$4.0M

Faculty page Lab page Website

See your match with Luis Ceze — sign in to PhdFit.Sign in

About

Luis Ceze is a Professor at the Paul G. Allen School of Computer Science and Engineering at the University of Washington. He leads three research groups focusing on hardware/software systems (Sampa), machine learning systems and architecture (SAMPL), and the use of DNA for information technology applications (MISL). In addition to his academic roles, he is the co-founder and CEO of OctoML, where he leads a team working on machine learning systems. His research lies at the intersection of computer architecture, programming languages, machine learning, and biology, with the goal of exploring new and improved methods for building computing systems. Luis Ceze received his PhD in Computer Science from the University of Illinois at Urbana-Champaign and holds a BEng and MEng in Electrical Engineering from the University of São Paulo, Brazil. Born in São Paulo, Brazil, he balances his professional work with personal interests such as cooking and spending time with his family.

Research topics

Computer Science
Biology
Computational biology
Artificial Intelligence
Genetics
Theoretical computer science
Data Mining
Nanotechnology
Machine Learning
Mathematical optimization
Computer network
Combinatorial chemistry
Materials science
Biophysics
Algorithm
Biochemistry
Data science
Software engineering
Programming language
Biological system
Parallel computing
Computer engineering
Chemistry

Selected publications

AVO: Agentic Variation Operators for Autonomous Evolutionary Search
arXiv (Cornell University) · 2026-03-25
articleOpen access
Agentic Variation Operators (AVO) are a new family of evolutionary variation operators that replace the fixed mutation, crossover, and hand-designed heuristics of classical evolutionary search with autonomous coding agents. Rather than confining a language model to candidate generation within a prescribed pipeline, AVO instantiates variation as a self-directed agent loop that can consult the current lineage, a domain-specific knowledge base, and execution feedback to propose, repair, critique, and verify implementation edits. We evaluate AVO on attention, among the most aggressively optimized kernel targets in AI, on NVIDIA Blackwell (B200) GPUs. Over 7 days of continuous autonomous evolution on multi-head attention, AVO discovers kernels that outperform cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5% across the evaluated configurations. The discovered optimizations transfer readily to grouped-query attention, requiring only 30 minutes of additional autonomous adaptation and yielding gains of up to 7.0% over cuDNN and 9.3% over FlashAttention-4. Together, these results show that agentic variation operators move beyond prior LLM-in-the-loop evolutionary pipelines by elevating the agent from candidate generator to variation operator, and can discover performance-critical micro-architectural optimizations that produce kernels surpassing state-of-the-art expert-engineered attention implementations on today's most advanced GPU hardware.
Publisher OA PDF
VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents
arXiv (Cornell University) · 2026-01-21
preprintOpen access
VIBETENSOR is an open-source research system software stack for deep learning, generated by LLM-powered coding agents under high-level human guidance. In this paper, "fully generated" refers to code provenance: implementation changes were produced and applied as agent-proposed diffs; validation relied on agent-run builds, tests, and differential checks, without per-change manual diff review. It implements a PyTorch-style eager tensor library with a C++20 core (CPU+CUDA), a torch-like Python overlay via nanobind, and an experimental Node.js/TypeScript interface. Unlike thin bindings, VIBETENSOR includes its own tensor/storage system, schema-lite dispatcher, reverse-mode autograd, CUDA runtime (streams/events/graphs), a stream-ordered caching allocator with diagnostics, and a stable C ABI for dynamically loaded operator plugins. We view this release as a milestone for AI-assisted software engineering: it shows coding agents can generate a coherent deep learning runtime spanning language bindings down to CUDA memory management, validated primarily by builds and tests. We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact. We report repository scale and test-suite composition, and summarize reproducible microbenchmarks from an accompanying AI-generated kernel suite, including fused attention versus PyTorch SDPA/FlashAttention. We also report end-to-end training sanity checks on 3 small workloads (sequence reversal, ViT, miniGPT) on NVIDIA H100 (Hopper, SM90) and Blackwell-class GPUs; multi-GPU results are Blackwell-only and use an optional CUTLASS-based ring-allreduce plugin gated on CUDA 13+ and sm103a toolchain support. Finally, we discuss failure modes in generated system software, including a "Frankenstein" composition effect where locally correct subsystems interact to yield globally suboptimal performance.
Publisher DOI
Author Correction: Random access and semantic search in DNA data storage enabled by Cas9 and machine-guided design
Nature Communications · 2026-02-09
articleOpen access
Publisher OA PDF DOI
AVO: Agentic Variation Operators for Autonomous Evolutionary Search
arXiv (Cornell University) · 2026-03-25
preprintOpen access
Agentic Variation Operators (AVO) are a new family of evolutionary variation operators that replace the fixed mutation, crossover, and hand-designed heuristics of classical evolutionary search with autonomous coding agents. Rather than confining a language model to candidate generation within a prescribed pipeline, AVO instantiates variation as a self-directed agent loop that can consult the current lineage, a domain-specific knowledge base, and execution feedback to propose, repair, critique, and verify implementation edits. We evaluate AVO on attention, among the most aggressively optimized kernel targets in AI, on NVIDIA Blackwell (B200) GPUs. Over 7 days of continuous autonomous evolution on multi-head attention, AVO discovers kernels that outperform cuDNN by up to 3.5% and FlashAttention-4 by up to 10.5% across the evaluated configurations. The discovered optimizations transfer readily to grouped-query attention, requiring only 30 minutes of additional autonomous adaptation and yielding gains of up to 7.0% over cuDNN and 9.3% over FlashAttention-4. Together, these results show that agentic variation operators move beyond prior LLM-in-the-loop evolutionary pipelines by elevating the agent from candidate generator to variation operator, and can discover performance-critical micro-architectural optimizations that produce kernels surpassing state-of-the-art expert-engineered attention implementations on today's most advanced GPU hardware.
Publisher DOI
VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents
ArXiv.org · 2026-01-21
articleOpen access
VIBETENSOR is an open-source research system software stack for deep learning, generated by LLM-powered coding agents under high-level human guidance. In this paper, "fully generated" refers to code provenance: implementation changes were produced and applied as agent-proposed diffs; validation relied on agent-run builds, tests, and differential checks, without per-change manual diff review. It implements a PyTorch-style eager tensor library with a C++20 core (CPU+CUDA), a torch-like Python overlay via nanobind, and an experimental Node.js/TypeScript interface. Unlike thin bindings, VIBETENSOR includes its own tensor/storage system, schema-lite dispatcher, reverse-mode autograd, CUDA runtime (streams/events/graphs), a stream-ordered caching allocator with diagnostics, and a stable C ABI for dynamically loaded operator plugins. We view this release as a milestone for AI-assisted software engineering: it shows coding agents can generate a coherent deep learning runtime spanning language bindings down to CUDA memory management, validated primarily by builds and tests. We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact. We report repository scale and test-suite composition, and summarize reproducible microbenchmarks from an accompanying AI-generated kernel suite, including fused attention versus PyTorch SDPA/FlashAttention. We also report end-to-end training sanity checks on 3 small workloads (sequence reversal, ViT, miniGPT) on NVIDIA H100 (Hopper, SM90) and Blackwell-class GPUs; multi-GPU results are Blackwell-only and use an optional CUTLASS-based ring-allreduce plugin gated on CUDA 13+ and sm103a toolchain support. Finally, we discuss failure modes in generated system software, including a "Frankenstein" composition effect where locally correct subsystems interact to yield globally suboptimal performance.
Publisher OA PDF
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
arXiv (Cornell University) · 2025-01-02 · 3 citations
preprintOpen accessSenior author
Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.
Publisher OA PDF DOI
Random access and semantic search in DNA data storage enabled by Cas9 and machine-guided design
Nature Communications · 2025-07-10 · 4 citations
articleOpen access
DNA is a promising medium for digital data storage due to its exceptional data density and longevity. Practical DNA-based storage systems require selective data retrieval to minimize decoding time and costs. In this work, we introduce CRISPR-Cas9 as a user-friendly tool for multiplexed, low-latency molecular data extraction. We first present a one-pot, multiplexed random access method in which specific data files are selectively cleaved using a CRISPR-Cas9 addressing system and then sequenced via nanopore technology. This approach was validated on a pool of 1.6 million DNA sequences, comprising 25 unique data files. We then developed a molecular similarity-search approach combining machine learning with Cas9-based retrieval. Using a deep neural network, we mapped a database of 1.74 million images into a reduced-dimensional embedding, encoding each embedding as a Cas9 target sequence. These target sequences act as molecular addresses, capturing clusters of semantically related images. By leveraging Cas9's off-target cleavage activity, query sequences cleave both exact and closely related targets, enabling high-fidelity retrieval of molecular addresses corresponding to in silico image clusters similar to the query. These approaches move towards addressing key challenges in molecular data retrieval by offering simplified, rapid isothermal protocols and new DNA data access capabilities.
Publisher OA PDF DOI
Palu: Compressing KV-Cache with Low-Rank Projection
arXiv (Cornell University) · 2024-07-30
preprintOpen access
Post-training KV-Cache compression methods typically either sample a subset of effectual tokens or quantize the data into lower numerical bit width. However, these methods cannot exploit redundancy in the hidden dimension of the KV tensors. This paper presents a hidden dimension compression approach called Palu, a KV-Cache compression framework that utilizes low-rank projection to reduce inference-time LLM memory usage. Palu decomposes the linear layers into low-rank matrices, caches compressed intermediate states, and reconstructs the full keys and values on the fly. To improve accuracy, compression rate, and efficiency, Palu further encompasses (1) a medium-grained low-rank decomposition scheme, (2) an efficient rank search algorithm, (3) low-rank-aware quantization compatibility enhancements, and (4) optimized GPU kernels with operators fusion. Extensive experiments with popular LLMs show that Palu compresses KV-Cache by 50% while maintaining strong accuracy and delivering up to 1.89x on the RoPE-based attention module. When combined with quantization, Palu's inherent quantization-friendly design yields small to negligible extra accuracy degradation while saving additional memory than quantization-only methods and achieving up to 2.91x speedup for the RoPE-based attention. Moreover, it maintains comparable or even better accuracy (up to 1.19 lower perplexity) compared to quantization-only methods. These results demonstrate Palu's superior capability to effectively address the efficiency and memory challenges of LLM inference posed by KV-Cache. Our code is publicly available at: https://github.com/shadowpa0327/Palu
Publisher OA PDF DOI
Optimizing Convolution Neural Nets with a Unified Transformation Approach
Communications of the ACM · 2024-09-25
article1st authorCorresponding
Publisher DOI
vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs
arXiv (Cornell University) · 2024-05-01 · 1 citations
preprintOpen access
IoT devices based on microcontroller units (MCU) provide ultra-low power consumption and ubiquitous computation for near-sensor deep learning models (DNN). However, the memory of MCU is usually 2-3 orders of magnitude smaller than mobile devices, which makes it challenging to map DNNs onto MCUs. Previous work separates memory management and kernel implementation for MCU and relies on coarse-grained memory management techniques such as inplace update to reduce memory consumption. In this paper, we propose to coordinate memory management and kernel optimization for DNN inference on MCUs to enable fine-grained memory management. The key idea is to virtualize the limited memory of MCU as a large memory pool. Each kernel divides the memory pool into kernel-specific segments and handles segment load and store while computing DNN layers. Memory consumption can be reduced because using the fine-grained segment-level memory control, we can overlap the memory footprint of different tensors without the need to materialize them at the same time. Following this idea, we implement \ours{} for DNN inference on MCU. Evaluation for single layers on ARM Cortex-M4 and Cortex-M7 processors shows that \ours{} can reduce from $12.0\%$ to $49.5\%$ RAM usage and from $20.6\%$ to $53.0\%$ energy consumption compared to state-of-the-art work. For full DNN evaluation, \ours{} can reduce the memory bottleneck by $61.5\%$, enabling more models to be deployed on low-end MCUs.
Publisher OA PDF DOI

Recent grants

EAGER: Closed-loop Silicon-biomolecular Systems with Integrated Synthesis-fluidics-nanopore Interfaces
NSF · $230k · 2018–2020
SHF:Small:Disciplined Approximate Programming for Energy-Efficient Computing
NSF · $300k · 2012–2015
SHF: Large: General-Purpose Approximate Computing Across the System Stack
NSF · $2.4M · 2015–2024
SHF: Small: Precise Concurrency Exceptions: Architecture Support, Semantics and System Implications
NSF · $516k · 2010–2014
CAREER: Deterministic Shared Memory Multiprocessing: Vision, Architecture, and Impact on Programmability
NSF · $559k · 2009–2016

Frequent coauthors

Karin Strauß
Microsoft (United States)
977 shared
Jeff Nivala
University of Washington
854 shared
Karen Zhang
The University of Texas at Austin
843 shared
Kathryn J Doroschak
Adaptive Biotechnologies (United States)
842 shared
Melissa Queen
University of Washington
840 shared
Aishwarya Mandyam
University of Washington
839 shared
David A. Bader
152 shared
Guojing Cong
Oak Ridge National Laboratory
150 shared

Education

Ph.D., Computer Science
University of Illinois at Urbana-Champaign
Other, Electrical Engineering
University of São Paulo, Brazil
Other, Electrical Engineering
University of São Paulo, Brazil

Awards & honors

NSF CAREER Award
Sloan Research Fellowship
Microsoft Research Faculty Fellowship
2013 IEEE TCCA Young Computer Architect Award

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Luis Ceze

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you