
Tony Nowatzki
· ProfessorVerifiedUniversity of California, Los Angeles · Computer Science
Active 2012–2026
About
Tony Nowatzki is an Associate Professor in the Department of Computer Science at UCLA Samueli School of Engineering. His research interests include hardware/software co-design, modeling, and optimization. He has contributed to the fields of accelerator efficiency, heterogeneous execution models, and architecture simulation, with notable publications in top conferences and journals. He holds a PhD from the University of Wisconsin – Madison, earned in 2016, and has received several awards including the NSF Career Award in 2018, the IEEE Micro Top Picks in 2016, and the Best Paper Runner-Up Award in 2022.
Research topics
- Computer Science
- Programming language
- Parallel computing
- Artificial Intelligence
- Computer architecture
- Computer hardware
- Embedded system
- Algorithm
- Computational science
- Computer engineering
Selected publications
PolyArch/SegFold-AE: SegFold Artifact v1.0.2 (AE Submission)
Zenodo (CERN European Organization for Nuclear Research) · 2026-04-07
otherOpen accessSenior authorSegFold Artifact v1.0.2 (AE Submission) This release contains the artifact for the SegFold paper, including the full implementation, benchmarks, and scripts required to reproduce the experimental results. Contents Cycle-accurate simulator (C++) SuiteSparse benchmark datasets (download script provided) Scripts to reproduce Figures 8–12 and Table IV Hardware synthesis reports (RTL) Expected results for verification Reproducibility All main results reported in the paper can be reproduced using the provided scripts. A full reproduction can be launched with: ./scripts/run_all.sh The simulator is deterministic, and generated outputs are expected to match the reference results in expected_results/. Requirements Linux (tested on Ubuntu 22.04+) GCC 10+, CMake 3.15+ Python 3.8+ (numpy, scipy, matplotlib, pandas, pyyaml) No proprietary software or specialized hardware is required. Notes This release corresponds to the version submitted for artifact evaluation.
PolyArch/SegFold-AE: SegFold Artifact v1.0.1 (AE Submission)
Zenodo (CERN European Organization for Nuclear Research) · 2026-04-07
otherOpen accessSenior authorSegFold Artifact v1.0.1 (AE Submission) This release contains the artifact for the SegFold paper, including the full implementation, benchmarks, and scripts required to reproduce the experimental results. Contents Cycle-accurate simulator (C++) SuiteSparse benchmark datasets (download script provided) Scripts to reproduce Figures 8–12 and Table IV Hardware synthesis reports (RTL) Expected results for verification Reproducibility All main results reported in the paper can be reproduced using the provided scripts. A full reproduction can be launched with: ./scripts/run_all.sh The simulator is deterministic, and generated outputs are expected to match the reference results in expected_results/. Requirements Linux (tested on Ubuntu 22.04+) GCC 10+, CMake 3.15+ Python 3.8+ (numpy, scipy, matplotlib, pandas, pyyaml) No proprietary software or specialized hardware is required. Notes This release corresponds to the version submitted for artifact evaluation.
PolyArch/SegFold-AE: SegFold Artifact v1.0.2 (AE Submission)
Zenodo (CERN European Organization for Nuclear Research) · 2026-04-07
otherOpen accessSenior authorSegFold Artifact v1.0.2 (AE Submission) This release contains the artifact for the SegFold paper, including the full implementation, benchmarks, and scripts required to reproduce the experimental results. Contents Cycle-accurate simulator (C++) SuiteSparse benchmark datasets (download script provided) Scripts to reproduce Figures 8–12 and Table IV Hardware synthesis reports (RTL) Expected results for verification Reproducibility All main results reported in the paper can be reproduced using the provided scripts. A full reproduction can be launched with: ./scripts/run_all.sh The simulator is deterministic, and generated outputs are expected to match the reference results in expected_results/. Requirements Linux (tested on Ubuntu 22.04+) GCC 10+, CMake 3.15+ Python 3.8+ (numpy, scipy, matplotlib, pandas, pyyaml) No proprietary software or specialized hardware is required. Notes This release corresponds to the version submitted for artifact evaluation.
Cache and Near-Data Co-Design for Chiplets
IEEE Computer Architecture Letters · 2025-01-01
articleSenior authorVendors are increasingly adopting chiplet-based designs to manage cost for large-scale multi-cores. While near-data computing, a paradigm involving offloading computation near where data is located in memory, has been studied in the context of monolithic chip designs – its applications to chiplets remain unexplored. In this letter, we explore how the paradigm extends to chiplets in a system where computation is offloaded to accelerators collocated within the last-level-cache structure. We explore both shared and private last-level-cache designs across a variety of different workloads, both large-scale graph computations and more regular-access workloads, in order to understand how to optimize the cache and topology design for near-data workloads. We find that with a mesh chiplet architecture with shared last-level-cache (LLC), near-data optimization can achieve an 8.70× speedup on graph workloads, providing an even greater benefit than in traditional systems.
LLM-DSE: Searching Accelerator Parameters with LLM Agents
ArXiv.org · 2025-05-18
preprintOpen accessEven though high-level synthesis (HLS) tools mitigate the challenges of programming domain-specific accelerators (DSAs) by raising the abstraction level, optimizing hardware directive parameters remains a significant hurdle. Existing heuristic and learning-based methods struggle with adaptability and sample efficiency. We present LLM-DSE, a multi-agent framework designed specifically for optimizing HLS directives. Combining LLM with design space exploration (DSE), our explorer coordinates four agents: Router, Specialists, Arbitrator, and Critic. These multi-agent components interact with various tools to accelerate the optimization process. LLM-DSE leverages essential domain knowledge to identify efficient parameter combinations while maintaining adaptability through verbal learning from online interactions. Evaluations on the HLSyn dataset demonstrate that LLM-DSE achieves substantial $2.55\times$ performance gains over state-of-the-art methods, uncovering novel designs while reducing runtime. Ablation studies validate the effectiveness and necessity of the proposed agent interactions. Our code is open-sourced here: https://github.com/Nozidoali/LLM-DSE.
NoH: NoC Compilation in High-Level Synthesis
2025-05-04 · 2 citations
articleIn FPGAs, high communication latency in multi-die chips has driven the integration of hardened networks-on-chip (NoCs) in commercial devices. However, for programming FPGAs with high-level synthesis (HLS), existing tools only provide low-level cumbersome abstractions, and only work for offloading memory accesses. Furthermore, these abstractions remain inaccessible to programmers due to their reliance on placement knowledge. While automatically leveraging the NoC without manual intervention is ideal, it poses several challenges: 1. Managing the trade-off in resource utilization between the hard NoC and the Programmable Logic (PL). 2. Allocating limited hard NoC resources between different communication in the designs. 3. Aligning hard NoC and PL placement even though the actual PL placement cannot be determined beforehand. We address these challenges by developing NoH, the first HLS flow that automates hard NoC offloading. First, we develop a formal NoC-aware placement algorithm that leverages integer linear programming (ILP) and considers the first two challenges for offloading external memory accesses and latency-insensitive communication between modules. Then, we arrange the ports synergistically with PL modules via a port-affinity model that approximates the PL placement. Finally, NoH is integrated into an end-to-end HLS flow and evaluated on 4 workloads with diverse communication patterns. NoH gains 20% FPGA frequency over AMD tools by leveraging the hard NoC. Compared to AutoBridge [1], a recent high-level physical synthesis technique that optimizes frequency but does not consider the hard NoC, NoH never fails place-and-route by offloading inter-die crossings (AutoBridge fails in 31% of workload configurations tested) and is faster (6%) for the rest.
2025-06-20
articleOpen accessSenior authorData movement is the dominant energy, performance, and scalability bottleneck in modern architectures.Systems have tackled data movement by distributing data, e.g., via non-uniform memory access (NUMA) architectures.However, to reduce data movement, these architectures must identify critical data and place it closer to compute.Clever data placement is complex and often ineffective.Spatial dataflow architectures (SDAs) present a new opportunity to tackle data movement.SDAs distribute program instructions across a spatial fabric of processing elements (PEs).On large SDAs, some PEs are necessarily closer to memory than others, giving rise to non-uniform processing-element access (NUPEA).Clever instruction placement can thus reduce data movement by, e.g., placing critical loads close to memory.This paper introduces NUPEA and contrasts it with prior datacentric approaches to scaling data movement.We find that it is often easier for the compiler to identify critical loads than the data they access, making NUPEA applicable where NUMA is not.We present simple architecture and compiler optimizations for NUPEA and implement them on the Monaco SDA architecture and effcc compiler, both industry products by Efficient Computer.On Monaco, across a range of important kernels, NUPEA yields an avg 28% speedup over a uniform-PE-access (UPEA) SDA and an avg 20% speed over a UPEA SDA with NUMA.
Can Asymmetric Tile Buffering Be Beneficial?
ArXiv.org · 2025-11-20
preprintOpen accessGeneral matrix multiplication (GEMM) is the computational backbone of modern AI workloads, and its efficiency is critically dependent on effective tiling strategies. Conventional approaches employ symmetric tile buffering, where the buffered tile size of the input $A$ along the dimension $M$ matches the output tile size of $C$. In this paper, we introduce asymmetric tile buffering (ATB), a simple but powerful technique that decouples the buffered tile dimensions of the input and output operands. We show, for the first time, that ATB is both practical and highly beneficial. To explain this effect, we develop a performance model that incorporates both the benefits of ATB (higher arithmetic intensity) and its overheads (higher kernel switching costs), providing insight into how to select effective ATB tiling factors. As a case study, we apply ATB to AMD's latest XDNA2 AI Engine (AIE), achieving up to a 4.54x speedup, from 4.8 to 24.6 TFLOPS on mixed-precision BFP16--BF16 GEMM, establishing a new performance record for XDNA2 AIE.
Demystifying FPGA Hard NoC Performance
ArXiv.org · 2025-03-13
preprintOpen accessWith the advent of modern multi-chiplet FPGA architectures, vendors have begun integrating hardened NoC to address the scalability, resource usage, and frequency disadvantages of soft NoCs. However, as this work shows, effectively harnessing these hardened NoC is not trivial. It requires detailed knowledge of the microarchitecture and how it relates to the physical design of the FPGA. Existing literature has provided in-depth analyses for NoC in MPSoC devices, but few studies have systematically evaluated hardened NoC in FPGA, which have several unique implications. This work aims to bridge this knowledge gap by demystifying the performance and design trade-offs of hardened NoC on FPGA. Our work performs detailed performance analysis of hard (and soft) NoC under different settings, including diverse NoC topologies, routing strategies, traffic patterns and different external memories under various NoC placements. In the context of Versal FPGAs, our results show that using hardened NoC in multi-SLR designs can reduce expensive cross-SLR link usage by up to 30~40%, eliminate general-purpose logic overhead, and remove most critical paths caused by large on-chip crossbars. However, under certain aggressive traffic patterns, the frequency advantage of hardened NoC is outweighed by the inefficiency in the network microarchitecture. We also observe suboptimal solutions from the NoC compiler and distinct performance variations between the vertical and horizontal interconnects, underscoring the need for careful design. These findings serve as practical guidelines for effectively integrating hardened NoC and highlight important trade-offs for future FPGA-based systems.
SPGPU: Spatially Programmed GPU
IEEE Computer Architecture Letters · 2024-07-01
articleSenior authorCommunication is a critical bottleneck for GPUs, manifesting as energy and performance overheads due to network-on-chip (NoC) delay and congestion. While many algorithms exhibit locality among thread blocks and accessed data, modern GPUs lack the interface to exploit this locality: GPU thread blocks are mapped to cores obliviously. In this work, we explore a simple extension to the conventional GPU programming interface to enable control over the spatial placement of data and threads, yielding new opportunities for aggressive locality optimizations within a GPU kernel. Across 7 workloads that can take advantage of these optimizations, for a 32 (or 128) SM GPU: we achieve a 1.28× (1.54×) speedup and 35% (44%) reduction in NoC traffic, compared to baseline non-spatial GPUs.
Frequent coauthors
- 44 shared
Karthikeyan Sankaralingam
- 15 shared
Vinay Gangadhar
- 13 shared
Cristian Estan
Google (United States)
- 13 shared
Jian Weng
King Abdullah University of Science and Technology
- 12 shared
Sihao Liu
Beijing Tian Tan Hospital
- 10 shared
Vidushi Dadu
Google (United States)
- 9 shared
Michael C. Ferris
ImaginAb (United States)
- 9 shared
Nilay Vaish
Google (United States)
Awards & honors
- MICRO Best Paper Runner-Up Award, 2022
- NSF Career Award, 2018
- IEEE Micro Top Picks, 2016
- Best of CAL, 2015
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Tony Nowatzki
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup