
Paul Gratz
· Professor, Electrical & Computer EngineeringVerifiedTexas A&M University · Electrical & Computer Engineering
Active 2006–2025
About
Paul Gratz is a Professor in the Department of Electrical and Computer Engineering at Texas A&M University and is affiliated with the College of Engineering. He holds the Eugene E. Webb ‘43 Professorship and can be contacted via phone at 979-488-4551 or email at pgratz@tamu.edu. His educational background includes a Ph.D. in Electrical and Computer Engineering from The University of Texas at Austin, earned in 2008. His research interests encompass security, power, reliability, and performance in multicore and distributed computer architectures, as well as processor memory systems and on-chip interconnection networks. His work focuses on advancing understanding and development in these areas, contributing to the fields of computer architecture and systems performance.
Research topics
- Computer Science
- Algorithm
- Embedded system
- Computer hardware
- Computer architecture
- Distributed computing
- Computer network
- Engineering
- Parallel computing
- Telecommunications
- Operating system
Selected publications
Light-weight Cache Replacement for Instruction Heavy Workloads
2025-06-20 · 2 citations
articleOpen accessThe last-level cache (LLC) is the last chance for memory accesses from the processor to avoid the costly latency of accessing the main memory.In recent years, an increasing number of instruction heavy workloads have put pressure on the last-level cache.We find that, for instruction heavy workloads, a simple replacement policy with minimal overhead provides at least the same benefit as a stateof-the-art, high-overhead replacement policy in the presence of aggressive prefetching.Our proposal is based on specifying insertion and promotion vectors (IPVs) as a generalization of re-reference interval prediction (RRIP) in such a way that the space of feasible policies may be searched exhaustively to find the best policy for the training set of workloads.The policies are formulated to deliver the best performance taking into account demand and prefetch accesses.We show that our technique, Prefetch Aware Coarse-grained Insertion and Promotion Vectors (PACIPV), improves performance over a state-of-the-art LLC replacement policy (Mockingjay) for instruction heavy workloads, and remains competitive for data heavy workloads with significantly less hardware overhead.We show that RRIP-based IPVs are very easy to implement but outperform far more complex replacement policies.PACIPV achieves a speedup of 3.3% over the baseline of LRU, outperforming SRRIP by 1.1% and the much more hardware intensive Mockingjay by 0.1%.
Skia: Exposing Shadow Branches
2025-03-27
articleModern processors implement a decoupled front-end, often using a form of Fetch Directed Instruction Prefetching (FDIP), to avoid front-end stalls. FDIP is driven by the Branch Prediction Unit (BPU), relying on the BPU's accuracy and branch target tracking structures to speculatively fetch instructions into the Instruction Cache (L1-I cache). As contemporary data center applications become more complex, their code footprints also grow, resulting in a high number of Branch Target Buffer (BTB) misses. These BTB missing branches typically have previously been decoded and placed in the BTB, but have since been evicted, leading to BTB misses now. FDIP can alleviate L1-I cache misses, but its reliance on the BPU's tracking structures means that when it encounters a BTB miss, the BPU may not identify the current instruction as a branch to FDIP. This can prevent FDIP from prefetching or cause it to speculate down the wrong path, further polluting the L1-I cache.
R-Max: A Method for Approximating the Benefit of Ideal Prefetching and Replacement Policy
IEEE Computer Architecture Letters · 2025-07-01
articleSenior authorMemory performance continues lagging behind the demand of processing elements, a well known phenomenon known as the memory wall. Cache prefetching is a well studied and effective method to bridge this gap. Despite a long history of study and the existence of many prefetchers, an open question remains with respect to the upper bound of performance that might be had from prefetching. Here, we propose a system, R-Max, to approximate ideal prefetching and replacement policy with realistic constraints on bandwidth, cache structure, and capacity but oracular knowledge of future accesses. We compare R-Max's approximated ideal speedup against the speedup of current state of the art prefetchers to show how much remaining performance gain may be left on the table for cache prefetching techniques. We show that, for a set of workloads taken from SPEC CPU2017, CVP, GAP and XSBench, up to 299.6% maximum and 72.6% average gains are possible under realistic assumptions for a prefetcher that perfectly predicts future accesses, outperforming current state of the art prefetchers by 60.8%. Interestingly, we see that the workloads where R-Max shows the most potential have little relationship with those where existing prefetchers perform best. Taken together our results highlight the need for new research into prefetching techniques for these under-exploited workloads.
Benchmarking 3D Gaussian Splatting Rendering
2025-05-11
articleSenior authorThe growing demand for 3D modeling, particularly in applications like augmented reality (AR) and virtual reality (VR), has underscored the need for more efficient techniques. Manual 3D modeling requires extensive effort and a specialized skill set. Although traditional Photogrammetry is faster than manual 3D modeling, it is still compute-intensive and time-consuming. Emerging methods like 3D Gaussian Splatting (3DGS) offer a faster and more cost-effective solution for model generation (training), at the cost of requiring a different rendering framework. For their widespread adoption in applications such as AR and VR, it is crucial to meet rendering performance requirements, such as power and frame rate. Existing works mainly focus on 3DGS training performance and lack comprehensive analysis and comparison with traditional graphics (TG) rendering techniques that use a mesh representation. Hence, a thorough benchmarking of 3DGS rendering is necessary to identify its limitations and potential areas for improvement. In this paper, we conduct a comprehensive performance study of 3DGS rendering using state-of-the-art frameworks on three different hardware platforms. We evaluate 3DGS performance against TG rendering in terms of frames-per-second (FPS), power, GPU memory footprint, GPU utilization, frametime breakdown, FPS to Watt, several GPU performance counters, and rendered image quality. We observe that 3DGS generates high-quality images with an average PSNR of 37. Our analysis reveals that 3DGS rendering requires a 3x improvement in the FPS to Watt to achieve the performance of TG rendering. Our evaluation shows that 3DGS occupies, on average, a 2x lesser GPU memory footprint compared to TG rendering. Our results indicate that around 10 % image quality can be a tradeoff for a 100 % improvement in frame rate. We see that the rasterization step, on average, consumes 64.76 % of frame time and is the main bottleneck to 3DGS. We identify that 3DGS has, on average, 25 % higher GPU utilization than TG rendering. However, its performance is limited by stalls due to branching and synchronization, pointing to possible improvements in 3DGS algorithm and making it GPU friendly.
IEEE Computer Architecture Letters · 2025-07-01 · 1 citations
articleModern OOO CPUs have very deep pipelines with large branch misprediction recovery penalties. Speculatively executed instructions on the wrong path can significantly change cache state, depending on speculation levels. Architects often employ trace-driven simulation models in the design exploration stage, which sacrifice precision for speed. Trace-driven simulators are orders of magnitude faster than execution-driven models, reducing the often hundreds of thousands of simulation hours needed to explore new micro-architectural ideas. Despite the strong benefits of trace-driven simulation, it often fails to adequately model the consequences of wrong-path execution because obtaining such traces from real systems is nontrivial. Prior works exclusively consider either pollution or prefetching in the instruction stream/L1-I cache and often ignore the impact on the data stream. Here, we examine wrong path execution in simulation results and design a set of infrastructure for enabling wrong-path execution in a trace driven simulator. Our analysis shows the wrong path affects structures on both the instruction and data sides extensively, resulting in performance variations ranging from <inline-formula><tex-math notation="LaTeX">$-3.05$</tex-math></inline-formula>% to 20.9% versus ignoring wrong path. To benefit the research community and enhance the accuracy of simulators, we opened our traces and tracing utility in the hopes that industry can provide wrong-path traces generated by their internal simulators, enabling academic simulation without exposing industry IP.
Estimating CPI Stacks From Multiplexed Performance Counter Data Using Machine Learning
IEEE Computer Architecture Letters · 2025-01-01 · 1 citations
articleOptimizing software at runtime is much easier with a clear understanding of the bottlenecks facing the software. CPI stacks are a common method of visualizing these bottlenecks. However, existing proposals to implement CPI stacks require hardware modifications. To compute CPI stacks without modifying the CPU, we demonstrate CPI stacks can be estimated from existing performance counters using machine learning.
Flow Correlator: A Flow Table Cache Management Strategy
2024-07-29
articleSwitching, routing, and security functions are the backbone of packet processing networks. Fast and efficient processing of packets requires maintaining the state information for many transient network connections. In particular, modern stateful firewalls, security monitoring devices, and Software-Defined Networking (SDN) dataplanes require maintaining state-ful flow tables. These flow tables often grow much larger than can fit on-chip, requiring caching to maintain performance.This paper focuses on improving caching efficiency, an important architectural component of the packet processing data planes. We present a novel predictive approach (Flow Correlator) to network flow table cache management by adapting the Hashed Perceptron binary classifier to improve the reliability and performance of the data plane caching. We also discovered an iterative approach to feature selection and ranking while adapting the Hashed Perceptron mechanism to network flow table cache management.Through extensive experimentation, we demonstrate improved caching efficiency of the proposed Flow Correlator mechanism. We also rigorously validate the performance and generic applicability of our technique across real-world datasets.
arXiv (Cornell University) · 2024-08-12
preprintOpen accessModern OOO CPUs have very deep pipelines with large branch misprediction recovery penalties. Speculatively executed instructions on the wrong path can significantly change cache state, depending on speculation levels. Architects often employ trace-driven simulation models in the design exploration stage, which sacrifice precision for speed. Trace-driven simulators are orders of magnitude faster than execution-driven models, reducing the often hundreds of thousands of simulation hours needed to explore new micro-architectural ideas. Despite this strong benefit of trace-driven simulation, these often fail to adequately model the consequences of wrong path because obtaining them is nontrivial. Prior works consider either a positive or negative impact of wrong path but not both. Here, we examine wrong path execution in simulation results and design a set of infrastructure for enabling wrong-path execution in a trace driven simulator. Our analysis shows the wrong path affects structures on both the instruction and data sides extensively, resulting in performance variations ranging from $-3.05$\% to $20.9$\% when ignoring wrong path. To benefit the research community and enhance the accuracy of simulators, we opened our traces and tracing utility in the hopes that industry can provide wrong-path traces generated by their internal simulators, enabling academic simulation without exposing industry IP.
arXiv (Cornell University) · 2024-08-22
preprintOpen accessModern processors implement a decoupled front-end in the form of Fetch Directed Instruction Prefetching (FDIP) to avoid front-end stalls. FDIP is driven by the Branch Prediction Unit (BPU), relying on the BPU's accuracy and branch target tracking structures to speculatively fetch instructions into the Instruction Cache (L1I). As data center applications become more complex, their code footprints also grow, resulting in an increase in Branch Target Buffer (BTB) misses. FDIP can alleviate L1I cache misses, but when it encounters a BTB miss, the BPU may not identify the current instruction as a branch to FDIP. This can prevent FDIP from prefetching or cause it to speculate down the wrong path, further polluting the L1I cache. We observe that the vast majority, 75%, of BTB-missing, unidentified branches are actually present in instruction cache lines that FDIP has previously fetched but, these missing branches have not yet been decoded and inserted into the BTB. This is because the instruction line is decoded from an entry point (which is the target of the previous taken branch) till an exit point (the taken branch). Branch instructions present in the ignored portion of the cache line we call them "Shadow Branches". Here we present Skeia, a novel shadow branch decoding technique that identifies and decodes unused bytes in cache lines fetched by FDIP, inserting them into a Shadow Branch Buffer (SBB). The SBB is accessed in parallel with the BTB, allowing FDIP to speculate despite a BTB miss. With a minimal storage state of 12.25KB, Skeia delivers a geomean speedup of ~5.7% over an 8K-entry BTB (78KB) and ~2% versus adding an equal amount of state to the BTB across 16 front-end bound applications. Since many branches stored in the SBB are unique compared to those in a similarly sized BTB, we consistently observe greater performance gains with Skeia across all examined sizes until saturation.
Aiding Microprocessor Performance Validation with Machine Learning
2024-05-05 · 1 citations
articleMicroprocessor validation is a complex task that consumes substantial engineering time. Degradation of the system performance that does not affect its functional correctness, is particularly difficult to address given the lack of a golden reference for performance. This work introduces an automated methodology based on machine learning to assist in localizing performance faults, aiming to speed up the validation process. Our results show that, for the injected performance issues, whose average IPC impact is <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$> 1{\%}$</tex>, our technique is able to help localize the exact microarchitectural unit where the degradation occurs <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\sim$</tex> 75% of the time while achieving a top-3 unit accuracy (out of 11 possible locations) of <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$> 97{\%}$</tex>. The proposed setup requires a few seconds to perform a localization inference, leading to a reduced validation time.
Recent grants
NSF · $225k · 2018–2021
SHF: Small: Emerging Memory Architectures for Big Memory Applications
NSF · $440k · 2013–2017
Frequent coauthors
- 19 shared
Daniel A. Jiménez
- 16 shared
Mian Qin
Texas A&M University
- 15 shared
Narasimha Reddy
Texas A&M University
- 14 shared
Jinchun Kim
- 14 shared
A. L. Narasimha Reddy
- 14 shared
Fei Wen
Texas A&M University
- 14 shared
Jiang Hu
Massachusetts General Hospital
- 12 shared
Gino Chacon
Labs
Electrical & Computer Engineering, Texas A&M UniversityPI
Education
- 2008
PhD, Electrical and Computer Engineering
University of Texas at Austin
- 1997
MS, Electrical and Computer Engineering
University of Florida
- 1994
BS, Electrical and Computer Engineering
University of Florida
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Paul Gratz
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup