Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Glenn Reinman

Glenn Reinman

· Professor/Vice Chair of Undergraduate Programs

University of California, Los Angeles · Computer Science

Active 1998–2025

h-index40
Citations4.4k
Papers1404 last 5y
Funding$362k
See your match with Glenn Reinman — sign in to PhdFit.Sign in

About

Glenn Reinman is a Professor and Vice Chair of Undergraduate Programs in the Department of Computer Science at UCLA Samueli School of Engineering. His research interests include processor architecture design and optimization, speculative execution, profile-guided optimization, instruction-level parallelism, computer architecture, augmented reality, parallel programming, graphics processing, compilers, and systems. He holds a Ph.D. from the University of California, San Diego, obtained in 2001, and an M.S. in Computer Science from the same institution earned in 1999. Reinman has been recognized with awards such as SIGMICRO's Test of Time Award in 2021. His work focuses on advancing the understanding and development of computer architecture and related systems.

Research topics

  • Computer Science
  • Embedded system
  • Operating system
  • Parallel computing
  • Computer hardware
  • Control engineering
  • Engineering
  • Telecommunications
  • Physics
  • Systems engineering

Selected publications

  • Fine Grain 3D Integration for Microarchitecture Design Through Cube Packing Exploration

    ArXiv.org · 2025-07-13

    articleOpen access

    Most previous 3D IC research focused on stacking traditional 2D silicon layers, so the interconnect reduction is limited to inter-block delays. In this paper, we propose techniques that enable efficient exploration of the 3D design space where each logical block can span more than one silicon layers. Although further power and performance improvement is achievable through fine grain 3D integration, the necessary modeling and tool infrastructure has been mostly missing. We develop a cube packing engine which can simultaneously optimize physical and architectural design for effective utilization of 3D in terms of performance, area and temperature. Our experimental results using a design driver show 36% performance improvement (in BIPS) over 2D and 14% over 3D with single layer blocks. Additionally multi-layer blocks can provide up to 30% reduction in power dissipation compared to the single-layer alternatives. Peak temperature of the design is kept within limits as a result of thermal-aware floorplanning and thermal via insertion techniques.

  • BeaconGNN: Large-Scale GNN Acceleration with Out-of-Order Streaming In-Storage Computing

    2024 · 22 citations

    Senior authorCorresponding
    • Computer Science
    • Computer Science
    • Physics

    Prior in-storage computing (ISC) solutions show fundamental drawbacks when applied to GNN acceleration. First, they obey a strict ordering of GNN neighbor sampling. Such serialization fails to utilize flash internal parallelism. Second, the I/Osizes generated by GNN are much smaller than the minimum flash access granularity. The limited channel bandwidth is wasted when serving the requests. Third, the prior solutions rely on firmware-based request processing, making the backend I/O throughput constrained by the embedded core processing power. To address these challenges, we propose BeaconGNN, an in-storage computing (ISC) design for GNN that supports both large-scale graph structures and feature tables. First, it utilizes a novel graph format to enable out-of-order GNN neighbor sampling, improving flash resource utilization. Second, it deploys near-data processing engines across multiple levels of the flash hierarchy (i.e., controller, channel, and die). Specifically, flash-die-level samplers perform neighbor samplings while reducing channel transfer simultaneously. Flash-channel-level command routers communicate with backend dies without the involvement of flash firmware. Lastly, a spatial accelerator is attached to the device bus to accelerate GNN computation. With our software and hardware co-design, BeaconGNN achieves up to 11.6x higher throughput and 4 x better energy efficiency than the state-of-the-art ISC design.

  • An Evaluation of Deeply Decoupled Cores

    SSRN Electronic Journal · 2024 · 3 citations

    Senior authorCorresponding
    • Computer Science
    • Computer Science
    • Embedded system
  • Reconfigurable Accelerator Compute Hierarchy: A Case Study using Content-Based Image Retrieval

    2020-10-01

    articleSenior author

    The recent adoption of reconfigurable hardware accelerators in data centers has significantly improved their computational power and energy efficiency for compute-intensive applications. However, for common communication-bound analytics workloads, these benefits are limited by the efficiency of data movement in the IO stack. For this reason, server architects are proposing a more data-centric acceleration scheme by moving the compute elements closer to the data. While prior studies focus on the benefits of Near Data Processing (NDP) solely on one level of the memory hierarchy (one of cache, main memory or storage), we focus on the collaboration of NDP accelerators at all levels and their collective benefits in accelerating an application pipeline. In this paper, we present a Reconfigurable Accelerator Compute Hierarchy (ReACH) that combines on-chip, near-memory, and near-storage accelerators. Each memory level has a reconfigurable accelerator chip attached to it, which provides distinct compute and memory capabilities and offers a broad spectrum of acceleration options. To enable effective acceleration on various application pipelines, we propose a holistic approach to coordinate between the compute levels, reducing inter-level data access interference and achieving asynchronous task flow control. To minimize the programming efforts of using the compute hierarchy, a uniform programming interface is designed to decouple the ReACH configuration from the user application source code and allow runtime adjustments without modifying the deployed application. We experimentally deploy a billion-scale Content-Based Image Retrieval (CBIR) system on ReACH. Simulation results demonstrate that a proper application mapping eliminates unnecessary data movement, and ReACH achieves 4.5x throughput gain while reducing energy consumption by 52% compared to conventional on-chip acceleration.

  • FPGA-based Near Data Processing Platform Selection Using Fast Performance Modeling (WiP Paper)

    2020 · 3 citations

    Senior authorCorresponding
    • Computer Science
    • Computer Science
    • Embedded system

    With the trend of adopting FPGAs in data centers, various FPGA acceleration platforms have been developed in recent years. Each server could incorporate one or many of these FPGAs at different compute hierarchy levels to match its workload intensity. FPGAs could either be used as IO-attached accelerators or be closely integrated with CPU as on-chip co-processors. For a more data-centric approach, an FPGA could be moved closer to the data medium (RAM or disk) and serve as a near-memory or near-storage accelerator.

  • Understanding Performance Gains of Accelerator-Rich Architectures

    2019-07-01 · 4 citations

    articleSenior author

    The power and utilization walls in today's processors have led to a recent focus on accelerator-rich architectures (ARAs), which include a sea of customized accelerators with orders-of-magnitude performance and energy gains. Meanwhile, some researchers wonder how the reported large gains are achieved, considering that ARAs use a similar memory hierarchy to conventional processors. In this paper we conduct an in-depth analysis of ARAs with a key focus on the memory access component not studied in prior work. Based on our experimental results, we observe that ARAs achieve performance gains from both computation and memory access customization. For computation customization, ARAs not only exploit the coarse-grained parallelism as conventional processors do, but also uniquely customize a deep processing pipeline without instruction overhead. For memory access customization, ARAs exploit a tile-based read-compute-write execution model that both reduces the number of memory accesses and improves the memory-level parallelism (MLP). We quantitatively evaluate the performance impact of such factors and surprisingly find that 1) memory access customization plays a bigger role in the performance improvement than computation customization, and 2) the dominating contributor to the ARA memory access performance improvement is the improved MLP rather than the widely-expected memory access reduction. Indeed, we find that existing GPU accelerators also benefit from the improved MLP through different techniques. The unique customized deep processing pipeline of ARAs further provide an average of 1.4x speedup over GPUs. Moreover, on average, ARAs are 18x more energy efficient over GPUs. We hope this understanding can help future ARA design and adoption.

  • In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms

    ACM Transactions on Reconfigurable Technology and Systems · 2019-02-17 · 43 citations

    article

    Conventional homogeneous multicore processors are not able to provide the continued performance and energy improvement that we have expected from past endeavors. Heterogeneous architectures that feature specialized hardware accelerators are widely considered a promising paradigm for resolving this issue. Among different heterogeneous devices, FPGAs that can be reconfigured to accelerate a broad class of applications with orders-of-magnitude performance/watt gains, are attracting increased attention from both academia and industry. As a consequence, a variety of CPU-FPGA acceleration platforms with diversified microarchitectural features have been supplied by industry vendors. Such diversity, however, poses a serious challenge to application developers in selecting the appropriate platform for a specific application or application domain. This article aims to address this challenge by determining which microarchitectural characteristics affect performance, and in what ways. Specifically, we conduct a quantitative comparison and an in-depth analysis on five state-of-the-art CPU-FPGA acceleration platforms: (1) the Alpha Data board and (2) the Amazon F1 instance that represent the traditional PCIe-based platform with private device memory; (3) the IBM CAPI that represents the PCIe-based system with coherent shared memory; (4) the first generation of the Intel Xeon+FPGA Accelerator Platform that represents the QPI-based system with coherent shared memory; and (5) the second generation of the Intel Xeon+FPGA Accelerator Platform that represents a hybrid PCIe-based (non-coherent) and QPI-based (coherent) system with shared memory. Based on the analysis of their CPU-FPGA communication latency and bandwidth characteristics, we provide a series of insights for both application developers and platform designers. Furthermore, we conduct two case studies to demonstrate how these insights can be leveraged to optimize accelerator designs. The microbenchmarks used for evaluation have been released for public use.

  • CHILL: a system for fine-grained mapping of chained high impact long-latency load phases on tightly coupled heterogeneous multi-cores

    International Journal of High Performance Systems Architecture · 2017-01-01

    articleOpen accessSenior author

    With increasing power and application demands, heterogeneous multi-core processors are becoming more prevalent. However, the key to proper utilisation of heterogeneous multi-cores is assigning, or mapping, the right application to the right core type. Recent work has shown that fine-grained mapping takes advantage of short program phases with highly variant performance requirements, and can elicit greater benefits from tightly coupled heterogeneous multi-cores. In this paper, we show that bottlenecks in performance can occur in fine-grained program phases during chains of high impact long-latency loads. We design a system that detects these bottleneck phases, and propose accelerating these phases on the out-of-order core for better performance and energy efficiency. Our system operates within 10% of performance, and 2.6% of energy to an oracle resource mapper. This translates to a 44.4% performance gain, and 9.2% energy savings over existing fine-grained mapping techniques.

  • Supporting Address Translation for Accelerator-Centric Architectures

    2017-02-01 · 75 citations

    article

    While emerging accelerator-centric architectures offer orders-of-magnitude performance and energy improvements, use cases and adoption can be limited by their rigid programming model. A unified virtual address space between the host CPU cores and customized accelerators can largely improve the programmability, which necessitates hardware support for address translation. However, supporting address translation for customized accelerators with low overhead is nontrivial. Prior studies either assume an infinite-sized TLB and zero page walk latency, or rely on a slow IOMMU for correctness and safety- which penalizes the overall system performance. To provide efficient address translation support for accelerator-centric architectures, we examine the memory access behavior of customized accelerators to drive the TLB augmentation and MMU designs. First, to support bulk transfers of consecutive data between the scratchpad memory of customized accelerators and the memory system, we present a relatively small private TLB design to provide low-latency caching of translations to each accelerator. Second, to compensate for the effects of the widely used data tiling techniques, we design a shared level-two TLB to serve private TLB misses on common virtual pages, eliminating duplicate page walks from accelerators working on neighboring data tiles that are mapped to the same physical page. This two-level TLB design effectively reduces page walks by 75.8% on average. Finally, instead of implementing a dedicated MMU which introduces additional hardware complexity, we propose simply leveraging the host per-core MMU for efficient page walk handling. This mechanism is based on our insight that the existing MMU cache in the CPU MMU satisfies the demand of customized accelerators with minimal overhead. Our evaluation demonstrates that the combined approach incurs only a 6.4% performance overhead compared to the ideal address translation.

  • CHILL: a system for fine-grained mapping of chained high impact long-latency load phases on tightly coupled heterogeneous multi-cores

    International Journal of High Performance Systems Architecture · 2017-01-01

    articleSenior author

    With increasing power and application demands, heterogeneous multi-core processors are becoming more prevalent. However, the key to proper utilisation of heterogeneous multi-cores is assigning, or mapping, the right application to the right core type. Recent work has shown that fine-grained mapping takes advantage of short program phases with highly variant performance requirements, and can elicit greater benefits from tightly coupled heterogeneous multi-cores. In this paper, we show that bottlenecks in performance can occur in fine-grained program phases during chains of high impact long-latency loads. We design a system that detects these bottleneck phases, and propose accelerating these phases on the out-of-order core for better performance and energy efficiency. Our system operates within 10% of performance, and 2.6% of energy to an oracle resource mapper. This translates to a 44.4% performance gain, and 9.2% energy savings over existing fine-grained mapping techniques.

Recent grants

Frequent coauthors

  • Jason Cong

    UCLA Health

    51 shared
  • Petros Faloutsos

    York University

    25 shared
  • Mubbasir Kapadia

    17 shared
  • Brad Calder

    16 shared
  • Shawn Singh

    University of Manitoba

    16 shared
  • Mau-Chung Frank Chang

    University of California, Los Angeles

    15 shared
  • Michael Gill

    University of New Brunswick

    14 shared
  • Beayna Grigorian

    University of California, Los Angeles

    11 shared

Awards & honors

  • SIGMICRO's Test of Time Award, 2021
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Glenn Reinman

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup