Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Rajeev Balasubramonian

Rajeev Balasubramonian

· Professor & Associate DirectorVerified

University of Utah · Computer Science

Active 2000–2025

h-index45
Citations9.0k
Papers16526 last 5y
Funding$3.7M1 active
See your match with Rajeev Balasubramonian — sign in to PhdFit.Sign in

About

Rajeev Balasubramonian is a Professor and Associate Director at the Kahlert School of Computing at the University of Utah. His research interests include computer architecture, specifically accelerators and VLSI, memory systems, high-performance computing, scalable machine learning, and related areas. He is involved in advancing knowledge and technology in these fields through his academic and research activities at the university.

Research topics

  • Computer Science
  • Artificial Intelligence
  • Algorithm
  • Parallel computing
  • Embedded system
  • Database
  • Computer engineering
  • Computer hardware
  • Operating system
  • Physics
  • Electronic engineering
  • Distributed computing
  • Computer architecture
  • Theoretical computer science

Selected publications

  • <scp>FLEXPROF</scp> : Flexible, Side-Channel-Free Memory Access

    2025-03-27

    articleOpen accessSenior author

    Secure processors must defend against a wide array of microarchitecture side-channels, including those induced by a shared memory controller. Multiple studies have proposed techniques that allocate ''turns'' (within the memory controller) to each co-scheduled virtual machine (VM), and introduce gaps between VM turns to prevent resource conflicts and side-channels. In spite of past advancements in secure memory scheduling, the elimination of side-channels imposes a performance slowdown of 2x. We observe that one of the causes of this slowdown is that the memory controller schedule accommodates the worst case, i.e., it is prepared to handle either reads or writes. The key insight in this work is that the schedule can be more efficient if we designate every turn to handle fixed patterns of reads and writes.In particular, we introduce a read-optimized turn and a write-optimized turn. Coarse-grain application profiling helps determine how often the two types of turns are invoked, without leaking sensitive information. We also add flexibility so that a read-optimized turn can opportunistically also issue writes, and vice versa. This provides a good balance between restrictions and flexibility; between throughput and utilization. The proposed FlexProf memory controller improves performance by up to 33% with a geometric mean gain of 8% on mixed workloads, relative to state-of-the-art methods. Over half the memory-intensive programs evaluated exhibit performance gains of over 10%.

  • PATHFINDER: Practical Real-Time Learning for Data Prefetching

    2024-04-24 · 9 citations

    articleOpen accessSenior author

    Data prefetching is vital in high-performance processors and a large body of research has introduced a number of different approaches for accurate prefetching: stride detection, address correlating prefetchers, delta pattern detection, irregular pattern detection, etc. Most recently, a few works have leveraged advances in machine learning and deep neural networks to design prefetchers. These neural-inspired prefetchers observe data access patterns and develop a trained model that can then make accurate predictions for future accesses. A significant impediment to the success of these prefetchers is their high implementation cost, for both inference and training. These models cannot be trained in real-time, i.e., they have to be trained beforehand with a large benchmark suite. This results in a large model (that increases the overhead for inference), and the model can only successfully predict patterns that are similar to patterns in the training set.

  • Hyena: Balancing Packing, Reuse, and Rotations for Encrypted Inference

    2024-05-19 · 5 citations

    articleSenior author

    Deep neural networks are widely used in a range of commercial services. Many of these services are hosted on the cloud, requiring users to send their personal data to the cloud. This, in turn, exposes the user’s private and sensitive data to several third parties. To address this problem, Homomorphic Encryption (HE) has been introduced, where the user encrypts their data before sending it to the cloud; the cloud performs operations on encrypted data and returns a ciphertext that the user must then decrypt. While this approach keeps user data private, it demands orders of magnitude more computation and data movement. It is, therefore, imperative to design hardware/software techniques to lower the overheads when executing AI services under Homomorphic Encryption schemes.In this paper, we consider a range of HE implementations for AI inference and address the key bottlenecks in state-of-the-art frameworks. We start by making the case for a hybrid HE and Multi-Party Computation (MPC) scheme that is more practical than pure Fully HE. This paper introduces new techniques at various levels: (i) we introduce new data packing techniques that result in lower data movement, (ii) we introduce new dataflows that increase reuse and reduce other costly HE operations (rotations, key switching, NTT conversion), (iii) we evaluate Hyena on a balanced pipelined architecture that efficiently handles the above primitives. The resulting framework, Hyena (new packing + dataflow), achieves better performance and energy than several packing baselines. Compared to the widely used Channel-packing, Hyena is 38× faster and achieves 162× lower energy consumption, with an overall ResNet20 inference end-to-end latency of 11.4 ms, using a 163 mm<sup>2</sup> accelerator dissipating 16.75 W.

  • Multi-cluster processor operating only select number of clusters during each phase based on program statistic monitored at predetermined intervals

    OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information) · 2023-01-23

    articleOpen access1st authorCorresponding

    In a processor having multiple clusters which operate in parallel, the number of clusters in use can be varied dynamically. At the start of each program phase, the configuration option for an interval is run to determine the optimal configuration, which is used until the next phase change is detected. The optimum instruction interval is determined by starting with a minimum interval and doubling it until a low stability factor is reached.

  • Performance monitoring for new phase dynamic optimization of instruction dispatch cluster configuration

    OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information) · 2023-07-26

    articleOpen access1st authorCorresponding

    In a processor having multiple clusters which operate in parallel, the number of clusters in use can be varied dynamically. At the start of each program phase, the configuration option for an interval is run to determine the optimal configuration, which is used until the next phase change is detected. The optimum instruction interval is determined by starting with a minimum interval and doubling it until a low stability factor is reached.

  • XCRYPT: Accelerating Lattice Based Cryptography with Memristor Crossbar Arrays

    arXiv (Cornell University) · 2023-01-31

    preprintOpen access

    This paper makes a case for accelerating lattice-based post quantum cryptography (PQC) with memristor based crossbars, and shows that these inherently error-tolerant algorithms are a good fit for noisy analog MAC operations in crossbars. We compare different NIST round-3 lattice-based candidates for PQC, and identify that SABER is not only a front-runner when executing on traditional systems, but it is also amenable to acceleration with crossbars. SABER is a module-LWR based approach, which performs modular polynomial multiplications with rounding. We map the polynomial multiplications in SABER on crossbars and show that analog dot-products can yield a $1.7-32.5\times$ performance and energy efficiency improvement, compared to recent hardware proposals. This initial design combines the innovations in multiple state-of-the-art works -- the algorithm in SABER and the memristive acceleration principles proposed in ISAAC (for deep neural network acceleration). We then identify the bottlenecks in this initial design and introduce several additional techniques to improve its efficiency. These techniques are synergistic and especially benefit from SABER's power-of-two modulo operation. First, we show that some of the software techniques used in SABER, that are effective on CPU platforms, are unhelpful in crossbar-based accelerators. Relying on simpler algorithms further improves our efficiencies by $1.3-3.6\times$. Second, we exploit the nature of SABER's computations to stagger the operations in crossbars and share a few variable precision ADCs, resulting in up to $1.8\times$ higher efficiency. Third, to further reduce ADC pressure, we propose a simple analog Shift-and-Add technique, which results in a $1.3-6.3\times$ increase in the efficiency. Overall, our designs achieve $3-15\times$ higher efficiency over initial design, and $3-51\times$ higher than prior work.

  • XCRYPT: Accelerating Lattice-Based Cryptography With Memristor Crossbar Arrays

    IEEE Micro · 2023-02-24 · 7 citations

    article

    This article makes a case for accelerating lattice-based postquantum cryptography with memristor-based crossbars. We map the polynomial multiplications in a representative algorithm, SABER, and show that analog dot products can yield 1.7–32.5× performance and energy efficiency improvement compared to recent hardware proposals. We introduce several additional techniques to address the bottlenecks in this initial design. First, we show that software techniques used in SABER that are effective on central processing unit platforms are unhelpful in crossbars. Relying on simpler algorithms further improves our efficiency by 1.3–3.6×. Second, modular arithmetic in SABER offers an opportunity to drop most significant bits, enabling techniques that exploit a few variable-precision analog-to-digital converters (ADCs) and yielding up to 1.8× higher efficiency. Third, to further reduce ADC pressure, we propose a simple analog shift-and-add technique, demonstrating a 1.3–6.3× improvement. Overall, the Xbar-based accelerator for postquantum cryptography (called XCRYPT) achieves 3–15× higher efficiency over the initial design and highlights the importance of algorithm–accelerator co-design.

  • CANDLES: Channel-Aware Novel Dataflow-Microarchitecture Co-Design for Low Energy Sparse Neural Network Acceleration

    2022-04-01 · 21 citations

    article

    Several deep neural network (DNN) accelerators have been designed to exploit the sparsity exhibited by DNN activations and weights. State-of-the-art sparse accelerators can be described as either Pixel-first or Channel-first accelerators, each with its unique dataflow and compression format aiding its dataflow. The former expends significant energy updating neuron partial sums, while the latter expends significant energy in handling the index metadata. This work introduces a novel microarchitecture and dataflow that reconciles these trade-offs by adopting a Pixel-first compression and Channel-first dataflow. The proposed microarchitecture has a simpler index-generation logic combined with an accumulator buffer hierarchy and crossbar with low wiring overhead. The compression format and dataflow promote high temporal locality in neuron updates, further lowering energy. Finally, we introduce work partitions across processing elements that naturally lead to load balance without offline analysis. Compared to four state-of-the-art baselines, the proposed architecture, CANDLES, significantly outperforms three and matches the performance of the fourth. In terms of energy, CANDLES is between 2.5&#x00D7; and 5.6&#x00D7; more energy-efficient than these four baselines.

  • Efficient and Oblivious Query Processing for Range and kNN Queries (Extended Abstract)

    2022 IEEE 38th International Conference on Data Engineering (ICDE) · 2022-05-01 · 1 citations

    articleSenior author

    Oblivious RAMs (ORAMs) are proposed to completely hide access patterns. However, most ORAM constructions are expensive and not suitable to deploy in a database for supporting query processing over large data. In this work, we design a practical oblivious query processing framework to enable efficient query processing over a cloud database. In particular, we focus on processing multiple range and kNN queries asynchronously and concurrently with high throughput. The key idea is to integrate indices into ORAM which leverages a suite of optimization techniques (e.g., oblivious batch processing and caching). Our construction shows an order of magnitude speedup in comparison with other baselines over large datasets.

  • Interconnects for DNA, Quantum, In-Memory, and Optical Computing: Insights From a Panel Discussion

    IEEE Micro · 2022 · 17 citations

    • Computer Science
    • Computer Science
    • Computer architecture

    The computing world is witnessing a proverbial Cambrian explosion of emerging paradigms propelled by applications, such as artificial intelligence, big data, and cybersecurity. The recent advances in technology to store digital data inside a deoxyribonucleic acid (DNA) strand, manipulate quantum bits (qubits), perform logical operations with photons, and perform computations inside memory systems are ushering in the era of emerging paradigms of DNA computing, quantum computing, optical computing, and in-memory computing. In an orthogonal direction, research on interconnect design using advanced electro-optic, wireless, and microfluidic technologies has shown promising solutions to the architectural limitations of traditional von-Neumann computers. In this article, experts present their comments on the role of interconnects in the emerging computing paradigms, and discuss the potential use of chiplet-based architectures for the heterogeneous integration of such technologies.

Recent grants

Frequent coauthors

  • Sudeep Pasricha

    66 shared
  • Barış Taşkın

    Drexel University

    65 shared
  • Ishan Thakkar

    65 shared
  • Amlan Ganguly

    Rochester Institute of Technology

    65 shared
  • Masoud Babaie

    65 shared
  • Marc D. Riedel

    University of Minnesota

    65 shared
  • Naveen Muralimanohar

    35 shared
  • David H. Albonesi

    Cornell University

    31 shared

Labs

  • Rajeev Balasubramonian's LabPI

    Designing memory systems and accelerators for data-intensive workloads, including machine learning, genomic analysis, and security primitives.

Education

  • Ph.D., Computer Science

    University of California, Berkeley

    2000
  • M.S., Computer Science

    University of California, Berkeley

    1996
  • B.S., Electrical and Electronics Engineering

    University of Madras

    1994
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Rajeev Balasubramonian

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup