Venkatesh Akella

· Professor in Electrical EngineeringVerified

University of California, Davis · Electrical and Computer Engineering

Active 1988–2026

h-index29

Citations2.7k

Papers16716 last 5y

Funding$756k1 active

Faculty page Lab page

See your match with Venkatesh Akella — sign in to PhdFit.Sign in

About

Venkatesh Akella is a Professor in the Department of Electrical & Computer Engineering at the University of California, Davis. He holds a PhD from the University of Utah, Salt Lake City, Utah, and an MS from the Indian Institute of Science, Bangalore, India. His research encompasses a variety of areas including multimedia algorithms and architectures, low power processors and tile based architectures, control plane design for optical networking, and asynchronous design and applications. Professor Akella is actively involved in teaching several courses such as Computer Architecture, Digital System Design, Embedded System Design, Hardware/Software Codesign, Computer Systems and Assembly Language Programming, Introduction to Computer Architecture, and Digital Logic Design. He has also contributed to tutorials like the ICME 2005 Tutorial on Design and Optimization of Adaptive Multimedia Systems and Getting Started with IXP-425 StarEast Boards.

Research topics

Computer Science
Parallel computing
Distributed computing
Operating system
Embedded system
Computer hardware
Artificial Intelligence
Mathematical optimization
Algorithm
Data science
Mathematics

Selected publications

Space-Control: Process-Level Isolation for Sharing CXL-based Disaggregated Memory
Open MIND · 2026-03-06
preprint
Memory disaggregation via Compute Express Link (CXL) enables multiple hosts to share remote memory, improving utilization for data-intensive workloads. Today, virtual memory enables process-level isolation on a host and CXL enables host-level isolation. This creates a critical security gap: the absence of process-level memory isolation in shared disaggregated memory. We present Space-Control, a hardware-software co-design that provides fine-grained, process-level isolation for shared disaggregated memory. Space-Control authenticates execution context in the hardware and enforces access control on every memory access and amortizes lookup times with a small cache. Our design allows up to 127 processes Simulation Toolkit (SST) based CXL model, Space-Control incurs minimal performance overhead of 3.3%, making shared disaggregated memory isolation practical.
DOI
Space-Control: Process-Level Isolation for Sharing CXL-based Disaggregated Memory
ArXiv.org · 2026-03-06
articleOpen access
Memory disaggregation via Compute Express Link (CXL) enables multiple hosts to share remote memory, improving utilization for data-intensive workloads. Today, virtual memory enables process-level isolation on a host and CXL enables host-level isolation. This creates a critical security gap: the absence of process-level memory isolation in shared disaggregated memory. We present Space-Control, a hardware-software co-design that provides fine-grained, process-level isolation for shared disaggregated memory. Space-Control authenticates execution context in the hardware and enforces access control on every memory access and amortizes lookup times with a small cache. Our design allows up to 127 processes Simulation Toolkit (SST) based CXL model, Space-Control incurs minimal performance overhead of 3.3%, making shared disaggregated memory isolation practical.
Publisher OA PDF
Precision Aware Bank Separated Data Placement
Proceedings of the International Symposium on Memory Systems · 2025-10-06
articleOpen accessSenior author
Mixed-precision computing, which uses data with different bit-widths, is a promising way to improve the performance and energy efficiency of High-Performance Computing (HPC) workloads. While decreasing the precision of data representation saves storage, the benefits in other critical DRAM performance metrics, such as data movement, power consumption, and row buffer locality, do not always scale proportionally. This discrepancy is due to architectural constraints in DRAM subsystems, where fixed access granularities and bank contention limit the benefits of precision reduction. To address this, a DRAM optimization technique is proposed: precision-aware bank-separated data placement. This method involves dedicating specific DRAM banks to store data of a particular precision, which can improve data locality and reduce row buffer issues, leading to better performance and energy efficiency. Preliminary results suggest that this approach improves row-buffer hits by an average of 24%, reduces DRAM activation energy by 12%, and reduces the variability in DRAM access latency. Sensitivity studies on DDR4, LPDDR, and HBM show that internal DRAM microarchitectural characteristics influence the efficacy of the approach.
Publisher DOI
TEGRA - Scaling Up Graph Processing with Disaggregated Computing
2025-11-07 · 1 citations
articleOpen accessSenior author
Graph processing workloads continue to grow in scale and complexity, demanding architectures that can adapt to diverse compute and memory requirements. Traditional scale-out accelerators couple compute and memory resources, resulting in resource underutilization when executing workloads with varying compute-to-memory intensities. In this paper, we present TEGRA, a composable, scale-up architecture for large-scale graph processing. TEGRA leverages disaggregated memory via CXL and a message-passing communication model to decouple compute and memory, enabling independent scaling of each. Through detailed evaluation using the gem5 simulator, we show that TEGRA improves memory bandwidth utilization by up to 15% over state-of-the-art accelerators by dynamically provisioning compute based on workload demands. Our results demonstrate that TEGRA provides a flexible and efficient foundation for supporting emerging graph analytics workloads across a wide range of arithmetic intensities.
Publisher DOI
An Optimal Implementation of Multiplier based Galois Field Transform on an FPGA
2025-03-14
preprintOpen accessSenior author
The discrete Fourier transform over Galois field, referred to as the Galois field transform (GFT), is chiefly employed in the syndrome decoding phase of BCH and Reed-Solomon codes. Apart from coding theory, the GFTs are used in the field of cryptography for calculating multiplicative inverse in Advanced Encryption Standard (AES). The Galois field (GF) multiplier is an integral part in the computation of Galois field transform, therefore optimizing the GF multipliers plays a vital role in GFT's performance. Over the years, various architectures and algorithms for optimizing the GF multipliers and GFTs are developed but there hasn't been any multiplier based implementation of GFTs on FPGAs for lengths greater than 15. The implementation of higher order GFTs is essential as the general standard length of GFT used in AES, RS and BCH codes is 255. In our paper, we not only implement higher order GFTs but also show better performance results for lower order GFTs in comparison to the related work shown in Table V. The main focus of this paper is to exploit an optimal multiplier based GFT architecture and implement GFTs of length 3 to 1023 on an FPGA. For implementation, we use the Gappmair algorithm which reduces one-fourth of the multiplicative complexity and the Good-Thomas fast Fourier transform (FFT) algorithm. In order to achieve better performance in terms of area, throughput and power, the Gappmair algorithm architecture is pipelined and automated. In addition, we design an optimized Karatsuba Galois field multiplier with Montgomery reduction array to reduce the latency of the overall design. The complexity of mapping the indices of GFTs in Good-Thomas algorithm is handled by automating the mapping process that makes implementation easier in higher order GFTs. Finally, a performance comparison of various GFTs is done that shows the architecture proposed in this paper is an optimal solution for multiplier based implementation of GFTs on FPGAs.
Publisher OA PDF DOI
NOVA: A Novel Vertex Management Architecture for Scalable Graph Processing
2025-03-01 · 2 citations
articleSenior author
We propose a scalable graph processing hardware accelerator called NOVA that is based on a novel vertex management architecture that decouples the execution of reduction and propagation operations in the popular vertex-centric graph processing paradigm. This allows us to store the working set in off-chip memory and utilize the available on-chip memory as a buffer to hide the latency of DRAM accesses instead of a traditional cache. This overcomes one of the key drawbacks of almost all the prior works which require temporal partitioning of graphs to scale to large graphs. We develop a cycle-accurate model of the architecture in gem 5 and demonstrate that NOVA exhibits near-perfect weak and strong scaling while scaling to large graphs by spatially tiling multiple nodes. In addition, our simulations show that NOVA is $2.35 \times$ better than a state-of-the-art graph accelerator (PolyGraph) while using a fraction of the on-chip memory on a synthetic graph with 134M vertices and over 2.14B edges.
Publisher DOI
Leveraging Trusted Execution Environments For Data Security in Healthcare Workflows
2025-10-26
articleSenior author
Modern biomedical AI pipelines require robust data protection across heterogeneous environments, including edge devices, hospital servers, and cloud resources, each with distinct performance, trust, and regulatory considerations. While recent advancements in hardware-backed confidential computing (e.g., Intel SGX, AMD SEV, ARM TrustZone) offer promising solutions for data security, their differing threat models prevent seamless, end-to-end "capture-to-use" protection. To address this, we propose a novel, hardware-agnostic security monitor that extends the attestation and memory-encryption capabilities of these disparate Trusted Execution Environments (TEEs). This is complemented by a software-defined secure tunnel that enforces data-centric policy, provenance, and compliance. Our proof-of-concept prototype, integrating a TrustZone-enabled Raspberry Pi with an AMD SEV virtual machine in a cloud environment, demonstrates a deployable, data-centric enclave architecture that achieves end-to-end confidentiality, integrity, and compliance without compromising clinical throughput in biomedical AI workflows.
Publisher DOI
CachedArrays: Optimizing Data Movement for Heterogeneous Memory Systems
2024-05-27 · 1 citations
articleSenior author
We propose a new framework called CachedArrays and a set of APIs to address the data tiering problem in large scale heterogeneous and disaggregated memory systems. The proposed framework operates at a variable size object granularity and allows the programmer to specify semantic hints about future use of data via a Policy API, which are used by a Data Manager to choose when and where to place a particular data object using a data management API, thus bridging the semantic gap between the programmer and the platform-specific hardware details, and optimizing overall performance. We evaluate the proposed framework on a real hardware platform with terabytes of memory consisting of NVRAM and DRAM on large scale ML training workloads such CNNs that exhibit different data access and usage patterns. We show that CachedArrays outperforms hardware caches, and can exploit many of the algorithmic-specific optimizations of prior works.
Publisher DOI
TEGRA -- Scaling Up Terascale Graph Processing with Disaggregated Computing
arXiv (Cornell University) · 2024-04-04
preprintOpen accessSenior author
Graphs are essential for representing relationships in various domains, driving modern AI applications such as graph analytics and neural networks across science, engineering, cybersecurity, transportation, and economics. However, the size of modern graphs are rapidly expanding, posing challenges for traditional CPUs and GPUs in meeting real-time processing demands. As a result, hardware accelerators for graph processing have been proposed. However, the largest graphs that can be handled by these systems is still modest often targeting Twitter graph(1.4B edges approximately). This paper aims to address this limitation by developing a graph accelerator capable of terascale graph processing. Scale out architectures, architectures where nodes are replicated to expand to larger datasets, are natural for handling larger graphs. We argue that this approach is not appropriate for very large-scale graphs because it leads to under utilization of both memory resources and compute resources. Additionally, vertex and edge processing have different access patterns. Communication overheads also pose further challenges in designing scalable architectures. To overcome these issues, this paper proposes TEGRA, a scale-up architecture for terascale graph processing. TEGRA leverages a composable computing system with disaggregated resources and a communication architecture inspired by Active Messages. By employing direct communication between cores and optimizing memory interconnect utilization, TEGRA effectively reduces communication overhead and improves resource utilization, therefore enabling efficient processing of terascale graphs.
Publisher OA PDF DOI
Scalable Hardware Acceleration of Graph Processing with Photonic Interconnects
2023-09-26
article1st authorCorresponding
We need computing systems that can keep up with the exponential growth of data to enable the artificial intelligence-driven transformation of the modern world. Scalable graph processing systems that can handle graphs with trillions of edges present new challenges that require new ways of thinking about architecting computing systems. We show that the memory and interconnect requirements of large-scale graph processing systems mesh very well with the unique strengths of photonic interconnects, such as bandwidth density and the ability to provide high bandwidth, low latency, and low energy per bit across very long distances. These advantages of photonics can be synergized with emerging 3D and chiplet-based integration technology to create rackscale or warehouse-scale systems for high-speed predictive data analytics that can enable new applications in many disciplines.
Publisher DOI

Recent grants

Programmable Architectures for Low Density Parity Check Codes
NSF · $156k · 2004–2009
CNS Core:Small:A HW/SW Codesign Framework For Dynamic Composition of Disaggregated Hardware Systems Securely
NSF · $600k · 2022–2026

Frequent coauthors

S. J. Ben Yoo
University of California, Davis
45 shared
Christopher Nitta
34 shared
Roberto Proietti
Polytechnic University of Turin
27 shared
Matthew Farrens
University of California, Davis
24 shared
Yawei Yin
Microsoft (United States)
24 shared
Jason Lowe-Power
17 shared
Rajeevan Amirtharajah
University of California, Davis
15 shared
John Oliver
Liberal Arts University
13 shared

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Venkatesh Akella

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you