
Sandhya Dwarkadas
· Walter N. Munster Professor and Chair Department of Computer ScienceVerifiedUniversity of Virginia · Computer Science
Active 1989–2025
About
Sandhya Dwarkadas is the Walter N. Munster Professor and Chair of the Department of Computer Science at the University of Virginia. Her research lies at the intersection of computer hardware and software, with a particular focus on concurrency. She has made fundamental contributions to the design and implementation of shared memory in both hardware and software, as well as to hardware and software energy- and resource-aware configurability. Her recent research has concentrated on data movement-aware design of accelerator-rich systems, often utilizing algorithm-runtime-hardware co-design for efficiency. She holds a B.Tech. in Electrical Engineering from the Indian Institute of Technology, obtained in 1986, and both an M.S. and Ph.D. in Electrical and Computer Engineering from Rice University, completed in 1989 and 1993 respectively. Her work has significantly impacted modern computing systems, making them faster, more energy-efficient, and easier to use. Dwarkadas has been recognized as an ACM Fellow in 2018 and an IEEE Fellow in 2017 for her contributions to shared memory and reconfigurability. She received the University of Rochester's Edmund A. Hajim Outstanding Faculty Award in 2020 and the Indian Institute of Technology Madras Distinguished Alumni Award in 2025. Additionally, she was elected as an AAAS Fellow in 2024.
Research topics
- Computer Science
- Parallel computing
- Operating system
- Computer Security
- Computer network
- Database
- Embedded system
- Computer architecture
- Distributed computing
Selected publications
Improving the Performance of Out-of-Core LLM Inference Using Heterogeneous Host Memory
2025-10-12
articleSenior authorThe memory footprint of modern applications like large language models (LLMs) far exceeds the memory capacity of accelerators they run on and often spills over to host memory. As model sizes continue to grow, DRAM-based memory is no longer sufficient to contain these models, resulting in further spill-over to storage and necessitating the use of technologies like Intel Optane and CXL-enabled memory expansion. While such technologies provide more capacity, their higher latency and lower bandwidth has given rise to heterogeneous memory configurations that attempt to strike a balance between capacity and performance. This paper evaluates the impact of such memory configurations on a GPU running out-of-core LLMs. Starting with basic host/device bandwidth measurements using an Optane and Nvidia A100 equipped NUMA system, we present a comprehensive performance analysis of serving OPT-30B and OPT-175B models using FlexGen, a state-of-the-art serving framework.Our characterization shows that FlexGen’s weight placement algorithm is a key bottleneck limiting performance. Based on this observation, we evaluate two alternate weight placement strategies, one each optimizing for inference latency and throughput. When combined with model quantization, our strategies improve latency and throughput by 27% and 5x, respectively. These figures are within 9% and 6% of an all-DRAM system, demonstrating how careful data placement can effectively enable the substitution of DRAM with high-capacity but slower memory, improving overall system energy efficiency.
JSPIM: A Skew-Aware PIM Accelerator for High-Performance Databases Join and Select Operations
ArXiv.org · 2025-08-11
preprintOpen accessSenior authorDatabase applications are increasingly bottlenecked by memory bandwidth and latency due to the memory wall and the limited scalability of DRAM. Join queries, central to analytical workloads, require intensive memory access and are particularly vulnerable to inefficiencies in data movement. While Processing-in-Memory (PIM) offers a promising solution, existing designs typically reuse CPU-oriented join algorithms, limiting parallelism and incurring costly inter-chip communication. Additionally, data skew, a main challenge in CPU-based joins, remains unresolved in current PIM architectures. We introduce JSPIM, a PIM module that accelerates hash join and, by extension, corresponding select queries through algorithm-hardware co-design. JSPIM deploys parallel search engines within each subarray and redesigns hash tables to achieve O(1) lookups, fully exploiting PIM's fine-grained parallelism. To mitigate skew, our design integrates subarray-level parallelism with rank-level processing, eliminating redundant off-chip transfers. Evaluations show JSPIM delivers 400x to 1000x speedup on join queries versus DuckDB. When paired with DuckDB for the full SSB benchmark, JSPIM achieves an overall 2.5x throughput improvement (individual query gains of 1.1x to 28x), at just a 7% data overhead and 2.1% per-rank PIM-enabled chip area increase.
Concurrent PIM and Load/Store Servicing in PIM-Enabled Memory
2025-05-11 · 1 citations
articleSenior authorProcessing in-memory (PIM) has emerged as a promising approach to address the increasingly memory bound nature of modern applications like machine learning and genomics. While PIM-enabled memories offer significant performance and energy improvements over host-side execution, integration of such memories into existing systems remains an open challenge. In particular, naively replacing regular memory with a PIM-enabled one in a conventional processor could be detrimental to its performance. PIM applications are optimized to saturate the memory subsystem to maximize speedup. However, since modern processors, including CPUs and GPUs, support multi-tenancy to improve utilization, such saturation can lead to extreme unfairness and denial of service to other applications. In this paper, we characterize the performance of a PIMenabled GPU system when co-executing regular GPU kernels with a PIM kernel. Our characterization shows that PIM kernels can easily overwhelm the interconnect and the memory controller and severely degrade the performance of the non-PIM kernel, hurting system-level fairness and throughput metrics. Based on this characterization, we propose changes to the interconnect that ease the flow of requests from the processor to the memory controller. At the memory controller, we propose a new scheduling policy, called F3FS, that optimizes for fairness and throughput. While F3FS benefits from changes to the interconnect, we show that it performs comparably to existing policies without them. We evaluate and compare the proposed changes to state-of-theart memory controller scheduling policies under both competitive (two kernels from different applications) and collaborative (two kernels from same application) scenarios.
RollingCache: Using Runtime Behavior to Defend Against Cache Side Channel Attacks
arXiv (Cornell University) · 2024-08-16
preprintOpen accessSenior authorShared caches are vulnerable to side channel attacks through contention in cache sets. Besides being a simple source of information leak, these side channels form useful gadgets for more sophisticated attacks that compromise the security of shared systems. The fundamental design aspect that contention attacks exploit is the deterministic nature of the set of addresses contending for a cache set. In this paper, we present RollingCache, a cache design that defends against contention attacks by dynamically changing the set of addresses contending for cache sets. Unlike prior defenses, RollingCache does not rely on address encryption/decryption, data relocation, or cache partitioning. We use one level of indirection to implement dynamic mapping controlled by the whole-cache runtime behavior. Our solution does not depend on having defined security domains, and can defend against an attacker running on the same or another core. We evaluate RollingCache on ChampSim using the SPEC-2017 benchmark suite. Our security evaluation shows that our dynamic mapping removes the deterministic ability to identify the source of contention. The performance evaluation shows an impact of 1.67\% over a mix of workloads, with a corresponding
RELIEF: Relieving Memory Pressure In SoCs Via Data Movement-Aware Accelerator Scheduling
2024-03-02 · 4 citations
articleSenior authorData movement latency when using on-chip accelerators in emerging heterogeneous architectures is a serious performance bottleneck. While hardware/software mechanisms such as peer-to-peer DMA between producer/consumer accelerators allow bypassing main memory and significantly reduce main memory contention, schedulers in both the hardware and software domains remain oblivious to their presence. Instead, most contemporary schedulers tend to be deadline-driven, with improved utilization and/or throughput serving as secondary or co-primary goals. This lack of focus on data communication will only worsen execution times as accelerator latencies reduce. In this paper, we present RELIEF (RElaxing Least-laxIty to Enable Forwarding), an online least laxity-driven accelerator scheduling policy that relieves memory pressure in accelerator-rich architectures via data movement-aware scheduling. RELIEF leverages laxity (time margin to a deadline) to opportunistically utilize available hardware data forwarding mechanisms while minimizing quality-of-service (QoS) degradation and unfairness. RELIEF achieves up to 50 % more forwards compared to state-of-the-art policies, reducing main memory traffic and energy consumption by up to 32 % and 18 %, respectively. At the same time, RELIEF meets 14% more task deadlines on average and reduces worst-case deadline violation by 14%, highlighting QoS and fairness improvements.
Blast from the Past: Least Expected Use (LEU) Cache Replacement with Statistical History
2023-06-06 · 1 citations
articleOpen accessSenior authorCache replacement policies typically use some form of statistics on past access behavior. As a common limitation, however, the extent of the history being recorded is limited to either just the data in cache or, more recently, a larger but still finite-length window of accesses, because the cost of keeping a long history can easily outweigh its benefit.
The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead
ACM Transactions on Architecture and Code Optimization · 2023 · 2 citations
- Computer Science
- Computer Science
- Parallel computing
As the volume of data processed by applications has increased, considerable attention has been paid to data address translation overheads, leading to the widespread use of larger page sizes (“superpages”) and multi-level translation lookaside buffers (TLBs). However, far less attention has been paid to instruction address translation and its relation to TLB and pipeline structure. In prior work, we quantified the impact of using code superpages on a variety of widely used applications, ranging from compilers to web user-interface frameworks, and the impact of sharing page table pages for executables and shared libraries. Within this article, we augment those results by first uncovering the effects that microarchitectural differences between Intel Skylake and AMD Zen+, particularly their different TLB organizations, have on instruction address translation overhead. This analysis provides some key insights into the microarchitectural design decisions that impact the cost of instruction address translation. First, a lower-level (level 2) TLB that has both instruction and data mappings competing for space within the same structure allows better overall performance and utilization when using code superpages. Code superpages not only reduce instruction address translation overhead but also indirectly reduce data address translation overhead. In fact, for a few applications, the use of just a few code superpages has a larger impact on overall performance than the use of a much larger number of data superpages. Second, a level 1 (L1) TLB with separate structures for different page sizes may require careful tuning of the superpage promotion policy for code, and a correspondingly suboptimal utilization of the level 2 TLB. In particular, increasing the number of superpages when the size of the L1 superpage structure is small may result in more L1 TLB misses for some applications. Moreover, on some microarchitectures, the cost of these misses can be highly variable, because replacement is delayed until all of the in-flight instructions mapped by the victim entry are retired. Hence, more superpage promotions can result in a performance regression. Finally, our findings also make a case for first-class OS support for superpages on ordinary files containing executables and shared libraries, as well as a more aggressive superpage policy for code.
MAPPER: Managing Application Performance via Parallel Efficiency Regulation
ACM Transactions on Architecture and Code Optimization · 2022 · 3 citations
Senior authorCorresponding- Computer Science
- Computer Science
- Operating system
State-of-the-art systems, whether in servers or desktops, provide ample computational and storage resources to allow multiple simultaneously executing potentially parallel applications. However, performance tends to be unpredictable, being a function of algorithmic design, resource allocation choices, and hardware resource limitations. In this article, we introduce MAPPER, a manager of application performance via parallel efficiency regulation. MAPPER uses a privileged daemon to monitor (using hardware performance counters) and coordinate all participating applications by making two coupled decisions: the degree of parallelism to allow each application to improve system efficiency while guaranteeing quality of service (QoS), and which specific CPU cores to schedule applications on. The QoS metric may be chosen by the application and could be in terms of execution time, throughput, or tail latency, relative to the maximum performance achievable on the machine. We demonstrate that using a normalized parallel efficiency metric allows comparison across and cooperation among applications to guarantee their required QoS. While MAPPER may be used without application or runtime modification, use of a simple interface to communicate application-level knowledge improves MAPPER’s efficacy. Using a QoS guarantee of 85% of the IPC achieved with a fair share of resources on the machine, MAPPER achieves up to 3.3 \( \times \) speedup relative to unmodified Linux and runtime systems, with an average improvement of 17% in our test cases. At the same time, MAPPER violates QoS for only 2% of the applications (compared to 23% for Linux), while placing much tighter bounds on the worst case. MAPPER relieves hardware bottlenecks via task-to-CPU placement and allocates more CPU contexts to applications that exhibit higher parallel efficiency while guaranteeing QoS, resulting in both individual application performance predictability and overall system efficiency.
Interference and Need Aware Workload Colocation in Hyperscale Datacenters
arXiv (Cornell University) · 2022-07-25
preprintOpen accessDatacenters suffer from resource utilization inefficiencies due to the conflicting goals of service owners and platform providers. Service owners intending to maintain Service Level Objectives (SLO) for themselves typically request a conservative amount of resources. Platform providers want to increase operational efficiency to reduce capital and operating costs. Achieving both operational efficiency and SLO for individual services at the same time is challenging due to the diversity in service workload characteristics, resource usage patterns that are dependent on input load, heterogeneity in platform, memory, I/O, and network architecture, and resource bundling. This paper presents a tunable approach to resource allocation that accounts for both dynamic service resource needs and platform heterogeneity. In addition, an online K-Means-based service classification method is used in conjunction with an offline sensitivity component. Our tunable approach allows trading resource utilization efficiency for absolute SLO guarantees based on the service owners' sensitivity to its SLO. We evaluate our tunable resource allocator at scale in a private cloud environment with mostly latency-critical workloads. When tuning for operational efficiency, we demonstrate up to ~50% reduction in required machines; ~40% reduction in Total-Cost-of-Ownership (TCO); and ~60% reduction in CPU and memory fragmentation, but at the cost of increasing the number of tasks experiencing degradation of SLO by up to ~25% compared to the baseline. When tuning for SLO, by introducing interference-aware colocation, we can tune the solver to reduce tasks experiencing degradation of SLO by up to ~22% compared to the baseline, but at an additional cost of ~30% in terms of the number of hosts. We highlight this trade-off between TCO and SLO violations, and offer tuning based on the requirements of the platform owners.
Preventing Coherence State Side Channel Leaks Using TimeCache
IEEE Transactions on Computers · 2022-09-29
articleSenior authorCache side channel attacks in the presence of shared memory have been used to extract cryptographic keys and enclave data, and are used by Spectre variants for leaking speculatively loaded data. Timing side channels exist in shared caches due to the difference in response latency of cached and uncached data. In prior work, we presented TimeCache, a cache design that prevents side channel exploits from reuse of shared memory. In this work, we extend TimeCache to also defend against attacks that exploit coherence states. TimeCache allows all running applications to use the entire cache, avoiding the need for partitioning in order to effect timing isolation. A per-process caching context prevents cache hits on data filled by another process. A novel bit-serial timestamp-parallel comparison logic allows low-overhead update of stale caching contexts. The defense is suited to all caches levels, and defends against an attacker running on any core. We evaluate TimeCache using the gem5 simulator to show that it is capable of preventing both reuse attacks and an attack based on coherence state leak. The average performance overhead for SPEC2006 is 1.13%, and for PARSEC and SPLASH is 0.46%.
Recent grants
SHF: Small:Scalable Support for Concurrency in Multicore Systems
NSF · $416k · 2012–2016
NSF · $450k · 2013–2017
CSR: Small: Collaborative Research: Instruction Address Translation Revisited
NSF · $250k · 2016–2019
Operating System Strategies for Energy- and Resource-Aware Adaptation
NSF · $499k · 2004–2008
CAREER: Enhanced Software Distributed Shared Memory as a Compiler Target
NSF · $205k · 1997–2001
Frequent coauthors
- 162 shared
David A. Bader
- 156 shared
Guojing Cong
Oak Ridge National Laboratory
- 150 shared
Srinivas Aluru
- 97 shared
Michael L. Scott
University of Rochester
- 82 shared
Josep Torrellas
University of Illinois Urbana-Champaign
- 82 shared
Felix Wolf
Technical University of Darmstadt
- 81 shared
Matthias Müller
- 81 shared
Hans–Joachim Bungartz
Technical University of Munich
Labs
Not provided
Education
- 1994
Ph.D., Computer Science
University of California, Berkeley
- 1990
M.S., Computer Science
University of California, Berkeley
- 1987
B.S., Electrical Engineering
University of Bombay
Awards & honors
- ACM Fellow 2018 for contributions to shared memory and recon…
- IEEE Fellow 2017 for contributions to shared memory and reco…
- University of Rochester 2020 Edmund A. Hajim Outstanding Fac…
- Indian Institute of Technology Madras 2025 Distinguished Alu…
- AAAS Fellow 2024
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Sandhya Dwarkadas
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup