Jonathan Balkind
VerifiedUniversity of California, Santa Barbara · Computing
Active 2014–2026
Research topics
- Computer Science
- Embedded system
- Programming language
- Computer hardware
- Parallel computing
- Computer architecture
- History
Selected publications
REPTILES: Repeated Tiles of Sargantana, a RISC-V multicore based on OpenPiton
arXiv (Cornell University) · 2026-05-06
preprintOpen accessChip industry continues advancing and expanding modern computing systems, resulting in more complex multi-core processors. Conversely, academic projects face scalability challenges due to limited resources, highlighting the need for open-source frameworks that enable innovation and knowledge sharing. Recently, several open-source proposals have emerged, offering flexible and scalable designs, but fail to meet the performance demands of modern High-Performance Computing (HPC) applications. In this project, we present REPTILES, an open-source RISC-V multicore framework based on OpenPiton\thanks. REPTILES interconnects multiple Sargantana cores with the memory hierarchy of OpenPiton. Moreover, we present the new features incorporated in Sargantana and OpenPiton designs to improve the performance of HPC applications. We demonstrate that REPTILES presents suitable scalability, achieving a speedup of 3.1x on average with 4 cores. Additionally, we show that Sargantana's new features increase the performance of vector addition benchmark in a 9.3x.
REPTILES: Repeated Tiles of Sargantana, a RISC-V multicore based on OpenPiton
ArXiv.org · 2026-05-06
articleOpen accessChip industry continues advancing and expanding modern computing systems, resulting in more complex multi-core processors. Conversely, academic projects face scalability challenges due to limited resources, highlighting the need for open-source frameworks that enable innovation and knowledge sharing. Recently, several open-source proposals have emerged, offering flexible and scalable designs, but fail to meet the performance demands of modern High-Performance Computing (HPC) applications. In this project, we present REPTILES, an open-source RISC-V multicore framework based on OpenPiton\thanks. REPTILES interconnects multiple Sargantana cores with the memory hierarchy of OpenPiton. Moreover, we present the new features incorporated in Sargantana and OpenPiton designs to improve the performance of HPC applications. We demonstrate that REPTILES presents suitable scalability, achieving a speedup of 3.1x on average with 4 cores. Additionally, we show that Sargantana's new features increase the performance of vector addition benchmark in a 9.3x.
Learning Architectural Cache Simulator Behaviour
2025-10-12
articleSenior authorModern applications exhibit memory access patterns with complex spatial and temporal relationships. Traditional architectural simulators utilized to evaluate these applications are highly sequential in nature, particularly for stateful components like caches. In this paper, we present an innovative approach to cache simulation by reframing the problem from a deep learning perspective. We exploit the fact that memory access traces in any part of a processor design can be represented as two-dimensional heatmaps. Our key insight is that the behaviour of a cache acts as a filter on these heatmap images which can be learned as a function using deep learning techniques. Leveraging this observation, we introduce CacheBox, a framework that employs a Generative Adversarial Network (GAN) to learn and replicate the filtering behaviour of caches using memory access heatmaps. We demonstrate that CacheBox effectively generalises across multiple state-of-the-art benchmarks, various cache configurations, different cache hierarchy levels, and even alternative microarchitectural structures with high accuracy. We also show that CacheBox enables highly parallelized inference, allowing for simultaneous processing of multiple memory access heatmaps.
Depth-First: A Deterministic and Scalable NoC Routing Protocol for 3.5D Packaged Architectures
IEEE Journal on Emerging and Selected Topics in Circuits and Systems · 2025-07-17
articleOpen accessNew high-volume commercial products combine 2.5D silicon-interposer based assemblies with 3D monolithic stacks of chiplets. This combination is called 3.5D packaging and makes it possible to assemble dense compute solutions. Components communicate via a Network-On-Chip, but current solutions do not support 3.5D Network-On-Chip topologies. To this end, this work proposes Depth-First, the first Deterministic, Virtual Channel based, Network-On-Chip routing protocol supporting 3.5D network topologies. The protocol prevents deadlocks using additional Virtual Channels only in the upper chiplets, while imposing no VC constraints on the base interposer. Depth-First also features an efficient node naming scheme, enabling highly compact routing tables. Since vertical links must be assigned to routers, we present a Mixed-Integer Linear Programming formulation that greatly speeds up execution time compared to a reference implementation from prior work, which was based on an exhaustive search. We formally prove that the protocol is deadlock-free, study its performance using an open-source cycle-accurate simulator, and compare it with other protocols (on a comparable topology). A partial implementation of Depth-First in an open-source router results in a small 4.9% area impact (7nm process) compared to an implementation without our routing algorithm.
Empowering E-Waste Recycling with Intelligent PCB Component Detection
2025-07-18 · 1 citations
articleStramash: A Fused-Kernel Operating System For Cache-Coherent, Heterogeneous-ISA Platforms
2025-03-27 · 4 citations
articleOpen accessWe live in the world of heterogeneous computing. With specialised elements reaching all aspects of our computer systems and their prevalence only growing, we must act to rein in their inherent complexity. One area that has seen significantly less investment in terms of development is heterogeneous-ISA systems, specifically because of complexity. To date, heterogeneous-ISA processors have required significant software overheads, workarounds, and coordination layers, making the development of more advanced software hard, and motivating little further development of more advanced hardware. In this paper, we take a fused approach to heterogeneity, and introduce a new operating system (OS) design, the fused-kernel OS, which goes beyond the multiple-kernel OS design, exploiting cache-coherent shared memory among heterogeneous-ISA CPUs as a first principle -- introducing a set of new OS kernel mechanisms. We built a prototype fused-kernel OS, Stramash-Linux, to demonstrate the applicability of our design to monolithic OS kernels. We profile Stramash OS components on real hardware but tested them on an architectural simulator -- Stramash-QEMU, which we design and build. Our evaluation begins by validating the accuracy of our simulator, achieving an average of less than 4% errors. We then perform a direct comparison between our fused-kernel OS and state-of-the-art multiple-kernel OS designs. Results demonstrate speedups of up to 2.1× on NPB benchmarks. Further, we provide an in-depth analysis of the differences and trade-offs between fused-kernel and multiple-kernel OS designs.
Oobleck: Low-Compromise Design for Fault Tolerant Accelerators
ArXiv.org · 2025-06-27
preprintOpen accessSenior authorData center hardware refresh cycles are lengthening. However, increasing processor complexity is raising the potential for faults. To achieve longevity in the face of increasingly fault-prone datapaths, fault tolerance is needed, especially in on-chip accelerator datapaths. Previously researched methods for adding fault tolerance to accelerator designs require high area, lowering chip utilisation. We propose a novel architecture for accelerator fault tolerance, Oobleck, which leverages modular acceleration to enable fault tolerance without burdensome area requirements. In order to streamline the development and enforce modular conventions, we introduce the Viscosity language, an actor based approach to hardware-software co-design. Viscosity uses a single description of the accelerator's function and produces both hardware and software descriptions. Our high-level models of data centers indicate that our approach can decrease the number of failure-induced chip purchases inside data centers while not affecting aggregate throughput, thus reducing data center costs. To show the feasibility of our approach, we show three case-studies: FFT, AES, and DCT accelerators. We additionally profile the performance under the key parameters affecting latency. Under a single fault we can maintain speedups of between 1.7x-5.16x for accelerated applications over purely software implementations. We show further benefits can be achieved by adding hot-spare FPGAs into the chip.
Using SBPF to Accelerate Kernel Memory Access From Userspace
ArXiv.org · 2025-06-27
preprintOpen accessSenior authorThe cost of communication between the operating system kernel and user applications has long blocked improvements in software performance. Traditionally, operating systems encourage software developers to use the system call interface to transfer (or initiate transfer of) data between user applications and the kernel. This approach not only hurts performance at the software level due to memory copies between user space address spaces and kernel space address spaces, it also hurts system performance at the microarchitectural level by flushing processor pipelines and other microarchitectural state. In this paper, we propose a new communication interface between user applications and the kernel by setting up a shared memory region between user space applications and the kernel's address space. We acknowledge the danger in breaking the golden law of user-kernel address space isolation, so we coupled a uBPF VM (user-space BPF Virtual Machine) with shared memory to control access to the kernel's memory from the user's application. In this case, user-space programs can access the shared memory under the supervision of the uBPF VM (and the kernel's blessing of its shared library) to gain non-blocking data transfer to and from the kernel's memory space. We test our implementation in several use cases and find this mechanism can bring speedups over traditional user-kernel information passing mechanisms.
There and Back Again: A Netlist's Tale with Much Egraphin'
arXiv (Cornell University) · 2024
Senior authorCorresponding- Computer Science
- Computer Science
- History
EDA toolchains are notoriously unpredictable, incomplete, and error-prone; the generally-accepted remedy has been to re-imagine EDA tasks as compilation problems. However, any compiler framework we apply must be prepared to handle the wide range of EDA tasks, including not only compilation tasks like technology mapping and optimization (the "there"} in our title), but also decompilation tasks like loop rerolling (the "back again"). In this paper, we advocate for equality saturation -- a term rewriting framework -- as the framework of choice when building hardware toolchains. Through a series of case studies, we show how the needs of EDA tasks line up conspicuously well with the features equality saturation provides.
Exploiting HPC Techniques to Parallelise Simulation of 10B+ Transistor SoCs
2024-04-22
article1st authorCorrespondingTL simulation has become a crucial bottleneck in the design of emerging SoCs for AI. To clear this bottleneck, design teams are leaning ever more heavily on emulation and other alternative tools. We find that the designer can instead exploit the natural boundaries of these emerging SoCs in order to parallelise their RTL simulations using HPC techniques. By distributing Verilog simulation across tens of HPC nodes (and thousands of physical cores), we can simulate a 10B+ transistor, 1024 core SoC with over 2.7MIPS of aggregate throughput for the simulated cores. This talk will describe the insight, HPC techniques, and efficiency results of our novel open-source approach, known as Metro-MPI.
Frequent coauthors
- 19 shared
David Wentzlaff
Princeton University
- 10 shared
Mohammad Shahrad
University of British Columbia
- 10 shared
Alexey Lavrov
Princeton University
- 9 shared
Tri Minh Nguyen
North Carolina State University
- 9 shared
Michael McKeown
- 9 shared
Yaosheng Fu
Nvidia (United States)
- 8 shared
Yanqi Zhou
- 8 shared
Guillem López-Paradı́s
Barcelona Supercomputing Center
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Jonathan Balkind
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup