Jonathan Balkind

Verified

University of California, Santa Barbara · Computing

Active 2014–2026

h-index10

Citations521

Papers3521 last 5y

Funding—

Faculty page

See your match with Jonathan Balkind — sign in to PhdFit.Sign in

Research topics

Computer Science
Embedded system
Programming language
Computer hardware
Parallel computing
Computer architecture
History

Selected publications

REPTILES: Repeated Tiles of Sargantana, a RISC-V multicore based on OpenPiton
arXiv (Cornell University) · 2026-05-06
preprintOpen access
Chip industry continues advancing and expanding modern computing systems, resulting in more complex multi-core processors. Conversely, academic projects face scalability challenges due to limited resources, highlighting the need for open-source frameworks that enable innovation and knowledge sharing. Recently, several open-source proposals have emerged, offering flexible and scalable designs, but fail to meet the performance demands of modern High-Performance Computing (HPC) applications. In this project, we present REPTILES, an open-source RISC-V multicore framework based on OpenPiton\thanks. REPTILES interconnects multiple Sargantana cores with the memory hierarchy of OpenPiton. Moreover, we present the new features incorporated in Sargantana and OpenPiton designs to improve the performance of HPC applications. We demonstrate that REPTILES presents suitable scalability, achieving a speedup of 3.1x on average with 4 cores. Additionally, we show that Sargantana's new features increase the performance of vector addition benchmark in a 9.3x.
Publisher DOI
REPTILES: Repeated Tiles of Sargantana, a RISC-V multicore based on OpenPiton
ArXiv.org · 2026-05-06
articleOpen access
Chip industry continues advancing and expanding modern computing systems, resulting in more complex multi-core processors. Conversely, academic projects face scalability challenges due to limited resources, highlighting the need for open-source frameworks that enable innovation and knowledge sharing. Recently, several open-source proposals have emerged, offering flexible and scalable designs, but fail to meet the performance demands of modern High-Performance Computing (HPC) applications. In this project, we present REPTILES, an open-source RISC-V multicore framework based on OpenPiton\thanks. REPTILES interconnects multiple Sargantana cores with the memory hierarchy of OpenPiton. Moreover, we present the new features incorporated in Sargantana and OpenPiton designs to improve the performance of HPC applications. We demonstrate that REPTILES presents suitable scalability, achieving a speedup of 3.1x on average with 4 cores. Additionally, we show that Sargantana's new features increase the performance of vector addition benchmark in a 9.3x.
Publisher OA PDF
Learning Architectural Cache Simulator Behaviour
2025-10-12
articleSenior author
Modern applications exhibit memory access patterns with complex spatial and temporal relationships. Traditional architectural simulators utilized to evaluate these applications are highly sequential in nature, particularly for stateful components like caches. In this paper, we present an innovative approach to cache simulation by reframing the problem from a deep learning perspective. We exploit the fact that memory access traces in any part of a processor design can be represented as two-dimensional heatmaps. Our key insight is that the behaviour of a cache acts as a filter on these heatmap images which can be learned as a function using deep learning techniques. Leveraging this observation, we introduce CacheBox, a framework that employs a Generative Adversarial Network (GAN) to learn and replicate the filtering behaviour of caches using memory access heatmaps. We demonstrate that CacheBox effectively generalises across multiple state-of-the-art benchmarks, various cache configurations, different cache hierarchy levels, and even alternative microarchitectural structures with high accuracy. We also show that CacheBox enables highly parallelized inference, allowing for simultaneous processing of multiple memory access heatmaps.
Publisher DOI
Depth-First: A Deterministic and Scalable NoC Routing Protocol for 3.5D Packaged Architectures
IEEE Journal on Emerging and Selected Topics in Circuits and Systems · 2025-07-17
articleOpen access
New high-volume commercial products combine 2.5D silicon-interposer based assemblies with 3D monolithic stacks of chiplets. This combination is called 3.5D packaging and makes it possible to assemble dense compute solutions. Components communicate via a Network-On-Chip, but current solutions do not support 3.5D Network-On-Chip topologies. To this end, this work proposes Depth-First, the first Deterministic, Virtual Channel based, Network-On-Chip routing protocol supporting 3.5D network topologies. The protocol prevents deadlocks using additional Virtual Channels only in the upper chiplets, while imposing no VC constraints on the base interposer. Depth-First also features an efficient node naming scheme, enabling highly compact routing tables. Since vertical links must be assigned to routers, we present a Mixed-Integer Linear Programming formulation that greatly speeds up execution time compared to a reference implementation from prior work, which was based on an exhaustive search. We formally prove that the protocol is deadlock-free, study its performance using an open-source cycle-accurate simulator, and compare it with other protocols (on a comparable topology). A partial implementation of Depth-First in an open-source router results in a small 4.9% area impact (7nm process) compared to an implementation without our routing algorithm.
Publisher OA PDF DOI
Empowering E-Waste Recycling with Intelligent PCB Component Detection
2025-07-18 · 1 citations
article
Publisher DOI
Stramash: A Fused-Kernel Operating System For Cache-Coherent, Heterogeneous-ISA Platforms
2025-03-27 · 4 citations
articleOpen access
We live in the world of heterogeneous computing. With specialised elements reaching all aspects of our computer systems and their prevalence only growing, we must act to rein in their inherent complexity. One area that has seen significantly less investment in terms of development is heterogeneous-ISA systems, specifically because of complexity. To date, heterogeneous-ISA processors have required significant software overheads, workarounds, and coordination layers, making the development of more advanced software hard, and motivating little further development of more advanced hardware. In this paper, we take a fused approach to heterogeneity, and introduce a new operating system (OS) design, the fused-kernel OS, which goes beyond the multiple-kernel OS design, exploiting cache-coherent shared memory among heterogeneous-ISA CPUs as a first principle -- introducing a set of new OS kernel mechanisms. We built a prototype fused-kernel OS, Stramash-Linux, to demonstrate the applicability of our design to monolithic OS kernels. We profile Stramash OS components on real hardware but tested them on an architectural simulator -- Stramash-QEMU, which we design and build. Our evaluation begins by validating the accuracy of our simulator, achieving an average of less than 4% errors. We then perform a direct comparison between our fused-kernel OS and state-of-the-art multiple-kernel OS designs. Results demonstrate speedups of up to 2.1× on NPB benchmarks. Further, we provide an in-depth analysis of the differences and trade-offs between fused-kernel and multiple-kernel OS designs.
Publisher OA PDF DOI
Oobleck: Low-Compromise Design for Fault Tolerant Accelerators
ArXiv.org · 2025-06-27
preprintOpen accessSenior author
Data center hardware refresh cycles are lengthening. However, increasing processor complexity is raising the potential for faults. To achieve longevity in the face of increasingly fault-prone datapaths, fault tolerance is needed, especially in on-chip accelerator datapaths. Previously researched methods for adding fault tolerance to accelerator designs require high area, lowering chip utilisation. We propose a novel architecture for accelerator fault tolerance, Oobleck, which leverages modular acceleration to enable fault tolerance without burdensome area requirements. In order to streamline the development and enforce modular conventions, we introduce the Viscosity language, an actor based approach to hardware-software co-design. Viscosity uses a single description of the accelerator's function and produces both hardware and software descriptions. Our high-level models of data centers indicate that our approach can decrease the number of failure-induced chip purchases inside data centers while not affecting aggregate throughput, thus reducing data center costs. To show the feasibility of our approach, we show three case-studies: FFT, AES, and DCT accelerators. We additionally profile the performance under the key parameters affecting latency. Under a single fault we can maintain speedups of between 1.7x-5.16x for accelerated applications over purely software implementations. We show further benefits can be achieved by adding hot-spare FPGAs into the chip.
Publisher OA PDF DOI
Using SBPF to Accelerate Kernel Memory Access From Userspace
ArXiv.org · 2025-06-27
preprintOpen accessSenior author
The cost of communication between the operating system kernel and user applications has long blocked improvements in software performance. Traditionally, operating systems encourage software developers to use the system call interface to transfer (or initiate transfer of) data between user applications and the kernel. This approach not only hurts performance at the software level due to memory copies between user space address spaces and kernel space address spaces, it also hurts system performance at the microarchitectural level by flushing processor pipelines and other microarchitectural state. In this paper, we propose a new communication interface between user applications and the kernel by setting up a shared memory region between user space applications and the kernel's address space. We acknowledge the danger in breaking the golden law of user-kernel address space isolation, so we coupled a uBPF VM (user-space BPF Virtual Machine) with shared memory to control access to the kernel's memory from the user's application. In this case, user-space programs can access the shared memory under the supervision of the uBPF VM (and the kernel's blessing of its shared library) to gain non-blocking data transfer to and from the kernel's memory space. We test our implementation in several use cases and find this mechanism can bring speedups over traditional user-kernel information passing mechanisms.
Publisher OA PDF DOI
There and Back Again: A Netlist's Tale with Much Egraphin'
arXiv (Cornell University) · 2024
Senior authorCorresponding
- Computer Science
- Computer Science
- History
EDA toolchains are notoriously unpredictable, incomplete, and error-prone; the generally-accepted remedy has been to re-imagine EDA tasks as compilation problems. However, any compiler framework we apply must be prepared to handle the wide range of EDA tasks, including not only compilation tasks like technology mapping and optimization (the "there"} in our title), but also decompilation tasks like loop rerolling (the "back again"). In this paper, we advocate for equality saturation -- a term rewriting framework -- as the framework of choice when building hardware toolchains. Through a series of case studies, we show how the needs of EDA tasks line up conspicuously well with the features equality saturation provides.
Publisher OA PDF DOI
Exploiting HPC Techniques to Parallelise Simulation of 10B+ Transistor SoCs
2024-04-22
article1st authorCorresponding
TL simulation has become a crucial bottleneck in the design of emerging SoCs for AI. To clear this bottleneck, design teams are leaning ever more heavily on emulation and other alternative tools. We find that the designer can instead exploit the natural boundaries of these emerging SoCs in order to parallelise their RTL simulations using HPC techniques. By distributing Verilog simulation across tens of HPC nodes (and thousands of physical cores), we can simulate a 10B+ transistor, 1024 core SoC with over 2.7MIPS of aggregate throughput for the simulated cores. This talk will describe the insight, HPC techniques, and efficiency results of our novel open-source approach, known as Metro-MPI.
Publisher DOI

Frequent coauthors

David Wentzlaff
Princeton University
19 shared
Mohammad Shahrad
University of British Columbia
10 shared
Alexey Lavrov
Princeton University
10 shared
Tri Minh Nguyen
North Carolina State University
9 shared
Michael McKeown
9 shared
Yaosheng Fu
Nvidia (United States)
9 shared
Yanqi Zhou
8 shared
Guillem López-Paradı́s
Barcelona Supercomputing Center
8 shared

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Jonathan Balkind

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you

Jonathan Balkind

Research topics

Selected publications

Frequent coauthors

See your match with Jonathan Balkind