Martin Herbordt

· Professor (ECE)Verified

Boston University · Electrical and Computer Engineering

Active 1990–2025

h-index29

Citations2.8k

Papers19656 last 5y

Funding$3.8M

Faculty page Website

See your match with Martin Herbordt — sign in to PhdFit.Sign in

About

Martin Herbordt is a Professor in the Department of Electrical & Computer Engineering at Boston University. His research interests focus on computer architecture and high performance computing, with a particular emphasis on accelerating applications that are not optimally served by commercial off-the-shelf solutions. This focus has led him to work in diverse domains including computer vision, weather and climate modeling, bioinformatics, computational biology, and computational electrodynamics. Early in his career, he concentrated on ASIC-based solutions, exemplified by his work with the IUA project. Subsequently, his research expanded to include switching fabrics for connecting off-the-shelf components or intellectual property (IP). More recently, he has explored the computational capabilities of Graphics Processors (GPUs) and configurable circuits (FPGAs), which he describes as incredible but largely untapped. A consistent theme throughout his research has been programmability, which has driven his involvement in developing portable programming languages. Currently, his interests include low-latency interconnects, particularly among accelerators themselves, thermal and power-aware application development, and high performance computing in the Cloud.

Research topics

Computer Science
Operating system
Theoretical computer science
Embedded system
Parallel computing
Distributed computing
Algorithm
World Wide Web
Software engineering
Computer architecture

Selected publications

Towards -O<sub>mL</sub>: A Deep Learning Based Approach for Outperforming Compiler Defaults
2025-09-15
articleSenior author
Compilers offer default optimization levels (e.g., -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">z</inf>, -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3</inf>) to generate high performance code based on developer goals such as size, speed, or energy efficiency. As prior work has shown, however, these default levels often generate code that leaves substantial room for improvement. While per-application tuning of compiler heuristics can yield performance benefits, it is time-consuming and lacks generality. A preferred solution is one where a single deep learning model can (i) surpass performance of compiler defaults and (ii) is sufficiently practical to be integrated into the compiler as its own option, e.g., -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">mL</inf>.In this work, we first train such a deep learning model for code size reduction (-O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">zmL</inf>). Our reinforcement learning (RL) based approach achieves an average reduction in code size of 1.40% over GCC 13’s -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">z</inf> and yields 1.37% better byte reduction as compared to state-of-the-art efforts. Across a large, diverse set of functions from standard benchmarks, this model optimizes 36.7% of the functions achieving an average 8.45% code size reduction on top of -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">z</inf> for those functions. As expected with any model, however, there are still some functions whose performance degrades when compared to -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">z</inf>. We therefore propose a classifier to be used in tandem with the model to reduce such performance regressions, reducing them by 99%. Results demonstrate the viability of a practical and learning-based -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">mL</inf> optimization level in production compilers.
Publisher DOI
SmartNIC-GPU-CPU Heterogeneous System for Large Machine Learning Model with Software-Hardware Codesign
2025-06-08 · 2 citations
articleOpen accessSenior author
Publisher DOI
ACiS: Complex Processing in the Switch Fabric
ArXiv.org · 2025-01-30
preprintOpen accessSenior author
For the last three decades a core use of FPGAs has been for processing communication: FPGA-based SmartNICs are in widespread use from the datacenter to IoT. Augmenting switches with FPGAs, however, has been less studied, but has numerous advantages built around the processing being moved from the edge of the network to the center. Communication switches have previously been augmented to process collectives, e.g., IBM BlueGene and Mellanox SHArP, but the support has been limited to a small set of predefined scalar operations and datatypes. Here we present ACiS, a framework and taxonomy for Advanced Computing in the Switch that unifies and expands our previous work in this area. In addition to fixed scalar collectives (Type 1), we propose three more types of in-switch application processing: (Type 2) User-defined operations and types, including data structures; (Type 3) Look-aside operations that have state within the operation and can have loops; and (Type 4) Fused collectives built by fusing multiple existing collectives or collectives with map computations. ACiS is supported in hardware with modular switch extensions including a CGRA architecture. Software support for ACiS includes evaluation and translation of relevant parts of user programs, compilation of user specifications into control flow graphs, and mapping the graphs into switch hardware. The overall goal is the transparent acceleration of HPC applications encapsulated within an MPI implementation.
Publisher OA PDF DOI
Load Imbalance in HPC Applications: Improved Profiling and New Ways to Use Wasted Cycles
2025-09-15
articleSenior author
As the scale of High Performance Computing (HPC) applications increases, load imbalance and other inefficiencies become correspondingly more significant. In this work we extend a previously described system that reclaims idle computation time during MPI synchronization. We begin by applying a new profiler to a new target system (Stampede III) and observing application characteristics. We find that for miniFE there is immense potential for capturing wasted cycles (roughly 30%). Moreover these cycles do not appear to be caused systematically by particular ranks, but rather by a number of ranks that vary without obvious pattern. We then propose a new runtime algorithm that improves CPU cycle utilization by an additional 10–15% over the previous best result.
Publisher DOI
AnnotationGym: A Generic Framework for Automatic Source Code Annotation
IEEE Access · 2025-01-01 · 1 citations
articleOpen accessSenior author
A common approach to code optimization is to insert compiler hints in the source code using annotations. Two major challenges with using annotations effectively are their complexity and lack of portability. This means, first, that significant developer expertise is required, and, second, that the supported annotations, as well as their syntax and use, can vary substantially. Moreover, there is not currently any tool that can output performant annotation-inserted codes for different back-ends. To address these challenges, we present AnnotationGym, an easy-to-use, open-source, generic infrastructure that supplements or replaces the developer in annotating source code. It demonstrates a novel application of AI methods to code annotation. In addition to improving code performance, the flexibility of AnnotationGym enables easy comparisons of performance and optimization strategies among compilers and target architectures and thus provides an extensible platform to facilitate further progress in this field. AnnotationGym automatically extracts structured information about the target code and compiler to generate a list of possible annotations. AI-based optimization algorithms then traverse this space to determine the best set of annotations depending on the developer goals. To demonstrate its effectiveness, we run AnnotationGym on popular, representative workloads from the Polybench suite, and target different compilers (GCC, AMD HLS, Intel HLS), optimization algorithms (Reinforcement Learning, Bayesian Optimization), and architectures (CPU, FPGA).We also test our approach on FPGA codes derived, e.g., from the Rodinia and OpenDwarfs benchmarks and that are hand-optimized using standard best practices. An interesting finding is that the best overall performance obtained by AnnotationGym was generally with unoptimized codes.
Publisher DOI
Accelerating Multi-Party Computation Using Heterogeneous Systems
2025-09-15
articleSenior author
Multi-party computation (MPC) allows multiple parties to compute with private data without sharing it. We have previously found that MPC through Secret Sharing has some advantages in data center deployments. But while MPC provides strong privacy guarantees, in Secret Sharing MPC, the protocol’s communication between parties often takes a significant portion of the total execution time. That is, adding MPC to applications like neural network training can increase the execution time from minutes to days. A significant fraction of this increased execution time is due to the use of generic TCP/IP networks and other communication overhead.We present advances to a system that accelerates Secret Sharing MPC using FPGA network cards, which is especially applicable for deployments where all parties are in the same data center. This approach makes improvements over previous work in this area. First, we replace TCP/IP with RDMA over FPGA SmartNICs; this sends data directly among memories without CPU involvement. And second, computation is overlapped with communication to avoid device idling.We implement the SmartNICs using AMD FPGAs with Coyote’s RDMA stack. The proof-of-concept application is a simple machine learning model. We performed layer-by-layer experiments to analyze latency. For the convolutional layers, the system achieves a 2.1× speedup compared to existing MPC frameworks. Moreover, based on these optimizations and analysis, we estimate that the machine learning workflow can achieve a 1.7× speedup compared to unoptimized MPC, potentially making MPC machine learning significantly more attractive.
Publisher DOI
AutoAnnotate: Reinforcement Learning based Code Annotation for High Level Synthesis
2024-04-03 · 5 citations
articleSenior author
High Level Synthesis (HLS) allows custom hardware generation using developer-friendly programming languages. Often, however, the HLS compiler is unable to output high quality results. One approach is to pre-process the source code, e.g., to restructure the computational flow, or to insert compiler hints using annotations or pragmas. But while the latter approach appears to enhance programmability, it also requires developer expertise, both regarding hardware design patterns and even compiler internals: an incorrect annotation strategy can worsen performance or result in compilation deadlocks. To address these challenges, this work presents AutoAnnotate, an automatic code annotation framework for HLS. It demonstrates the efficacy, novelty, and benefit of applying ML methods to code annotation. AutoAnnotate replaces the need for developer expertise by using Reinforcement Learning (RL) to determine the best set of annotations for a given input code. To demonstrate the effectiveness of this approach, we ran AutoAnnotate on a number of common FPGA benchmarks derived, e.g., from Rodinia and OpenDwarfs, with state-of-art HLS tools (AMD Vitis and Intel HLS). We obtained a geometric mean of $42 \times$ performance improvement for Vitis HLS and $3.42 \times$ for Intel HLS. We then hand optimized these codes using standard best practices and again applied AutoAnnotate, this time still achieving $32.3 \times$ performance improvement for Vitis HLS and $3.1 \times$ for Intel HLS. Interestingly, the best overall performance obtained by AutoAnnotate was generally with unoptimized codes.
Publisher DOI
Cycle-Stealing in Load-Imbalanced HPC Applications
2024-09-23 · 1 citations
articleSenior author
It is practical to steal cycles when Message Passing Interface (MPI) programs are load imbalanced, either from natural algorithmic load, because of collective communication and/or process skew, and/or because of variable message loads and needs for message progress within the MPI implementation. We introduce TimeLord, a runtime library to provide fungibility for compute-cycles without significantly degrading the nominal performance of the parallel application. We propose three means to exploit wasted cycles in LAMMPS, miniFE, and miniAMR. The results show, on average, 40% of the runtime can be used to execute extra computations and that runtime overheads are less than 4%. We envisage that these extra computations be related to the target application; here, for simplicity and proof-of-concept, we assume arbitrary commodity applications. Additionally, TimeLord uncovers inefficiencies in the MPI progress engine and improves baseline performance. We explore some fundamental issues of load imbalance and its interaction with system software, including the predictability of extra time and the relative benefits of prediction versus preemption. TimeLord requires no changes to the MPI application, MPI implementation, or other system components.
Publisher DOI
Multi-Core Multi-Rule VeBPF Firewall for Secure FPGA IoT Device Deployments
2024-05-27 · 2 citations
articleSenior author
FPGAs are often deployed in IoT devices: sensor technology is advancing rapidly and microcontrollers may be unable to handle the needed throughput. But with their connections to the internet and only basic system support, IoT devices may be easy targets of cyberattacks. Current FPGA-based SmartNIC defenses against cyberattacks, however, are mostly applicable in cloud deployments. In order to mitigate cyberattacks on resource-limited IoT devices, we have developed a multi-core multi-rule VeBPF (Verilog extended Berkeley Packet Filter) firewall for FPGA-based IoT devices. This VeBPF firewall accepts standard eBPF bytecode as firewall rules; these rules are run by VeBPF CPU cores. Any number of VeBPF cores can be generated by specifying the N<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">VeBPF</inf> parameter.
Publisher DOI
A Neural Network Based GCC Cost Model for Faster Compiler Tuning
2024-09-23 · 2 citations
articleSenior author
Machine learning models have been found to be effective in predicting compiler heuristics, but are limited by their very long training times. This is because computing the impact of transformations on a code, e.g., through the performance values, involves invoking the downstream compiler. One way to circumvent the cost-computation bottleneck is to devise accurate cost models that can be trained to predict the target metric. In this paper, we develop a neural net based cost function that can accurately predict binary code size for GCC-based compilation. The input to the model is a comprehensive list of features that have been extracted offline from GCC's intermediate tree representation and the compiler flags that need to be evaluated by a compiler tuning workload. To extract the code features, we have built a GIMPLE analysis framework that can generate feature sets from intermediate representations at different stages of the compilation process. Our results show that the cost model has a mean absolute percentage error of just 8%, and a Spearman correlation of 0.98 between predicted and measured binary size of the test applications. We also demonstrate that compiler pass selection for feature extraction has a significant benefit on the accuracy of the model. Finally, we show that the cost model can reduce metric evaluation time by multiple orders of magnitude.
Publisher DOI

Recent grants

Exploration of Low-Cost Long-Timescale Free Energy Perturbations on FPGAs
NIH · $218k · 2018–2019
II-EN: Collaborative Research: Large-Scale FPGA-Centric Cluster with Direct and Programmable Communication
NSF · $350k · 2014–2018
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
NSF · $457k · 2019–2024
NIH Grant R21RR020209
NIH · $344k · 2007
CCRI: Grand: Developing a Testbed for the Research Community Exploring Next-Generation Cloud Platforms
NSF · $2.0M · 2019–2024

Frequent coauthors

Tong Geng
University of Rochester
45 shared
Chunshu Wu
University of Rochester
25 shared
Charles Weems
University of Massachusetts Amherst
23 shared
Ahmed Sanaullah
Red Hat (United States)
22 shared
Pouya Haghi
University of Rochester
22 shared
Ang Li
20 shared
Tianqi Wang
18 shared
Anqi Guo
Boston University
18 shared

Education

Ph.D., Electrical Engineering
Massachusetts Institute of Technology
1985
M.S., Electrical Engineering
Massachusetts Institute of Technology
1981
B.S., Electrical Engineering
University of California, Berkeley
1979

Awards & honors

2014 General Chair, 22nd IEEE International Symposium on Fie…
2013 General Chair, IEEE International Parallel and Distribu…
2011 Fellow, Hariri Institute for Computing and Computationa…
2009 FPL Outstanding Paper Award
2008 IBM Faculty Award

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Martin Herbordt

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you