Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Martin Herbordt

Martin Herbordt

· Professor (ECE)Verified

Boston University · Electrical and Computer Engineering

Active 1990–2025

h-index29
Citations2.8k
Papers19656 last 5y
Funding$3.8M
See your match with Martin Herbordt — sign in to PhdFit.Sign in

About

Martin Herbordt is a Professor in the Department of Electrical & Computer Engineering at Boston University. His research interests focus on computer architecture and high performance computing, with a particular emphasis on accelerating applications that are not optimally served by commercial off-the-shelf solutions. This focus has led him to work in diverse domains including computer vision, weather and climate modeling, bioinformatics, computational biology, and computational electrodynamics. Early in his career, he concentrated on ASIC-based solutions, exemplified by his work with the IUA project. Subsequently, his research expanded to include switching fabrics for connecting off-the-shelf components or intellectual property (IP). More recently, he has explored the computational capabilities of Graphics Processors (GPUs) and configurable circuits (FPGAs), which he describes as incredible but largely untapped. A consistent theme throughout his research has been programmability, which has driven his involvement in developing portable programming languages. Currently, his interests include low-latency interconnects, particularly among accelerators themselves, thermal and power-aware application development, and high performance computing in the Cloud.

Research topics

  • Computer Science
  • Operating system
  • Theoretical computer science
  • Embedded system
  • Parallel computing
  • Distributed computing
  • Algorithm
  • World Wide Web
  • Software engineering
  • Computer architecture

Selected publications

  • Towards -O<sub>mL</sub>: A Deep Learning Based Approach for Outperforming Compiler Defaults

    2025-09-15

    articleSenior author

    Compilers offer default optimization levels (e.g., -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">z</inf>, -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3</inf>) to generate high performance code based on developer goals such as size, speed, or energy efficiency. As prior work has shown, however, these default levels often generate code that leaves substantial room for improvement. While per-application tuning of compiler heuristics can yield performance benefits, it is time-consuming and lacks generality. A preferred solution is one where a single deep learning model can (i) surpass performance of compiler defaults and (ii) is sufficiently practical to be integrated into the compiler as its own option, e.g., -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">mL</inf>.In this work, we first train such a deep learning model for code size reduction (-O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">zmL</inf>). Our reinforcement learning (RL) based approach achieves an average reduction in code size of 1.40% over GCC 13’s -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">z</inf> and yields 1.37% better byte reduction as compared to state-of-the-art efforts. Across a large, diverse set of functions from standard benchmarks, this model optimizes 36.7% of the functions achieving an average 8.45% code size reduction on top of -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">z</inf> for those functions. As expected with any model, however, there are still some functions whose performance degrades when compared to -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">z</inf>. We therefore propose a classifier to be used in tandem with the model to reduce such performance regressions, reducing them by 99%. Results demonstrate the viability of a practical and learning-based -O<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">mL</inf> optimization level in production compilers.

  • SmartNIC-GPU-CPU Heterogeneous System for Large Machine Learning Model with Software-Hardware Codesign

    2025-06-08 · 2 citations

    articleOpen accessSenior author
  • ACiS: Complex Processing in the Switch Fabric

    ArXiv.org · 2025-01-30

    preprintOpen accessSenior author

    For the last three decades a core use of FPGAs has been for processing communication: FPGA-based SmartNICs are in widespread use from the datacenter to IoT. Augmenting switches with FPGAs, however, has been less studied, but has numerous advantages built around the processing being moved from the edge of the network to the center. Communication switches have previously been augmented to process collectives, e.g., IBM BlueGene and Mellanox SHArP, but the support has been limited to a small set of predefined scalar operations and datatypes. Here we present ACiS, a framework and taxonomy for Advanced Computing in the Switch that unifies and expands our previous work in this area. In addition to fixed scalar collectives (Type 1), we propose three more types of in-switch application processing: (Type 2) User-defined operations and types, including data structures; (Type 3) Look-aside operations that have state within the operation and can have loops; and (Type 4) Fused collectives built by fusing multiple existing collectives or collectives with map computations. ACiS is supported in hardware with modular switch extensions including a CGRA architecture. Software support for ACiS includes evaluation and translation of relevant parts of user programs, compilation of user specifications into control flow graphs, and mapping the graphs into switch hardware. The overall goal is the transparent acceleration of HPC applications encapsulated within an MPI implementation.

  • Load Imbalance in HPC Applications: Improved Profiling and New Ways to Use Wasted Cycles

    2025-09-15

    articleSenior author

    As the scale of High Performance Computing (HPC) applications increases, load imbalance and other inefficiencies become correspondingly more significant. In this work we extend a previously described system that reclaims idle computation time during MPI synchronization. We begin by applying a new profiler to a new target system (Stampede III) and observing application characteristics. We find that for miniFE there is immense potential for capturing wasted cycles (roughly 30%). Moreover these cycles do not appear to be caused systematically by particular ranks, but rather by a number of ranks that vary without obvious pattern. We then propose a new runtime algorithm that improves CPU cycle utilization by an additional 10–15% over the previous best result.

  • AnnotationGym: A Generic Framework for Automatic Source Code Annotation

    IEEE Access · 2025-01-01 · 1 citations

    articleOpen accessSenior author

    A common approach to code optimization is to insert compiler hints in the source code using annotations. Two major challenges with using annotations effectively are their complexity and lack of portability. This means, first, that significant developer expertise is required, and, second, that the supported annotations, as well as their syntax and use, can vary substantially. Moreover, there is not currently any tool that can output performant annotation-inserted codes for different back-ends. To address these challenges, we present AnnotationGym, an easy-to-use, open-source, generic infrastructure that supplements or replaces the developer in annotating source code. It demonstrates a novel application of AI methods to code annotation. In addition to improving code performance, the flexibility of AnnotationGym enables easy comparisons of performance and optimization strategies among compilers and target architectures and thus provides an extensible platform to facilitate further progress in this field. AnnotationGym automatically extracts structured information about the target code and compiler to generate a list of possible annotations. AI-based optimization algorithms then traverse this space to determine the best set of annotations depending on the developer goals. To demonstrate its effectiveness, we run AnnotationGym on popular, representative workloads from the Polybench suite, and target different compilers (GCC, AMD HLS, Intel HLS), optimization algorithms (Reinforcement Learning, Bayesian Optimization), and architectures (CPU, FPGA).We also test our approach on FPGA codes derived, e.g., from the Rodinia and OpenDwarfs benchmarks and that are hand-optimized using standard best practices. An interesting finding is that the best overall performance obtained by AnnotationGym was generally with unoptimized codes.

  • Accelerating Multi-Party Computation Using Heterogeneous Systems

    2025-09-15

    articleSenior author

    Multi-party computation (MPC) allows multiple parties to compute with private data without sharing it. We have previously found that MPC through Secret Sharing has some advantages in data center deployments. But while MPC provides strong privacy guarantees, in Secret Sharing MPC, the protocol’s communication between parties often takes a significant portion of the total execution time. That is, adding MPC to applications like neural network training can increase the execution time from minutes to days. A significant fraction of this increased execution time is due to the use of generic TCP/IP networks and other communication overhead.We present advances to a system that accelerates Secret Sharing MPC using FPGA network cards, which is especially applicable for deployments where all parties are in the same data center. This approach makes improvements over previous work in this area. First, we replace TCP/IP with RDMA over FPGA SmartNICs; this sends data directly among memories without CPU involvement. And second, computation is overlapped with communication to avoid device idling.We implement the SmartNICs using AMD FPGAs with Coyote’s RDMA stack. The proof-of-concept application is a simple machine learning model. We performed layer-by-layer experiments to analyze latency. For the convolutional layers, the system achieves a 2.1× speedup compared to existing MPC frameworks. Moreover, based on these optimizations and analysis, we estimate that the machine learning workflow can achieve a 1.7× speedup compared to unoptimized MPC, potentially making MPC machine learning significantly more attractive.

  • AutoAnnotate: Reinforcement Learning based Code Annotation for High Level Synthesis

    2024-04-03 · 5 citations

    articleSenior author

    High Level Synthesis (HLS) allows custom hardware generation using developer-friendly programming languages. Often, however, the HLS compiler is unable to output high quality results. One approach is to pre-process the source code, e.g., to restructure the computational flow, or to insert compiler hints using annotations or pragmas. But while the latter approach appears to enhance programmability, it also requires developer expertise, both regarding hardware design patterns and even compiler internals: an incorrect annotation strategy can worsen performance or result in compilation deadlocks. To address these challenges, this work presents AutoAnnotate, an automatic code annotation framework for HLS. It demonstrates the efficacy, novelty, and benefit of applying ML methods to code annotation. AutoAnnotate replaces the need for developer expertise by using Reinforcement Learning (RL) to determine the best set of annotations for a given input code. To demonstrate the effectiveness of this approach, we ran AutoAnnotate on a number of common FPGA benchmarks derived, e.g., from Rodinia and OpenDwarfs, with state-of-art HLS tools (AMD Vitis and Intel HLS). We obtained a geometric mean of $42 \times$ performance improvement for Vitis HLS and $3.42 \times$ for Intel HLS. We then hand optimized these codes using standard best practices and again applied AutoAnnotate, this time still achieving $32.3 \times$ performance improvement for Vitis HLS and $3.1 \times$ for Intel HLS. Interestingly, the best overall performance obtained by AutoAnnotate was generally with unoptimized codes.

  • Cycle-Stealing in Load-Imbalanced HPC Applications

    2024-09-23 · 1 citations

    articleSenior author

    It is practical to steal cycles when Message Passing Interface (MPI) programs are load imbalanced, either from natural algorithmic load, because of collective communication and/or process skew, and/or because of variable message loads and needs for message progress within the MPI implementation. We introduce TimeLord, a runtime library to provide fungibility for compute-cycles without significantly degrading the nominal performance of the parallel application. We propose three means to exploit wasted cycles in LAMMPS, miniFE, and miniAMR. The results show, on average, 40% of the runtime can be used to execute extra computations and that runtime overheads are less than 4%. We envisage that these extra computations be related to the target application; here, for simplicity and proof-of-concept, we assume arbitrary commodity applications. Additionally, TimeLord uncovers inefficiencies in the MPI progress engine and improves baseline performance. We explore some fundamental issues of load imbalance and its interaction with system software, including the predictability of extra time and the relative benefits of prediction versus preemption. TimeLord requires no changes to the MPI application, MPI implementation, or other system components.

  • Multi-Core Multi-Rule VeBPF Firewall for Secure FPGA IoT Device Deployments

    2024-05-27 · 2 citations

    articleSenior author

    FPGAs are often deployed in IoT devices: sensor technology is advancing rapidly and microcontrollers may be unable to handle the needed throughput. But with their connections to the internet and only basic system support, IoT devices may be easy targets of cyberattacks. Current FPGA-based SmartNIC defenses against cyberattacks, however, are mostly applicable in cloud deployments. In order to mitigate cyberattacks on resource-limited IoT devices, we have developed a multi-core multi-rule VeBPF (Verilog extended Berkeley Packet Filter) firewall for FPGA-based IoT devices. This VeBPF firewall accepts standard eBPF bytecode as firewall rules; these rules are run by VeBPF CPU cores. Any number of VeBPF cores can be generated by specifying the N<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">VeBPF</inf> parameter.

  • A Neural Network Based GCC Cost Model for Faster Compiler Tuning

    2024-09-23 · 2 citations

    articleSenior author

    Machine learning models have been found to be effective in predicting compiler heuristics, but are limited by their very long training times. This is because computing the impact of transformations on a code, e.g., through the performance values, involves invoking the downstream compiler. One way to circumvent the cost-computation bottleneck is to devise accurate cost models that can be trained to predict the target metric. In this paper, we develop a neural net based cost function that can accurately predict binary code size for GCC-based compilation. The input to the model is a comprehensive list of features that have been extracted offline from GCC's intermediate tree representation and the compiler flags that need to be evaluated by a compiler tuning workload. To extract the code features, we have built a GIMPLE analysis framework that can generate feature sets from intermediate representations at different stages of the compilation process. Our results show that the cost model has a mean absolute percentage error of just 8%, and a Spearman correlation of 0.98 between predicted and measured binary size of the test applications. We also demonstrate that compiler pass selection for feature extraction has a significant benefit on the accuracy of the model. Finally, we show that the cost model can reduce metric evaluation time by multiple orders of magnitude.

Recent grants

Frequent coauthors

  • Tong Geng

    University of Rochester

    45 shared
  • Chunshu Wu

    University of Rochester

    25 shared
  • Charles Weems

    University of Massachusetts Amherst

    23 shared
  • Ahmed Sanaullah

    Red Hat (United States)

    22 shared
  • Pouya Haghi

    University of Rochester

    22 shared
  • Ang Li

    20 shared
  • Tianqi Wang

    18 shared
  • Anqi Guo

    Boston University

    18 shared

Education

  • Ph.D., Electrical Engineering

    Massachusetts Institute of Technology

    1985
  • M.S., Electrical Engineering

    Massachusetts Institute of Technology

    1981
  • B.S., Electrical Engineering

    University of California, Berkeley

    1979

Awards & honors

  • 2014 General Chair, 22nd IEEE International Symposium on Fie…
  • 2013 General Chair, IEEE International Parallel and Distribu…
  • 2011 Fellow, Hariri Institute for Computing and Computationa…
  • 2009 FPL Outstanding Paper Award
  • 2008 IBM Faculty Award
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Martin Herbordt

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup