
Saman Amarasinghe
· Associate Professor of Electrical Engineering and Computer ScienceVerifiedMassachusetts Institute of Technology · Electrical Engineering and Computer Science
Active 1993–2026
About
Saman Amarasinghe is the Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science at MIT. His research areas include Artificial Intelligence and Machine Learning, Computer Architecture, Programming Languages and Software Engineering, Security and Cryptography, and Systems and Networking. He is involved in developing systems that sense, process, and transmit energy and information, leveraging computational, theoretical, and experimental tools to create groundbreaking sensors, energy transducers, and physical substrates for computation. His work addresses shared challenges facing humanity through innovative system design and analysis.
Research topics
- Computer Science
- Programming language
- Parallel computing
- Algorithm
- Mathematics
- Theoretical computer science
- Computational science
- Computer architecture
- Physics
- Operating system
- Pure mathematics
Selected publications
Backwards Data-Flow Analysis using Prophecy Variables in the BuildIt System
arXiv (Cornell University) · 2026-01-06
preprintOpen accessMany program transformations and optimizations require information about the future behavior of the program. A standard way to obtain this information is to build an intermediate program representation, then use a backwards program analysis to propagate relevant information against the flow of control back to the transformation/optimization site. We instead propose to use prophecy variables, which predict information about the future execution of the program, to enable such transformations and optimizations. We implement prophecy variables in BuildIt, a lightweight domain specific language implementation system. BuildIt uses staged compilation to implement high performance domain specific languages embedded within a standard general purpose programming language (C++). The BuildIt first phase uses standard C++ program execution to generate optimized C, C++, and CUDA second phase code. This approach enables BuildIt to eliminate programming language implementation components such as parsers and intermediate representations, delivering a dramatic decrease in the engineering effort required to implement domain specific languages. The combination of prophecy variables and repeated forward program execution enables BuildIt to extend this approach to include transformations and optimizations that require information about the future execution of the program without backwards analyses and without the engineering overhead associated with implementing these analyses. We formalize the use of prophecy variables for this purpose, discuss the implementation of prophecy variables and repeated execution in BuildIt, and present experimental results for BuildIt computations that benefit from optimizations enabled by the information that prophecy variables provide.
SPAC: Automating FPGA-based Network Switches with Protocol Adaptive Customization
arXiv (Cornell University) · 2026-04-23
preprintOpen accessWith network requirements diverging across emerging applications, latency-critical services demand minimal logic delay, while hyperscale training and collectives require sustained line-rate throughput for synchronized bulk transfers. This divergence creates an urgent need for custom network switches tailored to specialized protocols and application-specific traffic patterns. This paper presents SPAC (Switch and Protocol Adaptive Customization), a novel approach that automates the generation of FPGA-based network switches co-optimized for custom protocols and application-specific traffic patterns. SPAC introduces a unified workflow with a domain-specific language (DSL) for protocol-architecture co-design, a library of modular HLS-based adaptive switch components, and a trace-aware Design Space Exploration (DSE) engine. By providing a multi-fidelity simulation stack, SPAC enables rapid identification of Pareto-optimal designs prior to deployment. We demonstrate the efficacy of the domain-specific adaptation of SPAC across a spectrum of real-world scenarios, spanning from latency-sensitive sensor and HFT networks to hyperscale datacenter fabrics. Experimental results show that by tailoring the micro-architecture and protocol to the specific workload, SPAC-generated designs reduce LUT and BRAM usage by 55% and 53%, respectively. Compared to fixed-architecture counterparts, SPAC delivers latency reductions ranging from 7.8% to 38.4% across various tasks while maintaining adequate resource consumption and packet drop rate.
Backwards Data-Flow Analysis using Prophecy Variable in the BuildIt System
arXiv (Cornell University) · 2026-01-06
articleOpen accessMany program transformations and optimizations require information about the future behavior of the program. A standard way to obtain this information is to build an intermediate program representation, then use a backwards program analysis to propagate relevant information against the flow of control back to the transformation/optimization site. We instead propose to use prophecy variables, which predict information about the future execution of the program, to enable such transformations and optimizations. We implement prophecy variables in BuildIt, a lightweight domain specific language implementation system. BuildIt uses staged compilation to implement high performance domain specific languages embedded within a standard general purpose programming language (C++). The BuildIt first phase uses standard C++ program execution to generate optimized C, C++, and CUDA second phase code. This approach enables BuildIt to eliminate programming language implementation components such as parsers and intermediate representations, delivering a dramatic decrease in the engineering effort required to implement domain specific languages. The combination of prophecy variables and repeated forward program execution enables BuildIt to extend this approach to include transformations and optimizations that require information about the future execution of the program without backwards analyses and without the engineering overhead associated with implementing these analyses. We formalize the use of prophecy variables for this purpose, discuss the implementation of prophecy variables and repeated execution in BuildIt, and present experimental results for BuildIt computations that benefit from optimizations enabled by the information that prophecy variables provide.
Insum: Sparse GPU Kernels Simplified and Optimized with Indirect Einsums
2026-03-10
articleOpen accessProgramming high-performance sparse GPU kernels is notoriously difficult, requiring both substantial effort and deep expertise. Sparse compilers aim to simplify this process, but existing systems fall short in two key ways. First, they are primarily designed for CPUs and rarely produce high-performance GPU code. Second, when computations involve both sparse and dense regions, these compilers often fail to optimize the dense portions effectively. In this paper, we propose a new approach for expressing sparse computations. We start from format-agnostic Einsums over sparse tensors and rewrite them into format-conscious indirect Einsums, which explicitly encode format information by mapping sparse data and metadata onto dense tensor operations through indirect indexing. To execute indirect Einsums, we introduce the Insum compiler, which generates efficient GPU code for these Einsums by lowering to the PyTorch compiler, extended to better support Tensor Core–enabled indirect Einsums. We also present two fixed-length sparse formats, GroupCOO and BlockGroupCOO, designed to fit naturally with indirect Einsums. Our approach achieves 1.14×–3.81× speedups across a range of sparse GPU applications while reducing lines of code by 202×–4491× compared to hand-written implementations. The source code for Insum is publicly available at https://github.com/nullplay/IndirectEinsum.
UniTe: A Universal Tensor Abstraction for Capturing Spatial Relationships
ACM Transactions on Architecture and Code Optimization · 2026-01-10
articleOpen accessSenior authorTensors are an integral part of numerous domains, and while significant effort has been put into the design of tensor data structures in isolation, little attention has been paid to the relationships that exist across tensors and how this affects their representation and use. In this article, we focus on spatial relationships across tensors in a program, where such tensors are defined relative to a common reference coordinate system. These relationships are complicated by the fact that the tensors may differ in their representations, such as having variations in their axes, spacings, origins, and overall shape. Due to the lack of existing abstractions and language support for these types of tensor semantics, users are currently forced to manually perform the bookkeeping necessary to account for these varying relationships and representations. Unfortunately, we cannot rely on a simple library to capture these relationships, as computations on these types of tensors often happen at the innermost levels of programs; we find that the overheads associated with an unoptimized implementation quickly accumulate, leading to performance up to nearly 65x slower than a reference C implementation on a series of image and video compression benchmarks. In this article, we introduce the novel UniTe abstraction, which captures spatial relationships across all such tensors in a program. We also introduce two domain-specific languages and optimizing compilers, CoLa for Python and SHiM for C/C++, built off of UniTe. Both CoLa and SHiM provide users with an intuitive set of tensor primitives based on spatial relationships, hiding the complexity that goes into maintaining the tensors and computing accesses across them. In addition, we discuss the optimizations necessary to remove the associated abstraction overhead and describe their implementations. On the benchmarks, we show that both CoLa and SHiM successfully remove the overheads, achieving performance parity with existing C implementations.
SPAC: Automating FPGA-based Network Switches with Protocol Adaptive Customization
arXiv (Cornell University) · 2026-04-23
articleOpen accessWith network requirements diverging across emerging applications, latency-critical services demand minimal logic delay, while hyperscale training and collectives require sustained line-rate throughput for synchronized bulk transfers. This divergence creates an urgent need for custom network switches tailored to specialized protocols and application-specific traffic patterns. This paper presents SPAC (Switch and Protocol Adaptive Customization), a novel approach that automates the generation of FPGA-based network switches co-optimized for custom protocols and application-specific traffic patterns. SPAC introduces a unified workflow with a domain-specific language (DSL) for protocol-architecture co-design, a library of modular HLS-based adaptive switch components, and a trace-aware Design Space Exploration (DSE) engine. By providing a multi-fidelity simulation stack, SPAC enables rapid identification of Pareto-optimal designs prior to deployment. We demonstrate the efficacy of the domain-specific adaptation of SPAC across a spectrum of real-world scenarios, spanning from latency-sensitive sensor and HFT networks to hyperscale datacenter fabrics. Experimental results show that by tailoring the micro-architecture and protocol to the specific workload, SPAC-generated designs reduce LUT and BRAM usage by 55% and 53%, respectively. Compared to fixed-architecture counterparts, SPAC delivers latency reductions ranging from 7.8% to 38.4% across various tasks while maintaining adequate resource consumption and packet drop rate.
Journal of Global Agriculture and Ecology · 2025-10-22
articleOpen access1st authorCorrespondingBackground: The food security crisis exacerbated by climate change, population growth, and urbanization requires resilient and sustainable agriculture systems. Hydroponics, as a central technology in Controlled Environment Agriculture (CEA), has the potential to be disruptively different from legacy farming with efficient use of water, nutrients, and space. Hydroponic systems can increase the yield by 30-200% depending on the crop or system used, especially in leafy items like lettuce and spinach. Hydroponics use 90-95% less water, in comparison with soil based agriculture because of recirculating the nutrient solution and evaporation differential. Aims: This review of the scientific literature covers the development and technological advancement of hydroponic systems with respect to initially developed strategies like Nutrient Film Technique (NFT), Deep Water Culture (DWC), and Aeroponics, to futuristic use cases with AI, IoT, and automation. Methodology: This study synthesizes information from more than fifty peer-reviewed articles published from 2015 - 2025 focused on specifically, thematic innovations in system design, vertical farming, recycling of resources, and harvesting-encoding environmental control. Discussion: Comparative analyses of hydroponics show evidence of advantages with increased yield and resource use efficiency; however, limits remain in terms of scalability, energy needs, and dependence of know-how. The socioeconomic analysis shows promise in employment opportunities and urban food sovereignty, meanwhile suggests a need to dismantle the rural-urban divide and distributional fairness. Policy, and weak institutional backer remain major limits to implementation. Conclusion: Ultimately, this review shows that hydroponics, with the help of an inclusive policy approach, and interdisciplinary advancements in innovation can be used as a revolutionary strategy for climate-smart sustainable, food systems in local communities on a global scale.
The Continuous Tensor Abstraction: Where Indices Are Real
Proceedings of the ACM on Programming Languages · 2025-10-09 · 1 citations
articleOpen accessSenior authorThis paper introduces the continuous tensor abstraction, allowing indices to take real-number values (e.g., A[3.14]). It also presents continuous tensor algebra expressions, such as C x , y = A x , y ∗ B x , y , where indices are defined over a continuous domain. This work expands the traditional tensor model to include continuous tensors. Our implementation supports piecewise-constant tensors, on which infinite domains can be processed in finite time. We also introduce a new tensor format for efficient storage and a code generation technique for automatic kernel generation. For the first time, our abstraction expresses domains like computational geometry and computer graphics in the language of tensor programming. Our approach demonstrates competitive or better performance to hand-optimized kernels in leading libraries across diverse applications. Compared to hand-implemented libraries on a CPU, our compiler-based implementation achieves an average speedup of 9.20× on 2D radius search with ∼60× fewer lines of code (LoC), 1.22× on genomic interval overlapping queries (with ∼18× LoC saving), and 1.69× on trilinear interpolation in Neural Radiance Field (with ∼6× LoC saving).
Finch: Sparse and Structured Tensor Programming with Control Flow
Proceedings of the ACM on Programming Languages · 2025-04-09 · 4 citations
articleOpen accessSenior authorFrom FORTRAN to NumPy, tensors have revolutionized how we express computation. However, tensors in these, and almost all prominent systems, can only handle dense rectilinear integer grids. Real world tensors often contain underlying structure, such as sparsity, runs of repeated values, or symmetry. Support for structured data is fragmented and incomplete. Existing frameworks limit the tensor structures and program control flow they support to better simplify the problem. In this work, we propose a new programming language, Finch, which supports both flexible control flow and diverse data structures. Finch facilitates a programming model which resolves the challenges of computing over structured tensors by combining control flow and data structures into a common representation where they can be co-optimized. Finch automatically specializes control flow to data so that performance engineers can focus on experimenting with many algorithms. Finch supports a familiar programming language of loops, statements, ifs, breaks, etc., over a wide variety of tensor structures, such as sparsity, run-length-encoding, symmetry, triangles, padding, or blocks. Finch reliably utilizes the key properties of structure, such as structural zeros, repeated values, or clustered non-zeros. We show that this leads to dramatic speedups in operations such as SpMV and SpGEMM, image processing, and graph analytics.
Modular GPU Programming with Typed Perspectives
ArXiv.org · 2025-11-14
preprintOpen accessTo achieve peak performance on modern GPUs, one must balance two frames of mind: issuing instructions to individual threads to control their behavior, while simultaneously tracking the convergence of many threads acting in concert to perform collective operations like Tensor Core instructions. The tension between these two mindsets makes modular programming error prone. Functions that encapsulate collective operations, despite being called per-thread, must be executed cooperatively by groups of threads. In this work, we introduce Prism, a new GPU language that restores modularity while still giving programmers the low-level control over collective operations necessary for high performance. Our core idea is typed perspectives, which materialize, at the type level, the granularity at which the programmer is controlling the behavior of threads. We describe the design of Prism, implement a compiler for it, and lay its theoretical foundations in a core calculus called Bundl. We implement state-of-the-art GPU kernels in Prism and find that it offers programmers the safety guarantees needed to confidently write modular code without sacrificing performance.
Recent grants
NSF · $2.1M · 2022–2027
XPS: FULL: DSD: Scalable High Performance with Halide and Simit Domain Specific Languages
NSF · $845k · 2015–2020
NGS: StreamIt: A Language and a Compiler for Streaming Applications
NSF · $500k · 2004–2007
CISE Experimental Partnerships: MIT Raw Machine
NSF · $2.1M · 2000–2005
Collaborative Research: Programmable Microfluidics: A Universal Substrate for Biological Computing
NSF · $375k · 2006–2009
Frequent coauthors
- 45 shared
William Thies
- 33 shared
Shoaib Kamil
Adobe Systems (United States)
- 26 shared
Anant Agarwal
The Ohio State University
- 23 shared
Fredrik Kjølstad
Stanford University
- 23 shared
Qin Zhao
Third Xiangya Hospital
- 21 shared
Monica S. Lam
- 21 shared
Rodric Rabbah
- 20 shared
Jason Ansel
Alpha Omega Alpha Medical Honor Society
Education
- 1993
Ph.D., Electrical Engineering and Computer Science
Massachusetts Institute of Technology
- 1989
M.S., Electrical Engineering and Computer Science
Massachusetts Institute of Technology
- 1986
B.S., Electrical Engineering
University of Moratuwa
Awards & honors
- 2025-26 EECS Faculty Award Roundup
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Saman Amarasinghe
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup