
David Brooks
· Haley Family Professor of Computer ScienceVerifiedHarvard University · Computer Science
Active 1891–2025
About
David Brooks is the Haley Family Professor of Computer Science at Harvard University, affiliated with the Harvard John A. Paulson School of Engineering and Applied Sciences. His primary teaching area is Computer Science. His research areas include applied mathematics, science and engineering for ClimateTech, applied physics, bioengineering, computer engineering and architecture, electrical engineering, environmental science and engineering, materials science, and mechanical engineering. His work involves addressing environmental impacts of computation, with a focus on sustainable computing and reducing the carbon footprint of computing technologies. He is involved in multi-institution research initiatives aimed at advancing green computing solutions.
Research topics
- Computer Science
- Artificial Intelligence
- Machine Learning
- Computer hardware
- Parallel computing
- Data science
- Operating system
- Computer architecture
- Distributed computing
Selected publications
DreamRAM: A Fine-Grained Configurable Design Space Modeling Tool for Custom 3D Die-Stacked DRAM
ArXiv.org · 2025-12-13
preprintOpen access3D die-stacked DRAM has emerged as a key technology for delivering high bandwidth and high density for applications such as high-performance computing, graphics, and machine learning. However, different applications place diverse and sometimes diverging demands on power, performance, and area that cannot be universally satisfied with fixed commodity DRAM designs. Die stacking creates the opportunity for a large DRAM design space through 3D integration and expanded total die area. To open and navigate this expansive design space of customized memory architectures that cater to application-specific needs, we introduce DreamRAM, a configurable bandwidth, capacity, energy, latency, and area modeling tool for custom 3D die-stacked DRAM designs. DreamRAM exposes fine-grained design customization parameters at the MAT, subarray, bank, and inter-bank levels, including extensions of partial page and subarray parallelism proposals found in the literature, to open a large previously-unexplored design space. DreamRAM analytically models wire pitch, width, length, capacitance, and scaling parameters to capture the performance tradeoffs of physical layout and routing design choices. Routing awareness enables DreamRAM to model a custom MAT-level routing scheme, Dataline-Over-MAT (DLOMAT), to facilitate better bandwidth tradeoffs. DreamRAM is calibrated and validated against published industry HBM3 and HBM2E designs. Within DreamRAM's rich design space, we identify designs that achieve each of 66% higher bandwidth, 100% higher capacity, and 45% lower power and energy per bit compared to the baseline design, each on an iso-bandwidth, iso-capacity, and iso-power basis.
Democratizing Customization for ML at the Edge Through Hetero-Chiplet SiP Architectures
IEEE Journal on Emerging and Selected Topics in Circuits and Systems · 2025-07-25
articleOpen accessSenior authorThe demand for efficient machine learning in edge devices is challenging the capabilities of general-purpose computing systems. While domain-specific System on Chip (SoCs) are efficient, they are often prohibitively expensive due to long design times and high design costs. To address these limitations, the community has begun to explore System in Package (SiP) designs for low-cost assembly of reusable accelerators, available as chiplets, to democratize customization. This presents a new challenge of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">macro-architecture</i> design space exploration (DSE). Prior works do not address this problem, having only investigated micro-architecture design and optimization of homogeneous SiPs. To address this need, and unlock the potential of assembling custom SiPs, comprising heterogeneous chiplets, we introduce an early DSE framework, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">CASCADE</i> – A. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">CASCADE</i> employs fast, first-order performance models to capture the tradeoffs of composable compute chiplets, leveraging tool-generated traces to comprehend dataflow patterns in the context of state-of-the-art machine learning tasks. Using <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">CASCADE</i>, we assess the performance benefits of composable SiPs comprising hetero-chiplets for single-tenant and two-tenant scenarios. Notably, we demonstrate that hetero-chiplet systems can deliver speedups in the range of 3-5x, depending on the application, compared to a baseline GPU chiplet system.
Wafer-Scale Systems: A Carbon Perspective
ACM SIGEnergy Energy Informatics Review · 2025-07-01 · 2 citations
articleSenior authorThe rapid rise of Large Language Models (LLMs) has prompted a re-evaluation of system architecture design, making energy efficiency and sustainability more crucial than ever. Recently, wafer-scale architectures have emerged as a viable alternative for LLM training and inference, as evidenced by the success of Cerebras Systems. In this work, we examine the carbon implications of wafer-scale architectures as compared to traditional GPUs. As a case study, we examine LLMs on a Cerebras CS-3 system in order to quantify power and total carbon. Then, we analyze total carbon delay product (tCDP) to evaluate the carbon efficiency and performance potential of these systems. We take the first step towards exploring this trade-off for wafer-scale versus traditional GPU architectures - and ultimately find there exists a rich design space, depending on workload and hardware configuration.
2025-11-03
articleAbstract This paper presents the deployment of Agentic AI within ADNOC's Artificial Intelligence Production System Optimization (AiPSO), a strategic initiative aimed at transforming upstream oilfield operations through intelligent automation. At its core, AiPSO embeds domain-specific generative agents into engineering and optimization workflows, enabling autonomous diagnostics, scenario modeling, and decision support. These agents interact with a field-wide digital twin powered by hybrid physics/ML models and a knowledge graph that contextualizes data from IT, OT, and ET domains. The system goes beyond rule-based automation by introducing agents that reason over constraints, simulate outcomes, and proactively recommend actions delivering conversational intelligence grounded in engineering logic and real-time operational data. These capabilities are tightly integrated with foundational workflows including artificial lift diagnostics, MPFM validation, and injection optimization, ensuring adoption and value from day one. Aimed to be deployed across 25 fields, AiPSO will enable uplift in production capacity with minimal CAPEX while reducing decision latency and enhancing operational transparency. The phased architecture of AiPSO ensures scalability, trust, and explainability key to industrial AI. This paper outlines how Agentic AI transforms traditional workflows into intelligent systems, positioning ADNOC's upstream assets for autonomous operations and redefining how human-machine collaboration evolves in energy production
ArXiv.org · 2025-10-17
preprintOpen accessLarge model training beyond tens of thousands of GPUs is an uncharted territory. At such scales, disruptions to the training process are not a matter of if, but a matter of when -- a stochastic process degrading training productivity. Dynamic runtime variation will become increasingly more frequent as training scales up and as GPUs are operated in increasingly power-limited and thermally-stressed environments. At the 64,000+ GPU scale, we already observe 9% GPU time variability for frontier foundation model training. Motivated by our analysis and the large design space around performance variability, we present PRISM -- a performance modeling framework that captures the stochastic nature of large-scale distributed training. The core of PRISM is a statistical method that quantifies probabilistic guarantees on training time. Using PRISM, we explore the design and optimization space of distributed training, enabling principled, variability-aware decisions that improve performance and system efficiency at scale.
2025-03-31
articlePFAS (per-and poly-fluoroalkyl substances), also known as forever chemicals, are widely used in electronics and semiconductor manufacturing. PFAS are environmentally persistent and bioaccumulative synthetic chemicals, which have recently received considerable regulatory attention. Manufacturing semiconductors and electronics, including integrated circuits (IC), batteries, displays, etc., currently accounts for a staggering 10% of the total PFAS-containing fluoropolymers used in Europe alone. Now, computer system designers have an opportunity to reduce the use of PFAS in semiconductors and electronics at the design phase. In this work, we quantify the environmental impact of PFAS in computing systems, and outline how designers can optimize their designs to use less PFAS. We show that manufacturing an IC design at a 7 nm technology node using Extreme Ultraviolet (EUV) lithography uses 20% less volume of PFAS-containing chemicals versus manufacturing the same design at a 7 nm node using Deep Ultraviolet (DUV) immersion lithography (instead of EUV). We also show that manufacturing an IC design at a 16 nm technology node results in 15% less volume of PFAS than manufacturing the same design at a 28 nm node due to its smaller area.
IEEE Micro · 2025-05-30
articleThe increase in both the number and the types of accelerators in modern SoCs necessitates a rethinking of power-management strategies. To overcome the scalability shortcomings of current methods, we propose BlitzCoin a fully decentralized hardware-based power management coupled with optimized unified voltage and frequency regulation. We evaluated BlitzCoin through RTL simulations of multiple SoCs targeted toward different application domains. The results are further validated through silicon measurements of a fabricated 12 nm many-accelerator SoC that includes BlitzCoin. Our evaluations show that BlitzCoin is markedly faster than state-of-the-art centralized power-management strategies, with 8× to 12× lower response times. This results in 25%-34% throughput improvement and allows for scaling to 7× to 13× larger SoCs, all with a small area overhead of <1%. BlitzCoin is an addition to the open-source ESP SoC platform, offering a foundation for further exploration of power-management strategies.
CORDOBA: Carbon-Efficient Optimization Framework for Computing Systems
2025-03-01 · 7 citations
articleThe world’s push toward an environmentally sustainable society is highly dependent on the semiconductor industry. Despite existing carbon modeling efforts to quantify carbon footprint of computing systems, optimizing carbon footprint in large design spaces-while also considering trade-offs in power, performance, and area-is especially challenging. To address this need, we present CORDOBA, a carbon-aware optimization framework that optimizes carbon efficiency. We quantify carbon efficiency using the total Carbon Delay Product metric (tCDP): the product of total carbon and application execution time. We justify why tCDP is an effective metric for quantifying carbon efficiency. We use CORDOBA to explore the large design space for carbonefficient specialized hardware, and identify distinct carbonefficient optimal designs across operational use (eliminating up to $\mathbf{9 8 \%}$ of the design space) despite uncertainty in carbon footprint parameters. We quantify opportunities to improve tCDP for real system case studies: (a) optimizing hardware provisioning from 8 to 4 cores in real system CPUs improves tCDP by $1.25 \times$; and (b) leveraging advanced three-dimensional (3D) integration techniques (3D stacking of separately-fabricated logic and memory chips) improves tCDP by $6.9 \times$ versus conventional systems.
2025-11-03
articleSenior authorAbstract Acquiring and interpreting real-time data from both surface facilities and subsurface environments provides an accurate representation of wellbore and reservoir conditions. By capturing high-resolution and high-frequency environmental measurements, sensors enable continuous surveillance of the entire production system from reservoir inflow to surface processing. High and low frequency production data are ingested into digital engineering workflows that incorporate physics laws, analytics and domain expertise. Despite significant advancements in data acquisition and analytics, many digital oil fields (DOF) continue to operate engineering workflows in functional silos, with surface production, reservoir management, and equipment diagnostics often managed independently. While these compartmentalized workflows are individually robust and capable of generating valuable insights within their domains, the lack of integration limits the holistic understanding of the production system. This fragmentation constrains the ability to fully integrate cross-disciplinary data insights, thereby limiting opportunities for comprehensive production optimization and proactive operational decision-making across the asset lifecycle. Addressing these silos is fundamental to realizing the full value of integrated DOF. This paper presents the design and large-scale deployment of a pioneering, engineering-driven feedback loop platform for Well Surveillance and Diagnostics, developed under ADNOC’s Intelligent Production System Optimization (AiPSO) initiative. Engineered to support high-well-count operations across both greenfield and brownfield assets, the platform tackles critical challenges including delayed issue detection, alarm fatigue, inconsistent diagnostics, and fragmented decision-making. Integrated and orchestrated through an innovative framework, the system synchronizes inputs and outputs leveraging engineering expertise alongside automated data and physics model analysis to continuously validate insights, prioritize exceptions, and guide decision-making by embedding human-centered logic within a feedback loop that incorporates modular workflows covering virtual flow meters (VFM) for rate estimation, lift performance diagnostics, and smart ticketing to quantify added value, all aimed at accelerating cross-disciplinary collaboration and action-field response. The deployment of this pioneering production ecosystem marks a shift away from the fragmented, siloed workflows that have long limited the full potential of DOF. By creating an innovative, engineering-driven feedback loop, the system successfully integrates data and expertise to provide a holistic view of the production system. The result is a transformative leap from passive monitoring to a proactive engine of production optimization. This integrated approach not only accelerates decision-making and cross-disciplinary collaboration but also quantify value, demonstrating that a unified and intelligent framework is the key to unlocking true operational excellence and maximizing asset performance across the entire asset lifecycle.
IEEE Journal of Solid-State Circuits · 2025-09-26
articleWe present EPOCHS-1, a 12 nm, 64 mm<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> system-on-chip (SoC) with a high degree of heterogeneity. It features four Linux-SMP-capable RISC-V cores, 14 different types of accelerators, a distributed memory hierarchy, and various peripherals. EPOCHS-1’s memory hierarchy has the flexibility to support a diverse set of accelerators and can scale to support complex applications with 34% and 25% reduction in latency and energy, respectively. A subset of the SoC’s 23 power and 35 clock domains is regulated with a fully-decentralized power-allocation scheme and hybrid unified voltage and frequency scaling (HUVFS) that combines an in-package switched regulator with a per-tile low dropout (LDO). Combined, these techniques achieve up to a <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.57\times $</tex-math> </inline-formula> speedup versus a centralized power management baseline. Designed with an agile methodology, EPOCHS-1 is based on an open-source SoC architecture and features only open-source components, either third-party or newly designed, thus enabling design reuse for future research projects.
Recent grants
CSR: SMALL: Virtualized Accelerators for Scalable, Composable Architectures
NSF · $450k · 2017–2022
NSF CCF-CPA: Reliability in the Face of Variability under Nanoscale Technology Scaling
NSF · $500k · 2007–2012
CAREER: A Framework for Early-Stage Computer Architecture Design Space Exploration and Optimization
NSF · $400k · 2005–2011
NSF · $402k · 2007–2012
An adaptive alarm-based approach to high-performance/low-cost computing
NSF · $375k · 2004–2008
Frequent coauthors
- 210 shared
Gu-Yeon Wei
- 45 shared
Udit Gupta
Harvard University
- 43 shared
Brandon Reagen
New York University
- 38 shared
Carole-Jean Wu
- 38 shared
Pradip Bose
IBM (United States)
- 28 shared
Paul N. Whatmough
- 27 shared
Vijay Janapa Reddi
- 24 shared
Mark Hempstead
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with David Brooks
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup