David Brooks

· Haley Family Professor of Computer ScienceVerified

Harvard University · Computer Science

Active 1891–2025

h-index64

Citations18.1k

Papers32289 last 5y

Funding$3.7M

Faculty page

See your match with David Brooks — sign in to PhdFit.Sign in

About

David Brooks is the Haley Family Professor of Computer Science at Harvard University, affiliated with the Harvard John A. Paulson School of Engineering and Applied Sciences. His primary teaching area is Computer Science. His research areas include applied mathematics, science and engineering for ClimateTech, applied physics, bioengineering, computer engineering and architecture, electrical engineering, environmental science and engineering, materials science, and mechanical engineering. His work involves addressing environmental impacts of computation, with a focus on sustainable computing and reducing the carbon footprint of computing technologies. He is involved in multi-institution research initiatives aimed at advancing green computing solutions.

Research topics

Computer Science
Artificial Intelligence
Machine Learning
Computer hardware
Parallel computing
Data science
Operating system
Computer architecture
Distributed computing

Selected publications

DreamRAM: A Fine-Grained Configurable Design Space Modeling Tool for Custom 3D Die-Stacked DRAM
ArXiv.org · 2025-12-13
preprintOpen access
3D die-stacked DRAM has emerged as a key technology for delivering high bandwidth and high density for applications such as high-performance computing, graphics, and machine learning. However, different applications place diverse and sometimes diverging demands on power, performance, and area that cannot be universally satisfied with fixed commodity DRAM designs. Die stacking creates the opportunity for a large DRAM design space through 3D integration and expanded total die area. To open and navigate this expansive design space of customized memory architectures that cater to application-specific needs, we introduce DreamRAM, a configurable bandwidth, capacity, energy, latency, and area modeling tool for custom 3D die-stacked DRAM designs. DreamRAM exposes fine-grained design customization parameters at the MAT, subarray, bank, and inter-bank levels, including extensions of partial page and subarray parallelism proposals found in the literature, to open a large previously-unexplored design space. DreamRAM analytically models wire pitch, width, length, capacitance, and scaling parameters to capture the performance tradeoffs of physical layout and routing design choices. Routing awareness enables DreamRAM to model a custom MAT-level routing scheme, Dataline-Over-MAT (DLOMAT), to facilitate better bandwidth tradeoffs. DreamRAM is calibrated and validated against published industry HBM3 and HBM2E designs. Within DreamRAM's rich design space, we identify designs that achieve each of 66% higher bandwidth, 100% higher capacity, and 45% lower power and energy per bit compared to the baseline design, each on an iso-bandwidth, iso-capacity, and iso-power basis.
Publisher OA PDF DOI
Democratizing Customization for ML at the Edge Through Hetero-Chiplet SiP Architectures
IEEE Journal on Emerging and Selected Topics in Circuits and Systems · 2025-07-25
articleOpen accessSenior author
The demand for efficient machine learning in edge devices is challenging the capabilities of general-purpose computing systems. While domain-specific System on Chip (SoCs) are efficient, they are often prohibitively expensive due to long design times and high design costs. To address these limitations, the community has begun to explore System in Package (SiP) designs for low-cost assembly of reusable accelerators, available as chiplets, to democratize customization. This presents a new challenge of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">macro-architecture design space exploration (DSE). Prior works do not address this problem, having only investigated micro-architecture design and optimization of homogeneous SiPs. To address this need, and unlock the potential of assembling custom SiPs, comprising heterogeneous chiplets, we introduce an early DSE framework, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">CASCADE – A. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">CASCADE employs fast, first-order performance models to capture the tradeoffs of composable compute chiplets, leveraging tool-generated traces to comprehend dataflow patterns in the context of state-of-the-art machine learning tasks. Using <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">CASCADE, we assess the performance benefits of composable SiPs comprising hetero-chiplets for single-tenant and two-tenant scenarios. Notably, we demonstrate that hetero-chiplet systems can deliver speedups in the range of 3-5x, depending on the application, compared to a baseline GPU chiplet system.
Publisher DOI
Wafer-Scale Systems: A Carbon Perspective
ACM SIGEnergy Energy Informatics Review · 2025-07-01 · 2 citations
articleSenior author
The rapid rise of Large Language Models (LLMs) has prompted a re-evaluation of system architecture design, making energy efficiency and sustainability more crucial than ever. Recently, wafer-scale architectures have emerged as a viable alternative for LLM training and inference, as evidenced by the success of Cerebras Systems. In this work, we examine the carbon implications of wafer-scale architectures as compared to traditional GPUs. As a case study, we examine LLMs on a Cerebras CS-3 system in order to quantify power and total carbon. Then, we analyze total carbon delay product (tCDP) to evaluate the carbon efficiency and performance potential of these systems. We take the first step towards exploring this trade-off for wafer-scale versus traditional GPU architectures - and ultimately find there exists a rich design space, depending on workload and hardware configuration.
Publisher DOI
Agentic AI for Oilfield Optimization: Bridging Expert Workflows and Generative Intelligence in Multi-Field Facilities
2025-11-03
article
Abstract This paper presents the deployment of Agentic AI within ADNOC's Artificial Intelligence Production System Optimization (AiPSO), a strategic initiative aimed at transforming upstream oilfield operations through intelligent automation. At its core, AiPSO embeds domain-specific generative agents into engineering and optimization workflows, enabling autonomous diagnostics, scenario modeling, and decision support. These agents interact with a field-wide digital twin powered by hybrid physics/ML models and a knowledge graph that contextualizes data from IT, OT, and ET domains. The system goes beyond rule-based automation by introducing agents that reason over constraints, simulate outcomes, and proactively recommend actions delivering conversational intelligence grounded in engineering logic and real-time operational data. These capabilities are tightly integrated with foundational workflows including artificial lift diagnostics, MPFM validation, and injection optimization, ensuring adoption and value from day one. Aimed to be deployed across 25 fields, AiPSO will enable uplift in production capacity with minimal CAPEX while reducing decision latency and enhancing operational transparency. The phased architecture of AiPSO ensures scalability, trust, and explainability key to industrial AI. This paper outlines how Agentic AI transforms traditional workflows into intelligent systems, positioning ADNOC's upstream assets for autonomous operations and redefining how human-machine collaboration evolves in energy production
Publisher DOI
PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training
ArXiv.org · 2025-10-17
preprintOpen access
Large model training beyond tens of thousands of GPUs is an uncharted territory. At such scales, disruptions to the training process are not a matter of if, but a matter of when -- a stochastic process degrading training productivity. Dynamic runtime variation will become increasingly more frequent as training scales up and as GPUs are operated in increasingly power-limited and thermally-stressed environments. At the 64,000+ GPU scale, we already observe 9% GPU time variability for frontier foundation model training. Motivated by our analysis and the large design space around performance variability, we present PRISM -- a performance modeling framework that captures the stochastic nature of large-scale distributed training. The core of PRISM is a statistical method that quantifies probabilistic guarantees on training time. Using PRISM, we explore the design and optimization space of distributed training, enabling principled, variability-aware decisions that improve performance and system efficiency at scale.
Publisher OA PDF DOI
PFASware: Quantifying the Environmental Impact of Per- and Polyfluoroalkyl Substances (PFAS) in Computing Systems
2025-03-31
article
PFAS (per-and poly-fluoroalkyl substances), also known as forever chemicals, are widely used in electronics and semiconductor manufacturing. PFAS are environmentally persistent and bioaccumulative synthetic chemicals, which have recently received considerable regulatory attention. Manufacturing semiconductors and electronics, including integrated circuits (IC), batteries, displays, etc., currently accounts for a staggering 10% of the total PFAS-containing fluoropolymers used in Europe alone. Now, computer system designers have an opportunity to reduce the use of PFAS in semiconductors and electronics at the design phase. In this work, we quantify the environmental impact of PFAS in computing systems, and outline how designers can optimize their designs to use less PFAS. We show that manufacturing an IC design at a 7 nm technology node using Extreme Ultraviolet (EUV) lithography uses 20% less volume of PFAS-containing chemicals versus manufacturing the same design at a 7 nm node using Deep Ultraviolet (DUV) immersion lithography (instead of EUV). We also show that manufacturing an IC design at a 16 nm technology node results in 15% less volume of PFAS than manufacturing the same design at a 28 nm node due to its smaller area.
Publisher DOI
BlitzCoin: A Decentralized Hardware Solution for Power Management of Highly Heterogeneous Systems on Chip
IEEE Micro · 2025-05-30
article
The increase in both the number and the types of accelerators in modern SoCs necessitates a rethinking of power-management strategies. To overcome the scalability shortcomings of current methods, we propose BlitzCoin a fully decentralized hardware-based power management coupled with optimized unified voltage and frequency regulation. We evaluated BlitzCoin through RTL simulations of multiple SoCs targeted toward different application domains. The results are further validated through silicon measurements of a fabricated 12 nm many-accelerator SoC that includes BlitzCoin. Our evaluations show that BlitzCoin is markedly faster than state-of-the-art centralized power-management strategies, with 8× to 12× lower response times. This results in 25%-34% throughput improvement and allows for scaling to 7× to 13× larger SoCs, all with a small area overhead of <1%. BlitzCoin is an addition to the open-source ESP SoC platform, offering a foundation for further exploration of power-management strategies.
Publisher DOI
CORDOBA: Carbon-Efficient Optimization Framework for Computing Systems
2025-03-01 · 7 citations
article
The world’s push toward an environmentally sustainable society is highly dependent on the semiconductor industry. Despite existing carbon modeling efforts to quantify carbon footprint of computing systems, optimizing carbon footprint in large design spaces-while also considering trade-offs in power, performance, and area-is especially challenging. To address this need, we present CORDOBA, a carbon-aware optimization framework that optimizes carbon efficiency. We quantify carbon efficiency using the total Carbon Delay Product metric (tCDP): the product of total carbon and application execution time. We justify why tCDP is an effective metric for quantifying carbon efficiency. We use CORDOBA to explore the large design space for carbonefficient specialized hardware, and identify distinct carbonefficient optimal designs across operational use (eliminating up to $\mathbf{9 8 \%}$ of the design space) despite uncertainty in carbon footprint parameters. We quantify opportunities to improve tCDP for real system case studies: (a) optimizing hardware provisioning from 8 to 4 cores in real system CPUs improves tCDP by $1.25 \times$; and (b) leveraging advanced three-dimensional (3D) integration techniques (3D stacking of separately-fabricated logic and memory chips) improves tCDP by $6.9 \times$ versus conventional systems.
Publisher DOI
Accelerating Field Interventions And Value Realization Through Intelligent Well Surveillance And Diagnostics
2025-11-03
articleSenior author
Abstract Acquiring and interpreting real-time data from both surface facilities and subsurface environments provides an accurate representation of wellbore and reservoir conditions. By capturing high-resolution and high-frequency environmental measurements, sensors enable continuous surveillance of the entire production system from reservoir inflow to surface processing. High and low frequency production data are ingested into digital engineering workflows that incorporate physics laws, analytics and domain expertise. Despite significant advancements in data acquisition and analytics, many digital oil fields (DOF) continue to operate engineering workflows in functional silos, with surface production, reservoir management, and equipment diagnostics often managed independently. While these compartmentalized workflows are individually robust and capable of generating valuable insights within their domains, the lack of integration limits the holistic understanding of the production system. This fragmentation constrains the ability to fully integrate cross-disciplinary data insights, thereby limiting opportunities for comprehensive production optimization and proactive operational decision-making across the asset lifecycle. Addressing these silos is fundamental to realizing the full value of integrated DOF. This paper presents the design and large-scale deployment of a pioneering, engineering-driven feedback loop platform for Well Surveillance and Diagnostics, developed under ADNOC’s Intelligent Production System Optimization (AiPSO) initiative. Engineered to support high-well-count operations across both greenfield and brownfield assets, the platform tackles critical challenges including delayed issue detection, alarm fatigue, inconsistent diagnostics, and fragmented decision-making. Integrated and orchestrated through an innovative framework, the system synchronizes inputs and outputs leveraging engineering expertise alongside automated data and physics model analysis to continuously validate insights, prioritize exceptions, and guide decision-making by embedding human-centered logic within a feedback loop that incorporates modular workflows covering virtual flow meters (VFM) for rate estimation, lift performance diagnostics, and smart ticketing to quantify added value, all aimed at accelerating cross-disciplinary collaboration and action-field response. The deployment of this pioneering production ecosystem marks a shift away from the fragmented, siloed workflows that have long limited the full potential of DOF. By creating an innovative, engineering-driven feedback loop, the system successfully integrates data and expertise to provide a holistic view of the production system. The result is a transformative leap from passive monitoring to a proactive engine of production optimization. This integrated approach not only accelerates decision-making and cross-disciplinary collaboration but also quantify value, demonstrating that a unified and intelligent framework is the key to unlocking true operational excellence and maximizing asset performance across the entire asset lifecycle.
Publisher DOI
EPOCHS-1: A 12 nm Highly Heterogeneous Open-Source SoC With Distributed Coin-Based Power Management and Integrated Hybrid Voltage Regulation
IEEE Journal of Solid-State Circuits · 2025-09-26
article
We present EPOCHS-1, a 12 nm, 64 mm2 system-on-chip (SoC) with a high degree of heterogeneity. It features four Linux-SMP-capable RISC-V cores, 14 different types of accelerators, a distributed memory hierarchy, and various peripherals. EPOCHS-1’s memory hierarchy has the flexibility to support a diverse set of accelerators and can scale to support complex applications with 34% and 25% reduction in latency and energy, respectively. A subset of the SoC’s 23 power and 35 clock domains is regulated with a fully-decentralized power-allocation scheme and hybrid unified voltage and frequency scaling (HUVFS) that combines an in-package switched regulator with a per-tile low dropout (LDO). Combined, these techniques achieve up to a <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.57\times $</tex-math> </inline-formula> speedup versus a centralized power management baseline. Designed with an agile methodology, EPOCHS-1 is based on an open-source SoC architecture and features only open-source components, either third-party or newly designed, thus enabling design reuse for future research projects.
Publisher DOI

Recent grants

CSR: SMALL: Virtualized Accelerators for Scalable, Composable Architectures
NSF · $450k · 2017–2022
NSF CCF-CPA: Reliability in the Face of Variability under Nanoscale Technology Scaling
NSF · $500k · 2007–2012
CAREER: A Framework for Early-Stage Computer Architecture Design Space Exploration and Optimization
NSF · $400k · 2005–2011
COLLABORATIVE RESEARCH -- CSR-EHS: Integrated Power Delivery - Hardware-Software Techniques to Eliminate Off-Chip Regulation from Embedded Systems
NSF · $402k · 2007–2012
An adaptive alarm-based approach to high-performance/low-cost computing
NSF · $375k · 2004–2008

Frequent coauthors

Gu-Yeon Wei
210 shared
Udit Gupta
Harvard University
45 shared
Brandon Reagen
New York University
43 shared
Carole-Jean Wu
38 shared
Pradip Bose
IBM (United States)
38 shared
Paul N. Whatmough
28 shared
Vijay Janapa Reddi
27 shared
Mark Hempstead
24 shared

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with David Brooks

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you