Ian Foster

· Professor of Computer ScienceVerified

University of Chicago · Computer Science

Active 1980–2026

h-index150

Citations139.5k

Papers1.7k516 last 5y

Funding$25.5M2 active

Faculty page Lab page

See your match with Ian Foster — sign in to PhdFit.Sign in

About

Ian Foster is an Arthur Holly Compton Distinguished Service Professor of Computer Science at the University of Chicago. His research focuses on the development and application of computer science principles, particularly in the areas of scientific computing, data science, and high-performance computing. Foster's work involves exploring innovative computational paradigms and advancing the foundations of data-driven scientific discovery. He is recognized for his contributions to the field of computer science, especially in the context of scientific research and data-intensive applications. Foster's leadership and research have significantly impacted how complex scientific computations are performed and how data is managed and analyzed in interdisciplinary settings, fostering collaboration across academia, industry, and government sectors.

Research topics

Computer Science
Artificial Intelligence
Political Science
Biology
Data science
Economics
Database
Engineering
Data Mining
Knowledge management
Engineering ethics
Ecology
Machine Learning
Computational biology
Biochemistry
Business
World Wide Web
Forestry
Algorithm
Genetics
Chemistry
Agroforestry
Medicine
Law and economics

Selected publications

Icicle: Scalable Metadata Indexing and Real-Time Monitoring for HPC File Systems
arXiv (Cornell University) · 2026-04-11
articleOpen accessSenior author
Modern HPC file systems can contain billions of files and hundreds of petabytes of data, making even simple questions increasingly intractable to answer. Traditional file system utilities such as find and du fail to scale to these sizes. While external indexing tools like GUFI and Brindexer improve query performance, they remain batch-oriented and unsuitable for heterogeneous, rapidly evolving environments. We present Icicle, a scalable framework for continuous file system metadata indexing and monitoring. Icicle maintains a unified, up-to-date, and queryable view of file system state while supporting both periodic snapshot-based ingestion for bulk metadata updates and event-based ingestion for real-time synchronization from production systems such as Lustre and IBM Storage Scale. Built on Apache Kafka and Apache Flink, Icicle provides high-throughput, fault-tolerant, and horizontally scalable ingestion of metadata events into two complementary search indexes, enabling both individual file discovery and aggregate summary statistics by user, group, and directory. This architecture enables efficient support for both coarse-grained administrative queries and interactive analytics over billions of objects. Our experimental evaluation on production-scale HPC datasets demonstrates order-of-magnitude throughput improvements over existing monitoring and indexing approaches, with tunable options for balancing consistency, latency, and metadata freshness.
Publisher OA PDF
Automated, reliable, and efficient continental-scale replication of 7.3 petabytes of computational simulation data: A case study
The International Journal of High Performance Computing Applications · 2026-04-25
articleSenior author
We report on our experiences replicating 7.3 petabytes (PB) of Earth System Grid Federation (ESGF) computational simulation data from Lawrence Livermore National Laboratory (LLNL) in California to Argonne National Laboratory (ANL) in Illinois and Oak Ridge National Laboratory (ORNL) in Tennessee—a task motivated by a need for increased reliability, capacity, and performance. This task presented significant challenges: the need to move 29 million files twice under time pressure from aging storage hardware; a source file system bottleneck limiting throughput to 1.5 GB/s; frequent site maintenance windows; and the need for complete reliability at scale. We addressed these challenges using a simple replication tool that invoked Globus to transfer large bundles of files while tracking progress in a database, dynamically rerouting transfers to work around maintenance periods and file system limitations. Under the covers, Globus organized transfers to make efficient use of the high-speed Energy Sciences network (ESnet) and the data transfer nodes deployed at participating sites, and also addressed security, integrity checking, and recovery from a variety of transient failures. This success demonstrates the considerable benefits that can accrue from the adoption of performant data replication infrastructure. The replication tool is available at https://github.com/esgf2-us/data-replication-tools .
Publisher DOI
Icicle: Scalable Metadata Indexing and Real-Time Monitoring for HPC File Systems
arXiv (Cornell University) · 2026-04-11
preprintOpen accessSenior author
Modern HPC file systems can contain billions of files and hundreds of petabytes of data, making even simple questions increasingly intractable to answer. Traditional file system utilities such as find and du fail to scale to these sizes. While external indexing tools like GUFI and Brindexer improve query performance, they remain batch-oriented and unsuitable for heterogeneous, rapidly evolving environments. We present Icicle, a scalable framework for continuous file system metadata indexing and monitoring. Icicle maintains a unified, up-to-date, and queryable view of file system state while supporting both periodic snapshot-based ingestion for bulk metadata updates and event-based ingestion for real-time synchronization from production systems such as Lustre and IBM Storage Scale. Built on Apache Kafka and Apache Flink, Icicle provides high-throughput, fault-tolerant, and horizontally scalable ingestion of metadata events into two complementary search indexes, enabling both individual file discovery and aggregate summary statistics by user, group, and directory. This architecture enables efficient support for both coarse-grained administrative queries and interactive analytics over billions of objects. Our experimental evaluation on production-scale HPC datasets demonstrates order-of-magnitude throughput improvements over existing monitoring and indexing approaches, with tunable options for balancing consistency, latency, and metadata freshness.
Publisher DOI
FIRST: Federated Inference Resource Scheduling Toolkit for Scientific AI Model Access
2025-11-07 · 1 citations
article
We present the Federated Inference Resource Scheduling Toolkit (FIRST), a framework enabling Inference-as-a-Service across distributed High-Performance Computing (HPC) clusters. FIRST provides cloud-like access to diverse AI models, like Large Language Models (LLMs), on existing HPC infrastructure. Leveraging Globus Auth and Globus Compute, the system allows researchers to run parallel inference workloads via an OpenAI-compliant API on private, secure environments. This cluster-agnostic API allows requests to be distributed across federated clusters, targeting numerous hosted models. FIRST supports multiple inference backends (e.g., vLLM), auto-scales resources, maintains "hot" nodes for low-latency execution, and offers both high-throughput batch and interactive modes. The framework addresses the growing demand for private, secure, and scalable AI inference in scientific workflows, allowing researchers to generate billions of tokens daily on-premises without relying on commercial cloud infrastructure.
Publisher DOI
Radio Afterglow Detection and AI-driven Response (RADAR): A Federated Framework for Gravitational-wave Event Follow-up
The Astrophysical Journal Supplement Series · 2025-10-01 · 2 citations
articleOpen access
Abstract The landmark detection of both gravitational waves (GWs) and electromagnetic (EM) radiation from the binary neutron star merger GW170817 has spurred efforts to streamline the follow-up of GW alerts in current and future observing runs of ground-based GW detectors. Within this context, the radio band of the EM spectrum presents unique challenges. Sensitive radio facilities capable of detecting the faint radio afterglow seen in GW170817, and with sufficient angular resolution, have small fields of view compared to typical GW localization areas. Additionally, theoretical models predict that the radio emission from binary neutron star mergers can evolve over weeks to years, necessitating long-term monitoring to probe the physics of the various postmerger ejecta components. These constraints, combined with limited radio observing resources, make the development of more coordinated follow-up strategies essential—especially as the next generation of GW detectors promises a dramatic increase in detection rates. Here, we present RADAR , a framework designed to address these challenges by promoting community-driven information sharing, federated data analysis, and system resilience, while integrating AI methods for both GW signal identification and radio data aggregation. We show that it is possible to preserve data rights while sharing models that can help design and/or update follow-up strategies. We demonstrate our approach through a case study of GW170817, and discuss future directions for refinement and broader application.
Publisher DOI
Addressing Reproducibility Challenges in HPC with Continuous Integration
2025-11-12 · 1 citations
articleOpen access
The high-performance computing (HPC) community has adopted incentive structures to motivate reproducible research, with major conferences awarding badges to papers that meet reproducibility requirements. Yet, many papers do not meet such requirements. The uniqueness of HPC infrastructure and software, coupled with strict access requirements, may limit opportunities for reproducibility. In the absence of resource access, we believe that regular documented testing, through continuous integration (CI), coupled with complete provenance information, can be used as a substitute. Here, we argue that better HPC-compliant CI solutions will improve reproducibility of applications. We present a survey of reproducibility initiatives and describe the barriers to reproducibility in HPC. To address existing limitations, we present a GitHub Action, CORRECT, that enables secure execution of tests on remote HPC resources. We evaluate CORRECT’s usability across three different types of HPC applications, demonstrating the effectiveness of using CORRECT for automating and documenting reproducibility evaluations.
Publisher DOI
Diamond: Harnessing GPU Resources for Scientific Deep Learning
2025-09-15
article
Modern research computing cyberinfrastructure, such as ACCESS-CI and NAIRR Pilot, offers GPU resources across geographically distributed clusters to accommodate the increasing needs of scientific deep learning (DL) workloads. Even for high-performance computing (HPC) experts, configuring environments and managing DL workloads across supercomputers remain significant barriers. To address these obstacles, we present Diamond, an open-source platform to simplify and streamline the DL lifecycle on HPC. Diamond provides an intuitive graphical interface that abstracts system-level complexity, enabling users to develop, debug, and deploy DL models with minimal overhead. We identify several challenges in building such a platform, including portability, security, and usability, and propose effective architectural solutions to each. Notably, Diamond enables users to share and reuse DL workload environments across systems and collaborators, reducing redundant setup efforts. Experimental results demonstrate that Diamond reduces the time to first successful deployment by an average of 68%, compared to manual configuration with command lines. The Diamond service is available at https://diamondhpc.ai.
Publisher DOI
FragmentGPT: A Unified GPT Model for Fragment Growing, Linking, and Merging in Molecular Design
ArXiv.org · 2025-09-14
preprintOpen access
Fragment-Based Drug Discovery (FBDD) is a popular approach in early drug development, but designing effective linkers to combine disconnected molecular fragments into chemically and pharmacologically viable candidates remains challenging. Further complexity arises when fragments contain structural redundancies, like duplicate rings, which cannot be addressed by simply adding or removing atoms or bonds. To address these challenges in a unified framework, we introduce FragmentGPT, which integrates two core components: (1) a novel chemically-aware, energy-based bond cleavage pre-training strategy that equips the GPT-based model with fragment growing, linking, and merging capabilities, and (2) a novel Reward Ranked Alignment with Expert Exploration (RAE) algorithm that combines expert imitation learning for diversity enhancement, data selection and augmentation for Pareto and composite score optimality, and Supervised Fine-Tuning (SFT) to align the learner policy with multi-objective goals. Conditioned on fragment pairs, FragmentGPT generates linkers that connect diverse molecular subunits while simultaneously optimizing for multiple pharmaceutical goals. It also learns to resolve structural redundancies-such as duplicated fragments-through intelligent merging, enabling the synthesis of optimized molecules. FragmentGPT facilitates controlled, goal-driven molecular assembly. Experiments and ablation studies on real-world cancer datasets demonstrate its ability to generate chemically valid, high-quality molecules tailored for downstream drug discovery tasks.
Publisher OA PDF DOI
Strategic investments in data democratization for scientific innovation
The International Journal of High Performance Computing Applications · 2025-09-30
articleSenior author
The urgent need for data democratization in scientific research was the focal point of a panel discussion at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), held in Denver, Colorado, from November 12 to 17, 2023, summarizing the outcomes of that discussion and subsequent conversations. The panelists advocated for strategic investments in financial, human, and technological resources to achieve sustainable data democratization. Emphasizing that data is central to scientific discovery and AI deployment, the panel highlighted barriers such as limited access, inadequate financial incentives for cross-domain collaboration, and a shortage of workforce development initiatives. The recommendations in this article aim to guide decision-makers in fostering an inclusive research community, breaking down research silos, and developing a skilled workforce to advance scientific discovery through data democratization.
Publisher DOI
LangChain-Parsl: Connect Large Language Model Agents to High Performance Computing Resource
2025-11-07 · 2 citations
articleOpen access
Large Language Models (LLMs) can improve performance in answering questions beyond their contextual understanding by running external tools, such as a calculator for arithmetics, an online query for real-time weather, et al. For scientific applications, this enables the LLM to perform and analyze simulation runs for more accurate answers. However, the increasing scale of scientific computing requires high-performance computers (HPCs), which are managed by job schedulers. In this work, we implemented Parsl to the LangChain tool calling to bridge the gap between the LLM agent and the HPC resource. Two implementations were set up and tested on a local Nvidia GPU workstation and the Polaris/ALCF HPC system. The first setup was implemented by modifying the LangChain tool calling, which converts the LangChain tool calls to Parsl functions and queues them to the Parsl workers for parallel execution. The second approach was achieved by designing a Parsl ensemble function as an LLM tool, which performed parallel tasks. With these implementations, the LLM agent workflow was prompted to run molecular dynamics simulations, with different protein structures and simulation conditions. The results show that our Parsl implementations enable parallel execution of scientific tools that invoked by LLM agents on both local GPU workstations and HPC platforms.
Publisher DOI

Recent grants

Collaborative Research: SI2-SSI: SciDaaS - Data Management as a Service
NSF · $2.8M · 2012–2018
NIH Grant R01LM010132
NIH · $1.2M · 2012
Frameworks: Garden: A FAIR Framework for Publishing and Applying AI Models for Translational Research in Science, Engineering, Education, and Industry
NSF · $3.5M · 2022–2027
Building protected data sharing networks to advance cancer risk assessment and treatment
NIH · $4.0M · 2017–2022
BD Spokes: SPOKE: MIDWEST: Collaborative: Integrative Materials Design (IMaD): Leverage, Innovate, and Disseminate
NSF · $723k · 2017–2022

Frequent coauthors

Kyle Chard
University of Chicago
544 shared
Ben Blaiszik
Argonne National Laboratory
265 shared
Carl Kesselman
University of Southern California
188 shared
Ryan Chard
174 shared
Steven Tuecke
University of Chicago
171 shared
Rajkumar Kettimuthu
University of Chicago
167 shared
Ravi Madduri
Argonne National Laboratory
165 shared
Logan Ward
Argonne National Laboratory
151 shared

Labs

Research Labs & GroupsPI

Education

B.S., Computer Science
University of Canterbury
Ph.D., Computer Science
Imperial College

Awards & honors

2024 HPCWired 35 Legends List
2023 IEEE internet award
2022 ACM/IEEE Ken Kennedy award
2020 DOE office of science distinguished scientists fellow
2019 IEEE-CS Charles Babbage award

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Ian Foster

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you