
About
Kyle Chard is a Research Associate Professor in the Department of Computer Science at the University of Chicago and a researcher at Argonne National Laboratory. His research focuses on developing new systems to address various computational and data-intensive problems. Together with Ian Foster, he co-leads the Globus Labs research group, which investigates a broad range of research problems in distributed systems, data-intensive computing, learning systems, and research data management. The group emphasizes exploring theoretical concepts in systems and developing implementations that are usable by a wide range of people. Kyle Chard's active research projects include Parsl, a parallel computing framework in Python; funcX, a distributed function as a service platform; DLHub, a machine learning model publication and serving system; and Whole Tale, a multi-user platform for reproducible research. He received his Ph.D. from the Department of Engineering and Computer Science at Victoria University of Wellington in March 2011 and holds a BSc. (Hons) in Computer Science as well as a BSc. in Mathematics and Computer & Electronic Systems.
Research topics
- Computer Science
- Biology
- Computational biology
- Physics
- Database
- Geology
- Medicine
- Chemistry
- Geodesy
- Genetics
- Computer graphics (images)
- Astronomy
- Biochemistry
- Remote sensing
Selected publications
Icicle: Scalable Metadata Indexing and Real-Time Monitoring for HPC File Systems
arXiv (Cornell University) · 2026-04-11
preprintOpen accessModern HPC file systems can contain billions of files and hundreds of petabytes of data, making even simple questions increasingly intractable to answer. Traditional file system utilities such as find and du fail to scale to these sizes. While external indexing tools like GUFI and Brindexer improve query performance, they remain batch-oriented and unsuitable for heterogeneous, rapidly evolving environments. We present Icicle, a scalable framework for continuous file system metadata indexing and monitoring. Icicle maintains a unified, up-to-date, and queryable view of file system state while supporting both periodic snapshot-based ingestion for bulk metadata updates and event-based ingestion for real-time synchronization from production systems such as Lustre and IBM Storage Scale. Built on Apache Kafka and Apache Flink, Icicle provides high-throughput, fault-tolerant, and horizontally scalable ingestion of metadata events into two complementary search indexes, enabling both individual file discovery and aggregate summary statistics by user, group, and directory. This architecture enables efficient support for both coarse-grained administrative queries and interactive analytics over billions of objects. Our experimental evaluation on production-scale HPC datasets demonstrates order-of-magnitude throughput improvements over existing monitoring and indexing approaches, with tunable options for balancing consistency, latency, and metadata freshness.
Vaccine Insights · 2026-03-26
articleThe global vaccine manufacturing ecosystem is bifurcating between hyper-connected ‘Industry 4.0’ facilities and brownfield environments in low- and middle-income countries (LMICs) constrained by legacy infrastructure, unreliable connectivity and high bandwidth costs. This Commentary argues that achieving vaccine independence in the Global South requires an offline-first, sovereign edge architecture that enables ‘collective intelligence without data pooling’. By combining TinyML, hierarchical federated learning and stateful agentic middleware (software agents that retain persistent memory of local system states across network disruptions), LMIC manufacturers can deploy AI-driven process optimization and surveillance while retaining full data custody. Drawing on established governance blueprints such as MELLODDY and OpenSAFELY, and on emerging initiatives including CEPI’s Pandemic Preparedness Engine for Disease X, we outline how data product management units, versioned federation and byzantine-robust aggregation can align regulatory requirements with continuously learning systems. We conclude with practical next steps and highlight technical, financial and political challenges that must be addressed for sovereign, AI-enabled vaccine systems to move from pilots to field-ready infrastructure.
Icicle: Scalable Metadata Indexing and Real-Time Monitoring for HPC File Systems
arXiv (Cornell University) · 2026-04-11
articleOpen accessModern HPC file systems can contain billions of files and hundreds of petabytes of data, making even simple questions increasingly intractable to answer. Traditional file system utilities such as find and du fail to scale to these sizes. While external indexing tools like GUFI and Brindexer improve query performance, they remain batch-oriented and unsuitable for heterogeneous, rapidly evolving environments. We present Icicle, a scalable framework for continuous file system metadata indexing and monitoring. Icicle maintains a unified, up-to-date, and queryable view of file system state while supporting both periodic snapshot-based ingestion for bulk metadata updates and event-based ingestion for real-time synchronization from production systems such as Lustre and IBM Storage Scale. Built on Apache Kafka and Apache Flink, Icicle provides high-throughput, fault-tolerant, and horizontally scalable ingestion of metadata events into two complementary search indexes, enabling both individual file discovery and aggregate summary statistics by user, group, and directory. This architecture enables efficient support for both coarse-grained administrative queries and interactive analytics over billions of objects. Our experimental evaluation on production-scale HPC datasets demonstrates order-of-magnitude throughput improvements over existing monitoring and indexing approaches, with tunable options for balancing consistency, latency, and metadata freshness.
The International Journal of High Performance Computing Applications · 2026-04-25
articleWe report on our experiences replicating 7.3 petabytes (PB) of Earth System Grid Federation (ESGF) computational simulation data from Lawrence Livermore National Laboratory (LLNL) in California to Argonne National Laboratory (ANL) in Illinois and Oak Ridge National Laboratory (ORNL) in Tennessee—a task motivated by a need for increased reliability, capacity, and performance. This task presented significant challenges: the need to move 29 million files twice under time pressure from aging storage hardware; a source file system bottleneck limiting throughput to 1.5 GB/s; frequent site maintenance windows; and the need for complete reliability at scale. We addressed these challenges using a simple replication tool that invoked Globus to transfer large bundles of files while tracking progress in a database, dynamically rerouting transfers to work around maintenance periods and file system limitations. Under the covers, Globus organized transfers to make efficient use of the high-speed Energy Sciences network (ESnet) and the data transfer nodes deployed at participating sites, and also addressed security, integrity checking, and recovery from a variety of transient failures. This success demonstrates the considerable benefits that can accrue from the adoption of performant data replication infrastructure. The replication tool is available at https://github.com/esgf2-us/data-replication-tools .
arXiv (Cornell University) · 2026-04-17
preprintOpen accessHigh-performance computing (HPC) has evolved over decades through multiple architectural transitions, from vector supercomputers to massively parallel CPU clusters and GPU-accelerated systems, continuously expanding the frontier of scientific discovery. With the emergence of quantum processing units (QPUs) as practical computational accelerators, a new opportunity arises to further extend this trajectory by integrating quantum and classical computing paradigms. This paper presents Quantum Integrated High-Performance Computing (QHPC), a visionary architectural framework that unifies CPUs, GPUs, FPGAs, and QPUs as first-class heterogeneous resources. We propose a layered system design comprising unified resource management, quantum-aware scheduling, hybrid workflow orchestration, middleware and programming abstraction, interconnect technologies, and a tiered execution model enabling seamless workload partitioning across classical and quantum backends. A central aspect of our vision is a strong user requests abstraction layer that exposes heterogeneous resources through a unified job submission interface, similar in spirit to existing schedulers such as Slurm, allowing users to describe workloads in a consistent template independent of underlying compute type or location. Drawing insights from prior accelerator integration eras, we outline how QHPC can support emerging workloads in quantum chemistry, materials discovery, combinatorial optimization, and climate modeling. We conclude by highlighting open challenges in building scalable, reliable, and programmable quantum-classical infrastructures that seamlessly connect global users to heterogeneous compute resources for future quantum-classical HPC ecosystems.
arXiv (Cornell University) · 2026-04-17
articleOpen accessHigh-performance computing (HPC) has evolved over decades through multiple architectural transitions, from vector supercomputers to massively parallel CPU clusters and GPU-accelerated systems, continuously expanding the frontier of scientific discovery. With the emergence of quantum processing units (QPUs) as practical computational accelerators, a new opportunity arises to further extend this trajectory by integrating quantum and classical computing paradigms. This paper presents Quantum Integrated High-Performance Computing (QHPC), a visionary architectural framework that unifies CPUs, GPUs, FPGAs, and QPUs as first-class heterogeneous resources. We propose a layered system design comprising unified resource management, quantum-aware scheduling, hybrid workflow orchestration, middleware and programming abstraction, interconnect technologies, and a tiered execution model enabling seamless workload partitioning across classical and quantum backends. A central aspect of our vision is a strong user requests abstraction layer that exposes heterogeneous resources through a unified job submission interface, similar in spirit to existing schedulers such as Slurm, allowing users to describe workloads in a consistent template independent of underlying compute type or location. Drawing insights from prior accelerator integration eras, we outline how QHPC can support emerging workloads in quantum chemistry, materials discovery, combinatorial optimization, and climate modeling. We conclude by highlighting open challenges in building scalable, reliable, and programmable quantum-classical infrastructures that seamlessly connect global users to heterogeneous compute resources for future quantum-classical HPC ecosystems.
arXiv (Cornell University) · 2026-03-20
preprintOpen accessArtificial Intelligence for scientific applications increasingly requires training large models on data that cannot be centralized due to privacy constraints, data sovereignty, or the sheer volume of data generated. Federated learning (FL) addresses this by enabling collaborative training without centralizing raw data, but scientific applications demand model scales that requires extensive computing resources, typically offered at High Performance Computing (HPC) facilities. Deploying FL experiments across HPC facilities introduces challenges beyond cloud or enterprise settings. We present a comprehensive cross-facility FL framework for heterogeneous HPC environments, built on Advanced Privacy-Preserving Federated Learning (APPFL) framework with Globus Compute and Transfer orchestration, and evaluate it across four U.S. Department of Energy (DOE) leadership-class supercomputers. We demonstrate that FL experiments across HPC facilities are practically achievable, characterize key sources of heterogeneity impacting the training performance, and show that algorithmic choices matter significantly under realistic HPC scheduling conditions. We validate the scientific applicability by fine-tuning a large language model on a chemistry instruction dataset, and identify scheduler-aware algorithm design as a critical open challenge for future deployments.
Managing distributed scientific workflows with Globus
DepositOnce · 2026-01-01
articleOpen access1st authorCorrespondingScientific workflows increasingly span remote computing resources, from local desktops and scientific instruments to supercomputers, clouds, and AI accelerators. This distribution is driven by the nature of modern data-driven research and the availability of specialized computing hardware. Distribution creates new opportunities to improve performance and efficiency by exploiting resource heterogeneity and locality; however, it also creates new challenges related to portability and security. In this chapter, we describe Globus, a platform designed to tackle these challenges via a hybrid model in which cloud services securely manage the remote execution of arbitrary research activities. We describe how Globus Flows, a cloud-hosted workflow platform, combined with Globus Compute and Globus Transfer, enables researchers to define and execute workflows across diverse distributed computing resources. We present several example applications in real-time instrument analysis, simulation campaigns, and distributed model training that demonstrate how Globus addresses challenges in real-world scenarios.
Addressing Reproducibility Challenges in HPC with Continuous Integration
2025-11-12 · 1 citations
articleOpen accessSenior authorThe high-performance computing (HPC) community has adopted incentive structures to motivate reproducible research, with major conferences awarding badges to papers that meet reproducibility requirements. Yet, many papers do not meet such requirements. The uniqueness of HPC infrastructure and software, coupled with strict access requirements, may limit opportunities for reproducibility. In the absence of resource access, we believe that regular documented testing, through continuous integration (CI), coupled with complete provenance information, can be used as a substitute. Here, we argue that better HPC-compliant CI solutions will improve reproducibility of applications. We present a survey of reproducibility initiatives and describe the barriers to reproducibility in HPC. To address existing limitations, we present a GitHub Action, CORRECT, that enables secure execution of tests on remote HPC resources. We evaluate CORRECT’s usability across three different types of HPC applications, demonstrating the effectiveness of using CORRECT for automating and documenting reproducibility evaluations.
Diamond: Harnessing GPU Resources for Scientific Deep Learning
2025-09-15
articleModern research computing cyberinfrastructure, such as ACCESS-CI and NAIRR Pilot, offers GPU resources across geographically distributed clusters to accommodate the increasing needs of scientific deep learning (DL) workloads. Even for high-performance computing (HPC) experts, configuring environments and managing DL workloads across supercomputers remain significant barriers. To address these obstacles, we present Diamond, an open-source platform to simplify and streamline the DL lifecycle on HPC. Diamond provides an intuitive graphical interface that abstracts system-level complexity, enabling users to develop, debug, and deploy DL models with minimal overhead. We identify several challenges in building such a platform, including portability, security, and usability, and propose effective architectural solutions to each. Notably, Diamond enables users to share and reuse DL workload environments across systems and collaborators, reducing redundant setup efforts. Experimental results demonstrate that Diamond reduces the time to first successful deployment by an average of 68%, compared to manual configuration with command lines. The Diamond service is available at https://diamondhpc.ai.
Recent grants
NSF · $2.8M · 2016–2022
CSR: Small: Cost-Aware Cloud Profiling, Prediction, and Provisioning as a Service
NSF · $500k · 2018–2023
NSF · $15k · 2020–2021
Frequent coauthors
- 544 shared
Ian Foster
University of Illinois Chicago
- 150 shared
Ryan Chard
- 141 shared
Yadu Babuji
- 94 shared
Ben Blaiszik
Argonne National Laboratory
- 78 shared
Steven Tuecke
University of Chicago
- 67 shared
Ravi Madduri
Argonne National Laboratory
- 60 shared
Zhuozhao Li
Southern University of Science and Technology
- 59 shared
Logan Ward
Argonne National Laboratory
Labs
Not provided
Education
- 2011
Ph.D., Computer Science
Victoria University of Wellington
Awards & honors
- IEEE TCHPC Award for Excellence for Early Career Researchers…
- Globus team R&D100 award
- New Zealand Top Achiever Doctoral Scholarship
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Kyle Chard
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup