Yue Cheng

· Assistant Professor, Computer Science Assistant Professor, Data ScienceVerified

University of Virginia · Computer Science

Active 1999–2026

h-index20

Citations1.2k

Papers11470 last 5y

Funding$1.3M1 active

Faculty page

See your match with Yue Cheng — sign in to PhdFit.Sign in

About

Yue Cheng is an Assistant Professor at the University of Virginia, holding a dual appointment in the School of Data Science and the Department of Computer Science. Prior to joining UVA in 2022, he served as an Assistant Professor of Computer Science at George Mason University. His research interests include distributed systems, cloud and serverless computing, high-performance computing, and operating systems. Cheng's work is driven by the complexities of modern data-intensive computer systems and aims to develop more efficient and user-friendly approaches to manage these complexities. His current research focuses on designing efficient data systems for data science, including the development of efficient stateful serverless computing systems through a full-stack approach that spans applications, platforms, and hardware, as well as building improved computing and storage systems for distributed machine learning.

Research topics

Computer Science
Computer network
Distributed computing
Artificial Intelligence
Machine Learning
Computer Security
Embedded system

Selected publications

Dual-axis myelination covariance drives the functional connectivity emergence during infancy
Nature Communications · 2026-03-19
articleOpen access
The mechanisms linking structural maturation to the emergence of functional networks in the perinatal brain remain unresolved. While prevailing models attribute functional connectivity to white matter myelination, neonates paradoxically exhibit adult-like resting-state networks despite profoundly immature white matter tracts. Here, we proposed gray matter myelination covariance as a critical basis of early functional connectivity emergence. We introduced a dual-axis myelination covariance framework and derived a myelination-function coupling (MFC) index specific to the newborn brain. Results revealed that the MFC exhibited distinct spatial patterns dominated by primary sensory and motor cortices, increased with age, and showed a distance-dependent strength. Crucially, neonatal MFC patterns showed a strong spatial correlation with gene expression profiles implicated in neurovascular coupling and specifically predicted later behaviors. These findings suggest that during infancy, the integration of brain function is not initially dominated by only the white matter connections but is also shaped by the synchrony of intracortical microstructure that reflects shared developmental trajectories, which offers a framework for understanding the formation of the developmental connectome.
Publisher OA PDF DOI
λScale: Enabling Fast Scaling for Serverless Large Language Model Inference
ArXiv.org · 2025-02-14
preprintOpen access
Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup overhead. This poses a significant challenge in efficiently scaling model instances to accommodate dynamic, bursty workloads commonly observed in real-world inference services. In this paper, we introduce λScale, an efficient serverless inference system to achieve fast model scaling. The key idea behind λScale is to leverage high-speed RDMA networks between GPU nodes for fast model multicast, while enabling distributed inference execution during model transmission -- referred to as "execute-while-load". λScale proposes an efficient model scaling scheme, λPipe, which supports adaptive model multicast and dynamically constructs execution pipelines across receiving nodes for collaborative, distributed inference. Additionally, λScale supports efficient model management across GPU and host memory, allowing fast scaling for models across different storage tiers. Evaluation results show that λScale enables fast model scaling and effectively handles load spikes, achieving up to 5x tail-latency improvement and 31.3% cost reduction compared to state-of-the-art solutions on real-world LLM inference traces.
Publisher OA PDF DOI
ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates
ArXiv.org · 2025-05-18
preprintOpen accessSenior author
Fine-tuning large language models (LLMs) often exceeds GPU memory limits, prompting systems to offload model states to CPU memory. However, existing offloaded training frameworks like ZeRO-Offload treat all parameters equally and update the full model on the CPU, causing severe GPU stalls, where fast, expensive GPUs sit idle waiting for slow CPU updates and limited-bandwidth PCIe transfers. We present ZenFlow, a new offloading framework that prioritizes important parameters and decouples updates between GPU and CPU. ZenFlow performs in-place updates of important gradients on GPU, while asynchronously offloading and accumulating less important ones on CPU, fully overlapping CPU work with GPU computation. To scale across GPUs, ZenFlow introduces a lightweight gradient selection method that exploits a novel spatial and temporal locality property of important gradients, avoiding costly global synchronization. ZenFlow achieves up to 5x end-to-end speedup, 2x lower PCIe traffic, and reduces GPU stalls by over 85 percent, all while preserving accuracy.
Publisher OA PDF DOI
NotebookOS: A Replicated Notebook Platform for Interactive Training with On-Demand GPUs
arXiv (Cornell University) · 2025-03-26
preprintOpen accessSenior author
Interactive notebook programming is universal in modern ML and AI workflows, with interactive deep learning training (IDLT) emerging as a dominant use case. To ensure responsiveness, platforms like Jupyter and Colab reserve GPUs for long-running notebook sessions, despite their intermittent and sporadic GPU usage, leading to extremely low GPU utilization and prohibitively high costs. In this paper, we introduce NotebookOS, a GPU-efficient notebook platform tailored for the unique requirements of IDLT. NotebookOS employs replicated notebook kernels with Raft-synchronized replicas distributed across GPU servers. To optimize GPU utilization, NotebookOS oversubscribes server resources, leveraging high interarrival times in IDLT workloads, and allocates GPUs only during active cell execution. It also supports replica migration and automatic cluster scaling under high load. Altogether, this design enables interactive training with minimal delay. In evaluation on production workloads, NotebookOS saved over 1,187 GPU hours in 17.5 hours of real-world IDLT, while significantly improving interactivity.
Publisher OA PDF DOI
Centralization in the Decentralized Web: Challenges and Opportunities in IPFS Data Management
2025-04-22 · 6 citations
articleOpen access
The InterPlanetary File System (IPFS) is a pioneering effort for Web 3.0, well-known for its decentralized infrastructure. However, some recent studies have shown that IPFS exhibits a high degree of centralization and has integrated centralized components for improved performance. While this change contradicts the core decentralized ethos of IPFS and introduces risks of hurting the data replication level and thus availability, it also opens some opportunities for better data management and cost savings through deduplication.
Publisher OA PDF DOI
Fine-tuning and electronic modulation of AuPdCu nanoflowers assembled with nanowires for robust ethanol oxidation reaction performance
Nanoscale · 2025-12-15
article
CO* intermediate on AuPdCu NPs is enhanced, thereby promoting the EOR process along the C1 pathway. This ternary metal fine-tuning alloying approach presents a viable route for fabricating highly active and durable EOR materials.
Publisher DOI
Strong electronic interactions stem from lattice strain control in PdSnCu nanochains for robust electrocatalytic ethanol oxidation
Materials Research Bulletin · 2025-11-25 · 1 citations
article
Publisher DOI
The Decentralization Dilemma: Performance Trade-Offs in IPFS and Breakpoints
2025-10-28
articleOpen access
Web 3.0 is redefining the current Web (Web 2.0) with a focus on data and governance decentralization. The InterPlanetary File System (IPFS) exemplifies this shift. However, it faces a trade-off between decentralization and performance: prior studies have shown IPFS's performance degradations but fail to diagnose root causes or deliver actionable fixes.
Publisher DOI
NotebookOS: A Replicated Notebook Platform for Interactive Training with On-Demand GPUs
2025-12-11
articleOpen accessSenior author
Interactive notebook programming is universal in modern ML and AI workflows, with interactive deep learning training (IDLT) emerging as a dominant use case.To ensure responsiveness, platforms like Jupyter and Colab reserve GPUs for long-running notebook sessions, despite their intermittent and sporadic GPU usage, leading to extremely low GPU utilization and prohibitively high costs.In this paper, we introduce NotebookOS, a GPU-efficient notebook platform tailored for the unique requirements of IDLT.NotebookOS employs replicated notebook kernels with Raft-synchronized replicas distributed across GPU servers.To optimize GPU utilization, NotebookOS oversubscribes server resources, leveraging high inter-arrival times in IDLT workloads, and allocates GPUs only during active cell execution.It also supports replica migration and automatic cluster scaling under high load.Altogether, this design enables interactive training with minimal delay.In evaluation on production workloads, NotebookOS saved over 1,187 GPU hours in 17.5 hours of real-world IDLT, while significantly improving interactivity.
Publisher OA PDF DOI
ZipLLM: Efficient LLM Storage via Model-Aware Synergistic Data Deduplication and Compression
arXiv (Cornell University) · 2025-04-30
preprintOpen accessSenior author
Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques -- such as deduplication and compression -- are either LLM-oblivious or not compatible with each other, limiting data reduction effectiveness. Our large-scale characterization study across all publicly available Hugging Face LLM repositories reveals several key insights: (1) fine-tuned models within the same family exhibit highly structured, sparse parameter differences suitable for delta compression; (2) bitwise similarity enables LLM family clustering; and (3) tensor-level deduplication is better aligned with model storage workloads, achieving high data reduction with low metadata overhead. Building on these insights, we design BitX, an effective, fast, lossless delta compression algorithm that compresses XORed difference between fine-tuned and base LLMs. We build ZipLLM, a model storage reduction pipeline that unifies tensor-level deduplication and lossless BitX compression. By synergizing deduplication and compression around LLM family clustering, ZipLLM reduces model storage consumption by 54%, over 20% higher than state-of-the-art deduplication and compression approaches.
Publisher OA PDF DOI

Recent grants

SPX: Collaborative Research: Cross-stack Memory Optimizations for Boosting I/O Performance of Deep Learning HPC Applications
NSF · $121k · 2022–2024
CAREER: Harnessing Serverless Functions to Build Highly Elastic Cloud Storage Infrastructure
NSF · $349k · 2021–2023
SPX: Collaborative Research: Cross-stack Memory Optimizations for Boosting I/O Performance of Deep Learning HPC Applications
NSF · $321k · 2019–2023
CAREER: Harnessing Serverless Functions to Build Highly Elastic Cloud Storage Infrastructure
NSF · $480k · 2022–2027

Frequent coauthors

Ali Anwar
37 shared
Ali R. Butt
Virginia Tech
32 shared
Gaoyan Zhang
Tianjin University
16 shared
Lixiang Huang
Fujian Women and Children Hospital
16 shared
Xiaodong Zhang
University of Electronic Science and Technology of China
16 shared
Jia-Min Zhou
Tianjin Medical University
16 shared
Shen Wen
Tianjin First Center Hospital
16 shared
Yuexuan Li
University of Minnesota Medical Center
16 shared

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Yue Cheng

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you