
Yue Cheng
· Assistant Professor, Computer Science Assistant Professor, Data ScienceVerifiedUniversity of Virginia · Computer Science
Active 1999–2026
About
Yue Cheng is an Assistant Professor at the University of Virginia, holding a dual appointment in the School of Data Science and the Department of Computer Science. Prior to joining UVA in 2022, he served as an Assistant Professor of Computer Science at George Mason University. His research interests include distributed systems, cloud and serverless computing, high-performance computing, and operating systems. Cheng's work is driven by the complexities of modern data-intensive computer systems and aims to develop more efficient and user-friendly approaches to manage these complexities. His current research focuses on designing efficient data systems for data science, including the development of efficient stateful serverless computing systems through a full-stack approach that spans applications, platforms, and hardware, as well as building improved computing and storage systems for distributed machine learning.
Research topics
- Computer Science
- Computer network
- Distributed computing
- Artificial Intelligence
- Machine Learning
- Computer Security
- Embedded system
Selected publications
Dual-axis myelination covariance drives the functional connectivity emergence during infancy
Nature Communications · 2026-03-19
articleOpen accessThe mechanisms linking structural maturation to the emergence of functional networks in the perinatal brain remain unresolved. While prevailing models attribute functional connectivity to white matter myelination, neonates paradoxically exhibit adult-like resting-state networks despite profoundly immature white matter tracts. Here, we proposed gray matter myelination covariance as a critical basis of early functional connectivity emergence. We introduced a dual-axis myelination covariance framework and derived a myelination-function coupling (MFC) index specific to the newborn brain. Results revealed that the MFC exhibited distinct spatial patterns dominated by primary sensory and motor cortices, increased with age, and showed a distance-dependent strength. Crucially, neonatal MFC patterns showed a strong spatial correlation with gene expression profiles implicated in neurovascular coupling and specifically predicted later behaviors. These findings suggest that during infancy, the integration of brain function is not initially dominated by only the white matter connections but is also shaped by the synchrony of intracortical microstructure that reflects shared developmental trajectories, which offers a framework for understanding the formation of the developmental connectome.
λScale: Enabling Fast Scaling for Serverless Large Language Model Inference
ArXiv.org · 2025-02-14
preprintOpen accessServerless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup overhead. This poses a significant challenge in efficiently scaling model instances to accommodate dynamic, bursty workloads commonly observed in real-world inference services. In this paper, we introduce λScale, an efficient serverless inference system to achieve fast model scaling. The key idea behind λScale is to leverage high-speed RDMA networks between GPU nodes for fast model multicast, while enabling distributed inference execution during model transmission -- referred to as "execute-while-load". λScale proposes an efficient model scaling scheme, λPipe, which supports adaptive model multicast and dynamically constructs execution pipelines across receiving nodes for collaborative, distributed inference. Additionally, λScale supports efficient model management across GPU and host memory, allowing fast scaling for models across different storage tiers. Evaluation results show that λScale enables fast model scaling and effectively handles load spikes, achieving up to 5x tail-latency improvement and 31.3% cost reduction compared to state-of-the-art solutions on real-world LLM inference traces.
ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates
ArXiv.org · 2025-05-18
preprintOpen accessSenior authorFine-tuning large language models (LLMs) often exceeds GPU memory limits, prompting systems to offload model states to CPU memory. However, existing offloaded training frameworks like ZeRO-Offload treat all parameters equally and update the full model on the CPU, causing severe GPU stalls, where fast, expensive GPUs sit idle waiting for slow CPU updates and limited-bandwidth PCIe transfers. We present ZenFlow, a new offloading framework that prioritizes important parameters and decouples updates between GPU and CPU. ZenFlow performs in-place updates of important gradients on GPU, while asynchronously offloading and accumulating less important ones on CPU, fully overlapping CPU work with GPU computation. To scale across GPUs, ZenFlow introduces a lightweight gradient selection method that exploits a novel spatial and temporal locality property of important gradients, avoiding costly global synchronization. ZenFlow achieves up to 5x end-to-end speedup, 2x lower PCIe traffic, and reduces GPU stalls by over 85 percent, all while preserving accuracy.
NotebookOS: A Replicated Notebook Platform for Interactive Training with On-Demand GPUs
arXiv (Cornell University) · 2025-03-26
preprintOpen accessSenior authorInteractive notebook programming is universal in modern ML and AI workflows, with interactive deep learning training (IDLT) emerging as a dominant use case. To ensure responsiveness, platforms like Jupyter and Colab reserve GPUs for long-running notebook sessions, despite their intermittent and sporadic GPU usage, leading to extremely low GPU utilization and prohibitively high costs. In this paper, we introduce NotebookOS, a GPU-efficient notebook platform tailored for the unique requirements of IDLT. NotebookOS employs replicated notebook kernels with Raft-synchronized replicas distributed across GPU servers. To optimize GPU utilization, NotebookOS oversubscribes server resources, leveraging high interarrival times in IDLT workloads, and allocates GPUs only during active cell execution. It also supports replica migration and automatic cluster scaling under high load. Altogether, this design enables interactive training with minimal delay. In evaluation on production workloads, NotebookOS saved over 1,187 GPU hours in 17.5 hours of real-world IDLT, while significantly improving interactivity.
Centralization in the Decentralized Web: Challenges and Opportunities in IPFS Data Management
2025-04-22 · 6 citations
articleOpen accessThe InterPlanetary File System (IPFS) is a pioneering effort for Web 3.0, well-known for its decentralized infrastructure. However, some recent studies have shown that IPFS exhibits a high degree of centralization and has integrated centralized components for improved performance. While this change contradicts the core decentralized ethos of IPFS and introduces risks of hurting the data replication level and thus availability, it also opens some opportunities for better data management and cost savings through deduplication.
Nanoscale · 2025-12-15
articleCO* intermediate on AuPdCu NPs is enhanced, thereby promoting the EOR process along the C1 pathway. This ternary metal fine-tuning alloying approach presents a viable route for fabricating highly active and durable EOR materials.
Materials Research Bulletin · 2025-11-25 · 1 citations
articleThe Decentralization Dilemma: Performance Trade-Offs in IPFS and Breakpoints
2025-10-28
articleOpen accessWeb 3.0 is redefining the current Web (Web 2.0) with a focus on data and governance decentralization. The InterPlanetary File System (IPFS) exemplifies this shift. However, it faces a trade-off between decentralization and performance: prior studies have shown IPFS's performance degradations but fail to diagnose root causes or deliver actionable fixes.
NotebookOS: A Replicated Notebook Platform for Interactive Training with On-Demand GPUs
2025-12-11
articleOpen accessSenior authorInteractive notebook programming is universal in modern ML and AI workflows, with interactive deep learning training (IDLT) emerging as a dominant use case.To ensure responsiveness, platforms like Jupyter and Colab reserve GPUs for long-running notebook sessions, despite their intermittent and sporadic GPU usage, leading to extremely low GPU utilization and prohibitively high costs.In this paper, we introduce NotebookOS, a GPU-efficient notebook platform tailored for the unique requirements of IDLT.NotebookOS employs replicated notebook kernels with Raft-synchronized replicas distributed across GPU servers.To optimize GPU utilization, NotebookOS oversubscribes server resources, leveraging high inter-arrival times in IDLT workloads, and allocates GPUs only during active cell execution.It also supports replica migration and automatic cluster scaling under high load.Altogether, this design enables interactive training with minimal delay.In evaluation on production workloads, NotebookOS saved over 1,187 GPU hours in 17.5 hours of real-world IDLT, while significantly improving interactivity.
ZipLLM: Efficient LLM Storage via Model-Aware Synergistic Data Deduplication and Compression
arXiv (Cornell University) · 2025-04-30
preprintOpen accessSenior authorModern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques -- such as deduplication and compression -- are either LLM-oblivious or not compatible with each other, limiting data reduction effectiveness. Our large-scale characterization study across all publicly available Hugging Face LLM repositories reveals several key insights: (1) fine-tuned models within the same family exhibit highly structured, sparse parameter differences suitable for delta compression; (2) bitwise similarity enables LLM family clustering; and (3) tensor-level deduplication is better aligned with model storage workloads, achieving high data reduction with low metadata overhead. Building on these insights, we design BitX, an effective, fast, lossless delta compression algorithm that compresses XORed difference between fine-tuned and base LLMs. We build ZipLLM, a model storage reduction pipeline that unifies tensor-level deduplication and lossless BitX compression. By synergizing deduplication and compression around LLM family clustering, ZipLLM reduces model storage consumption by 54%, over 20% higher than state-of-the-art deduplication and compression approaches.
Recent grants
NSF · $121k · 2022–2024
CAREER: Harnessing Serverless Functions to Build Highly Elastic Cloud Storage Infrastructure
NSF · $349k · 2021–2023
NSF · $321k · 2019–2023
CAREER: Harnessing Serverless Functions to Build Highly Elastic Cloud Storage Infrastructure
NSF · $480k · 2022–2027
Frequent coauthors
- 37 shared
Ali Anwar
- 32 shared
Ali R. Butt
Virginia Tech
- 16 shared
Gaoyan Zhang
Tianjin University
- 16 shared
Lixiang Huang
Fujian Women and Children Hospital
- 16 shared
Xiaodong Zhang
University of Electronic Science and Technology of China
- 16 shared
Jia-Min Zhou
Tianjin Medical University
- 16 shared
Shen Wen
Tianjin First Center Hospital
- 16 shared
Yuexuan Li
University of Minnesota Medical Center
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Yue Cheng
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup