Tingyi "Leo" Liu

· ProfessorVerified

University of Massachusetts Amherst · Materials Science and Engineering

Active 2008–2026

h-index15

Citations1.1k

Papers4720 last 5y

Funding$1.1M

Faculty page

See your match with Tingyi "Leo" Liu — sign in to PhdFit.Sign in

About

Tingyi 'Leo' Liu is an Associate Professor in the Department of Mechanical and Industrial Engineering at UMass Amherst, affiliated with the Riccio College of Engineering. His research focuses on bio-inspired soft robotics, soft electronics, and medical devices, as well as advanced manufacturing, heterogeneous integration, and roll-to-roll (R2R) manufacturing. He also works on super-repellent surfaces, superomniphobic surfaces, superhydrophobic surfaces, and liquid-metal-based micro/nano-devices. Dr. Liu holds a PhD and MS in Mechanical Engineering from the University of California, Los Angeles (UCLA), and a Bachelor's degree in Electrical Engineering from Zhejiang University in China. His work involves interdisciplinary interface engineering, contributing to healthcare and biomedicine, advanced manufacturing, and supply chain management.

Research topics

Computer Science
Data Mining
Theoretical computer science
Artificial Intelligence
Algorithm
Programming language
Computer architecture
Distributed computing
Mathematics
Parallel computing
Operating system

Selected publications

Long-term Monitoring of Kernel and Hardware Events to Understand Latency Variance
ArXiv.org · 2026-01-15
articleOpen access
This paper presents our experience to understand latency variance caused by kernel and hardware events, which are often invisible at the application level. For this purpose, we have built VarMRI, a tool chain to monitor and analyze those events in the long term. To mitigate the "big data" problem caused by long-term monitoring, VarMRI selectively records a subset of events following two principles: it only records events that are affecting the requests recorded by the application; it records coarse-grained information first and records additional information only when necessary. Furthermore, VarMRI introduces an analysis method that is efficient on large amount of data, robust on different data set and against missing data, and informative to the user. VarMRI has helped us to carry out a 3,000-hour study of six applications and benchmarks on CloudLab. It reveals a wide variety of events causing latency variance, including interrupt preemption, Java GC, pipeline stall, NUMA balancing etc.; simple optimization or tuning can reduce tail latencies by up to 31%. Furthermore, the impacts of some of these events vary significantly across different experiments, which confirms the necessity of long-term monitoring.
Publisher OA PDF
Long-term Monitoring of Kernel and Hardware Events to Understand Latency Variance
arXiv (Cornell University) · 2026-01-15
preprintOpen access
This paper presents our experience to understand latency variance caused by kernel and hardware events, which are often invisible at the application level. For this purpose, we have built VarMRI, a tool chain to monitor and analyze those events in the long term. To mitigate the "big data" problem caused by long-term monitoring, VarMRI selectively records a subset of events following two principles: it only records events that are affecting the requests recorded by the application; it records coarse-grained information first and records additional information only when necessary. Furthermore, VarMRI introduces an analysis method that is efficient on large amount of data, robust on different data set and against missing data, and informative to the user. VarMRI has helped us to carry out a 3,000-hour study of six applications and benchmarks on CloudLab. It reveals a wide variety of events causing latency variance, including interrupt preemption, Java GC, pipeline stall, NUMA balancing etc.; simple optimization or tuning can reduce tail latencies by up to 31%. Furthermore, the impacts of some of these events vary significantly across different experiments, which confirms the necessity of long-term monitoring.
Publisher DOI
AIBrix: Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure
ArXiv.org · 2025-02-22
preprintOpen access
We introduce AIBrix, a cloud-native, open-source framework designed to optimize and simplify large-scale LLM deployment in cloud environments. Unlike traditional cloud-native stacks, AIBrix follows a co-design philosophy, ensuring every layer of the infrastructure is purpose-built for seamless integration with inference engines like vLLM. AIBrix introduces several key innovations to reduce inference costs and enhance performance including high-density LoRA management for dynamic adapter scheduling, LLM-specific autoscalers, and prefix-aware, load-aware routing. To further improve efficiency, AIBrix incorporates a distributed KV cache, boosting token reuse across nodes, leading to a 50% increase in throughput and a 70% reduction in inference latency. AIBrix also supports unified AI runtime which streamlines model management while maintaining vendor-agnostic engine compatibility. For large-scale multi-node inference, AIBrix employs hybrid orchestration -- leveraging Kubernetes for coarse-grained scheduling and Ray for fine-grained execution -- to balance efficiency and flexibility. Additionally, an SLO-driven GPU optimizer dynamically adjusts resource allocations, optimizing heterogeneous serving to maximize cost efficiency while maintaining service guarantees. Finally, AIBrix enhances system reliability with AI accelerator diagnostic tools, enabling automated failure detection and mock-up testing to improve fault resilience. AIBrix is available at https://github.com/vllm-project/aibrix.
Publisher OA PDF DOI
An Empirical Study of Microscaling Formats for Low-Precision LLM Training
2025-05-04 · 3 citations
article
This paper presents a comprehensive evaluation of microscaling (MX) quantization in the pre-training of large language models (LLMs), investigating its potential to enhance the computation and memory efficiencies. We systematically examine the effects of key design parameters - including data types, rounding modes, scaling strategies, granularity, and organization - on numerical accuracy and training stability. Our extensive experimental study on Llama3 models reveals critical insights into the challenges of 4-bit training for LLMs and identifies optimal configurations with mixed precisions of 4-bit and 6-bit MX formats that significantly enhance training quality, bridging the gap with higher-precision formats. This research provides valuable guidance on the benefits and limitations of MX quantization, laying the groundwork for future innovations in low-precision LLM training.
Publisher DOI
Exploring Performance and Cost Optimization with ASIC-Based CXL Memory
2024-04-18 · 31 citations
articleOpen access
As memory-intensive applications continue to drive the need for advanced architectural solutions, Compute Express Link (CXL) has risen as a promising interconnect technology that enables seamless high-speed, low-latency communication between host processors and various peripheral devices. In this study, we explore the application performance of ASIC CXL memory in various data-center scenarios. We then further explore multiple potential impacts (e.g., throughput, latency, and cost reduction) of employing CXL memory via carefully designed policies and strategies. Our empirical results show the high potential of CXL memory, reveal multiple intriguing observations of CXL memory and contribute to the wide adoption of CXL memory in real-world deployment environments. Based on our benchmarks, we also develop an Abstract Cost Model that can estimate the cost benefit from using CXL memory.
Publisher DOI
AdapMTL: Adaptive Pruning Framework for Multitask Learning Model
2024-10-26 · 3 citations
preprintOpen accessSenior author
In the domain of multimedia and multimodal processing, the efficient handling of diverse data streams such as images, video, and sensor data is paramount. Model compression and multitask learning (MTL) are crucial in this field, offering the potential to address the resource-intensive demands of processing and interpreting multiple forms of media simultaneously. However, effectively compressing a multitask model presents significant challenges due to the complexities of balancing sparsity allocation and accuracy performance across multiple tasks. To tackle these challenges, we propose AdapMTL, an adaptive pruning framework for MTL models. AdapMTL leverages multiple learnable soft thresholds independently assigned to the shared backbone and the task-specific heads to capture the nuances in different components' sensitivity to pruning. During training, it co-optimizes the soft thresholds and MTL model weights to automatically determine the suitable sparsity level at each component to achieve both high task accuracy and high overall sparsity. It further incorporates an adaptive weighting mechanism that dynamically adjusts the importance of task-specific losses based on each task's robustness to pruning. We demonstrate the effectiveness of AdapMTL through comprehensive experiments on popular multitask datasets, namely NYU-v2 and Tiny-Taskonomy, with different architectures, showcasing superior performance compared to state-of-the-art pruning methods.
Publisher DOI
Scaler: Efficient and Effective Cross Flow Analysis
2024-10-18
articleOpen accessSenior author
Performance analysis is challenging as different components (e.g., different libraries, and applications) of a complex system can interact with each other. However, few existing tools focus on understanding such interactions. To bridge this gap, we propose a novel analysis method-"Cross Flow Analysis (XFA)"- that monitors the interactions/flows across these components. We also built the Scaler profiler that provides a holistic view of the time spent on each component (e.g., library or application) and every API inside each component. This paper proposes multiple new techniques, such as Universal Shadow Table, and Relation-Aware Data Folding. These techniques enable Scaler to achieve low runtime overhead, low memory overhead, and high profiling accuracy. Based on our extensive experimental results, Scaler detects multiple unknown performance issues inside widely-used applications, and therefore will be a useful complement to existing work.
Publisher DOI
Understanding and Alleviating Memory Consumption in RLHF for LLMs
arXiv (Cornell University) · 2024-10-21
preprintOpen accessSenior author
Fine-tuning with Reinforcement Learning with Human Feedback (RLHF) is essential for aligning large language models (LLMs). However, RLHF often encounters significant memory challenges. This study is the first to examine memory usage in the RLHF context, exploring various memory management strategies and unveiling the reasons behind excessive memory consumption. Additionally, we introduce a simple yet effective approach that substantially reduces the memory required for RLHF fine-tuning.
Publisher OA PDF DOI
Improving Resource and Energy Efficiency for Cloud 3D through Excessive Rendering Reduction
2024-04-18 · 1 citations
articleOpen access
The rise of cloud gaming makes interactive 3D applications an emerging type of data center workload. However, the excessive rendering in current cloud 3D systems leads to large gaps between the cloud and client frame rates (FPS, frames per second), thus wasting resources and power. Although FPS regulation can remove excessive rendering, due to the highly-varying frame processing time and the use of rendering delays, existing cloud FPS regulation solutions have low FPS and slow motion-to-photon (MtP) latency, causing violations of Quality-of-Service (QoS) requirements.
Publisher DOI
Profile Dynamic Memory Allocation in Autonomous Driving Software
2023-08-10 · 1 citations
articleSenior author
The software-defined vehicle has driven the autonomy and electrification of the automotive industry. A technical challenge for software designers is how to leverage existing software from AI research and autonomous driving (AD) development and make it useful, reliable, and efficient for customer requirements and functional safety standards. However, the software is critical in autonomous driving (AD) systems, where it should ensure reliability and real-time guarantee simultaneously. Further, the AD industry may re-utilize existing mature software implementation (e.g., C++ STL libraries) in order to accelerate development. However, the jeopardy of reliability and real-time guarantee caused by dynamic memory management inside remains a major concern for practitioners in the field. This paper presents a software tool (called MemTrace) to conveniently analyze the dynamic memory management behavior of AD software and provide important analytical results for software designers to make judgments on software quality, run-time efficiency, and safety with high confidence. MemTrace relies on interception and instrumentation for profiling the explicit allocation behavior of general software, as well as the implicit memory allocations of using C++ STL containers and smart pointers. The profiling data will be analyzed for the behavior that could jeopardize software safety, for example, memory leak and memory external fragmentation. Our experiment results show that MemTrace can effectively provide detailed periteration results for AD software modules, and identify potential memory-related hazards. Through the profiling of prototype AD software with MemTrace, we have gained 6 insightful observations, including the potential risks associated with prolonged usage, and suggestions for effective utilization of STL containers and smart pointers in AD software, which can assist AD software developers during the development process.
Publisher DOI

Recent grants

SPX: Collaborative Research: Pinpointing and Resolving Scalability Culprits Hidden in Different Components of the Whole System Stack
NSF · $431k · 2019–2023
CRII: SHF: EVID: Evidence-Assisted Detection and Elimination of Memory Errors in Single and Multi-threaded Programs
NSF · $207k · 2016–2019
SPX: Collaborative Research: Pinpointing and Resolving Scalability Culprits Hidden in Different Components of the Whole System Stack
NSF · $500k · 2018–2020

Frequent coauthors

Steven Tang
The Ohio State University
13 shared
Emery D. Berger
12 shared
Sam Silvestro
The University of Texas at San Antonio
11 shared
Jianjun Chen
10 shared
Mingcan Xiang
University of Massachusetts Amherst
10 shared
Bo Wu
Yibin University
10 shared
Yang Wang
The Ohio State University
9 shared
Hongyu Liu
8 shared

Labs

Interdisciplinary Interface Engineering LaboratoryPI

Awards & honors

UMass Board of Trustees Awards Tenure and Promotion to Six C…
UMass Amherst ADVANCE Fellows
NIH Trailblazer Awards

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Tingyi "Leo" Liu

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you