
T. S. Eugene Ng
· Professor of Computer Science and of Electrical and Computer Engineering Chair, CS Grad Committee Member, Ken Kennedy InstituteVerifiedRice University · Computer Science
Active 1989–2025
About
T. S. Eugene Ng is a Professor of Computer Science and Electrical & Computer Engineering at Rice University. He is recognized as an IEEE Fellow, an Alfred P. Sloan Research Fellow, a Distinguished Member of the ACM, and a Kavli Fellow. His accolades include receiving an IBM Faculty Award in 2009 and a National Science Foundation CAREER Award in 2005. He earned his B.S. in Computer Engineering with distinction and magna cum laude from the University of Washington, followed by an M.S. and a Ph.D. in Computer Science from Carnegie Mellon University. He holds six U.S. patents. His current research focuses on developing new network models, network architectures, and holistic networked systems aimed at enabling a robust and manageable global networked infrastructure for the future. Professor Ng's research spans several key areas including BOLD (Big data and Optical Lightpaths Driven) networking, telemetry and congestion control, and efficient deep neural network training. His work in BOLD networking includes advances such as an efficient optical network core and innovations for rack topology flexibility and low-power multicast. In telemetry and congestion control, he has contributed to weighted bandwidth allocation, data center network sharing, SmartNIC performance, closed-loop performance monitoring, max-min fair congestion control, and deadlock management. His research on efficient deep neural networks includes developments in sparse tensor training, fast training failure recovery, gradient compression optimization, and optimal gradient communication strategies. Throughout his career, Professor Ng has also led numerous projects addressing network control and management, cloud and data center networking, network data plane bug detection, and Internet geometry. His work aims to create dependable, secure, and scalable network blueprints and operating platforms, as well as to develop geometric models of the Internet's structural properties to enable scalable performance-aware protocols and applications. He has contributed to the design of multicast systems, clean slate Internet architecture redesigns, and network services facilitating IPv6 transition and multicast routing. His extensive research portfolio reflects a commitment to advancing the understanding and capabilities of networked systems and infrastructures.
Research topics
- Computer Science
- Distributed computing
- Computer Security
- Embedded system
- Computer network
- Operating system
Selected publications
Söze: One Network Telemetry Is All You Need for Per-flow Weighted Bandwidth Allocation at Scale
ArXiv.org · 2025-06-01
preprintOpen accessSenior authorWeighted bandwidth allocation is a powerful abstraction that has a wide range of use cases in modern data center networks. However, realizing highly agile and precise weighted bandwidth allocation for large-scale cloud environments is fundamentally challenging. In this paper, we propose Söze, a lightweight decentralized weighted bandwidth allocation system that leverages simple network telemetry features of commodity Ethernet switches. Given the flow weights, Söze can effectively use the telemetry information to compute and enforce the weighted bandwidth allocations without per-flow, topology, or routing knowledge. We demonstrate the effectiveness of Söze through simulations and testbed experiments, improving TPC-H jobs completion time by up to $0.59\times$ and $0.79\times$ on average.
SuperGen: An Efficient Ultra-high-resolution Video Generation System with Sketching and Tiling
ArXiv.org · 2025-08-25
preprintOpen accessDiffusion models have recently achieved remarkable success in generative tasks (e.g., image and video generation), and the demand for high-quality content (e.g., 2K/4K videos) is rapidly increasing across various domains. However, generating ultra-high-resolution videos on existing standard-resolution (e.g., 720p) platforms remains challenging due to the excessive re-training requirements and prohibitively high computational and memory costs. To this end, we introduce SUPERGEN, an efficient tile-based framework for ultra-high-resolution video generation. SUPERGEN features a novel training-free algorithmic innovation with tiling to successfully support a wide range of resolutions without additional training efforts while significantly reducing both memory footprint and computational complexity. Moreover, SUPERGEN incorporates a tile-tailored, adaptive, region-aware caching strategy that accelerates video generation by exploiting redundancy across denoising steps and spatial regions. SUPERGEN also integrates cache-guided, communication-minimized tile parallelism for enhanced throughput and minimized latency. Evaluations show that SUPERGEN maximizes performance gains while achieving high output quality across various benchmarks.
NSX: Large-Scale Network Simulation on an AI Server
2025-09-02
articleOpen accessNetwork innovation is key to supporting AI workloads. Packet-level simulation is indispensable for testing new network features as it enables high-fidelity experimentation. However, today's simulators struggle to scale to large topologies that are typical to AI clusters. To scale the simulation, we have built NSX which is a new simulator that takes advantage of AI servers themselves (e.g., NVIDIA's DGX) to experiment with AI networks. Network simulation has unique workload characteristics that make AI servers an ideal fit: relatively simple, parallelizable compute, with high memory bandwidth pressure. Yet, in order to fully leverage this platform, we need new techniques to rearchitect network simulators for GPU execution. We describe the design decisions that have gone into NSX, and report evaluation results from our current prototype: NSX can scale simulation to networks of 524 k nodes, and it finishes 0.1 ms simulation in less than 2 seconds on a DGX-H100 box. NSX is being used by NVIDIA's networking team on a daily basis for AI cluster design, and new features are added to it on a regular basis.
Rearchitecting Datacenter Networks: A New Paradigm with Optical Core and Optical Edge
2024-05-20 · 3 citations
articleSenior authorAll-optical circuit-switching (OCS) technology is the key to design energy-efficient and high-performance datacenter network (DCN) architectures for the future. However, existing round-robin based OCS cores perform poorly under realistic workloads having high traffic skewness and high volume of inter-rack traffic. To address this issue, we propose a novel DCN architecture OSSV: a combination of OCS-based core (between ToR switches) and OCS-based reconfigurable edge (between servers and ToR switches). On one hand, the OCS core is traffic agnostic and realizes reconfigurably non-blocking ToR-level connectivity. On the other hand, OCS-based edge reconfigures itself to reshape the incoming traffic in order to jointly minimize traffic skewness and inter-rack traffic volume. Our novel optimization framework can obtain the right balance between these intertwined objectives. Our extensive simulations and testbed evaluation show that OSSV can achieve high performance under diverse DCN traffic while consuming low power and incurring low cost.
Unleashing SmartNIC Packet Processing Performance in P4
2023-09-01 · 21 citations
articleSmartNICs are on the rise as a packet processing platform, with the trend towards a uniform P4 programming model. However, unleashing SmartNIC packet processing performance in P4 is a formidable task. Traditional SmartNIC optimizations rely on low-level program tuning, but P4 abstractions operate at one level above. At the same time, today's P4 optimizations primarily focus on resource packing rather than performance tuning. We develop Pipeleon, an automated performance optimization framework for P4 programmable SmartNICs. We introduce techniques that are tailored to the performance characteristics of SmartNICs, and further leverage dynamic workload patterns for profile-guided optimization. Pipeleon pinpoints program hotspots at the P4 level and computes runtime optimization plans to specialize the program layout based on the latest profile. We have prototyped Pipeleon and applied it to optimize two popular P4 SmartNICs---Nvidia BlueField2 and Netronome Agilio CX---as well as a software SmartNIC emulator extended based on BMv2. Our results show that Pipeleon significantly improves SmartNIC packet processing performance in realistic scenarios.
Augmented Queue: A Scalable In-Network Abstraction for Data Center Network Sharing
2023-09-01 · 6 citations
articleSenior authorTraffic aggregates in cloud data center networks are by and large buffered and transmitted by simple physical FIFO queues. Despite the crucial role they play, a well-known problem of physical FIFO queues is that they are unable to provide precise bandwidth guarantees. This leads to a range of negative impacts spanning the application layer, the transport layer, and the data link layer.
Empowering Distributed Training with Sparsity-driven Data Synchronization
arXiv (Cornell University) · 2023-09-23
preprintOpen accessDistributed training is the de facto standard to scale up the training of deep learning models with multiple GPUs. Its performance bottleneck lies in communications for gradient synchronization. Although high tensor sparsity is widely observed, the optimal communication scheme to fully leverage sparsity is still missing. This paper aims to bridge this gap. We first analyze the characteristics of sparse tensors in popular models to understand the fundamentals of sparsity. We then systematically explore the design space of communication schemes for sparse tensors and find the optimal ones. These findings give a new understanding and inspire us to develop a holistic gradient synchronization system called Zen for sparse tensors. We demonstrate that Zen can achieve up to 5.09x speedup in communication time and up to $2.48\times$ speedup in training throughput compared to the state-of-the-art methods.
2023-05-05 · 23 citations
articleSenior authorGradient compression (GC) is a promising approach to addressing the communication bottleneck in distributed deep learning (DDL). It saves the communication time, but also incurs additional computation overheads. The training throughput of compression-enabled DDL is determined by the compression strategy, including whether to compress each tensor, the type of compute resources (e.g., CPUs or GPUs) for compression, the communication schemes for compressed tensor, and so on. However, it is challenging to find the optimal compression strategy for applying GC to DDL because of the intricate interactions among tensors. To fully unleash the benefits of GC, two questions must be addressed: 1) How to express any compression strategies and the corresponding interactions among tensors of any DDL training job? 2) How to quickly select a near-optimal compression strategy?
Poster: Near Non-blocking Performance with All-optical Circuit-switched Core
2023-09-01 · 3 citations
articleSenior authorAll-optical circuit-switched (OCS) core is the holy grail for the future generation datacenter architectures. However, such proposals consist of a common operational abstraction termed as round-robin circuit scheduling, which heavily suffers from a) high traffic skewness, and b) high volume of inter-rack traffic. To address this issue, we propose a novel architecture: round-robin OCS-core equipped with OCS-based reconfigurable edge for joint Skewness and Inter-rack traffic Volume (SV) minimization. Our architecture significantly improves the performance of all-optical cores, making it very close to a non-blocking network.
GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
2023 · 61 citations
- Computer Science
- Computer Science
- Distributed computing
Large deep learning models have recently garnered substantial attention from both academia and industry. Nonetheless, frequent failures are observed during large model training due to large-scale resources involved and extended training time. Existing solutions have significant failure recovery costs due to the severe restriction imposed by the bandwidth of remote storage in which they store checkpoints.
Recent grants
NeTS-FIND: An Architecture for Network Control Management
NSF · $520k · 2007–2012
NeTS: Small: Convertible Data Center Networks
NSF · $500k · 2017–2022
NSF · $412k · 2005–2013
II-NEW: BOLD: Big Data and Optical Lightpaths-Driven Networked Systems Research Infrastructure
NSF · $900k · 2013–2017
NSF · $500k · 2018–2023
Frequent coauthors
- 13 shared
Yiting Xia
- 11 shared
Ion Stoica
- 11 shared
Florin Dinu
Huawei Technologies (China)
- 10 shared
Kunwadee Sripanidkulchai
Chulalongkorn University
- 10 shared
Xiaoye Steven Sun
Rice University
- 10 shared
Guohui Wang
Shenyang Ligong University
- 9 shared
Ang Chen
Jiangsu University
- 8 shared
Weitao Wang
Education
- 2003
Ph.D., Computer Science
Carnegie Mellon University
Awards & honors
- IEEE Fellow (2023)
- Alfred P. Sloan Research Fellow (2009)
- Kavli Fellow
- IBM Faculty Award (2009)
- National Science Foundation CAREER Award (2005)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with T. S. Eugene Ng
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup