
Wu-chun Feng
· ProfessorVerifiedVirginia Tech · Computer Science
Active 1996–2026
About
Wu-chun Feng is a professor in the Department of Computer Science at Virginia Tech. His research interests include high-performance computing and computational science, data analytics systems, computational biology and bioinformatics, and machine learning. He holds a Ph.D. in computer science from the University of Illinois at Urbana-Champaign, obtained in 1996, and has earned a master's degree in computer engineering and a bachelor's degree in electrical and computer engineering along with a B.A. in music from Penn State University. His professional location is Torgersen Hall, RM 2050, at Virginia Tech, and he is involved with multiple research institutes and centers. His contact information includes an email address (feng@cs.vt.edu) and phone number (540-231-1192).
Research topics
- Computer Science
- Artificial Intelligence
- Parallel computing
- Machine Learning
- Programming language
- Embedded system
- Theoretical computer science
- Computer architecture
- Distributed computing
- Operating system
- Algorithm
Selected publications
On the Efficacy of PyTorch for High-Performance Computing: A Case Study in Computational Physics
Zenodo (CERN European Organization for Nuclear Research) · 2026-04-14
otherOpen accessSenior authorThis artifact accompanies the paper: “On the Efficacy of PyTorch for High-Performance Computing: A Case Study in Computational Physics” published in Proceedings of the 23rd ACM International Conference on Computing Frontiers (CF ’26), May 19–21, 2026, Catania, Italy. ACM, New York,NY, USA, 10 pages. https://doi.org/10.1145/3801487.3801838 It provides a complete, reproducible evaluation framework for comparing PyTorch against conventional HPC implementations (C++, OpenMP, and SYCL) using a computational physics workload based on the Local Orthogonal Inverse Transform Sampler (LOITS). The artifact is designed with a two-tier philosophy: Interpretability-first: Users can immediately regenerate all figures from precomputed results without requiring recompilation or access to specialized hardware Reproducibility (optional rerun): Full experimental pipelines can be re-executed to collect fresh performance data across CPU and GPU backends (CUDA, HIP, and XPU), subject to hardware availability. ------------------------------------------Key Features - Multi-implementation comparison - PyTorch (CPU, CUDA, HIP, XPU) - Native C++ (baseline) - OpenMP (CPU parallelism) - SYCL (DPC++, AdaptiveCpp across CPU/GPU backends) - Heterogeneous evaluation - Multicore CPUs - NVIDIA, AMD, and Intel GPUs - Unified benchmarking harness - Consistent workload (LOITS sampler) - Standardized CSV outputs - Comparable performance metrics (runtime, scaling, GFLOP/s) - Reproducible plotting pipeline - Strong scaling - Weak scaling - Fixed-resource scaling and breakdowns ------------------------------------------Artifact Workflow The artifact uses a simple `make`-based interface: Quick validation (recommended) make- Regenerates all paper figures from `results/`- Automatically bootstraps from `archived-results/` if needed Full rerun (optional)make rerun-strongmake rerun-weakmake rerun-frs- Rebuilds implementations- Executes experiments- Produces fresh results ------------------------------------------Repository Structure - cpp/ – Native C++ implementation (pybind interface)- omp/ – OpenMP implementation- sycl/ – SYCL implementations (DPC++, AdaptiveCpp, UniSYCL)- pytorch_2dloits.py – PyTorch implementation- archived-results/ – Precomputed results (used by default)- results/ – Generated or rerun results- images/ – Final figures used in the paper ------------------------------------------Reproducibility Notes - The artifact prioritizes rapid validation by shipping archived results.- Re-running experiments requires: - Appropriate toolchains (e.g., DPC++, ROCm, CUDA) - Compatible hardware for GPU backends- Performance results may vary depending on system configuration. ------------------------------------------Main Findings - PyTorch achieves 4–5× reduction in source lines of code compared to HPC C++ implementations - On CPUs, PyTorch reaches ~50–72% of optimized performance- On accelerators, PyTorch significantly outperforms SYCL: - ~5–6× on CUDA - ~15× on HIP - up to ~16× on Intel XPU ------------------------------------------Intended Use This artifact is intended for: - Researchers evaluating programming productivity vs performance- Practitioners exploring Python-based HPC workflows- Developers studying performance portability across heterogeneous systems
Mapping Sparse Triangular Solves to GPUs via Fine-grained Domain Decomposition
Society for Industrial and Applied Mathematics eBooks · 2026-02-13
book-chapterSenior authorSolving sparse linear systems typically uses preconditioned iterative methods, but applying preconditioners via sparse triangular solves introduces bottlenecks due to irregular memory accesses and data dependencies. This work leverages fine-grained domain decomposition to adapt triangular solves to the GPU architecture. We develop a fine-grained domain decomposition strategy that generates non-overlapping subdomains, increasing parallelism in the application of preconditioner at the expense of a modest increase in the iteration count for convergence. Each subdomain is assigned to a thread block and is sized such that the subdomain vector fits in the GPU shared memory, eliminating the need for inter-block synchronization and reducing irregular global memory accesses.
On the Efficacy of PyTorch for High-Performance Computing: A Case Study in Computational Physics
Zenodo (CERN European Organization for Nuclear Research) · 2026-04-14
otherOpen accessSenior authorThis artifact accompanies the paper: “On the Efficacy of PyTorch for High-Performance Computing: A Case Study in Computational Physics” published in Proceedings of the 23rd ACM International Conference on Computing Frontiers (CF ’26), May 19–21, 2026, Catania, Italy. ACM, New York,NY, USA, 10 pages. https://doi.org/10.1145/3801487.3801838 It provides a complete, reproducible evaluation framework for comparing PyTorch against conventional HPC implementations (C++, OpenMP, and SYCL) using a computational physics workload based on the Local Orthogonal Inverse Transform Sampler (LOITS). The artifact is designed with a two-tier philosophy: Interpretability-first: Users can immediately regenerate all figures from precomputed results without requiring recompilation or access to specialized hardware Reproducibility (optional rerun): Full experimental pipelines can be re-executed to collect fresh performance data across CPU and GPU backends (CUDA, HIP, and XPU), subject to hardware availability. ------------------------------------------Key Features - Multi-implementation comparison - PyTorch (CPU, CUDA, HIP, XPU) - Native C++ (baseline) - OpenMP (CPU parallelism) - SYCL (DPC++, AdaptiveCpp across CPU/GPU backends) - Heterogeneous evaluation - Multicore CPUs - NVIDIA, AMD, and Intel GPUs - Unified benchmarking harness - Consistent workload (LOITS sampler) - Standardized CSV outputs - Comparable performance metrics (runtime, scaling, GFLOP/s) - Reproducible plotting pipeline - Strong scaling - Weak scaling - Fixed-resource scaling and breakdowns ------------------------------------------Artifact Workflow The artifact uses a simple `make`-based interface: Quick validation (recommended) make- Regenerates all paper figures from `results/`- Automatically bootstraps from `archived-results/` if needed Full rerun (optional)make rerun-strongmake rerun-weakmake rerun-frs- Rebuilds implementations- Executes experiments- Produces fresh results ------------------------------------------Repository Structure - cpp/ – Native C++ implementation (pybind interface)- omp/ – OpenMP implementation- sycl/ – SYCL implementations (DPC++, AdaptiveCpp, UniSYCL)- pytorch_2dloits.py – PyTorch implementation- archived-results/ – Precomputed results (used by default)- results/ – Generated or rerun results- images/ – Final figures used in the paper ------------------------------------------Reproducibility Notes - The artifact prioritizes rapid validation by shipping archived results.- Re-running experiments requires: - Appropriate toolchains (e.g., DPC++, ROCm, CUDA) - Compatible hardware for GPU backends- Performance results may vary depending on system configuration. ------------------------------------------Main Findings - PyTorch achieves 4–5× reduction in source lines of code compared to HPC C++ implementations - On CPUs, PyTorch reaches ~50–72% of optimized performance- On accelerators, PyTorch significantly outperforms SYCL: - ~5–6× on CUDA - ~15× on HIP - up to ~16× on Intel XPU ------------------------------------------Intended Use This artifact is intended for: - Researchers evaluating programming productivity vs performance- Practitioners exploring Python-based HPC workflows- Developers studying performance portability across heterogeneous systems
On the Landscape of Scientific Computing Libraries in Python
2025-09-15
articleSenior authorPython has seen large-scale adoption as a highly productive language for scientific computing, primarily due to its rich ecosystem of libraries such as NumPy, PyTorch, and TensorFlow. These libraries claim to deliver scalable and portable performance without the low-level complexities associated with traditional high-performance compiled languages such as C, C++, and Fortran. However, they are predominantly designed and optimized for machine learning workloads.This work quantifies and characterizes the performance, productivity, and memory efficiency of these libraries for using four scientific computing workloads from the field of computational physics. In addition, we examine the influence of features offered by these libraries, including auto-parallelization, vectorization, and just-in-time (JIT) compilation. Using NumPy as the baseline and C++ as the performance upper bound, we analyze the compute and memory characteristics of each library, along with the associated runtime overhead within the Python ecosystem for solving these problems. Such a characterization enables an accurate quantification of the productivity-performance tradeoffs among the different libraries, both relative to each other and to C++. Finally, we leverage these insights to propose guidelines that can assist programmers and scientists in selecting the most suitable libraries for their work.
A 3D Deep Learning Architecture for Denoising Low-Dose CT Scans
Lecture notes in computer science · 2025-01-01 · 2 citations
book-chapterOpen accessSenior authorScalable and Maintainable Distributed Sequence Alignment Using Spark
IEEE Transactions on Computational Biology and Bioinformatics · 2025-05-20
articleOpen accessSenior authorThe exponential growth of genomic data presents a challenge to bioinformatics research. NCBI BLAST, a popular pairwise sequence alignment tool, does not scale with the hundreds of gigabytes (GB) of sequenced data. Therefore, mpiBLAST was widely adopted and scaled up to 65,536 processors. However, mpiBLAST is tightly coupled with an obsolete NCBI BLAST version, creating a challenge to upgrading mpiBLAST with the ever-changing NCBI BLAST code. Recent parallel BLAST implementations, like SparkBLAST, use parallelism wrappers separate from NCBI BLAST to overcome this issue. However, query partitioning, a parallel method that duplicates the genome database on each compute node, makes SparkBLAST scale poorly with databases larger than a single node's memory. Thus, no parallel BLAST utility simultaneously addresses performance, scalability, and software maintainability. To fill this gap, we introduce SparkLeBLAST, a parallel BLAST tool that uses the Spark framework and efficient data partitioning to combine mpiBLAST's performance and scalability with SparkBLAST's simplicity and maintainability. SparkLeBLAST democratizes scalable genomic analysis for domain scientists without extensive distributed computing experience. SparkLeBLAST runs up to 6.68× faster than SparkBLAST. SparkLeBLAST also accelerates taxonomic assignment of COVID-19 genomic diversity analysis by 20.9× as it speeds up the BLAST search component by 88.6× using 128 compute nodes.
Looking Back to Look Forward: 15 Years of the Green500
Computer · 2025-01-01
articleOpen accessSenior authorWe revisit a Computer article from 15 years ago that introduced the Green500—a list ranking the most energy-efficient supercomputers. Our exploration centers on the advancements achieved during this time, highlighting a notable trend: the energy efficiency of supercomputers has approximately doubled every two years.
Characterization of Sparsity-aware Parallelization of Jaccard Similarity in Graph Datasets
2025-09-15
articleSenior authorThe unpredictability of real-world graph workloads complicates the realization of high-performance graph analytics on the GPU. For example, variance in sparsity and neighbor commonality can dramatically alter computing costs and memory-access patterns, even within different regions of the same graph.In this paper, we investigate Jaccard similarity, a metric that measures the similarity of two sets. We compare and contrast the performance an edge-centric parallelization of Jaccard similarity with respect to the vertex-centric approach from NVIDIA’s cuGraph library. We then characterize the impact of graph metrics (e.g., average degree, maximum degree, Gini index) on the performance of edge-centric and vertex-centric kernels. By combining the above graph metrics with performance metrics (e.g., bandwidth utilization and thread activity), we deliver insight into why certain graphs benefit from edge-centric over vertex-centric parallelization, while other graphs benefit conversely. Finally, based on these results, we make a case for sparsity-aware parallelization, i.e., choosing between an edge-centric or a vertex-centric parallelized kernel, for improved performance by showing that selecting the best-performing parallelized kernel can deliver a geometric mean speedup of 3.2× over the reference cuGraph kernel on an NVIDIA A100 GPU.
Top-Down SBP: Turning Graph Clustering Upside Down
2025-07-20
articleOpen accessSenior authorStochastic block partitioning (SBP) is a statistical inference-based algorithm for clustering vertices within a graph. It has been shown to be statistically robust and highly accurate even on graphs with a complex structure, but its poor scalability limits its usability to smaller-sized graphs. In this manuscript we argue that one reason for its poor scalability is the agglomerative, or bottom-up, nature of SBP's algorithmic design; the agglomerative computations cause high memory usage and create a large search space that slows down statistical inference, particularly in the algorithm's initial iterations. To address this bottleneck, we propose Top-Down SBP, a novel algorithm that replaces the agglomerative (bottom-up) block merges in SBP with a block-splitting operation. This enables the algorithm to start with all vertices in one cluster and subdivide them over time into smaller clusters. We show that Top-Down SBP is up to 7.7× faster than Bottom-Up SBP without sacrificing accuracy and can process larger graphs than Bottom-Up SBP on the same hardware due to an up to 4.1× decrease in memory usage. Additionally, we adapt existing methods for accelerating Bottom-Up SBP to the Top-Down approach, leading to up to 13.2× speedup over accelerated Bottom-Up SBP and up to 403× speedup over sequential Bottom-Up SBP on 64 compute nodes. Thus, Top-Down SBP represents substantial improvements to the scalability of SBP, enabling the analysis of larger datasets on the same hardware.
Balancing Performance and Productivity: A Comparative Study of Apache Arrow vs. MPI
2025-09-15
articleSenior authorAs large-scale data processing becomes increasingly essential in today’s world, balancing developer productivity and computational performance in high-performance computing (HPC) environments remains a persistent challenge. Conventional HPC workloads rely heavily on MPI-based solutions, often written in C for the better performance. Meanwhile, Apache Arrow, specifically leveraging Arrow Flight for node-to-node data transfers, has gained traction as a flexible, in-memory columnar data approach that promises efficient, language-agnostic memory usage, although its usage in HPC environments is less exploredIn this paper, we perform a preliminary study that consists of three different implementations of the same data-parallel workload — the "Monte Carlo Simulation for Financial Risk Assessment" — in (1) MPI in C, (2) MPI in Python, and (3) Python-based Arrow. Specifically, we measure and compare the trade-offs between performance and lines of code for each implementation. In addition, we demonstrate the scalability of each implementation through a strong-scaling analysis, highlighting potential communication bottlenecks as we add more communicators. The results of this paper are intended to surface early trends for balancing developer productivity, raw performance, and scalability when selecting tools for large-scale computational tasks.
Recent grants
EAGER: Collaborative Research: Democratizing the Teaching of Parallel Computing Concepts
NSF · $260k · 2013–2016
NSF · $375k · 2013–2017
Phase-I IUCRC Virginia Tech: Center for Space, High-performance, and Resilient Computing (SHREC)
NSF · $2.2M · 2018–2025
NSF · $2.0M · 2010–2013
NSF · $350k · 2013–2016
Frequent coauthors
- 44 shared
Ümit V. Çatalyürek
- 44 shared
David A. Bader
- 44 shared
Quincey Koziol
Lawrence Berkeley National Laboratory
- 44 shared
Bora Uçar
- 44 shared
Yale N. Patt
The University of Texas at Austin
- 44 shared
Federico Silla
Universitat Politècnica de València
- 44 shared
Thomas M. Stricker
- 42 shared
Heshan Lin
Institute of Oceanography
Education
- 1996
Ph.D., Computer Science
University of Illinois at Urbana-Champaign
- 1990
M.S., Computer Engineering
The Pennsylvania State University
- 1988
B.S., Computer Engineering
The Pennsylvania State University
Awards & honors
- Elizabeth and James E. Turner Jr. '56 Faculty Fellow
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Wu-chun Feng
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup