Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Wu-chun Feng

Wu-chun Feng

· ProfessorVerified

Virginia Tech · Computer Science

Active 1996–2026

h-index46
Citations8.6k
Papers42639 last 5y
Funding$5.2M
See your match with Wu-chun Feng — sign in to PhdFit.Sign in

About

Wu-chun Feng is a professor in the Department of Computer Science at Virginia Tech. His research interests include high-performance computing and computational science, data analytics systems, computational biology and bioinformatics, and machine learning. He holds a Ph.D. in computer science from the University of Illinois at Urbana-Champaign, obtained in 1996, and has earned a master's degree in computer engineering and a bachelor's degree in electrical and computer engineering along with a B.A. in music from Penn State University. His professional location is Torgersen Hall, RM 2050, at Virginia Tech, and he is involved with multiple research institutes and centers. His contact information includes an email address (feng@cs.vt.edu) and phone number (540-231-1192).

Research topics

  • Computer Science
  • Artificial Intelligence
  • Parallel computing
  • Machine Learning
  • Programming language
  • Embedded system
  • Theoretical computer science
  • Computer architecture
  • Distributed computing
  • Operating system
  • Algorithm

Selected publications

  • On the Efficacy of PyTorch for High-Performance Computing: A Case Study in Computational Physics

    Zenodo (CERN European Organization for Nuclear Research) · 2026-04-14

    otherOpen accessSenior author

    This artifact accompanies the paper: “On the Efficacy of PyTorch for High-Performance Computing: A Case Study in Computational Physics” published in Proceedings of the 23rd ACM International Conference on Computing Frontiers (CF ’26), May 19–21, 2026, Catania, Italy. ACM, New York,NY, USA, 10 pages. https://doi.org/10.1145/3801487.3801838 It provides a complete, reproducible evaluation framework for comparing PyTorch against conventional HPC implementations (C++, OpenMP, and SYCL) using a computational physics workload based on the Local Orthogonal Inverse Transform Sampler (LOITS). The artifact is designed with a two-tier philosophy: Interpretability-first: Users can immediately regenerate all figures from precomputed results without requiring recompilation or access to specialized hardware Reproducibility (optional rerun): Full experimental pipelines can be re-executed to collect fresh performance data across CPU and GPU backends (CUDA, HIP, and XPU), subject to hardware availability. ------------------------------------------Key Features - Multi-implementation comparison - PyTorch (CPU, CUDA, HIP, XPU) - Native C++ (baseline) - OpenMP (CPU parallelism) - SYCL (DPC++, AdaptiveCpp across CPU/GPU backends) - Heterogeneous evaluation - Multicore CPUs - NVIDIA, AMD, and Intel GPUs - Unified benchmarking harness - Consistent workload (LOITS sampler) - Standardized CSV outputs - Comparable performance metrics (runtime, scaling, GFLOP/s) - Reproducible plotting pipeline - Strong scaling - Weak scaling - Fixed-resource scaling and breakdowns ------------------------------------------Artifact Workflow The artifact uses a simple `make`-based interface: Quick validation (recommended) make- Regenerates all paper figures from `results/`- Automatically bootstraps from `archived-results/` if needed Full rerun (optional)make rerun-strongmake rerun-weakmake rerun-frs- Rebuilds implementations- Executes experiments- Produces fresh results ------------------------------------------Repository Structure - cpp/ – Native C++ implementation (pybind interface)- omp/ – OpenMP implementation- sycl/ – SYCL implementations (DPC++, AdaptiveCpp, UniSYCL)- pytorch_2dloits.py – PyTorch implementation- archived-results/ – Precomputed results (used by default)- results/ – Generated or rerun results- images/ – Final figures used in the paper ------------------------------------------Reproducibility Notes - The artifact prioritizes rapid validation by shipping archived results.- Re-running experiments requires: - Appropriate toolchains (e.g., DPC++, ROCm, CUDA) - Compatible hardware for GPU backends- Performance results may vary depending on system configuration. ------------------------------------------Main Findings - PyTorch achieves 4–5× reduction in source lines of code compared to HPC C++ implementations - On CPUs, PyTorch reaches ~50–72% of optimized performance- On accelerators, PyTorch significantly outperforms SYCL: - ~5–6× on CUDA - ~15× on HIP - up to ~16× on Intel XPU ------------------------------------------Intended Use This artifact is intended for: - Researchers evaluating programming productivity vs performance- Practitioners exploring Python-based HPC workflows- Developers studying performance portability across heterogeneous systems

  • Mapping Sparse Triangular Solves to GPUs via Fine-grained Domain Decomposition

    Society for Industrial and Applied Mathematics eBooks · 2026-02-13

    book-chapterSenior author

    Solving sparse linear systems typically uses preconditioned iterative methods, but applying preconditioners via sparse triangular solves introduces bottlenecks due to irregular memory accesses and data dependencies. This work leverages fine-grained domain decomposition to adapt triangular solves to the GPU architecture. We develop a fine-grained domain decomposition strategy that generates non-overlapping subdomains, increasing parallelism in the application of preconditioner at the expense of a modest increase in the iteration count for convergence. Each subdomain is assigned to a thread block and is sized such that the subdomain vector fits in the GPU shared memory, eliminating the need for inter-block synchronization and reducing irregular global memory accesses.

  • On the Efficacy of PyTorch for High-Performance Computing: A Case Study in Computational Physics

    Zenodo (CERN European Organization for Nuclear Research) · 2026-04-14

    otherOpen accessSenior author

    This artifact accompanies the paper: “On the Efficacy of PyTorch for High-Performance Computing: A Case Study in Computational Physics” published in Proceedings of the 23rd ACM International Conference on Computing Frontiers (CF ’26), May 19–21, 2026, Catania, Italy. ACM, New York,NY, USA, 10 pages. https://doi.org/10.1145/3801487.3801838 It provides a complete, reproducible evaluation framework for comparing PyTorch against conventional HPC implementations (C++, OpenMP, and SYCL) using a computational physics workload based on the Local Orthogonal Inverse Transform Sampler (LOITS). The artifact is designed with a two-tier philosophy: Interpretability-first: Users can immediately regenerate all figures from precomputed results without requiring recompilation or access to specialized hardware Reproducibility (optional rerun): Full experimental pipelines can be re-executed to collect fresh performance data across CPU and GPU backends (CUDA, HIP, and XPU), subject to hardware availability. ------------------------------------------Key Features - Multi-implementation comparison - PyTorch (CPU, CUDA, HIP, XPU) - Native C++ (baseline) - OpenMP (CPU parallelism) - SYCL (DPC++, AdaptiveCpp across CPU/GPU backends) - Heterogeneous evaluation - Multicore CPUs - NVIDIA, AMD, and Intel GPUs - Unified benchmarking harness - Consistent workload (LOITS sampler) - Standardized CSV outputs - Comparable performance metrics (runtime, scaling, GFLOP/s) - Reproducible plotting pipeline - Strong scaling - Weak scaling - Fixed-resource scaling and breakdowns ------------------------------------------Artifact Workflow The artifact uses a simple `make`-based interface: Quick validation (recommended) make- Regenerates all paper figures from `results/`- Automatically bootstraps from `archived-results/` if needed Full rerun (optional)make rerun-strongmake rerun-weakmake rerun-frs- Rebuilds implementations- Executes experiments- Produces fresh results ------------------------------------------Repository Structure - cpp/ – Native C++ implementation (pybind interface)- omp/ – OpenMP implementation- sycl/ – SYCL implementations (DPC++, AdaptiveCpp, UniSYCL)- pytorch_2dloits.py – PyTorch implementation- archived-results/ – Precomputed results (used by default)- results/ – Generated or rerun results- images/ – Final figures used in the paper ------------------------------------------Reproducibility Notes - The artifact prioritizes rapid validation by shipping archived results.- Re-running experiments requires: - Appropriate toolchains (e.g., DPC++, ROCm, CUDA) - Compatible hardware for GPU backends- Performance results may vary depending on system configuration. ------------------------------------------Main Findings - PyTorch achieves 4–5× reduction in source lines of code compared to HPC C++ implementations - On CPUs, PyTorch reaches ~50–72% of optimized performance- On accelerators, PyTorch significantly outperforms SYCL: - ~5–6× on CUDA - ~15× on HIP - up to ~16× on Intel XPU ------------------------------------------Intended Use This artifact is intended for: - Researchers evaluating programming productivity vs performance- Practitioners exploring Python-based HPC workflows- Developers studying performance portability across heterogeneous systems

  • On the Landscape of Scientific Computing Libraries in Python

    2025-09-15

    articleSenior author

    Python has seen large-scale adoption as a highly productive language for scientific computing, primarily due to its rich ecosystem of libraries such as NumPy, PyTorch, and TensorFlow. These libraries claim to deliver scalable and portable performance without the low-level complexities associated with traditional high-performance compiled languages such as C, C++, and Fortran. However, they are predominantly designed and optimized for machine learning workloads.This work quantifies and characterizes the performance, productivity, and memory efficiency of these libraries for using four scientific computing workloads from the field of computational physics. In addition, we examine the influence of features offered by these libraries, including auto-parallelization, vectorization, and just-in-time (JIT) compilation. Using NumPy as the baseline and C++ as the performance upper bound, we analyze the compute and memory characteristics of each library, along with the associated runtime overhead within the Python ecosystem for solving these problems. Such a characterization enables an accurate quantification of the productivity-performance tradeoffs among the different libraries, both relative to each other and to C++. Finally, we leverage these insights to propose guidelines that can assist programmers and scientists in selecting the most suitable libraries for their work.

  • A 3D Deep Learning Architecture for Denoising Low-Dose CT Scans

    Lecture notes in computer science · 2025-01-01 · 2 citations

    book-chapterOpen accessSenior author
  • Scalable and Maintainable Distributed Sequence Alignment Using Spark

    IEEE Transactions on Computational Biology and Bioinformatics · 2025-05-20

    articleOpen accessSenior author

    The exponential growth of genomic data presents a challenge to bioinformatics research. NCBI BLAST, a popular pairwise sequence alignment tool, does not scale with the hundreds of gigabytes (GB) of sequenced data. Therefore, mpiBLAST was widely adopted and scaled up to 65,536 processors. However, mpiBLAST is tightly coupled with an obsolete NCBI BLAST version, creating a challenge to upgrading mpiBLAST with the ever-changing NCBI BLAST code. Recent parallel BLAST implementations, like SparkBLAST, use parallelism wrappers separate from NCBI BLAST to overcome this issue. However, query partitioning, a parallel method that duplicates the genome database on each compute node, makes SparkBLAST scale poorly with databases larger than a single node's memory. Thus, no parallel BLAST utility simultaneously addresses performance, scalability, and software maintainability. To fill this gap, we introduce SparkLeBLAST, a parallel BLAST tool that uses the Spark framework and efficient data partitioning to combine mpiBLAST's performance and scalability with SparkBLAST's simplicity and maintainability. SparkLeBLAST democratizes scalable genomic analysis for domain scientists without extensive distributed computing experience. SparkLeBLAST runs up to 6.68× faster than SparkBLAST. SparkLeBLAST also accelerates taxonomic assignment of COVID-19 genomic diversity analysis by 20.9× as it speeds up the BLAST search component by 88.6× using 128 compute nodes.

  • Looking Back to Look Forward: 15 Years of the Green500

    Computer · 2025-01-01

    articleOpen accessSenior author

    We revisit a Computer article from 15 years ago that introduced the Green500—a list ranking the most energy-efficient supercomputers. Our exploration centers on the advancements achieved during this time, highlighting a notable trend: the energy efficiency of supercomputers has approximately doubled every two years.

  • Characterization of Sparsity-aware Parallelization of Jaccard Similarity in Graph Datasets

    2025-09-15

    articleSenior author

    The unpredictability of real-world graph workloads complicates the realization of high-performance graph analytics on the GPU. For example, variance in sparsity and neighbor commonality can dramatically alter computing costs and memory-access patterns, even within different regions of the same graph.In this paper, we investigate Jaccard similarity, a metric that measures the similarity of two sets. We compare and contrast the performance an edge-centric parallelization of Jaccard similarity with respect to the vertex-centric approach from NVIDIA’s cuGraph library. We then characterize the impact of graph metrics (e.g., average degree, maximum degree, Gini index) on the performance of edge-centric and vertex-centric kernels. By combining the above graph metrics with performance metrics (e.g., bandwidth utilization and thread activity), we deliver insight into why certain graphs benefit from edge-centric over vertex-centric parallelization, while other graphs benefit conversely. Finally, based on these results, we make a case for sparsity-aware parallelization, i.e., choosing between an edge-centric or a vertex-centric parallelized kernel, for improved performance by showing that selecting the best-performing parallelized kernel can deliver a geometric mean speedup of 3.2× over the reference cuGraph kernel on an NVIDIA A100 GPU.

  • Top-Down SBP: Turning Graph Clustering Upside Down

    2025-07-20

    articleOpen accessSenior author

    Stochastic block partitioning (SBP) is a statistical inference-based algorithm for clustering vertices within a graph. It has been shown to be statistically robust and highly accurate even on graphs with a complex structure, but its poor scalability limits its usability to smaller-sized graphs. In this manuscript we argue that one reason for its poor scalability is the agglomerative, or bottom-up, nature of SBP's algorithmic design; the agglomerative computations cause high memory usage and create a large search space that slows down statistical inference, particularly in the algorithm's initial iterations. To address this bottleneck, we propose Top-Down SBP, a novel algorithm that replaces the agglomerative (bottom-up) block merges in SBP with a block-splitting operation. This enables the algorithm to start with all vertices in one cluster and subdivide them over time into smaller clusters. We show that Top-Down SBP is up to 7.7× faster than Bottom-Up SBP without sacrificing accuracy and can process larger graphs than Bottom-Up SBP on the same hardware due to an up to 4.1× decrease in memory usage. Additionally, we adapt existing methods for accelerating Bottom-Up SBP to the Top-Down approach, leading to up to 13.2× speedup over accelerated Bottom-Up SBP and up to 403× speedup over sequential Bottom-Up SBP on 64 compute nodes. Thus, Top-Down SBP represents substantial improvements to the scalability of SBP, enabling the analysis of larger datasets on the same hardware.

  • Balancing Performance and Productivity: A Comparative Study of Apache Arrow vs. MPI

    2025-09-15

    articleSenior author

    As large-scale data processing becomes increasingly essential in today’s world, balancing developer productivity and computational performance in high-performance computing (HPC) environments remains a persistent challenge. Conventional HPC workloads rely heavily on MPI-based solutions, often written in C for the better performance. Meanwhile, Apache Arrow, specifically leveraging Arrow Flight for node-to-node data transfers, has gained traction as a flexible, in-memory columnar data approach that promises efficient, language-agnostic memory usage, although its usage in HPC environments is less exploredIn this paper, we perform a preliminary study that consists of three different implementations of the same data-parallel workload — the "Monte Carlo Simulation for Financial Risk Assessment" — in (1) MPI in C, (2) MPI in Python, and (3) Python-based Arrow. Specifically, we measure and compare the trade-offs between performance and lines of code for each implementation. In addition, we demonstrate the scalability of each implementation through a strong-scaling analysis, highlighting potential communication bottlenecks as we add more communicators. The results of this paper are intended to surface early trends for balancing developer productivity, raw performance, and scalability when selecting tools for large-scale computational tasks.

Recent grants

Frequent coauthors

  • Ümit V. Çatalyürek

    44 shared
  • David A. Bader

    44 shared
  • Quincey Koziol

    Lawrence Berkeley National Laboratory

    44 shared
  • Bora Uçar

    44 shared
  • Yale N. Patt

    The University of Texas at Austin

    44 shared
  • Federico Silla

    Universitat Politècnica de València

    44 shared
  • Thomas M. Stricker

    44 shared
  • Heshan Lin

    Institute of Oceanography

    42 shared

Education

  • Ph.D., Computer Science

    University of Illinois at Urbana-Champaign

    1996
  • M.S., Computer Engineering

    The Pennsylvania State University

    1990
  • B.S., Computer Engineering

    The Pennsylvania State University

    1988

Awards & honors

  • Elizabeth and James E. Turner Jr. '56 Faculty Fellow
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Wu-chun Feng

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup