Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Kirk Cameron

Kirk Cameron

· Assistant ProfessorVerified

Virginia Tech · Computer Science

Active 1991–2025

h-index27
Citations3.2k
Papers16221 last 5y
Funding$6.0M
See your match with Kirk Cameron — sign in to PhdFit.Sign in

About

Kirk Cameron is a Professor and Managing Director at the Virginia Tech Institute for Advanced Computing, located in Alexandria, VA. His research interests include systems data analytics, high-performance computing, computational science, machine learning, and software engineering. He holds a Ph.D. in computer science from Louisiana State University earned in 2000 and a B.S. in mathematics from the University of Florida obtained in 1994. Cameron is involved in advancing computational methods and data analysis techniques within the field of computer science. His professional activities are centered at the Institute for Advanced Computing, where he contributes to research and development in these areas.

Research topics

  • Computer Science
  • Parallel computing
  • Machine Learning
  • Engineering
  • Engineering management
  • Mathematics
  • Data Mining
  • Artificial Intelligence
  • Algorithm
  • Knowledge management
  • Mathematics education
  • Simulation
  • Software engineering
  • Operating system
  • Statistics
  • Pedagogy
  • Psychology
  • Computational science
  • Computer hardware
  • Distributed computing
  • Computer graphics (images)
  • Geometry

Selected publications

  • Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

    ArXiv.org · 2025-12-04

    preprintOpen access

    Modern GPU software stacks demand developers who can anticipate performance bottlenecks before ever launching a kernel; misjudging floating-point workloads upstream can derail tuning, scheduling, and even hardware procurement. Yet despite rapid progress in code generation, today's Large Language Models (LLMs) are rarely tested on this kind of forward-looking reasoning. We close that gap with gpuFLOPBench, a benchmark that asks models to "count without running" by predicting single and double-precision FLOP counts for 577 CUDA kernels drawn from HeCBench, annotated with ground-truth profiles and eight execution attributes that distinguish trivially analyzable code from kernels whose FLOPs depend on hidden compiler or runtime behavior. Evaluating current closed-source reasoning models shows clear but uneven progress: the newest LLMs achieve perfect classification on straightforward kernels but still incur multiple order-of-magnitude errors whenever implicit FLOPs arise from division, intrinsic math functions, or common subexpressions. These results surface a core limitation of existing code assistants -- the inability to internalize hardware-specific microcode effects -- and position gpuFLOPBench as a focused testbed for developing LLM tooling that can reason about performance with the same rigor as experienced GPU developers. Sources are available at our repository: https://github.com/Scientific-Computing-Lab/gpuFLOPBench

  • Can Large Language Models Predict Parallel Code Performance?

    ArXiv.org · 2025-05-06

    preprintOpen access

    Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware -- an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large Language Models (LLMs) can offer an alternative approach for GPU performance prediction without relying on hardware. We frame the problem as a roofline classification task: given the source code of a GPU kernel and the hardware specifications of a target GPU, can an LLM predict whether the GPU kernel is compute-bound or bandwidth-bound? For this study, we build a balanced dataset of 340 GPU kernels, obtained from HeCBench benchmark and written in CUDA and OpenMP, along with their ground-truth labels obtained via empirical GPU profiling. We evaluate LLMs across four scenarios: (1) with access to profiling data of the kernel source, (2) zero-shot with source code only, (3) few-shot with code and label pairs, and (4) fine-tuned on a small custom dataset. Our results show that state-of-the-art LLMs have a strong understanding of the Roofline model, achieving 100% classification accuracy when provided with explicit profiling data. We also find that reasoning-capable LLMs significantly outperform standard LLMs in zero- and few-shot settings, achieving up to 64% accuracy on GPU source codes, without profiling information. Lastly, we find that LLM fine-tuning will require much more data than what we currently have available. This work is among the first to use LLMs for source-level roofline performance prediction via classification, and illustrates their potential to guide optimization efforts when runtime profiling is infeasible. Our findings suggest that with better datasets and prompt strategies, LLMs could become practical tools for HPC performance analysis and performance portability.

  • Can Large Language Models Predict Parallel Code Performance?

    2025-07-20 · 1 citations

    articleOpen access

    Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware - an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large Language Models (LLMs) can offer an alternative approach for GPU performance prediction without relying on hardware. We frame the problem as a roofline classification task: given the source code of a GPU kernel and the hardware specifications of a target GPU, can an LLM predict whether the GPU kernel is compute-bound or bandwidth-bound?

  • Using High Impact Practices to Broaden Undergraduate Participation in Computer Systems Research

    2024 · 3 citations

    • Computer Science
    • Computer Science
    • Engineering management

    She is focused on instructing and designing curriculum for CS2104 Problem Solving

  • Memory Allocation Under Hardware Compression

    2024-11-02

    article

    As the scaling of memory density slows physically, a promising solution is to scale memory logically by enhancing the CPU's memory controller to encode and store data more densely in memory. This is known as hardware memory compression. Hardware memory compression decouples OS-managed physical memory from actual memory (i.e., DRAM); the memory controller spends a dynamically varying amount of DRAM on each physical page, depending on the compressibility of the page's content. The newly-decoupled actual memory effectively forms a new layer of memory beyond the traditional layers of virtual, pseudo-physical, and physical memory. We note unlike these traditional memory layers, each with its own specialized allocation interface (e.g., malloc/mmap for virtual memory, page tables+MMU for physical memory), this new layer of memory introduced by hardware memory compression still awaits its own unique memory allocation interface; its absence makes the allocation of actual memory imprecise and, sometimes, even impossible. Imprecisely allocating less actual memory, and/or unable to allocate more, can harm performance. Even imprecisely allocating more actual memory to some jobs can be harmful as it can result in allocating less actual memory to other jobs in highly-occupied memory systems, where compression is useful. To restore precise memory allocation, we design a new memory allocation specialized for this new layer of memory and, subsequently, architect a new MMU-like component in the memory controller and tackle the corresponding design challenges. We create a full-system FPGA prototype of a hardware-compressed memory system with precise memory allocation. Our evaluations using the prototype show that jobs perform stably under colocation. The performance variation is only 1%-2%; in comparison, it is 19%-89% under the prior art.

  • A Detailed Historical and Statistical Analysis of the Influence of Hardware Artifacts on SPEC Integer Benchmark Performance

    arXiv (Cornell University) · 2024-01-30

    preprintOpen accessSenior author

    The Standard Performance Evaluation Corporation (SPEC) CPU benchmark has been widely used as a measure of computing performance for decades. The SPEC is an industry-standardized, CPU-intensive benchmark suite and the collective data provide a proxy for the history of worldwide CPU and system performance. Past efforts have not provided or enabled answers to questions such as, how has the SPEC benchmark suite evolved empirically over time and what micro-architecture artifacts have had the most influence on performance? -- have any micro-benchmarks within the suite had undue influence on the results and comparisons among the codes? -- can the answers to these questions provide insights to the future of computer system performance? To answer these questions, we detail our historical and statistical analysis of specific hardware artifacts (clock frequencies, core counts, etc.) on the performance of the SPEC benchmarks since 1995. We discuss in detail several methods to normalize across benchmark evolutions. We perform both isolated and collective sensitivity analyses for various hardware artifacts and we identify one benchmark (libquantum) that had somewhat undue influence on performance outcomes. We also present the use of SPEC data to predict future performance.

  • Prediction for distributional outcomes in high-performance computing input/output variability

    Journal of the Royal Statistical Society Series C (Applied Statistics) · 2024-01-22 · 4 citations

    articleOpen accessSenior author

    Abstract Although high-performance computing (HPC) systems have been scaled to meet the exponentially growing demand for scientific computing, HPC performance variability remains a major challenge in computer science. Statistically, performance variability can be characterized by a distribution. Predicting performance variability is a critical step in HPC performance variability management. In this article, we propose a new framework to predict performance distributions. The proposed framework is a modified Gaussian process that can predict the distribution function of the input/output (I/O) throughput under a specific HPC system configuration. We also impose a monotonic constraint so that the predicted function is nondecreasing, which is a property of the cumulative distribution function. Additionally, the proposed model can incorporate both quantitative and qualitative input variables. We predict the HPC I/O distribution using the proposed method for the IOzone variability data. Data analysis results show that our framework can generate accurate predictions, and outperform existing methods. We also show how the predicted functional output can be used to generate predictions for a scalar summary of the performance distribution, such as the mean, standard deviation, and quantiles. Our prediction results can further be used for HPC system variability monitoring and optimization. This article has online supplementary materials.

  • Integrating DevOps to Enhance Student Experience in an Undergraduate Research Project

    2024 · 1 citations

    Senior authorCorresponding
    • Computer Science
    • Computer Science
    • Software engineering

    She is focused on instructing and designing curriculum

  • A Detailed Historical and Statistical Analysis of the Influence of Hardware Artifacts on SPEC Integer Benchmark Performance

    IEEE Transactions on Computers · 2024-02-14 · 5 citations

    articleSenior author

    The Standard Performance Evaluation Corporation (SPEC) CPU benchmark has been widely used as a measure of computing performance for decades. The SPEC is an industry-standardized, CPU-intensive benchmark suite and the collective data provide a proxy for the history of worldwide CPU and system performance. Past efforts have not provided or enabled answers to questions such as, how has the SPEC benchmark suite evolved empirically over time and what micro-architecture artifacts have had the most influence on performance?—have any micro-benchmarks within the suite had undue influence on the results and comparisons among the codes?—can the answers to these questions provide insights to the future of computer system performance? To answer these questions, we detail our historical and statistical analysis of specific hardware artifacts (clock frequencies, core counts, etc.) on the performance of the SPEC benchmarks since 1995. We discuss in detail several methods to normalize across benchmark evolutions. We perform both isolated and collective sensitivity analyses for various hardware artifacts and we identify one benchmark (libquantum) that had somewhat undue influence on performance outcomes. We also present the use of SPEC data to predict future performance.

  • An Exploration of Global Optimization Strategies for Autotuning OpenMP-based Codes

    2024-05-27 · 3 citations

    articleOpen access

    Automatic parameter tuning of parallel codes is ubiquitous in today's HPC environments where the performance portability of said codes is expected to keep pace with the perpetual release of new hardware. With changes in hardware, it is often the case that finding an optimal configuration of these codes is challenging and only further complicated by the high dimensionality or discontinuous topologies of the tuning spaces. Selecting a proper optimization strategy to automatically search these spaces is paramount to minimizing the energy and time spent on exploring sub-optimal configurations. Unfortunately, it is often the case that these optimizers have hyperparameters of their own, which are sensitive and can greatly affect the outcome of quickly converging to, or even finding an optimal code configuration. Much of the existing autotuning literature tends to use particular optimizers without describing their hyperparameter selection, leaving readers to figure out how to configure their optimizer for the best performance. In this work we compare and contrast the popular global optimization strategy of Bayesian Optimization (BO) to two less popular strategies: Particle Swarm Optimization (PSO), and Covariance Matrix Adaptive Evolution Strategy (CMA-ES). We sweep the hyperparameters of these three optimizers in the context of tuning OpenMP hyperparameters of four classic OpenMP programs: BT, FT, HPCG, and Lulesh. Our study compares the long-term search behavior and average time-to-convergence between these three optimization strategies in tuning OpenMP codes. We contribute a detailed study of these strategies and provide deeper insights as to their sensitivities, noting the conditions where each performs well, and hinting at which optimizers require minimal tuning of their hyperparameters for desirable tuning results.

Recent grants

Frequent coauthors

  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Kirk Cameron

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup