Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Jason (Jingsheng) Cong

Jason (Jingsheng) Cong

· Distinguished ProfessorVerified

University of California, Los Angeles · Computer Science

Active 1988–2025

h-index68
Citations20.6k
Papers560128 last 5y
Funding$14.2M
See your match with Jason (Jingsheng) Cong — sign in to PhdFit.Sign in

About

Jason (Jingsheng) Cong is a distinguished professor in the UCLA Samueli School of Engineering, holding the Volgenau Chair for Engineering Excellence in the Department of Computer Science, with a joint appointment in the Department of Electrical and Computer Engineering. He received his B.S. degree in computer science from Peking University in 1985, and his M.S. and Ph.D. degrees in computer science from the University of Illinois at Urbana-Champaign in 1987 and 1990, respectively. Dr. Cong's research interests encompass electronic design automation, customized computing for machine learning and big-data applications, quantum computing, and highly scalable algorithms. He has published over 500 research papers and has been recognized with numerous awards, including election to the National Academy of Engineering in 2017, IEEE Fellow in 2000, ACM Fellow in 2008, and Fellow of the National Academy of Inventors in 2020. His work on FPGA technology mapping (FlowMap) received the ACM/IEEE A. Richard Newton Technical Impact Award in 2011 and he was the first inductee into the FPGA and Reconfigurable Computing Hall of Fame. Dr. Cong has led significant research initiatives, including the establishment of the Center for Domain-Specific Computing through an NSF Expeditions in Computing Award, and has contributed to advancing US semiconductor technology. He is also a successful serial entrepreneur, founding and advising multiple companies that developed influential FPGA design tools and accelerators, many of which were acquired by major industry players. Throughout his career, Dr. Cong has mentored numerous students and postdoctoral researchers, many of whom now hold faculty positions or key R&D roles worldwide.

Research topics

  • Computer Science
  • Artificial Intelligence
  • Embedded system
  • Operating system
  • Computer vision
  • Computer engineering
  • Parallel computing
  • Engineering
  • Computer architecture

Selected publications

  • Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review

    2025-01-01

    articleOpen accessSenior author

    We introduce an effective and scalable data selection technique to accelerate the pretraining of large language models (LLMs).Given the variation in quality and informativeness of web-scale corpora, we present the Learn-Focus-Review (LFR) paradigm-a dynamic training approach that adapts to the model's learning progress.Inspired by human learning techniques like spaced repetition, LFR tracks the model's learning performance across data instances and prioritizes revisiting challenging and diverse regions of the dataset that are more prone to being forgotten, enabling better retention and more efficient learning.Through experiments spanning over 2200 GPU hours, we show that LFR significantly enhances data efficiency in pretraining while improving downstream performance across commonsense reasoning, question answering, problemsolving, language modeling, and translation tasks.LFR consistently achieves lower perplexity and higher accuracy using just 5%-19% of the training tokens as models trained on the full dataset.Notably, LFR matches the performance of industry-standard Pythia models with up to 2 the parameter count while requiring only 3.2% of the training tokens.

  • Dynamic-Width Speculative Beam Decoding for LLM Inference

    Proceedings of the AAAI Conference on Artificial Intelligence · 2025-04-11 · 2 citations

    articleOpen access

    Large language models (LLMs) based on transformer architecture have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens, which are then validated simultaneously by the larger model, achieving a speed-up of 1-2x. Although speculative decoding matches the same distribution as multinomial sampling, multinomial sampling itself is prone to suboptimal outputs, where as beam sampling is widely recognized for producing higher-quality results by maintaining multiple candidate sequences at each step. This paper explores the novel integration of speculative decoding with beam sampling. However, there are four key challenges: (1) how to generate multiple sequences from the larger model's distribution given drafts sequences from the small model; (2) how to dynamically optimize the number of beams to balance efficiency and accuracy; (3) how to efficiently verify the multiple drafts in parallel; and (4) how to address the extra memory costs inherent in beam sampling. To address these challenges, we propose dynamic-width speculative beam decoding (DSBD). Specifically, we first introduce a novel draft and verification scheme that generates multiple sequences following the large model's distribution based on beam sampling trajectories from the small model. Then, we introduce an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing efficiency and effectiveness. Besides, we extend tree-based parallel verification to handle multiple trees simultaneously, accelerating the verification process. Finally, we illustrate a simple modification to our algorithm to mitigate the memory overhead of beam sampling. Experimental results show that our approach achieves a 1.5-1.9x speed-up and1.8-2.5x lower energy consumption compared to beam sampling, with no loss in downstream performance. Moreover, it can produce significantly higher-quality outputs than speculative decoding, while maintaining similar time, memory, and energy costs. In summary, our method offers a more efficient and effective inference process for LLMs.

  • Reconfigurable Stream Network Architecture

    2025-06-20 · 1 citations

    article
  • Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear Programming Approach

    ACM Transactions on Design Automation of Electronic Systems · 2025-01-09 · 8 citations

    preprintOpen accessSenior author

    High-Level Synthesis enables the rapid prototyping of hardware accelerators, by combining a high-level description of the functional behavior of a kernel with a set of micro-architecture optimizations as inputs. Such optimizations can be described by inserting pragmas e.g., pipelining and replication of units, or even higher level transformations for HLS such as automatic data caching using the AMD/Xilinx Merlin compiler. Selecting the best combination of pragmas, even within a restricted set, remains particularly challenging and the typical state-of-practice uses design-space exploration to navigate this space. But due to the highly irregular performance distribution of pragma configurations, typical DSE approaches are either extremely time consuming, or operating on a severely restricted search space. This work proposes a framework to automatically insert HLS pragmas in regular loop-based programs, supporting pipelining, unit replication, and data caching. We develop an analytical performance and resource model as a function of the input program properties and pragmas inserted, using non-linear constraints and objectives. We prove this model provides a lower bound on the actual performance after HLS. We then encode this model as a Non-Linear Program, by making the pragma configuration unknowns of the system, which is computed optimally by solving this NLP. This approach can also be used during DSE, to quickly prune points with a (possibly partial) pragma configuration, driven by lower bounds on achievable latency. We extensively evaluate our end-to-end, fully implemented system, showing it can effectively manipulate spaces of billions of designs in seconds to minutes for the kernels evaluated.

  • SAT-Accel: A Modern SAT Solver on a FPGA

    2025-02-26 · 7 citations

    articleOpen accessSenior author

    Boolean satisfiability (SAT) solving is the first known NP-complete problem and is widely used in many application domains. Over the years, there have been so many consistent improvements in this area such that larger instances can be solved relatively quickly. Although these improvements have found their way onto CPU implementations, there has been limited progress adopting this on hardware accelerators mainly because it is difficult to implement the dynamic data structures needed to support a modern SAT solving algorithm.

  • ML-QLS: Multilevel Quantum Layout Synthesis

    2025-03-13 · 1 citations

    articleOpen accessSenior author

    Quantum Layout Synthesis (QLS) plays a crucial role in optimizing quantum circuit execution on physical quantum devices. As we enter the era where quantum computers have hundreds of qubits, optimal OLS tools face scalability issues, while heuristic methods suffer significant optimality gap due to the lack of global optimization. To address these challenges, we introduce a multilevel framework, which is an effective methodology for solving large-scale problems in VLSI design. In this paper, we present ML-QLS, the first multilevel quantum layout tool with a scalable refinement operation integrated with novel cost functions and clustering strategies. Our clustering provides valuable insights into generating a proper problem approximation for quantum circuits and devices. The experimental results demonstrate that ML-QLS can scale up to problems involving hundreds of qubits and achieve a remarkable 69% performance improvement over leading heuristic QLS tools for large circuits, which underscores the effectiveness of multilevel frameworks in quantum applications.

  • Fine Grain 3D Integration for Microarchitecture Design Through Cube Packing Exploration

    ArXiv.org · 2025-07-13

    articleOpen accessSenior author

    Most previous 3D IC research focused on stacking traditional 2D silicon layers, so the interconnect reduction is limited to inter-block delays. In this paper, we propose techniques that enable efficient exploration of the 3D design space where each logical block can span more than one silicon layers. Although further power and performance improvement is achievable through fine grain 3D integration, the necessary modeling and tool infrastructure has been mostly missing. We develop a cube packing engine which can simultaneously optimize physical and architectural design for effective utilization of 3D in terms of performance, area and temperature. Our experimental results using a design driver show 36% performance improvement (in BIPS) over 2D and 14% over 3D with single layer blocks. Additionally multi-layer blocks can provide up to 30% reduction in power dissipation compared to the single-layer alternatives. Peak temperature of the design is kept within limits as a result of thermal-aware floorplanning and thermal via insertion techniques.

  • Stream-HLS: Towards Automatic Dataflow Acceleration

    2025-02-26 · 11 citations

    preprintOpen accessSenior authorCorresponding

    High-level synthesis (HLS) has enabled the rapid development of custom hardware circuits for many software applications. However, developing high-performance hardware circuits using HLS is still a non-trivial task requiring expertise in hardware design. Further, the hardware design space, especially for multi-kernel applications, grows exponentially. Therefore, several HLS automation and abstraction frameworks have been proposed recently, but many issues remain unresolved. These issues include: 1) relying mainly on hardware directives (pragmas) to apply hardware optimizations without exploring loop scheduling opportunities. 2) targeting single-kernel applications only. 3) lacking automatic and/or global design space exploration. 4) missing critical hardware optimizations, such as graph-level pipelining for multi-kernel applications.

  • LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

    ArXiv.org · 2025-11-09

    preprintOpen accessSenior author

    The rapid development of large language models (LLM) has greatly enhanced everyday applications. While many FPGA-based accelerators, with flexibility for fine-grained data control, exhibit superior speed and energy efficiency compared to GPUs, recent GPU-specific optimizations have diminished this advantage. When limited to arithmetic-based computation, FPGAs often underperform GPUs due to their comparatively fewer computational resources. To address this challenge, we exploit a key advantage of FPGAs over GPUs: abundant distributed on-chip memory embedded among computational units. We believe that shifting LLM inference from arithmetic-based to memory-based computations through table lookups can improve the efficiency on FPGAs to compete with GPUs. However, existing methods are inefficient or unable to scale and deploy language models due to algorithm and architecture design limitations. This paper introduces \textbf{LUT-LLM}, the first FPGA accelerator that deploy 1B+ language model with memory-based computation, leveraging vector quantization. We construct a performance model, evaluate multiple quantization schemes, and identify activation-weight vector co-quantization as the most effective approach. To support this scheme, LUT-LLM features (1) bandwidth-aware parallel centroid search to reduce decoding latency, (2) efficient 2D table lookups, and (3) a spatial-temporal hybrid design to reduce data caching for a higher throughput table lookup. We develop a training recipe that converts existing models to support table lookups with high accuracy and prototype LUT-LLM for Qwen 3 1.7B model on the AMD V80 FPGA, reducing arithmetic operations by $4\times$ and achieving a $1.10\sim3.29\times$ faster generation speed and a $3.05\sim 6.60\times$ higher energy efficiency than GPUs.

  • Invited: Coping with Interconnects

    2025-03-13 · 2 citations

    articleOpen access1st authorCorresponding

    In this paper, I review the multi-decade research on overcoming the performance bottleneck of VLSI interconnects in deep sub-micrometer and nanometer technologies that started at UCLA in the early 1990s. Our research spans from interconnect topology and geometry optimization, to wire length reduction via scalable placement, to use of novel interconnect technologies such as 3D IC and RF-interconnects, to recent work on interconnect pipelining in chiplet designs, and the shift from interconnect to entanglement in quantum computing. The latter two efforts go beyond the typical physical design space and involve space-time co-optimization. This paper is dedicated to multiple generations of Ph.D. students, postdocs, and visiting researchers who contributed to build a strong physical design research program at UCLA.

Recent grants

Frequent coauthors

  • Glenn Reinman

    UCLA Health

    51 shared
  • Zhiru Zhang

    37 shared
  • Yuze Chi

    University of California, Los Angeles

    37 shared
  • Zhenman Fang

    Simon Fraser University

    35 shared
  • Deming Chen

    29 shared
  • Mau-Chung Frank Chang

    University of California, Los Angeles

    29 shared
  • Bingjun Xiao

    25 shared
  • Atefeh Sohrabizadeh

    University of California, Los Angeles

    24 shared

Labs

  • VLSI Architecture, Synthesis, and Technology (VAST) LaboratoryPI

Education

  • B.S., Computer Science

    Peking University

    1985
  • M.S., Computer Science

    University of Illinois at Urbana-Champaign

    1987
  • Ph.D., Computer Science

    University of Illinois at Urbana-Champaign

    1990

Awards & honors

  • IEEE Fellow (2000)
  • ACM Fellow (2008)
  • Member of the National Academy of Engineering (2017)
  • Fellow of the National Academy of Inventors (2020)
  • University Research Award from the Semiconductor Industry As…
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Jason (Jingsheng) Cong

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup