Gurpreet Singh
VerifiedUniversity of Illinois Urbana-Champaign · Department of Biomedical and Translational Sciences
Active 1969–2026
Research topics
- Computer Science
- Artificial Intelligence
- Natural Language Processing
- Algorithm
- Mathematics
- Chemistry
- Mathematical optimization
Selected publications
Evolving Abstract Transformers for Gradient-Guided, Adaptable Abstract Interpretation
Zenodo (CERN European Organization for Nuclear Research) · 2026-04-15
otherOpen accessSenior authorArtifact for our PLDI'26 paper titled "Evolving Abstract Transformers for Gradient-Guided, Adaptable Abstract Interpretation"
SAIL: Sound Abstract Interpreters with LLMs
Zenodo (CERN European Organization for Nuclear Research) · 2026-06-01
articleOpen accessSenior authorThis repo is the artifact associated with the PLDI 2026 submission#429 "SAIL: Sound Abstract Interpreters with LLMs".
Evolving Abstract Transformers for Gradient-Guided, Adaptable Abstract Interpretation
ArXiv.org · 2025-07-16
preprintOpen accessSenior authorCurrent numerical abstract interpretation relies on fixed, hand-crafted, instruction-specific transformers tailored to each domain, causing three key limitations: transformers cannot be reused across domains; precise compositional reasoning over instruction sequences is difficult; and all downstream tasks must use the same fixed transformer regardless of precision or efficiency needs.To address this, we propose the Evolving Abstract Transformer, which replaces the fixed single-output design with an adaptable search over a parametric space of sound outputs via two algorithms. First, the Universal Parametric Output Space Encoder (UPOSE) constructs a compact parametric space of sound outputs for any polyhedral numerical domain and any operator in the Quadratic-Bounded Guarded Operators (QGO) class, covering both individual instructions and structured sequences. Second, the Adaptive Gradient Guidance (AGG) algorithm leverages the differentiable structure of UPOSE's output space to efficiently search it according to downstream objectives and available runtime, continually evolving the output as more time is provided. We implement these ideas in the AbsEvolve framework and evaluate across three numerical abstract domains: Zones, Octagons, and Polyhedra. Results show the evolving transformer works across domains and instructions, enables efficient precision-efficiency tradeoffs by adjusting number of gradient steps in the search, and reaches the most precise invariants up to 3.2x faster than existing baselines.
American Journal of Therapeutics · 2025-12-09
articleLearning a Pessimistic Reward Model in RLHF
ArXiv.org · 2025-05-26
preprintOpen accessSenior authorThis work proposes `PET', a novel pessimistic reward fine-tuning method, to learn a pessimistic reward model robust against reward hacking in offline reinforcement learning from human feedback (RLHF). Traditional reward modeling techniques in RLHF train an imperfect reward model, on which a KL regularization plays a pivotal role in mitigating reward hacking when optimizing a policy. Such an intuition-based method still suffers from reward hacking, and the policies with large KL divergence from the dataset distribution are excluded during learning. In contrast, we show that when optimizing a policy on a pessimistic reward model fine-tuned through PET, reward hacking can be prevented without relying on any regularization. We test our methods on the standard TL;DR summarization dataset. We find that one can learn a high-quality policy on our pessimistic reward without using any regularization. Such a policy has a high KL divergence from the dataset distribution while having high performance in practice. In summary, our work shows the feasibility of learning a pessimistic reward model against reward hacking. The agent can greedily search for the policy with a high pessimistic reward without suffering from reward hacking.
BEAVER: An Efficient Deterministic LLM Verifier
ArXiv.org · 2025-12-05
preprintOpen accessSenior authorAs large language models (LLMs) transition from research prototypes to production systems, practitioners often need reliable methods to verify model outputs and characterize tail risk for safe deployment. While sampling-based estimates provide an ad-hoc intuition of model behavior, they offer no sound guarantees. We present BEAVER, the first practical framework for computing deterministic, sound probability bounds on LLM satisfaction of safety properties. Given a prompt & any safety property, BEAVER systematically explores the model output space using novel Token trie and Frontier data structures, maintaining provably sound bounds at every iteration. We formalize the verification problem, prove soundness of our approach, and evaluate BEAVER on 4 safety properties across 12 open-weight LLMs. BEAVER identifies 2-3x more risky instances compared to baselines while taking 1/10 of the compute budget, surfacing tail risks that loose bounds and ad-hoc evaluation misses.
SuperCoder: Assembly Program Superoptimization with Large Language Models
ArXiv.org · 2025-05-16
preprintOpen accessSuperoptimization is the task of transforming a program into a faster one while preserving its input-output behavior. In this work, we investigate whether large language models (LLMs) can serve as superoptimizers, generating assembly programs that outperform code already optimized by industry-standard compilers. We construct the first large-scale benchmark for this problem, consisting of 8,072 assembly programs averaging 130 lines, in contrast to prior datasets restricted to 2-15 straight-line, loop-free programs. We evaluate 23 LLMs on this benchmark and find that the strongest baseline, Claude-opus-4, achieves a 51.5% test-passing rate and a 1.43x average speedup over gcc -O3. To further enhance performance, we fine-tune models with reinforcement learning, optimizing a reward function that integrates correctness and performance speedup. Starting from Qwen2.5-Coder-7B-Instruct (61.4% correctness, 1.10x speedup), the fine-tuned model SuperCoder attains 95.0% correctness and 1.46x average speedup, with additional improvement enabled by Best-of-N sampling and iterative refinement. Our results demonstrate, for the first time, that LLMs can be applied as superoptimizers for assembly programs, establishing a foundation for future research in program performance optimization beyond compiler heuristics.
Data Shifts Hurt CoT: A Theoretical Study
ArXiv.org · 2025-06-12
preprintOpen accessSenior authorChain of Thought (CoT) has been applied to various large language models (LLMs) and proven to be effective in improving the quality of outputs. In recent studies, transformers are proven to have absolute upper bounds in terms of expressive power, and consequently, they cannot solve many computationally difficult problems. However, empowered by CoT, transformers are proven to be able to solve some difficult problems effectively, such as the $k$-parity problem. Nevertheless, those works rely on two imperative assumptions: (1) identical training and testing distribution, and (2) corruption-free training data with correct reasoning steps. However, in the real world, these assumptions do not always hold. Although the risks of data shifts have caught attention, our work is the first to rigorously study the exact harm caused by such shifts to the best of our knowledge. Focusing on the $k$-parity problem, in this work we investigate the joint impact of two types of data shifts: the distribution shifts and data poisoning, on the quality of trained models obtained by a well-established CoT decomposition. In addition to revealing a surprising phenomenon that CoT leads to worse performance on learning parity than directly generating the prediction, our technical results also give a rigorous and comprehensive explanation of the mechanistic reasons of such impact.
American Journal of Neuroradiology · 2025-12-01 · 1 citations
articleOpen access<h3>ABSTRACT</h3> <h3>BACKGROUND AND PURPOSE:</h3> Differentiating true progression from treatment-related changes in patients with glioblastoma (GBM) remains a major diagnostic challenge. Amino acid PET tracers such as [F18]-Fluciclovine provide biologically specific information, but clinical real-world validation across institutions is limited. We aimed to evaluate the clinical diagnostic performance of [F18]-Fluciclovine PET/MRI for distinguishing true progression from treatment-related change in patients with presumed GBM progression across 2 academic centers. <h3>MATERIALS AND METHODS:</h3> In this retrospective, multi-institutional, IRB-approved study, we analyzed [F18]-Fluciclovine PET/MRI scans performed in patients with presumed GBM progression. All PET/MRI examinations were clinically indicated and performed as part of routine standard-of-care imaging. Clinical classification was based on histopathology or imaging and clinical follow-up. SUVmax was measured in enhancing lesions. Group comparisons were assessed with Mann–Whitney U tests. Diagnostic performance was evaluated using receiver operating characteristic (ROC) analysis, including derivation of an optimal cutoff using Youden’s index and validation of the previously published diagnostic threshold of 4.66 (Nabavizadeh <i>et al</i>.). Subgroup analyses compared diagnostic performance across institutions. <h3>RESULTS:</h3> Thirty-six patients with presumed GBM progression (Institution 1, n = 17; Institution 2, n = 19) provided 22 examinations classified as true tumor progression and 14 as treatment-related change. There were no significant differences in clinical or demographic study population characteristics between the two institutions. SUVmax was significantly higher in true tumor progression than in treatment-related change (median [interquartile range], 8.73 [5.86 -10.83] versus 3.71 [1.70 -4.67], p< .01). Combined ROC analysis demonstrated excellent diagnostic performance (AUC=0.90; 95% CI, 0.79–0.98). The optimal threshold of 5.7 yielded 86% sensitivity (0.70–0.99) and 86% specificity (0.64–0.99). Applying the published threshold of 4.66 produced similar results (AUC=0.90), with 91% sensitivity (0.71–0.99) and 71% specificity (0.42–0.92). A stratified analysis demonstrated comparable diagnostic performance across both institutions. <h3>CONCLUSIONS:</h3> [F18]-Fluciclovine PET/MRI demonstrated high diagnostic accuracy for differentiating true GBM progression from treatment-related changes, with consistent SUVmax thresholds across 2 institutions. These findings support the generalizability of [F18]-Fluciclovine PET as a biologically specific adjunct to conventional MR imaging for patients with presumed GBM progression. ABBREVIATIONS: [F18]-FACBC and [F18]-Fluciclovine= Trans-1-amino-3-[F18]-fluorocyclobutane-1-carboxylic acid; GBM=glioblastoma; IDH=Isocitrate Dehydrogenase; SUVmax=maximum standardized uptake value
Automated Verification of Soundness of DNN Certifiers
Proceedings of the ACM on Programming Languages · 2025-04-09 · 3 citations
articleOpen accessSenior authorThe uninterpretability of Deep Neural Networks (DNNs) hinders their use in safety-critical applications. Abstract Interpretation-based DNN certifiers provide promising avenues for building trust in DNNs. Unsoundness in the mathematical logic of these certifiers can lead to incorrect results. However, current approaches to ensure their soundness rely on manual, expert-driven proofs that are tedious to develop, limiting the speed of developing new certifiers. Automating the verification process is challenging due to the complexity of verifying certifiers for arbitrary DNN architectures and handling diverse abstract analyses. We introduce ProveSound, a novel verification procedure that automates the soundness verification of DNN certifiers for arbitrary DNN architectures. Our core contribution is the novel concept of a symbolic DNN, using which, ProveSound reduces the soundness property, a universal quantification over arbitrary DNNs, to a tractable symbolic representation, enabling verification with standard SMT solvers. By formalizing the syntax and operational semantics of ConstraintFlow, a DSL for specifying certifiers, ProveSound efficiently verifies both existing and new certifiers, handling arbitrary DNN architectures. Our code is available at https://github.com/uiuc-focal-lab/constraintflow.git
Frequent coauthors
- 40 shared
Martin Vechev
- 23 shared
Markus Püschel
ETH Zurich
- 12 shared
Saša Misailovíc
University of Illinois Urbana-Champaign
- 10 shared
Mislav Balunović
ETH Zurich
- 9 shared
Shubham Ugare
- 9 shared
Sudhir Chandra Sarangi
All India Institute of Medical Sciences
- 9 shared
Seema Sood
All India Institute of Medical Sciences
- 9 shared
Sarthak Das
All India Institute of Medical Sciences, Deoghar
Awards & honors
- Carle Illinois College of Medicine Awards
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Gurpreet Singh
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup