Lingming Zhang
· Associate ProfessorVerifiedUniversity of Illinois Urbana-Champaign · Computer Science
Active 2009–2026
About
Lingming Zhang is an Associate Professor at the Siebel School of Computing and Data Science at the University of Illinois Urbana-Champaign. His main research interests are in Software Engineering, with a focus on its synergy with Machine Learning, Programming Languages, and Formal Methods. Zhang's work involves exploring AI for Systems and Security, Code Large Language Models (LLMs) and Agents, and Software Engineering. He has contributed to advancing the understanding and development of software testing, debugging, and the application of AI techniques in software engineering. Zhang teaches courses related to Software Engineering and advanced topics in the field, including seminars and specialized classes on software testing, debugging, and AI integration in software development.
Research topics
- Artificial Intelligence
- Computer Science
- Programming language
- Computer Security
- Embedded system
- Computer engineering
Selected publications
Myeloperoxidase regulates hypoxia-induced inflammation and oxidative stress in liver-spleen axis
Journal of Inflammation · 2026-05-02
articleOpen accessBACKGROUND: High-altitude hypobaric hypoxia induces inflammation and oxidative stress, yet the role of myeloperoxidase (MPO) in this pathology remains incompletely understood. This study aimed to investigate whether MPO mediates injury to the liverspleen axis under hypoxic conditions. RESULTS: mice displayed aggravated histopathological injury, accompanied by excessive phagocyte recruitment and elevated expression of key chemokines (KC, MCP1, MIP2) and proinflammatory mediators (TNFα, IL1β, IL17A). At the molecular level, MPO absence increased splenic protein expression of NFκB, NLRP3, and iNOS, while dysregulating the antioxidant response via the NRF2/HO1 pathway. CONCLUSIONS: These results reveal a novel protective role for MPO during hypoxic stress, where it functions to moderate the innate immune response and limit collateral tissue damage in the liverspleen axis. The study provides new insights into the complex immunomodulatory functions of MPO and suggests its activity is essential for maintaining immune homeostasis during acute hypoxia.
KernelGPT: Enhanced Kernel Fuzzing via Large Language Models
2025-03-27 · 18 citations
articleOpen accessSenior authorBugs in operating system kernels can affect billions of devices and users all over the world. As a result, a large body of research has been focused on kernel fuzzing, i.e., automatically generating syscall (system call) sequences to detect potential kernel bugs or vulnerabilities. Kernel fuzzing aims to generate valid syscall sequences guided by syscall specifications that define both the syntax and semantics of syscalls. While there has been existing work trying to automate syscall specification generation, this remains largely manual work, and a large number of important syscalls are still uncovered.
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks
ArXiv.org · 2025-06-13
preprintOpen accessSenior authorRigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice. We introduce SEC-bench, the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks. SEC-bench employs a novel multi-agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance. Using SEC-bench, we implement two critical software security tasks to rigorously evaluate LLM agents' capabilities: proof-of-concept (PoC) generation and vulnerability patching. A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps, achieving at most 18.0% success in PoC generation and 34.0% in vulnerability patching on our complete dataset. These results highlight the crucial steps needed toward developing LLM agents that are more practical, intelligent, and autonomous for security engineering.
Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?
ArXiv.org · 2025-11-17
preprintOpen accessSenior authorLarge Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
Demystifying LLM-Based Software Engineering Agents
Proceedings of the ACM on software engineering. · 2025-06-19 · 36 citations
articleOpen accessSenior authorRecent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless – an agentless approach to automatically resolve software development issues. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents at the time of paper submission! Agentless also achieves more than 50% solve rate when using Claude 3.5 Sonnet on the new SWE-bench Verified benchmark. In fact, Agentless has already been adopted by OpenAI as the go-to approach to showcase the real-world coding performance of both GPT-4o and the new o1 models; more recently, Agentless has also been used by DeepSeek to evaluate their newest DeepSeek V3 and R1 models. Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patches or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-𝑆 by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the currently overlooked potential of a simplistic, cost-effective technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction. We have open-sourced Agentless at: https://github.com/OpenAutoCoder/Agentless
Proceedings of the ACM on software engineering. · 2025-06-22
articleOpen accessWhile existing machine learning (ML) frameworks focus on established platforms, like running CUDA on server-grade GPUs, there have been growing demands to enable emerging AI applications in a broader set of scenarios, such as running Large Language Models (LLMs) within browsers and mobile phones. However, deploying emerging models on new platforms (such as Metal and WebGPU) presents significant software engineering challenges due to rapid model evolution and limited tooling and practices for these platforms. Previous practice for ML model deployment often follows a bottom-up fashion, where engineers first implement individual required operators and then put them together. However, this traditional development approach fails to meet the productivity requirements when deploying emerging ML applications, with the testing and debugging part as a bottleneck. To this end, we introduce TapML, a top-down approach designed to streamline model deployment on diverse platforms. While the traditional bottom-up approach requires crafting manual tests, TapML automatically creates high-quality, realistic test data through operator-wise test carving. Furthermore, TapML uses a migration-based strategy to gradually offload model implementation from the mature source platform to the target platform, minimizing the debugging scope of compound errors. TapML has been used as the default development method in the MLC-LLM project to deploy emerging ML models. Within 2 years, TapML has accelerated the deployment of 105 emerging models in 27 model architectures across 5 emerging platforms. We show that TapML effectively boosts developer productivity while ensuring the quality of deployed models. Furthermore, we summarize comprehensive case studies from our real-world development, offering best practices for developing emerging ML systems.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
ArXiv.org · 2025-02-25 · 2 citations
preprintOpen accessThe recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.
UniDebugger: Hierarchical Multi-Agent Framework for Unified Software Debugging
2025-01-01
articleOpen accessCheryl Lee, Chunqiu Steven Xia, Longji Yang, Jen-tse Huang, Zhouruixing Zhu, Lingming Zhang, Michael R. Lyu. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025.
CWM: An Open-Weights LLM for Research on Code Generation with World Models
ArXiv.org · 2025-09-30
preprintOpen accessWe release Code World Model (CWM), a 32-billion-parameter open-weights LLM, to advance research on code generation with world models. To improve code understanding beyond what can be learned from training on static code alone, we mid-train CWM on a large amount of observation-action trajectories from Python interpreter and agentic Docker environments, and perform extensive multi-task reasoning RL in verifiable coding, math, and multi-turn software engineering environments. With CWM, we provide a strong testbed for researchers to explore the opportunities world modeling affords for improving code generation with reasoning and planning in computational environments. We present first steps of how world models can benefit agentic coding, enable step-by-step simulation of Python code execution, and show early results of how reasoning can benefit from the latter. CWM is a dense, decoder-only LLM trained with a context size of up to 131k tokens. Independent of its world modeling capabilities, CWM offers strong performance on general coding and math tasks: it reaches pass@1 scores of 65.8% on SWE-bench Verified (with test-time scaling), 68.6% on LiveCodeBench, 96.6% on Math-500, and 76.0% on AIME 2024. To support further research on code world modeling, we release model checkpoints after mid-training, SFT, and RL.
LEAM++: Learning for Selective Mutation Fault Construction
ACM Transactions on Software Engineering and Methodology · 2025-03-21 · 1 citations
articleSenior authorMutation faults are the core of mutation testing and have been widely used in many software testing tasks. Hence, efficiently constructing high-quality mutation faults is critical. To address the effectiveness limitations of traditional and deep learning-based mutation techniques, we first proposed LEAM , utilizing a syntax-guided encoder–decoder architecture with extended grammar rules. While LEAM significantly enhances the effectiveness, it does not consider the associated testing cost. To further improve the efficiency of LEAM , we propose LEAM++ , adopting a novel selective mutation fault construction module based on the probability of grammar rule sequences and the similarity of mutation faults. We extensively evaluate LEAM++ using Defects4J. Regarding effectiveness, the results demonstrate that the mutation faults constructed by LEAM++ can better represent real faults than two traditional techniques ( Major and PIT ) and the deep learning-based technique ( DeepMutation ), and substantially boost three downstream applications, i.e., mutation-based test case prioritization, mutation-based fault localization, and mutation-based bug detection. Regarding efficiency, LEAM++ demonstrates superiority over the four selective mutation testing techniques across three scenarios, i.e., mutation testing, mutation-based test case prioritization, and mutation-based fault localization. Our work serves as an important step toward the efficiently automated construction of mutation faults.
Recent grants
CRII: SHF: Machine-Learning-Based Test Effectiveness Prediction
NSF · $174k · 2016–2019
NSF · $363k · 2018–2021
CAREER: Maximal and Scalable Unified Debugging for the JVM Ecosystem
NSF · $520k · 2021–2026
NSF · $254k · 2020–2023
CAREER: Maximal and Scalable Unified Debugging for the JVM Ecosystem
NSF · $191k · 2020–2021
Frequent coauthors
- 38 shared
Lu Zhang
Tianjin University
- 37 shared
Dan Hao
- 22 shared
Sarfraz Khurshid
- 20 shared
Chunqiu Steven Xia
University of Illinois Urbana-Champaign
- 19 shared
Hong Mei
Beijing Institute of Technology
- 18 shared
Yuqun Zhang
- 16 shared
Junjie Chen
Tianjin University
- 15 shared
Yinlin Deng
University of Illinois Urbana-Champaign
Labs
Siebel School of Computing and Data SciencePI
Awards & honors
- 2025 Google Academic Research Award
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Lingming Zhang
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup