About
Jian Pei is a professor in the field of computer science with research interests that include large language model (LLM)-based agents. His work focuses on the generalizability of these agents, which refers to their ability to maintain consistently high performance across varied instructions, tasks, environments, and domains, especially those different from the agent’s fine-tuning data. Pei has contributed to advancing the understanding of generalizability by providing comprehensive reviews that clarify its definition and boundaries, review existing benchmarks, and categorize strategies for improving generalizability. These strategies include methods targeting the backbone LLM, agent components, and their interactions. Furthermore, his research distinguishes between generalizable frameworks and generalizable agents, outlining how frameworks can be translated into agent-level generalizability. Pei’s work aims to establish a foundation for principled research on building LLM-based agents that generalize reliably across diverse real-world applications, identifying future directions such as standardized evaluation frameworks, variance- and cost-based metrics, and hybrid approaches integrating methodological innovations with agent architecture-level designs.
Research topics
- Computer Science
- Data Mining
- Artificial Intelligence
- Machine Learning
- Data science
Selected publications
Research Square · 2025-12-16
preprintOpen accessThe 2nd Workshop on Large Language Models for E-Commerce
2025-08-03
articleLarge Language Models (LLMs) are revolutionizing E-Commerce by enabling product recommendation, search, classification, question answering, and advertising applications. Their increasing adoption in real-world systems underscores their potential; however, challenges persist in ensuring accuracy, efficiency, fairness, and privacy. This workshop aims to bring together researchers and industry practitioners to explore both the limitations and opportunities of LLMs in e-commerce. The workshop seeks to foster collaboration, bridge the gap between academia and industry, and drive innovation in the application of LLMs to E-Commerce through discussions on model design, algorithmic advancements, and practical deployment.
AI4DE: The 1st International Workshop on AI for Data Editing
2025-08-03
articleOpen accessSenior authorMachine learning traditionally emphasizes developing models for given datasets, but real-world data is often messy, making model improvement insufficient for enhancing performance. AI for data editing (AI4DE) is an emerging field that systematically improves datasets, leading to significant practical ML advancements. While experienced data scientists have manually refined datasets through trial-and-error and intuition, AI4DE approaches data enhancement as a systematic engineering discipline. AI4DE represents a shift from focusing on models to the underlying data used for training and evaluation. Despite the dominance of common model architectures and predictable scaling rules, building and using datasets remain labor-intensive and costly, lacking infrastructure and best practices. The AI4DE movement aims to develop efficient, high-productivity open data engineering tools for modern ML systems. This workshop seeks to foster an interdisciplinary AI4DE community to address practical data challenges, including data collection, generation, labeling, preprocessing, augmentation, quality evaluation, debt, and governance. By defining and shaping the AI4DE movement, this workshop aims to influence the future of AI and ML, inviting interested parties to contribute through paper submissions
Research on a lightweight traffic sign detection algorithm based on GCL-YOLOv8
2025-09-19
article1st authorCorrespondingA lightweight traffic sign detection algorithm based on YOLOv8n improved GCL-YOLOv8 is proposed to address the issues of low accuracy and large parameter count in the process, resulting in long computation time and complexity. Firstly, a new module GRA is designed using GhostModule, RepConv, and ECA channel attention to replace the original C2f module, balancing inference efficiency and feature expression ability, reducing computational complexity while also considering accuracy; Secondly, the CARAFE lightweight upsampling module is used to replace ordinary upsampling, and the content aware recombination mechanism is applied to effectively amplify and preserve feature details; Then, by adding an LADH detection head, the parameters and computational complexity can be effectively reduced; Finally, Wise MPDIOU was used to replace the original CIoU loss function, simplifying the calculation process of the loss function and improving convergence speed and regression accuracy. Compared with the basic YOLOv8n algorithm, this algorithm improved accuracy P by 2.5%, mAP50 by 2.7%, parameter Params by 42.3%, and GFLOPs by 48.8% on the TT100K dataset. Proved the lightweight and accuracy of the improved algorithm.
CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets
2025-05-19
articleSenior authorThis paper introduces the novel concept of cost-sensitive data acquisition (CDA), a desirable addition to the data preparation process in a data science pipeline that focuses on strategically acquiring data from various priced sources, such as data markets, under budget constraints. CDA improves data quality by identifying the best set of values to acquire and integrating them into incomplete datasets, optimizing a particular objective defined in the resulting tables (data products). This paper focuses on CDA for a single relational table while also exploring possible extensions to multi-table contexts. First, we introduce an algorithm that utilizes conformal risk control to select rows likely to be included in the data product with probabilistic guarantees. We then investigate ways to acquire data to complete these rows under various CDA scenarios. We start with a scenario where data records are available on a row-wise basis, which proves to be an NP-hard problem. To solve this problem, we introduce an efficient row-wise greedy algorithm (RGreedy), which approaches an approximation ratio of 1. Subsequently, we explore a more generic scenario where each unit of data for acquisition may involve multiple records with a subset of the attributes. We propose a coverage minimum option selection (CMOS) algorithm for its solution, focusing on scalability. Through empirical evaluations on three real-world datasets and one synthetic dataset, we demonstrate that our methods yield performance improvements of 20 % to 40 % over applicable baselines.
Computing Shapley Values in Preference Queries
2025-05-19
articleThis paper tackles the novel problem of computing Shapley values when multiple data owners collaborate to answer preference queries. Despite extensive existing research on preference queries and Shapley value computation separately, the evaluation of data owners' contributions to cooperatively answering such queries has not been systematically explored. To address this gap, we first establish that, for a linear preference utility function with one data point per owner, the Shapley value can be computed in polynomial time. This finding is applicable to attribute weight spaces that are subsets of a simplex and represent various linear preference utility functions. For scenarios involving multiple data points per owner, we observe that only the locally optimal points from each data owner can make non-zero marginal contributions. Thus, we partition the attribute weight space into a polynomial number of subsets, ensuring that in each subset, only one data point per owner needs to be considered. Experimental results on real Airbnb Listing data and synthetic data sets validate the effectiveness and efficiency of our algorithms, which significantly outperform baseline methods.
2025-08-03 · 5 citations
articleOpen accessLarge language models (LLMs) based on Transformer architecture are powerful but face challenges with deployment, inference latency, and costly fine-tuning. These limitations highlight the emerging potential of small language models (SLMs), which can either replace LLMs through innovative architectures and technologies, or assist them as efficient proxy or reward models. Emerging architectures such as Mamba and xLSTM address the quadratic scaling of inference with window length in Transformers by enabling linear scaling. To maximize SLM performance, test-time compute scaling strategies reduce the performance gap with LLMs by allocating extra compute budget during test time. Beyond standalone usage, SLMs could also assist in LLMs via weak-to-strong learning, proxy tuning, and guarding, fostering secure and efficient LLM deployment. Lastly, the trustworthiness of SLMs remains a critical yet underexplored research area. However, there is a lack of tutorials on cutting-edge SLM technologies, prompting us to conduct one.
Gene · 2025-10-01 · 6 citations
article1st authorFrontiers in Chemistry · 2025-12-12 · 3 citations
articleOpen access1st authorDeveloping highly sensitive and convenient immunosensor for the detection of biomarker is important for enhancing the effectiveness of melanoma prevention and control measures. In this work, immunosensor was fabricated for sensitive detection of the melanoma biomarker S100B based on enhanced electrochemiluminescence (ECL) via electronic metal-support interactions. CoAl-layered double hydroxide (LDH) was selected as to modify the costless indium tin oxide (ITO) electrode due to its high surface area and tunable structure. To improve its conductivity and electron transfer capability, oxygen vacancies (Ov) were introduced on LDH through an alkaline etching process, resulting in the LDH-Ov structure. Platinum nanoparticles (Pt) were then in situ loaded onto the LDH-Ov surface (Pt@LDH-Ov/ITO). The electronic metal-support interaction (EMSI) between LDH-Ov and Pt nanoparticles played a critical role in improving the catalytic activity, leading to an enhanced ECL signal in the luminol-dissolved oxygen (DO) system. The immunorecognition interface was fabricated on Pt@LDH-Ov/ITO, enabling selective detection of S100B. The constructed immunosensor exhibited a linear detection range for S100B from 100 fg/mL to 100 ng/mL, with a limit of detection (LOD) of 65 fg/mL. The high performance and enhanced sensitivity of the immunosensor make it a promising tool for the early diagnosis, monitoring of recurrence, and personalized treatment of melanoma.
Finding Non-Redundant Simpson's Paradox from Multidimensional Data
ArXiv.org · 2025-11-02
preprintOpen accessSimpson's paradox, a long-standing statistical phenomenon, describes the reversal of an observed association when data are disaggregated into sub-populations. It has critical implications across statistics, epidemiology, economics, and causal inference. Existing methods for detecting Simpson's paradox overlook a key issue: many paradoxes are redundant, arising from equivalent selections of data subsets, identical partitioning of sub-populations, and correlated outcome variables, which obscure essential patterns and inflate computational cost. In this paper, we present the first framework for discovering non-redundant Simpson's paradoxes. We formalize three types of redundancy - sibling child, separator, and statistic equivalence - and show that redundancy forms an equivalence relation. Leveraging this insight, we propose a concise representation framework for systematically organizing redundant paradoxes and design efficient algorithms that integrate depth-first materialization of the base table with redundancy-aware paradox discovery. Experiments on real-world datasets and synthetic benchmarks show that redundant paradoxes are widespread, on some real datasets constituting over 40% of all paradoxes, while our algorithms scale to millions of records, reduce run time by up to 60%, and discover paradoxes that are structurally robust under data perturbation. These results demonstrate that Simpson's paradoxes can be efficiently identified, concisely summarized, and meaningfully interpreted in large multidimensional datasets.
Frequent coauthors
- 666 shared
Yang Yu
- 666 shared
Enhong Chen
University of Science and Technology of China
- 652 shared
Zhi‐Hua Zhou
Nanjing University
- 651 shared
João Gama
INESC TEC
- 651 shared
Chengqi Zhang
- 650 shared
Geoffrey I. Webb
- 650 shared
Hiroshi Motoda
Osaka University
- 649 shared
Jaideep Srivastava
Education
- 1994
Ph.D., Computer Science
University of California, Berkeley
- 1991
M.S., Computer Science
University of California, Berkeley
- 1988
B.S., Computer Science
University of Science and Technology of China
Awards & honors
- IEEE Fellow
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Jian Pei
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup