Jiawei Han
· Michael Aiken ChairVerifiedUniversity of Illinois Urbana-Champaign · Computer Science
Active 1988–2025
About
Jiawei Han is a professor at the Siebel School of Computing and Data Science within The Grainger College of Engineering at the University of Illinois Urbana-Champaign. He holds a Ph.D. in Computer Sciences from the University of Wisconsin-Madison, obtained in 1985. His research areas include Artificial Intelligence, Bioinformatics and Computational Biology, and Data and Information Systems. Han has contributed to the fields of data mining, information systems, and AI, with notable work recognized through various awards and honors. He is actively involved in teaching courses such as Data Mining Principles and Text Mining with Large Language Models, and has been recognized for his research and mentorship, with his students and colleagues receiving prominent awards at conferences like KDD.
Research topics
- Computer Science
- Artificial Intelligence
- Data Mining
- Machine Learning
- Data science
- Theoretical computer science
- Humanities
- Computer Security
- Biology
- Natural Language Processing
- Information Retrieval
- Ecology
- Library science
- Software engineering
- Human–computer interaction
- Philosophy
- Database
- World Wide Web
- Art
- Programming language
Selected publications
Topic Coverage-based Demonstration Retrieval for In-Context Learning
ArXiv.org · 2025-09-15
preprintOpen accessThe effectiveness of in-context learning relies heavily on selecting demonstrations that provide all the necessary information for a given test input. To achieve this, it is crucial to identify and cover fine-grained knowledge requirements. However, prior methods often retrieve demonstrations based solely on embedding similarity or generation probability, resulting in irrelevant or redundant examples. In this paper, we propose TopicK, a topic coverage-based retrieval framework that selects demonstrations to comprehensively cover topic-level knowledge relevant to both the test input and the model. Specifically, TopicK estimates the topics required by the input and assesses the model's knowledge on those topics. TopicK then iteratively selects demonstrations that introduce previously uncovered required topics, in which the model exhibits low topical knowledge. We validate the effectiveness of TopicK through extensive experiments across various datasets and both open- and closed-source LLMs. Our source code is available at https://github.com/WonbinKweon/TopicK_EMNLP2025.
Multimodal Search in Chemical Documents and Reactions
2025-07-13
articleSenior authorPairSem: LLM-Guided Pairwise Semantic Matching for Scientific Document Retrieval
ArXiv.org · 2025-10-10
preprintOpen accessScientific document retrieval is a critical task for enabling knowledge discovery and supporting research across diverse domains. However, existing dense retrieval methods often struggle to capture fine-grained scientific concepts in texts due to their reliance on holistic embeddings and limited domain understanding. Recent approaches leverage large language models (LLMs) to extract fine-grained semantic entities and enhance semantic matching, but they typically treat entities as independent fragments, overlooking the multi-faceted nature of scientific concepts. To address this limitation, we propose Pairwise Semantic Matching (PairSem), a framework that represents relevant semantics as entity-aspect pairs, capturing complex, multi-faceted scientific concepts. PairSem is unsupervised, base retriever-agnostic, and plug-and-play, enabling precise and context-aware matching without requiring query-document labels or entity annotations. Extensive experiments on multiple datasets and retrievers demonstrate that PairSem significantly improves retrieval performance, highlighting the importance of modeling multi-aspect semantics in scientific information retrieval.
GraphGPT-o: Synergistic Multimodal Comprehension and Generation on Graphs
2025-06-10 · 1 citations
articleSenior authorThe rapid development of Multimodal Large Language Models (MLLMs) has enabled the integration of multiple modalities, including texts and images, within the large language model (LLM) framework. However, texts and images are usually interconnected, forming a multimodal attributed graph (MMAG). It is underexplored how MLLMs can incorporate the relational information (i.e., graph structure) and semantic information (i.e., texts and images) on such graphs for multimodal comprehension and generation. In this paper, we propose GraphGPT-o, which supports omni-multimodal understanding and creation on MMAGs. We first comprehensively study linearization variants to transform semantic and structural information as input for MLLMs. Then, we propose a hierarchical aligner that enables deep graph encoding, bridging the gap between MMAGs and MLLMs. Finally, we explore the inference choices, adapting MLLM to interleaved text and image generation in graph scenarios. Extensive experiments on three datasets from different domains demonstrate the effectiveness of our proposed method. Datasets and codes will be publicly available at https://github.com/YiFang99/GraphGPT-o.
ISIJ International · 2025-12-10
articleOpen accessThis research aimed to investigate the influence of Fe-based addition on the solution-loss kinetics of coke in hydrogen-enriched blast furnace. The solution-loss reactions of base coke (BC) and Fe-based coke (BC+Fe) in the CO2 + 20%H2O atmosphere across the temperature 1000-1200 °C were carried out by a homemade coke reactivity measurement device with continuous water inflow. The kinetics of the Boudouard reaction (C + CO2 = 2CO) and the water-gas reaction (C + H2O = CO + H2) were assessed by monitoring the outlet gas composition (CO and H2) to quantitatively evaluate the catalytic influence of Fe2O3 on the solution-loss reaction. The results indicate that the solution-loss rates of BC+Fe coke are more those of BC coke, and the solution-loss ratios of BC+Fe coke are 10.5-26.8% for the Boudouard reaction and 12.1-42.2% for the water-gas reaction higher than those of BC coke. Furthermore, Fe2O3 lowers the apparent activation energy (Ea) of the Boudouard reaction by 4.2% and that of the water-gas reaction by 7.8%, which shows that the catalytic effect of Fe2O3 is stronger for the water-gas reaction than for the Boudouard reaction. SEM analysis shows that the BC+Fe coke has a more varied pore structure and wider range of pore sizes on the surface. XRD analysis indicates that Fe2O3 reacts with Si and Al species in the minerals to form Fe-based silicates and aluminosilicates, which could contribute to the catalytic effect of the coke solution-loss reaction.
SpectralX: Parameter-efficient Domain Generalization for Spectral Remote Sensing Foundation Models
ArXiv.org · 2025-08-03 · 2 citations
preprintOpen accessRecent advances in Remote Sensing Foundation Models (RSFMs) have led to significant breakthroughs in the field. While many RSFMs have been pretrained with massive optical imagery, more multispectral/hyperspectral data remain lack of the corresponding foundation models. To leverage the advantages of spectral imagery in earth observation, we explore whether existing RSFMs can be effectively adapted to process diverse spectral modalities without requiring extensive spectral pretraining. In response to this challenge, we proposed SpectralX, an innovative parameter-efficient fine-tuning framework that adapt existing RSFMs as backbone while introducing a two-stage training approach to handle various spectral inputs, thereby significantly improving domain generalization performance. In the first stage, we employ a masked-reconstruction task and design a specialized Hyper Tokenizer (HyperT) to extract attribute tokens from both spatial and spectral dimensions. Simultaneously, we develop an Attribute-oriented Mixture of Adapter (AoMoA) that dynamically aggregates multi-attribute expert knowledge while performing layer-wise fine-tuning. With semantic segmentation as downstream task in the second stage, we insert an Attribute-refined Adapter (Are-adapter) into the first stage framework. By iteratively querying low-level semantic features with high-level representations, the model learns to focus on task-beneficial attributes, enabling customized adjustment of RSFMs. Following this two-phase adaptation process, SpectralX is capable of interpreting spectral imagery from new regions or seasons. The codes will be available from the website: https://github.com/YuxiangZhang-BIT.
Retrieval And Structuring Augmented Generation with Large Language Models
2025-08-03 · 5 citations
preprintOpen accessSenior authorLarge Language Models (LLMs) have revolutionized natural language processing with their remarkable capabilities in text generation and reasoning. However, these models face critical challenges when deployed in real-world applications, including hallucination generation, outdated knowledge, and limited domain expertise. Retrieval And Structuring (RAS) Augmented Generation addresses these limitations by integrating dynamic information retrieval with structured knowledge representations. This survey (1) examines retrieval mechanisms including sparse, dense, and hybrid approaches for accessing external knowledge; (2) explore text structuring techniques such as taxonomy construction, hierarchical classification, and information extraction that transform unstructured text into organized representations; and (3) investigate how these structured representations integrate with LLMs through prompt-based methods, reasoning frameworks, and knowledge embedding techniques. It also identifies technical challenges in retrieval efficiency, structure quality, and knowledge integration, while highlighting research opportunities in multimodal retrieval, cross-lingual structures, and interactive systems. This comprehensive overview provides researchers and practitioners with insights into RAS methods, applications, and future directions.
Hybrid Latent Reasoning via Reinforcement Learning
ArXiv.org · 2025-05-24
preprintOpen accessRecent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.
GRACE: Generative Representation Learning via Contrastive Policy Optimization
ArXiv.org · 2025-10-06
preprintOpen accessSenior authorPrevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce GRACE (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy that produces explicit, human-interpretable rationales--structured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross category gains: averaged over four backbones, the supervised setting improves overall score by 11.5% over base models, and the unsupervised variant adds 6.9%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent rationales. The model, data and code are available at https://github.com/GasolSun36/GRACE.
Advanced Composites and Hybrid Materials · 2025-08-29 · 6 citations
articleOpen access1st authorCorrespondingAbstract Conventional bamboo waterproofing modifications frequently face limitations such as complex processing, limited functionality, inadequate mechanical durability, and reliance on petroleum-based polymers. Inspired by the hierarchical enamel-dentin structure of teeth, we propose a novel biomimetic strategy that utilizes bamboo’s intrinsic components to in situ generate a robust 170 µm-thick protective layer. This is achieved through selective surface delignification, directional NaIO 4 oxidation, and subsequent cell wall reconstruction via hot-pressing, effectively overcoming these longstanding challenges. Within this structure, the protective layer of the resulting tooth-mimetic bamboo hierarchical composite (TMB) forms via plasticization induced by the hydroxyl-aldehyde condensation reaction of dialdehyde cellulose, while the core layer densifies during hot-pressing. Consequently, TMB exhibits exceptional waterproofing, demonstrating a 99.0% reduction in surface water absorption rate compared to natural bamboo (NB). Remarkably, the protective layer maintains its waterproofing efficacy even after enduring over 100 cycles of abrasion and peeling. Additionally, TMB effectively repels common household liquids (e.g., coffee, milk, juice), and stubborn stains such as those from oil-based markers can be readily wiped off. Notably, TMB simultaneously achieves significant mechanical enhancement, attaining a Shore hardness of 92.0 HD alongside outstanding flexural and tensile properties. As a scalable composite material, TMB offers innovative strategies for protecting bamboo-based products and holds significant promise for diverse applications.
Recent grants
CPS: Small: Collaborative Research: Foundations of Cyber-Physical Networks
NSF · $275k · 2009–2014
NIH · $20.8M · 2021
NSF · $500k · 2010–2016
III: Small: Multi-Dimensional Structuring, Summarizing and Mining of Social Media Data
NSF · $500k · 2016–2021
NSF · $831k · 2009–2013
Frequent coauthors
- 98 shared
Xiang Ren
- 71 shared
Yizhou Sun
- 69 shared
Jingbo Shang
- 66 shared
Meng Yu
- 60 shared
Jiaming Shen
- 57 shared
Xifeng Yan
Beijing Institute of Technology
- 56 shared
Chao Zhang
- 54 shared
Jian Pei
Duke University
Labs
Siebel School of Computing and Data SciencePI
Education
- 1986
Ph.D., Computer Science
University of Wisconsin-Madison
- 1982
M.S., Computer Science
University of Science and Technology of China
- 1980
B.S., Computer Science
University of Science and Technology of China
Awards & honors
- 2025 ACM SIGKDD Rising Star Award
- 2025 ACM SIGKDD Dissertation Award, Runner-Up
- 2025 ACM SIGKDD Dissertation Award, Honorable Mention
- ACM SIGKDD 2024 Dissertation Award
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Jiawei Han
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup