Jiawei Han

· Michael Aiken ChairVerified

University of Illinois Urbana-Champaign · Computer Science

Active 1988–2025

h-index146

Citations121.9k

Papers1.2k338 last 5y

Funding$25.0M

Faculty page

See your match with Jiawei Han — sign in to PhdFit.Sign in

About

Jiawei Han is a professor at the Siebel School of Computing and Data Science within The Grainger College of Engineering at the University of Illinois Urbana-Champaign. He holds a Ph.D. in Computer Sciences from the University of Wisconsin-Madison, obtained in 1985. His research areas include Artificial Intelligence, Bioinformatics and Computational Biology, and Data and Information Systems. Han has contributed to the fields of data mining, information systems, and AI, with notable work recognized through various awards and honors. He is actively involved in teaching courses such as Data Mining Principles and Text Mining with Large Language Models, and has been recognized for his research and mentorship, with his students and colleagues receiving prominent awards at conferences like KDD.

Research topics

Computer Science
Artificial Intelligence
Data Mining
Machine Learning
Data science
Theoretical computer science
Humanities
Computer Security
Biology
Natural Language Processing
Information Retrieval
Ecology
Library science
Software engineering
Human–computer interaction
Philosophy
Database
World Wide Web
Art
Programming language

Selected publications

Topic Coverage-based Demonstration Retrieval for In-Context Learning
ArXiv.org · 2025-09-15
preprintOpen access
The effectiveness of in-context learning relies heavily on selecting demonstrations that provide all the necessary information for a given test input. To achieve this, it is crucial to identify and cover fine-grained knowledge requirements. However, prior methods often retrieve demonstrations based solely on embedding similarity or generation probability, resulting in irrelevant or redundant examples. In this paper, we propose TopicK, a topic coverage-based retrieval framework that selects demonstrations to comprehensively cover topic-level knowledge relevant to both the test input and the model. Specifically, TopicK estimates the topics required by the input and assesses the model's knowledge on those topics. TopicK then iteratively selects demonstrations that introduce previously uncovered required topics, in which the model exhibits low topical knowledge. We validate the effectiveness of TopicK through extensive experiments across various datasets and both open- and closed-source LLMs. Our source code is available at https://github.com/WonbinKweon/TopicK_EMNLP2025.
Publisher OA PDF DOI
Multimodal Search in Chemical Documents and Reactions
2025-07-13
articleSenior author
Publisher DOI
PairSem: LLM-Guided Pairwise Semantic Matching for Scientific Document Retrieval
ArXiv.org · 2025-10-10
preprintOpen access
Scientific document retrieval is a critical task for enabling knowledge discovery and supporting research across diverse domains. However, existing dense retrieval methods often struggle to capture fine-grained scientific concepts in texts due to their reliance on holistic embeddings and limited domain understanding. Recent approaches leverage large language models (LLMs) to extract fine-grained semantic entities and enhance semantic matching, but they typically treat entities as independent fragments, overlooking the multi-faceted nature of scientific concepts. To address this limitation, we propose Pairwise Semantic Matching (PairSem), a framework that represents relevant semantics as entity-aspect pairs, capturing complex, multi-faceted scientific concepts. PairSem is unsupervised, base retriever-agnostic, and plug-and-play, enabling precise and context-aware matching without requiring query-document labels or entity annotations. Extensive experiments on multiple datasets and retrievers demonstrate that PairSem significantly improves retrieval performance, highlighting the importance of modeling multi-aspect semantics in scientific information retrieval.
Publisher OA PDF DOI
GraphGPT-o: Synergistic Multimodal Comprehension and Generation on Graphs
2025-06-10 · 1 citations
articleSenior author
The rapid development of Multimodal Large Language Models (MLLMs) has enabled the integration of multiple modalities, including texts and images, within the large language model (LLM) framework. However, texts and images are usually interconnected, forming a multimodal attributed graph (MMAG). It is underexplored how MLLMs can incorporate the relational information (i.e., graph structure) and semantic information (i.e., texts and images) on such graphs for multimodal comprehension and generation. In this paper, we propose GraphGPT-o, which supports omni-multimodal understanding and creation on MMAGs. We first comprehensively study linearization variants to transform semantic and structural information as input for MLLMs. Then, we propose a hierarchical aligner that enables deep graph encoding, bridging the gap between MMAGs and MLLMs. Finally, we explore the inference choices, adapting MLLM to interleaved text and image generation in graph scenarios. Extensive experiments on three datasets from different domains demonstrate the effectiveness of our proposed method. Datasets and codes will be publicly available at https://github.com/YiFang99/GraphGPT-o.
Publisher DOI
Effect of Fe2O3 on Coke Solution-Loss Characteristics under a CO2+H2O Atmosphere — A Kinetics Study
ISIJ International · 2025-12-10
articleOpen access
This research aimed to investigate the influence of Fe-based addition on the solution-loss kinetics of coke in hydrogen-enriched blast furnace. The solution-loss reactions of base coke (BC) and Fe-based coke (BC+Fe) in the CO2 + 20%H2O atmosphere across the temperature 1000-1200 °C were carried out by a homemade coke reactivity measurement device with continuous water inflow. The kinetics of the Boudouard reaction (C + CO2 = 2CO) and the water-gas reaction (C + H2O = CO + H2) were assessed by monitoring the outlet gas composition (CO and H2) to quantitatively evaluate the catalytic influence of Fe2O3 on the solution-loss reaction. The results indicate that the solution-loss rates of BC+Fe coke are more those of BC coke, and the solution-loss ratios of BC+Fe coke are 10.5-26.8% for the Boudouard reaction and 12.1-42.2% for the water-gas reaction higher than those of BC coke. Furthermore, Fe2O3 lowers the apparent activation energy (Ea) of the Boudouard reaction by 4.2% and that of the water-gas reaction by 7.8%, which shows that the catalytic effect of Fe2O3 is stronger for the water-gas reaction than for the Boudouard reaction. SEM analysis shows that the BC+Fe coke has a more varied pore structure and wider range of pore sizes on the surface. XRD analysis indicates that Fe2O3 reacts with Si and Al species in the minerals to form Fe-based silicates and aluminosilicates, which could contribute to the catalytic effect of the coke solution-loss reaction.
Publisher OA PDF DOI
SpectralX: Parameter-efficient Domain Generalization for Spectral Remote Sensing Foundation Models
ArXiv.org · 2025-08-03 · 2 citations
preprintOpen access
Recent advances in Remote Sensing Foundation Models (RSFMs) have led to significant breakthroughs in the field. While many RSFMs have been pretrained with massive optical imagery, more multispectral/hyperspectral data remain lack of the corresponding foundation models. To leverage the advantages of spectral imagery in earth observation, we explore whether existing RSFMs can be effectively adapted to process diverse spectral modalities without requiring extensive spectral pretraining. In response to this challenge, we proposed SpectralX, an innovative parameter-efficient fine-tuning framework that adapt existing RSFMs as backbone while introducing a two-stage training approach to handle various spectral inputs, thereby significantly improving domain generalization performance. In the first stage, we employ a masked-reconstruction task and design a specialized Hyper Tokenizer (HyperT) to extract attribute tokens from both spatial and spectral dimensions. Simultaneously, we develop an Attribute-oriented Mixture of Adapter (AoMoA) that dynamically aggregates multi-attribute expert knowledge while performing layer-wise fine-tuning. With semantic segmentation as downstream task in the second stage, we insert an Attribute-refined Adapter (Are-adapter) into the first stage framework. By iteratively querying low-level semantic features with high-level representations, the model learns to focus on task-beneficial attributes, enabling customized adjustment of RSFMs. Following this two-phase adaptation process, SpectralX is capable of interpreting spectral imagery from new regions or seasons. The codes will be available from the website: https://github.com/YuxiangZhang-BIT.
Publisher OA PDF DOI
Retrieval And Structuring Augmented Generation with Large Language Models
2025-08-03 · 5 citations
preprintOpen accessSenior author
Large Language Models (LLMs) have revolutionized natural language processing with their remarkable capabilities in text generation and reasoning. However, these models face critical challenges when deployed in real-world applications, including hallucination generation, outdated knowledge, and limited domain expertise. Retrieval And Structuring (RAS) Augmented Generation addresses these limitations by integrating dynamic information retrieval with structured knowledge representations. This survey (1) examines retrieval mechanisms including sparse, dense, and hybrid approaches for accessing external knowledge; (2) explore text structuring techniques such as taxonomy construction, hierarchical classification, and information extraction that transform unstructured text into organized representations; and (3) investigate how these structured representations integrate with LLMs through prompt-based methods, reasoning frameworks, and knowledge embedding techniques. It also identifies technical challenges in retrieval efficiency, structure quality, and knowledge integration, while highlighting research opportunities in multimodal retrieval, cross-lingual structures, and interactive systems. This comprehensive overview provides researchers and practitioners with insights into RAS methods, applications, and future directions.
Publisher OA PDF DOI
Hybrid Latent Reasoning via Reinforcement Learning
ArXiv.org · 2025-05-24
preprintOpen access
Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.
Publisher OA PDF DOI
GRACE: Generative Representation Learning via Contrastive Policy Optimization
ArXiv.org · 2025-10-06
preprintOpen accessSenior author
Prevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce GRACE (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy that produces explicit, human-interpretable rationales--structured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross category gains: averaged over four backbones, the supervised setting improves overall score by 11.5% over base models, and the unsupervised variant adds 6.9%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent rationales. The model, data and code are available at https://github.com/GasolSun36/GRACE.
Publisher OA PDF DOI
Nature-inspired tooth-mimetic bamboo hierarchical composites with superhard, waterproof, and stain-resistant protective structures
Advanced Composites and Hybrid Materials · 2025-08-29 · 6 citations
articleOpen access1st authorCorresponding
Abstract Conventional bamboo waterproofing modifications frequently face limitations such as complex processing, limited functionality, inadequate mechanical durability, and reliance on petroleum-based polymers. Inspired by the hierarchical enamel-dentin structure of teeth, we propose a novel biomimetic strategy that utilizes bamboo’s intrinsic components to in situ generate a robust 170 µm-thick protective layer. This is achieved through selective surface delignification, directional NaIO 4 oxidation, and subsequent cell wall reconstruction via hot-pressing, effectively overcoming these longstanding challenges. Within this structure, the protective layer of the resulting tooth-mimetic bamboo hierarchical composite (TMB) forms via plasticization induced by the hydroxyl-aldehyde condensation reaction of dialdehyde cellulose, while the core layer densifies during hot-pressing. Consequently, TMB exhibits exceptional waterproofing, demonstrating a 99.0% reduction in surface water absorption rate compared to natural bamboo (NB). Remarkably, the protective layer maintains its waterproofing efficacy even after enduring over 100 cycles of abrasion and peeling. Additionally, TMB effectively repels common household liquids (e.g., coffee, milk, juice), and stubborn stains such as those from oil-based markers can be readily wiped off. Notably, TMB simultaneously achieves significant mechanical enhancement, attaining a Shore hardness of 92.0 HD alongside outstanding flexural and tensile properties. As a scalable composite material, TMB offers innovative strategies for protecting bamboo-based products and holds significant promise for diverse applications.
Publisher OA PDF DOI

Recent grants

CPS: Small: Collaborative Research: Foundations of Cyber-Physical Networks
NSF · $275k · 2009–2014
DATA SCIENCE RESEARCH
NIH · $20.8M · 2021
III-Core:Small: MoveMine: Mining Sophisticated Patterns and Actionable Knowledge from Massive Moving Object Data
NSF · $500k · 2010–2016
III: Small: Multi-Dimensional Structuring, Summarizing and Mining of Social Media Data
NSF · $500k · 2016–2021
III: Medium: Collaborative Research: Towards On-Line Analytical Mining of Heterogeneous Information Networks
NSF · $831k · 2009–2013

Frequent coauthors

Xiang Ren
98 shared
Yizhou Sun
71 shared
Jingbo Shang
69 shared
Meng Yu
66 shared
Jiaming Shen
60 shared
Xifeng Yan
Beijing Institute of Technology
57 shared
Chao Zhang
56 shared
Jian Pei
Duke University
54 shared

Labs

Siebel School of Computing and Data SciencePI

Education

Ph.D., Computer Science
University of Wisconsin-Madison
1986
M.S., Computer Science
University of Science and Technology of China
1982
B.S., Computer Science
University of Science and Technology of China
1980

Awards & honors

2025 ACM SIGKDD Rising Star Award
2025 ACM SIGKDD Dissertation Award, Runner-Up
2025 ACM SIGKDD Dissertation Award, Honorable Mention
ACM SIGKDD 2024 Dissertation Award

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Jiawei Han

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you