Wenhao David Huang
· ProfessorVerifiedUniversity of Illinois Urbana-Champaign · Department of Biomedical and Translational Sciences
Active 2003–2025
About
Wenhao David Huang is a Professor in the Biomedical and Translational Sciences department at the Carle Illinois College of Medicine, University of Illinois Urbana-Champaign. His recent courses include gamified learning in medical education, program planning and evaluation, principles of health-related education, learning technologies, supervised internships, advanced theories in health-related education, learning systems, thesis seminars, and independent studies. His research focus involves health innovation, medical education, and translational sciences, contributing to the integration of engineering and innovation in medicine. He is actively involved in cross-disciplinary research and health innovation initiatives at the college.
Research topics
- Medicine
- Psychiatry
- Family medicine
- Internal medicine
- Nursing
- Emergency medicine
- Demography
- Obstetrics
- Psychology
Selected publications
MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation
ArXiv.org · 2025-05-27
preprintOpen accessLarge Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbf{MARS-Bench}, a \textbf{M}ulti-turn \textbf{A}thletic \textbf{R}eal-world \textbf{S}cenario Dialogue \textbf{Bench}mark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs' robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenges when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs' performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
ArXiv.org · 2025-05-29
preprintOpen accessAlthough long-video understanding demands that models capture hierarchical temporal information -- from clip (seconds) and shot (tens of seconds) to event (minutes) and story (hours) -- existing benchmarks either neglect this multi-scale design or scatter scale-specific questions across different videos, preventing direct comparison of model performance across timescales on the same content. To address this, we introduce ScaleLong, the first benchmark to disentangle these factors by embedding questions targeting four hierarchical timescales -- clip (seconds), shot (tens of seconds), event (minutes), and story (hours) -- all within the same video content. This within-content multi-timescale questioning design enables direct comparison of model performance across timescales on identical videos. ScaleLong features 269 long videos (avg.\ 86\,min) from 5 main categories and 36 sub-categories, with 4--8 carefully designed questions, including at least one question for each timescale. Evaluating 23 MLLMs reveals a U-shaped performance curve, with higher accuracy at the shortest and longest timescales and a dip at intermediate levels. Furthermore, ablation studies show that increased visual token capacity consistently enhances reasoning across all timescales. ScaleLong offers a fine-grained, multi-timescale benchmark for advancing MLLM capabilities in long-video understanding. The code and dataset are available https://github.com/multimodal-art-projection/ScaleLong.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
ArXiv.org · 2025-09-02
preprintOpen accessThe development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
ArXiv.org · 2025-08-14
articleOpen accessAI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02\% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.
Do Mobile Learning Experiences Affect College Students’ Online Learning Readiness?
Open Praxis · 2025-01-01
articleOpen accessSenior authorThis study examines how mobile learning experiences influence college students’ readiness for online learning. The abrupt shift to online learning due to the COVID-19 pandemic posed challenges, including reduced student engagement and interaction. Previous research has indicated that the degree of students’ learning readiness is crucial to derive the greatest advantage from virtual learning. Also, a variety of Information & Communication Technology (ICT) such as computers, laptops, and smartphones, could play a key role in facilitating effective online learning. However, little research examines the effects of mobile learning on online learning readiness (OLR). The study explores how mobile learning affects key components of online learning readiness: computer & internet self-efficacy, online communication self-efficacy, motivation for learning, learner control, and self-directed learning. Two research questions guide the study: 1) To what extent do mobile learning experiences affect college students’ online learning readiness? and 2) In what ways do mobile learning experiences influence college students’ online learning readiness? Using a mixed-methods approach, data was collected from 73 survey participants and 11 interviewees. Quantitative analysis using a two-tailed t-test showed a positive impact of mobile learning on all factors of online learning readiness. Qualitative content analysis revealed key themes such as getting acquainted with learning technologies, access to related technologies/resources, awareness development, app features, motivation for searching information, learning environment, and limit on access.
KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
ArXiv.org · 2025-05-20
preprintOpen accessRecent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
ArXiv.org · 2025-12-14
preprintOpen accessRecent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.
MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation
2025-01-01
articleOpen accessChenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025.
Frontiers in Medicine · 2025-12-17 · 1 citations
articleOpen accessBackground: Depression is a major global health challenge, and traditional cognitive behavioral therapy (CBT) is constrained by therapist shortages and economic barriers. CBT-based mobile applications offer a scalable and accessible alternative, yet a comprehensive overview of their global research landscape remains limited. Objective: To map global research trends, clinical progress, and emerging frontiers in CBT-based mobile applications for depression using bibliometric methods. Methods: Relevant studies were systematically retrieved from the Web of Science Core Collection, PubMed, and PsycINFO from inception to June 25, 2025. CiteSpace 6.4 R1, Microsoft Excel 2019, and Python were used for visualization and data analysis, including temporal publication trends, co-authorship, co-citation, keyword analyses, and citation burst detection. Results: The WoSCC analysis identified 350 articles published between 2013 and 2025, showing a marked growth trajectory with leading contributions from the United States and major academic centers. Dominant themes included smartphone interventions, blended treatments, CBT for insomnia, and adolescent depression, while citation bursts indicated recent shifts toward prevention, technology integration, and standardized outcome measures. The PubMed analysis included 72 clinical trial articles, highlighting randomized controlled trials as the predominant design and revealing growing interest in integrating interpersonal therapy and mindfulness within broader, interdisciplinary treatment frameworks. The PsycINFO analysis comprised 20 articles and provided a complementary behavioral science perspective, emphasizing mobile phone-delivered CBT for major depression, digital interventions targeting comorbid social anxiety, culturally adapted applications for Chinese cultural groups, and emerging work linking mobile health and virtual reality. Conclusions: Research on CBT-based mobile applications for depression is rapidly advancing toward more personalized, adaptive, and preventive digital interventions grounded in robust clinical and behavioral evidence. Strengthening global, interdisciplinary collaboration and leveraging innovative technologies will be critical for translating these tools into effective, scalable services. Over the next 5-10 years, key research streams are likely to include the integration of Artificial Intelligence (AI) and large language models (LLMs) into mobile CBT platforms and the convergence of app-based interventions with sensor-based digital phenotyping, wearable devices, and immersive technologies to enhance real-time monitoring, user engagement, and long-term outcomes, with the potential to narrow treatment gaps across diverse populations. Systematic review registration: https://osf.io/, Identifier: https://doi.org/10.17605/OSF.IO/YCSR8.
BMC Medical Education · 2024-02-01 · 1 citations
erratumOpen access1st authorCorresponding
Frequent coauthors
- 11 shared
Sun Joo Yoo
- 5 shared
Wen-yeh Huang
National Taipei University
- 5 shared
Tristan E. Johnson
- 5 shared
Eunjung Oh
University of Illinois Urbana-Champaign
- 5 shared
Jung Sun Sung
University of Illinois Urbana-Champaign
- 4 shared
Karen M. Tabb
University of Illinois Urbana-Champaign
- 4 shared
Cameron Merrill
- 4 shared
Laura Shackelford
Centre National de la Recherche Scientifique
Education
M.D.
Carle Illinois College of Medicine
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Wenhao David Huang
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup