
Mohit Bansal
· Natural Language Processing and Multimodal AIVerifiedUniversity of North Carolina at Chapel Hill · Computer Science
Active 2005–2026
About
Mohit Bansal is the Parker Distinguished Professor and Director of the MURGe-Lab within the UNC-AI Group at the University of North Carolina at Chapel Hill. His research focuses on advancing artificial intelligence through multimodal learning, language and vision research, and scalable AI agents. He has made significant contributions to the development of multimodal foundation models, reasoning, and generative models, with a particular emphasis on trustworthy, responsible, and efficient AI systems. Throughout his career, Bansal has been recognized with numerous honors, including being named a PECASE Fellow, AAAI Fellow, ACL Fellow, and an ACM Distinguished Member. He has served as an associate editor-in-chief for the IEEE TPAMI journal and has been invited to deliver keynote and distinguished lectures at major conferences and institutions worldwide. His work is characterized by a prolific publication record in top-tier venues such as NeurIPS, ICML, CVPR, ACL, EMNLP, and others, reflecting his leadership in the field of multimodal AI, reasoning, and language understanding. Bansal's research integrates insights from neural networks, reasoning, and multimodal data to develop models capable of complex understanding, generation, and reasoning across diverse sources, contributing to the advancement of AI capabilities in real-world applications.
Research topics
- Computer Science
- Artificial Intelligence
- Natural Language Processing
- Machine Learning
- Applied mathematics
- Econometrics
- Mathematical economics
- Programming language
- Algorithm
- History
- Human–computer interaction
- Physics
- Data science
- Thermodynamics
- Mathematics
- Psychology
Selected publications
TimeRefine: Temporal Grounding with Time Refining Video LLM
2026-03-06
preprintOpen accessVideo temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt. Recent work has focused on enabling Video LLMs to perform video temporal grounding via next-token prediction of temporal timestamps. However, accurately localizing timestamps in videos remains challenging for Video LLMs when relying solely on temporal token prediction. Our proposed TimeRefine addresses this challenge in two ways. First, instead of directly predicting the start and end timestamps, we reformulate temporal grounding as a temporal refinement task: the model first makes rough predictions and then refines them by predicting offsets to the target segment. This refining process is repeated multiple times, through which the model progressively improves its own temporal localization accuracy. Second, to enhance the model’s temporal perception capabilities, we incorporate an auxiliary prediction head that applies a larger penalty as a predicted segment deviates further from the ground truth, encouraging more precise temporal localizations. Our plug-and-play method can be integrated into most LLM-based temporal grounding approaches. The experimental results demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on the ActivityNet and Charades-STA datasets, respectively. Code and pretrained models are available at https://github.com/SJTUwxz/TimeRefine_code.
Accuracy and perforation rate of free-hand pedicle screw insertion in thoracic spine
The Egyptian Journal of Neurosurgery : the official publication of the Egyptian Society of Neurological Surgeons/Egyptian journal of neurosurgery · 2025-08-02
articleOpen access1st authorCorrespondingAbstract Background To assess the accuracy and perforation rate of free-hand pedicle screw insertion in thoracic spine, patients aged 15–70 years with dorsal vertebrae pathology with or without neurological deficit undergoing dorsal spine pedicle screw fixation using free-hand technique were included. Revision surgery and patients who needed deformity correction surgery were excluded. The accuracy of pedicle screw placement was assessed by Gertzbein and Robbins classification scores on computed tomography scans. Microsoft Excel and statistical softwares were used for data cleaning and statistical analysis. Categorical and continuous variables were reported in proportions and mean ± standard deviation. A paired sample t-test was used to assess and determine whether there was a mean difference between the two sets of observations. The statistical significance was determined at a 5% level. Results Seventy (36.26%) pedicle screws were inserted for infective pathology, and 91.7% of pedicle screws showed no breach. Thoracic spine 3–7 vertebra demonstrated the highest breach rate (11/16) (68.75%). The direction of the breach was lateral in ten screws (62.5%) and medial in six screws (37.5%), and there was no inferior and superior breached screw. The breach was not statistically significant. Conclusions Free-hand technique has excellent accuracy and insignificant perforation rate in instrumentation of the thoracic spine.
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
2025-06-10 · 22 citations
articleSenior authorLong-form video understanding is complicated by the high redundancy of video data and the abundance of query-irrelevant information. To tackle these challenges, we propose VideoTree, a training-free framework which builds a query-adaptive and hierarchical video representation for LLM reasoning over long-form videos. First, VideoTree extracts query-relevant information from the input video through an iterative process, progressively refining the selection of keyframes based on their relevance to the query. Furthermore, VideoTree leverages the inherent hierarchical structure of long video data, which is often overlooked by existing LLM-based methods. Specifically, we incorporate multi-granularity information into a tree-based representation, allowing VideoTree to extract query-relevant details from long videos in a coarse-to-fine manner. This enables the model to effectively handle a wide range of video queries with varying levels of detail. Finally, VideoTree aggregates the hierarchical query-relevant information within the tree structure and feeds it into an LLM reasoning model to answer the query. Our experiments show that our method improves both reasoning accuracy and efficiency. Specifically, VideoTree outperforms existing training-free approaches on EgoSchema and NExT-QA with less inference time, achieving 61.1% and 75.6% accuracy on the test set without additional video-specific training. Moreover, on the long split of Video-MME (average 44 minutes), VideoTree achieves better performance than GPT-4V and many other MLLMs that were extensively trained on video data.
SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
ArXiv.org · 2025-10-09
preprintOpen accessLarge Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.
PrefixNLI: Detecting Factual Inconsistencies as Soon as They Arise
ArXiv.org · 2025-11-03
preprintOpen accessNatural Language Inference (NLI) models have been used in various ways to improve the factuality of LLM outputs. This is typically done by applying an NLI model to judge whether the model output is entailed from the supposed evidence, triggering some corrective actions, such as beam reranking at inference time or RL rewards during training. While NLI models are trained to detect factual inconsistencies over complete sentences, decisions in the common autoregressive generation architecture are made for each evolving text prefix, during decoding. Addressing this setting, we generalize the entailment detection task to apply over arbitrary text prefixes, and suggest its utility for improving generation faithfulness. Providing suitable evaluation and training datasets for this task, we train MiniTruePrefixes, a novel specialized model that better detects factual inconsistencies over text prefixes, outperforming comparable baseline NLI models by 5-14 F1 points in prefix-level entailment. We further demonstrate that integrating MiniTruePrefixes into a controlled decoding framework substantially improves factual consistency in abstractive summarization. When guided by MiniTruePrefixes, LLaMA-3.2-3B-Instruct matches the faithfulness and runtime of the 8B model from the same model family, while using only half the memory.
Hierarchy-Aware Multimodal Unlearning for Medical AI
ArXiv.org · 2025-12-10
preprintOpen accessSenior authorPretrained Multimodal Large Language Models (MLLMs) are increasingly used in sensitive domains such as medical AI, where privacy regulations like HIPAA and GDPR require specific removal of individuals' or institutions' data. This motivates machine unlearning, which aims to remove the influence of target data from a trained model. However, existing unlearning benchmarks fail to reflect the hierarchical and multimodal structure of real-world medical data, limiting their ability to properly evaluate unlearning in practice. Therefore, we introduce MedForget, a hierarchy-aware multimodal unlearning benchmark that models hospital data as a nested structure, enabling fine-grained evaluation of multimodal unlearning across retain and forget splits. Experiments with current unlearning methods show that existing approaches struggle to achieve effective hierarchy-aware forgetting without degrading downstream medical utility. To address this limitation, we propose Cross-modal Hierarchy-Informed Projection for unlearning (CHIP), a training-free, hierarchy-aware multimodal unlearning method that deletes information by selectively removing target-specific weight subspaces while preserving sibling-shared information. Experiments show that CHIP achieves the highest forget-retain performance gap across all hierarchy levels while maintaining competitive downstream utility compared to existing methods. Overall, MedForget provides a practical, HIPAA-aligned benchmark for evaluating structured multimodal unlearning for medical data, and CHIP offers an effective and general solution for hierarchy-aware forgetting that balances deletion with utility.
CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval
ArXiv.org · 2025-06-06
preprintOpen accessSenior authorOnline video web content is richly multimodal: a single video blends vision, speech, ambient audio, and on-screen text. Retrieval systems typically treat these modalities as independent retrieval sources, which can lead to noisy and subpar retrieval. We explore multimodal video content retrieval, where relevance can be scored from one particular modality or jointly across multiple modalities simultaneously. Consequently, an effective retriever must dynamically choose which modality (or set of modalities) best addresses the query. We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata. CLaMR jointly encodes all modalities with a unified multimodal backbone for improved contextualization and is trained to enhance dynamic modality selection via two key innovations. First, given the lack of training data for multimodal retrieval, we introduce MultiVENT 2.0++, a large-scale synthetic training dataset built on MultiVENT 2.0 (event-centric videos in various languages paired with queries) with modality-targeted queries. Next, we propose a modality-aware loss that jointly trains according to a standard contrastive objective alongside an objective for learning correct modality usage. On the test sets of MultiVENT 2.0++ and MSRVTT, conventional aggregation strategies, such as averaging similarities for baseline retrievers, degrade performance by introducing noise from irrelevant modalities. In contrast, CLaMR consistently outperforms existing retrievers: on MultiVENT 2.0++, CLaMR improves nDCG@10 by 25.6 over the best single-modality retriever and by 35.4 over the best multi-modality retriever. We illustrate CLaMR's downstream utility on long-video QA, retrieving relevant frames and obtaining a 3.50% boost over LanguageBind on Video-MME and 1.42% over dense sampling on LongVideoBench.
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
2025-10-19
preprintOpen accessThe academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands. Despite the differing focuses of these tasks, the underlying requirements of interpreting instructions, comprehending the surroundings, and inferring action decisions remain consistent. This paper consolidates diverse navigation tasks into a unified and generic framework -- we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. Powered by SAME, we present a versatile agent capable of addressing seven navigation tasks simultaneously that outperforms or achieves highly comparable performance to task-specific agents.
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
ArXiv.org · 2025-06-02
preprintOpen accessVideo understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation by incorporating reward models and reinforcement learning to enhance reasoning, or by employing tool-agent frameworks. However, these approaches face several challenges, including high annotation costs, reward signals that fail to capture real-time reasoning states, and low inference efficiency. To overcome these issues, we propose ReAgent-V, a novel agentic video understanding framework that integrates efficient frame selection with real-time reward generation during inference. These reward signals not only guide iterative answer refinement through a multi-perspective reflection mechanism-adjusting predictions from conservative, neutral, and aggressive viewpoints-but also enable automatic filtering of high-quality data for supervised fine-tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO). ReAgent-V is lightweight, modular, and extensible, supporting flexible tool integration tailored to diverse tasks. Extensive experiments on 12 datasets across three core applications-video understanding, video reasoning enhancement, and vision-language-action model alignment-demonstrate significant gains in generalization and reasoning, with improvements of up to 6.9%, 2.1%, and 9.8%, respectively, highlighting the effectiveness and versatility of the proposed framework.
ArXiv.org · 2025-07-09
preprintOpen accessSenior authorDespite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by 2.4% in accuracy using only 3.6% training samples. Specifically, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.
Recent grants
Frequent coauthors
- 122 shared
Ramakanth Pasunuru
- 72 shared
Shiyue Zhang
Chongqing University of Technology
- 70 shared
Hao Tan
- 53 shared
Swarnadeep Saha
- 52 shared
Peter Hase
- 51 shared
Jaemin Cho
University of North Carolina at Chapel Hill
- 50 shared
Yichen Jiang
Nantong University
- 49 shared
Yixin Nie
Tsinghua University
Labs
Not provided
Awards & honors
- PECASE Fellow, AAAI
- ACL Fellow
- Honored and humbled to be selected as a ACL Fellow
- Honored and humbled to be selected as a AAAI Fellow
- Presidential Early Career Award for Scientists and Engineers…
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Mohit Bansal
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup