
Mubbasir Kapadia
· Associate ProfessorVerifiedRutgers University · Computer Science
Active 2008–2026
About
Professor Mubbasir Kapadia is the Lab Director and an Assistant Professor in the Computer Science department at Rutgers University. He leads the Intelligent Visual Interfaces Lab, which focuses on research areas including Autonomous Virtual Humans, Human Crowd Modeling and Analysis, Crowd-Aware Computer-Aided Environment Design, and Digital Storytelling. His work involves computational models related to crowd behavior and visual impairment for wayfinding, as well as machine learning applications in human-building interaction and crowd dynamics. Professor Kapadia's research integrates computer vision, graph representation learning, and computational simulation to better understand and predict human crowd behavior and interactions within built environments.
Research topics
- Computer Science
- Machine Learning
- Artificial Intelligence
- Data Mining
- Human–computer interaction
- Mathematics
- Theoretical computer science
- Geography
- Computer graphics (images)
- Mathematical optimization
Selected publications
Large Sign Language Models: Toward 3D American Sign Language Translation
2026-03-06
articleOpen accessSenior authorWe present Large Sign Language Models (LSLM), a novel framework for translating 3D American Sign Language (ASL) by leveraging Large Language Models (LLMs) as the backbone, which can benefit hearing-impaired individuals’ virtual communication. Unlike existing sign language recognition methods that rely on 2D video, our approach directly utilizes 3D sign language data to capture rich spatial, gestural, and depth information in 3D scenes. This enables more accurate and resilient translation, enhancing digital communication accessibility for the hearing-impaired community. Beyond the task of ASL translation, our work explores the integration of complex, embodied multimodal languages into the processing capabilities of LLMs, moving beyond purely text-based inputs to broaden their understanding of human communication. We investigate both direct translation from 3D gesture features to text and an instruction-guided setting where translations can be modulated by external prompts, offering greater flexibility. This work provides a foundational step toward inclusive, multimodal intelligent systems capable of understanding diverse forms of language.
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
ArXiv.org · 2026-05-14
articleOpen accessLong-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.
Early-stage architecture design assistance by LLMs and knowledge graphs
Automation in Construction · 2026-01-10 · 1 citations
articleOpen accessSenior authorEarly-stage architectural design relies heavily on precedent cases and domain knowledge, yet existing assistance methods struggle with the dominance of visual data and the linguistic diversity of design descriptions. In this paper, a retrieval-augmented generation framework with a custom knowledge graph tailored to architecture is proposed. The approach features: (1) a lightweight graph structure representing design logic; (2) a knowledge extraction pipeline for visual and textual data; and (3) aggregation and question answering methods that consolidate precedent knowledge for design support. Experiments show improved retrieval accuracy, more comprehensive precedent recommendations, and enhanced user experience, advancing precedent-based assistance for early design. • Custom knowledge graph structure for early-stage architecture design. • Extracted design logic knowledge from multi-modal data in design cases. • Aggregate and apply design logic knowledge for referenced question answering.
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
arXiv (Cornell University) · 2026-05-14
preprintOpen accessLong-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.
Behavioral effects of anti-pandemic preventive measures: a VR study
2025-10-24
articleOpen accessDuring the COVID-19 pandemic, a variety of public health measures aimed at decreasing transmission between people sharing public spaces, including social distancing and the use of face masks. However the behavioral consequences of these policies on human navigation are mostly unknown. Using virtual reality, we investigated how some common preventive measures influenced how people navigate near other people. Participants (N=29) traversed virtual rooms containing virtual people(“agents”) while we manipulated their mask status, the mask status of the agents, and the perceived environmental safety (high/low community vaccination rates). We found that initial path selection was primarily determined by agent mask-wearing, with participants more likely to pass in front of masked than unmasked agents. The perception of environmental safety predominantly governed subsequent interpersonal distance, with closer proximity maintained in high-vaccination contexts. The participants’ own mask status was the least influential, suggesting that navigational choices were driven by external risk assessment rather than personal protection. We observed clear risk compensation: subjects generally reduced interpersonal distance in safer conditions (whether environmental, agent masking, or personal masking). Bayesian model comparisons favored a three-factor additive model, indicating independent rather than interactive influences on behavior. These findings reveal the interplay of decisions underlying pandemic-era social navigation, and highlight the importance of accounting for risk compensation in public health interventions.
Social and spatial predictors of collective search behaviors
Scientific Reports · 2025-05-30
articleOpen accessUnderstanding crowd behavior is critical for designing buildings and public spaces with efficient circulation. However, the interplay of social and spatial contexts makes this endeavor challenging. This paper examines scenarios in which crowds perform a search task with time constraints, akin to individuals shopping or officers searching a crime area. We formulate and test two sets of hypotheses defined at the crowd and individual levels using desktop VR experiments. We conducted four experimental sessions that employed different social incentives (collaborative versus competitive) with a total of 140 participants, using a mixed factorial design where each individual participated in 12 trials. We found that competitive incentives produced higher levels of crowd aggregation than collaborative incentives. In addition, individuals were more likely to be influenced by others' behaviors in the collaborative compared to the competitive condition. Notably, these social signals were conveyed among participants without any verbal communication. We also developed a novel graph theoretic measure, "search attractiveness," that accurately predicts space occupation during a search task. This paper highlights the roles of social and spatial contexts in understanding occupation and aggregation.
On the Separability of Human Navigational Behaviors in Virtual Reality
Underline Science Inc. · 2025-06-18
otherOpen accessHuman navigation is shaped by cognitive strategies, spatial awareness, and learned heuristics, yet existing models struggle to capture individual differences in wayfinding. To investigate the cognitive basis of navigational behavior, we conducted a virtual reality experiment where participants maneuvered around a human obstacle in a controlled, static environment. Using trajectory-based features, we classified participants with PartNet, a neural network that outperformed ElasticNet and Random Forest classifiers. While PartNet captured subtle yet consistent behavioral patterns, its interpretability was limited. To address this, we developed an analysis pipeline revealing key behavioral factors, showing that navigational styles differ primarily in midline adherence and speed. Clustering and embedding analyses further demonstrated participant separability, highlighting both individual distinctions and shared tendencies. By identifying structured variability in navigation, our work advances cognitive models of spatial decision-making, informing theories of wayfinding, predictive modeling of human movement, and applications in assistive navigation and urban design.
CASIM: Composite Aware Semantic Injection for Text to Motion Generation
ArXiv.org · 2025-02-04
preprintOpen accessSenior authorRecent advances in generative modeling and tokenization have driven significant progress in text-to-motion generation, leading to enhanced quality and realism in generated motions. However, effectively leveraging textual information for conditional motion generation remains an open challenge. We observe that current approaches, primarily relying on fixed-length text embeddings (e.g., CLIP) for global semantic injection, struggle to capture the composite nature of human motion, resulting in suboptimal motion quality and controllability. To address this limitation, we propose the Composite Aware Semantic Injection Mechanism (CASIM), comprising a composite-aware semantic encoder and a text-motion aligner that learns the dynamic correspondence between text and motion tokens. Notably, CASIM is model and representation-agnostic, readily integrating with both autoregressive and diffusion-based methods. Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores across state-of-the-art methods. Qualitative analyses further highlight the superiority of our composite-aware approach over fixed-length semantic injection, enabling precise motion control from text prompts and stronger generalization to unseen text inputs.
Communication in the surgical subspecialties: current state and opportunities for improvement
Patient Education and Counseling · 2025-07-15
articleSenior authorCardiverse: Harnessing LLMs for Novel Card Game Prototyping
2025-01-01
articleOpen accessSenior authorThe prototyping of computer games, particularly card games, requires extensive human effort in creative ideation and gameplay evaluation.Recent advances in Large Language Models (LLMs) offer opportunities to automate and streamline these processes.However, it remains challenging for LLMs to design novel game mechanics beyond existing databases, generate consistent gameplay environments, and develop scalable gameplay AI for large-scale evaluations.This paper addresses these challenges by introducing a comprehensive automated card game prototyping framework.The approach highlights a graph-based indexing method for generating novel game variations, an LLM-driven system for consistent game code generation validated by gameplay records, and a gameplay AI constructing method that uses an ensemble of LLM-generated heuristic functions optimized through self-play.These contributions aim to accelerate card game prototyping, reduce human labor, and lower barriers to entry for game developers.Code repo:
Recent grants
Frequent coauthors
- 98 shared
Petros Faloutsos
York University
- 45 shared
Markus Groß
Walt Disney (Switzerland)
- 40 shared
Samuel S. Sohn
- 34 shared
Norman I. Badler
- 33 shared
Tyler Thrash
- 31 shared
Vladimir Pavlović
Rutgers Sexual and Reproductive Health and Rights
- 30 shared
Brandon Haworth
University of Victoria
- 29 shared
Muhammad Usman
Labs
Intelligent Visual Interfaces LabPI
Not provided
Education
Ph.D., Computer Science
Rutgers, The State University of New Jersey
Awards & honors
- NSF SA&S grant
- DARPA SocialSim grant
- NSF CHS grant
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Mubbasir Kapadia
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup