
Achuta Kadambi
· ProfessorVerifiedUniversity of California, Los Angeles · Computer Science
Active 2013–2026
About
Achuta Kadambi is a leader of the Visual Machines Group and an Associate Professor at UCLA in the Department of Electrical Engineering and Computer Science. His academic background includes a PhD from the Massachusetts Institute of Technology. His research focuses on visual machines, exploring innovative approaches in computer vision and related fields. As a faculty member, he contributes to advancing the understanding and development of visual perception technologies, guiding graduate and postdoctoral researchers, and engaging in collaborative scientific efforts.
Research signals
Five dimensions sourced from public faculty / publication signals. Sign in to compare against your own profile and see your match score.
Research topics
- Computer Science
- Artificial Intelligence
- Computer vision
- Algorithm
- Psychology
- Data science
- Internet privacy
Selected publications
WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models
ArXiv.org · 2026-01-29
articleOpen accessSenior authorRecent advances in generative foundational models, often termed "world models," have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low-level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video-based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real-world interactions. Through its concept-specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world-model-driven learning.
MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane
arXiv (Cornell University) · 2026-03-20
preprintOpen accessSenior authorMonocular 3D object understanding has largely been cast as a 2D RoI-to-3D box lifting problem. However, emerging downstream applications require image-plane geometry (e.g., projected 3D box corners) which cannot be easily obtained without known intrinsics, a problem for object detection in the wild. We introduce MoCA3D, a Monocular, Class-Agnostic 3D model that predicts projected 3D bounding box corners and per-corner depths without requiring camera intrinsics at inference time. MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps. To evaluate image-plane geometric fidelity, we propose Pixel-Aligned Geometry (PAG), which directly measures image-plane corner and depth consistency. Extensive experiments demonstrate that MoCA3D achieves state-of-the-art performance, improving image-plane corner PAG by 22.8% while remaining comparable on 3D IoU, using up to 57 times fewer trainable parameters. Finally, we apply MoCA3D to downstream tasks which were previously impractical under unknown intrinsics, highlighting its utility beyond standard baseline models.
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
arXiv (Cornell University) · 2026-03-28
preprintOpen accessLarge vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
ArXiv.org · 2026-03-28
articleOpen accessLarge vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.
WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models
Open MIND · 2026-01-29
preprintSenior authorRecent advances in generative foundational models, often termed "world models," have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low-level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video-based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real-world interactions. Through its concept-specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world-model-driven learning.
MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane
ArXiv.org · 2026-03-20
articleOpen accessSenior authorMonocular 3D object understanding has largely been cast as a 2D RoI-to-3D box lifting problem. However, emerging downstream applications require image-plane geometry (e.g., projected 3D box corners) which cannot be easily obtained without known intrinsics, a problem for object detection in the wild. We introduce MoCA3D, a Monocular, Class-Agnostic 3D model that predicts projected 3D bounding box corners and per-corner depths without requiring camera intrinsics at inference time. MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps. To evaluate image-plane geometric fidelity, we propose Pixel-Aligned Geometry (PAG), which directly measures image-plane corner and depth consistency. Extensive experiments demonstrate that MoCA3D achieves state-of-the-art performance, improving image-plane corner PAG by 22.8% while remaining comparable on 3D IoU, using up to 57 times fewer trainable parameters. Finally, we apply MoCA3D to downstream tasks which were previously impractical under unknown intrinsics, highlighting its utility beyond standard baseline models.
JMIR Medical Informatics · 2025-07-29
articleOpen accessSenior author2025-05-29
articleSenior authorAs generative machine learning and deepfakes become increasingly important, reliable methods for protecting data provenance and authenticity are essential. Current approaches for verifying data provenance often rely on cryptographic measures. While cryptography can ensure the authenticity of data, it cannot guarantee the honesty/correctness of the data itself; for instance, if a sensor is spoofed, the generated data may be false even before the cryptographic process takes place. This paper introduces this new attack surface, the Physical Layer. We show a real example of how such an attack can be conducted. We then explore various solutions to address this concern, including leveraging hardware, sensing, and physics.
Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields
2025-06-10
articleSenior authorRecent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g. SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.
JMIR Medical Education · 2025-10-24
articleOpen accessSenior author<sec> <title>BACKGROUND</title> Operating room–to–intensive care unit (OR-to-ICU) handoffs are among the most complex and high-risk communication events in perioperative care. Despite the implementation of structured checklists, trainees often receive limited feedback on their communication skills, and simulation-based education rarely provides objective data on communication performance and checklist adherence. This study explores how an ambient AI handoff assistant used during simulation-based training of OR-to-ICU handoff discussions can enhance clinical communication training and AI literacy by mapping spoken handoff discussions to handoff checklist items, enabling the development of a handoff note that functioned as a structured, feedback-rich learning artifact. </sec> <sec> <title>OBJECTIVE</title> To co-design and evaluate an ambient AI handoff assistant that captures spoken OR-to-ICU handoff communication, maps it to handoff checklist items, and provides immediate feedback on handoff completeness during simulated OR-to-ICU transitions in an educational setting. </sec> <sec> <title>METHODS</title> A two-phase mixed-methods study was conducted within the UCLA Department of Anesthesiology and Perioperative Care (July–October 2025). Phase 1 comprised co-design interviews with four clinician educators to identify limitations of current handoff training and inform AI feature development. Phase 2 involved an error analysis, as well as evaluations of usability, workload, and educational impact, conducted through ten 60-minute simulation sessions with pairs of medical students and first-year residents. Quantitative measures included Physician Task Load Index (PTL), System Usability Scale (SUS), and a post-simulation survey; qualitative data from co-design sessions and simulation debrief interviews were thematically analyzed. </sec> <sec> <title>RESULTS</title> Educators highlighted inconsistent checklist use and the absence of objective feedback on learners’ communication skills as key areas that could benefit from structured documentation of handoff discussions using AI. Error analysis of the ambient AI handoff assistant revealed a mean of 3.6 errors per note, with incorrect output being the most frequent error type. There was no statistically significant difference between the ambient AI handoff assistant and the paper checklist with respect to PTL and SUS measures. Trainees valued real-time transcripts and structured handoff notes for reflection of communication practices, and exposure to AI documentation errors enhanced critical thinking and awareness of AI technology limitations. </sec> <sec> <title>CONCLUSIONS</title> The ambient AI handoff assistant mapped simulated handoff discussions to checklist items and generated a structured handoff note, facilitating reflection on team-based communication skills in handoff education. Imperfections in the AI’s output encouraged critical appraisal of its capabilities and prompted discussion about automation complacency, suggesting that AI-assisted simulations can foster both communication and digital literacy skills essential for future AI-enabled clinical practice. </sec>
Recent grants
CAREER: On the Fairness of Light Transport for Unbiased Low-level Vision
NSF · $524k · 2021–2027
CRII: RI: Computational Thermal Imaging
NSF · $191k · 2019–2022
Frequent coauthors
- 36 shared
Ramesh Raskar
- 24 shared
Yunhao Ba
- 18 shared
Pradyumna Chari
- 15 shared
Vage Taamazyan
Intrinsic LifeSciences (United States)
- 12 shared
Howard Zhang
- 11 shared
Ayush Bhandari
- 9 shared
Boxin Shi
Peking University
- 9 shared
Refael Whyte
Labs
An amazing website.
Education
- 2018
Ph.D., Electrical Engineering and Computer Science
Massachusetts Institute of Technology (MIT)
Awards & honors
- NSF CAREER Award (2021)
- DARPA Young Faculty Award (YFA) (2021)
- Army Research Office Young Investigator Award (YIP) (2021)
- National Academy of Engineering (NAE) Frontiers in Engineeri…
- Senior Member, National Academy of Inventors (2020)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Achuta Kadambi
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup