Martial Hebert

· Dean and University Professor of RoboticsVerified

Carnegie Mellon University · Computer Science

Active 1975–2026

h-index102

Citations44.5k

Papers62757 last 5y

Funding$2.5M

Faculty page Lab page

See your match with Martial Hebert — sign in to PhdFit.Sign in

About

Martial Hebert is not mentioned in the provided page text, and there is no information about his research focus, background, or key contributions in the content given.

Research topics

Computer Science
Artificial Intelligence
Mathematical optimization
Computer vision

Selected publications

Sur la complémentarité conique des contacts planaires
HAL (Le Centre pour la Communication Scientifique Directe) · 2026-06-01
articleOpen access
International audience
Publisher OA PDF
Walk through Paintings: Egocentric World Models from Internet Priors
ArXiv.org · 2026-01-21
articleOpen accessSenior author
What if a video generation model could not only imagine a plausible future, but the correct one, accurately reflecting how the world changes with each action? We address this question by presenting the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model, enabling controllable future prediction. Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully while preserving realism and strong generalization. Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle-driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation tasks, requiring only modest fine-tuning. To evaluate physical correctness independently of visual appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. EgoWM improves SCS by up to 80 percent over prior state-of-the-art navigation world models, while achieving up to six times lower inference latency and robust generalization to unseen environments, including navigation inside paintings.
Publisher OA PDF
Walk through Paintings: Egocentric World Models from Internet Priors
arXiv (Cornell University) · 2026-01-21
preprintOpen accessSenior author
What if a video generation model could not only imagine a plausible future, but the correct one, accurately reflecting how the world changes with each action? We address this question by presenting the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model, enabling controllable future prediction. Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully while preserving realism and strong generalization. Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle-driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation tasks, requiring only modest fine-tuning. To evaluate physical correctness independently of visual appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. EgoWM improves SCS by up to 80 percent over prior state-of-the-art navigation world models, while achieving up to six times lower inference latency and robust generalization to unseen environments, including navigation inside paintings.
Publisher DOI
Geodesic Turnpikes for Robot Motion Planning
Springer proceedings in advanced robotics · 2026-01-01
articleOpen access
Publisher OA PDF DOI
ReferevErything: Towards Segmenting Everything we can Speak of in Videos
2025-10-19
preprintOpen accessSenior author
We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method leverages the universal visual-language mapping learned by video diffusion models on Internet-scale data by fine-tuning them on small-scale Referring Object Segmentation datasets. Our key insight is to preserve the entirety of the generative model's architecture by shifting its objective from predicting noise to predicting mask latents. The resulting model can accurately segment rare and unseen objects, despite only being trained on a limited set of categories. Additionally, it can effortlessly generalize to non-object dynamic concepts, such as smoke or raindrops, as demonstrated in our new benchmark for Referring Video Process Segmentation (Ref-VPS). REM performs on par with the state-of-the-art on in-domain datasets, like Ref-DAVIS, while outperforming them by up to 12 IoU points out-of-domain, leveraging the power of generative pre-training. We also show that advancements in video generation directly improve segmentation.
Publisher OA PDF DOI
On the Conic Complementarity of Planar Contacts
ArXiv.org · 2025-09-30
preprintOpen access
We present a unifying theoretical result that connects two foundational principles in robotics: the Signorini law for point contacts, which underpins many simulation methods for preventing object interpenetration, and the center of pressure (also known as the zero-moment point), a key concept used in, for instance, optimization-based locomotion control. Our contribution is the planar Signorini condition, a conic complementarity formulation that models general planar contacts between rigid bodies. We prove that this formulation is equivalent to enforcing the punctual Signorini law across an entire contact surface, thereby bridging the gap between discrete and continuous contact models. A geometric interpretation reveals that the framework naturally captures three physical regimes -sticking, separating, and tilting-within a unified complementarity structure. This leads to a principled extension of the classical center of pressure, which we refer to as the extended center of pressure. By establishing this connection, our work provides a mathematically consistent and computationally tractable foundation for handling planar contacts, with implications for both the accurate simulation of contact dynamics and the design of advanced control and optimization algorithms in locomotion and manipulation.
Publisher OA PDF DOI
Dual Perspectives on Non-Contrastive Self-Supervised Learning
HAL (Le Centre pour la Communication Scientifique Directe) · 2025-01-01
preprintOpen access
The {\em stop gradient} and {\em exponential moving average} iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they {\em do not} optimize the original objective, or {\em any} other smooth function, they {\em do} avoid collapse Following~\citet{Tian21}, but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average {\em always} leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, {\em asymptotically stable}. Our theoretical findings are illustrated by empirical experiments with real and synthetic data.
Publisher OA PDF
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
arXiv (Cornell University) · 2024-09-05
preprintOpen access
Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks. Code: https://github.com/YunzeMan/Lexicon3D
Publisher OA PDF DOI
Separate-and-Enhance: Compositional Finetuning for Text-to-Image Diffusion Models
2024-07-12 · 3 citations
articleOpen accessSenior author
Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. In this work, we first show the fundamental reasons for such misalignment by identifying issues related to low attention activation and mask overlaps. Then we propose a compositional finetuning framework with two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Unlike conventional test-time adaptation methods, our model, once finetuned on critical parameters, is able to directly perform inference given an arbitrary multi-object prompt, which enhances the scalability and generalizability. Through comprehensive evaluations, our model demonstrates superior performance in image realism, text-image alignment, and adaptability, significantly surpassing established baselines. Furthermore, we show that training our model with a diverse range of concepts enables it to generalize effectively to novel concepts, exhibiting enhanced performance compared to models trained on individual concept pairs.
Publisher DOI
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
2024-01-01 · 1 citations
article
Publisher DOI

Recent grants

RI: Detecting Boundaries for Segmentation and Recognition
NSF · $324k · 2007–2011
NRI-Large: Collaborative Research: Purposeful Prediction: Co-robot Interaction via Understanding Intent and Goals
NSF · $2.2M · 2012–2017

Frequent coauthors

Jean Ponce
Département d'Informatique
98 shared
J. Andrew Bagnell
66 shared
Gerhard Goos
RWTH Aachen University
49 shared
Jan Van Leeuwen
Netherlands Institute for Radio Astronomy
49 shared
Andrew Zisserman
49 shared
Yu-Xiong Wang
36 shared
Takeo Kanade
33 shared
Katsushi Ikeuchi
30 shared

Labs

Human-Computer Interaction Institute (HCII)PI

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Martial Hebert

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you