
Jiajun Wu
· Computer Vision, Machine Learning & Cognitive ScienceVerifiedStanford University · Symbolic Systems
Active 1999–2026
About
Jiajun Wu is an Assistant Professor of Computer Science and, by courtesy, of Psychology at Stanford University. Prior to joining Stanford, he was a Visiting Faculty Researcher at Google Research, where he worked with Noah Snavely. He completed his PhD at MIT under the supervision of Bill Freeman and Josh Tenenbaum, and earned his undergraduate degrees at Tsinghua University, working with Zhuowen Tu. His academic background combines expertise in computer science, artificial intelligence, and cognitive science, reflecting a multidisciplinary approach to research and teaching. Professor Wu's research group focuses on physical scene understanding, aiming to build machines that can see, reason about, and interact with the physical world. His work investigates the levels of abstraction necessary for AI systems in their representations and explores the origins of these abstractions. Drawing inspiration from the physical world and human cognition, his research addresses fundamental questions about how machines can develop meaningful representations of their environments. His projects span a variety of domains including multi-modal perception from visual, acoustic, and tactile signals; visual generation of the 4D physical world; visual reasoning through physical concept grounding often using neuro-symbolic methods; and robotics and embodied AI leveraging learned physical scene representations. Through his research, Professor Wu has contributed to advancing the understanding of how intelligent agents can interpret and interact with complex physical environments. His work integrates learning algorithms with cognitive and symbolic reasoning, pushing the boundaries of AI systems' capabilities in perception, reasoning, and action. He is also actively involved in teaching courses that bridge computer science, psychology, and artificial intelligence, reflecting his commitment to interdisciplinary education and research.
Research topics
- Computer Science
- Artificial Intelligence
- Computer vision
- Human–computer interaction
- Mathematics
- Algorithm
- Applied mathematics
- Programming language
- Cartography
- Computer graphics (images)
- Geography
- Mathematical optimization
- Theoretical computer science
Selected publications
Biochar reshapes phosphorus distribution in soil aggregates and improves rice phosphorus uptake
Soil Ecology Letters · 2026-04-11
article1st authorIEEE Transactions on Automation Science and Engineering · 2025-12-08
articleSenior authorThe jumping motion of wheel-legged robots (WLR) is of great significance to their obstacle-crossing ability. In existing studies, a Vertical Jumping (VJ) scheme that mimics human-like jumping has been realized in WLRs. However, the capacity limit of the actuator severely restricts the height of the wheel off the ground that VJ can reach, that is, the effective jumping height which directly affects the ability to cross obstacles. To enhance the effective jumping height within the actuator’s capacity, this paper proposes the Aerial Leg-Swing Jumping (ALSJ) scheme. By analyzing the take-off and flight phases from an energy perspective, the ALSJ scheme is designed to reduce peak torque and energy consumption. The implementation framework of the proposed scheme includes a vertical reachability map, a phased optimization planning method, and an offset-free whole-body control strategy. Simulation results show that, compared to the VJ scheme, the proposed scheme increases the maximum achievable effective jumping height by about 31.03% within actuator constraints. Additionally, these two schemes are compared in hardware experiments under a test condition with a desired jumping height of 0.15 m, and the flight time constraint of 0.32 s is introduced in ALSJ scheme to enhance the practical significance of the jump. Using the ALSJ scheme, the energy consumption during take-off phase is reduced by 16.23%, the total energy consumption throughout the entire jumping process is decreased by approximately 9.78%, and the peak knee joint torque is decreased by 32.57 Nm, further validating the effectiveness of the proposed scheme.
Research Square · 2025-09-15
preprintOpen access3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds
ArXiv.org · 2025-07-09
preprintOpen accessDespite large-scale pretraining endowing models with language and vision reasoning capabilities, improving their spatial reasoning capability remains challenging due to the lack of data grounded in the 3D world. While it is possible for humans to manually create immersive and interactive worlds through 3D graphics, as seen in applications such as VR, gaming, and robotics, this process remains highly labor-intensive. In this paper, we propose a scalable method for generating high-quality 3D environments that can serve as training data for foundation models. We recast 3D environment building as a sequential decision-making problem, employing Vision-Language-Models (VLMs) as policies that output actions to jointly craft a 3D environment's layout, materials, lighting, and assets. Our proposed framework, 3D-Generalist, trains VLMs to generate more prompt-aligned 3D environments via self-improvement fine-tuning. We demonstrate the effectiveness of 3D-Generalist and the proposed training strategy in generating simulation-ready 3D environments. Furthermore, we demonstrate its quality and scalability in synthetic data generation by pretraining a vision foundation model on the generated data. After fine-tuning the pre-trained model on downstream tasks, we show that it surpasses models pre-trained on meticulously human-crafted synthetic data and approaches results achieved with real data orders of magnitude larger.
Meta‐Dispersive 3D Chromatic Confocal Measurement
Advanced Science · 2025-07-21 · 5 citations
articleOpen access1st author3D reconstruction can perceive the detailed structures of real-world objects. Integrating metasurfaces with stereo vision or structured-light projection enables compact and efficient 3D reconstruction systems, beneficial for next-generation sensing, augmented reality, and biomedical applications. Nevertheless, the limitations inherent in these visual measurement methods pose a significant challenge to achieving higher resolution. Here, a dispersive metalens (DML) combined with the chromatic confocal method is proposed to achieve high-precision 3D measurement. With appropriate engineered dispersion, the linearity of the dispersion is well maintained, alongside diffraction-limited focusing performance. As a proof-of-concept, experiments on both longitudinal and transversal measurements are conducted. Following the calibration of the DML, axial accuracy of ±4 µm and a subwavelength axial resolution of 0.325 µm is achieved. The successful reconstruction of a fabricated 3D USAF-1951 resolution test chart further corroborates the 3D measurement capability, demonstrating a lateral resolution exceeding 4.38 µm. It is envisioned that the proposed method will pave the way for future applications in areas such as microstructure characterization, industrial inspection, and on-chip 3D optical metrology.
ART: Articulated Reconstruction Transformer
ArXiv.org · 2025-12-16
preprintOpen accessWe introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.
Multimedia Systems · 2025-10-31
article1st authorCorrespondingWeb-Scale Collection of Video Data for 4D Animal Reconstruction
ArXiv.org · 2025-11-03
preprintOpen accessComputer vision for animals holds great promise for wildlife research but often depends on large-scale data, while existing collection methods rely on controlled capture setups. Recent data-driven approaches show the potential of single-view, non-invasive analysis, yet current animal video datasets are limited--offering as few as 2.4K 15-frame clips and lacking key processing for animal-centric 3D/4D tasks. We introduce an automated pipeline that mines YouTube videos and processes them into object-centric clips, along with auxiliary annotations valuable for downstream tasks like pose estimation, tracking, and 3D/4D reconstruction. Using this pipeline, we amass 30K videos (2M frames)--an order of magnitude more than prior works. To demonstrate its utility, we focus on the 4D quadruped animal reconstruction task. To support this task, we present Animal-in-Motion (AiM), a benchmark of 230 manually filtered sequences with 11K frames showcasing clean, diverse animal motions. We evaluate state-of-the-art model-based and model-free methods on Animal-in-Motion, finding that 2D metrics favor the former despite unrealistic 3D shapes, while the latter yields more natural reconstructions but scores lower--revealing a gap in current evaluation. To address this, we enhance a recent model-free approach with sequence-level optimization, establishing the first 4D animal reconstruction baseline. Together, our pipeline, benchmark, and baseline aim to advance large-scale, markerless 4D animal reconstruction and related tasks from in-the-wild videos. Code and datasets are available at https://github.com/briannlongzhao/Animal-in-Motion.
Rotate Both Ways: Time-and-Order RoPE for Generative Recommendation
ArXiv.org · 2025-10-23
preprintOpen accessGenerative recommenders, typically transformer-based autoregressive models, predict the next item or action from a user's interaction history. Their effectiveness depends on how the model represents where an interaction event occurs in the sequence (discrete index) and when it occurred in wall-clock time. Prevailing approaches inject time via learned embeddings or relative attention biases. In this paper, we argue that RoPE-based approaches, if designed properly, can be a stronger alternative for jointly modeling temporal and sequential information in user behavior sequences. While vanilla RoPE in LLMs considers only token order, generative recommendation requires incorporating both event time and token index. To address this, we propose Time-and-Order RoPE (TO-RoPE), a family of rotary position embedding designs that treat index and time as angle sources shaping the query-key geometry directly. We present three instantiations: early fusion, split-by-dim, and split-by-head. Extensive experiments on both publicly available datasets and a proprietary industrial dataset show that TO-RoPE variants consistently improve accuracy over existing methods for encoding time and index. These results position rotary embeddings as a simple, principled, and deployment-friendly foundation for generative recommendation.
Riemann-Silberstein geometric phase in 4D polarization space
ArXiv.org · 2025-10-10
preprintOpen accessGeometric phase is a far-reaching concept in quantum and classical physics. The first discovered geometric phase, the Pancharatnam-Berry (PB) phase, has profoundly shaped nanophotonics through metasurfaces. However, the PB phase arises from SU(2) polarization evolution and is constrained to a 2D polarization space, failing to capture the full polarization degrees of freedom. We generalize geometric phase to the 4D Riemann-Silberstein (RS) space that simultaneously describes electric, magnetic, and hybrid electric-magnetic polarizations. We show that SU(4) polarization evolution can generate a new geometric phase, the RS phase, alongside the PB phase. Unlike the PB phase that typically manifests in circularly polarized light, the RS phase can emerge in arbitrarily polarized light. Together, they enable a high-dimensional geometric framework for light propagation across general interfaces. We reveal that the phase shifts governed by Fresnel equations are direct manifestations of the RS-space geometric phases, integrating a century-old wave theory into this paradigm. We experimentally validate the framework using metasurfaces and achieve high-dimensional wavefront manipulation. Our work offers fundamental insights into the geometric nature of light-matter interactions, with implications for topological and non-Abelian physics in classical wave systems.
Recent grants
NSF · $400k · 2022–2027
CCRI: ENS: Activity-Centric Interactive Environments for Embodied AI
NSF · $1.8M · 2021–2025
Frequent coauthors
- 92 shared
Joshua B. Tenenbaum
Massachusetts Institute of Technology
- 50 shared
William T. Freeman
- 27 shared
Jiayuan Mao
- 26 shared
Hong-Xing Yu
Development Research Center
- 25 shared
Yunzhu Li
- 25 shared
Li Fei-Fei
- 23 shared
Antonio Torralba
- 17 shared
Zhoutong Zhang
Labs
Education
Ph.D., Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Awards & honors
- Faculty Fellowship, Microsoft Research (2026)
- Academic Grant, Nvidia (2025)
- Research Scholar Award, Google (2025)
- Best Paper Award Finalist, ICRA, IEEE (2025)
- Research Grant, Okawa Foundation (2024)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Jiajun Wu
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup