RoDyn: Taming Interactive Robot-Dynamic 2.5D World Model for Robotic Manipulation

IROS 2026
Chuanrui Zhang1 Zhengxian Wu2 Guanxing Lu2 Yansong Tang2 Ziwei Wang1
1Nanyang Technological University2Tsinghua University 
Paper Code Pre-trained Models
architecture

TL;DR

We present RoDyn, a Robot-Dynamic 2.5D world model that combines the inference efficiency of 2D video models with the spatial awareness of 3D models for robotic manipulation.

Abstract

Learned world models hold significant potential as neural simulators for robotic manipulation. However, prevalent 2D video-based models inherently lack the spatial and kinematic reasoning crucial for physical interactions. We introduce RoDyn, a novel Robot-Dynamic 2.5D World Model that formulates environmental dynamics within a highly efficient, geometry-aware latent space. Through the proposed Robot-Dynamic Tokenizer, we explicitly couple semantic visual appearances with spatial and agent-centric priors via an RGB-dominated cross-attention mechanism and dynamic mask guidance. Furthermore, by injecting these mask priors directly into sequence transitions, our Mask-guided Autoregressive architecture drives the model to focus on active robot-object interaction regions. Extensive experiments demonstrate that RoDyn establishes SOTA generation fidelity across large-scale datasets. Crucially, it translates these predictive capabilities into substantial downstream gains, accelerating model-based reinforcement learning and achieving a 42% improvement in real-world imitation learning success rates over pure 2D baselines.

Architecture

architecture
Overview of RoDyn. Multi-modal 2.5D inputs are encoded into physics-aware discrete tokens via the Robot-Dynamic Tokenizer, which features RGB-dominated cross-attention and mask-driven fusion. To enforce causal physical dynamics during sequence transitions, both robotic action trajectories and extracted kinematic masks are explicitly injected into the Embodiment Transition Tokens. The Mask-guided Autoregressive Transformer then sequentially predicts future tokens, strictly grounding the generation on prior context and active robot-object interactions.

Comparisons with the State-of-the-art

We present qualitative comparisons with the following state-of-the-art models:

SOTA comparisons