DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

D r e a m VLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

* equal contribution † corresponding authors

¹Shanghai Jiao Tong University ²Eastern Institute of Technology ³Tsinghua University
⁴Galbot ⁵Peking University ⁶UIUC ⁷University of Science and Technology of China

Abstract

Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action represen- tations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

CALVIN Demo

Semantic orientation can not only be applied to manipulation tasks but also to robotic navigation task. This orientation-aware constraint enhances the navigation process by ensuring precise alignment with the desired orientation, thereby improving task performance in scenarios where directionality is critical.

lift_pink_block_slider->place_in_slider->turn_on_led->close_drawer->rotate_blue_block_left

turn_off_led->move_slider_left->rotate_blue_block_right->lift_red_block_table->stack_block

turn_off_led->lift_red_block_slider->place_in_slider->move_slider_right->push_blue_block_left

rotate_pink_block_right->move_slider_right->turn_on_led->close_drawer->lift_pink_block_table

rotate_pink_block_left->turn_on_led->push_into_drawer->lift_pink_block_drawer->place_in_drawer

rotate_blue_block_right->lift_pink_block_table->place_in_slider->move_slider_left->open_drawer

Real-world Demo

The real-world demonstration showcases the practical application of our model in a physical environment. We collect post-training data using SoFar, a modular-based robot manipulation method.

Pick up the bottle.

Pick up the yellow doll.

Pick up the white doll.

Place the banana into the basket.

Place the chili into the basket.

Place the chili into the white basket.

BibTeX

@article{dreamvla25, author = {Wenyao Zhang and Hongsi Liu and Zekun Qi and Yunan Wang and Xinqiang Yu and Jiazhao Zhang and Runpei Dong and Jiawei He and He Wang and Zhizheng Zhang and Li Yi and Wenjun Zeng and Xin Jin}, title = {DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge}, journal = {CoRR}, volume = {abs/2507.04447}, year = {2025}, url = {https://doi.org/10.48550/arXiv.2507.04447}, doi = {10.48550/ARXIV.2507.04447}, eprinttype = {arXiv}, eprint = {2507.04447} }

D r e a m VLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Abstract

Comparison with previous VLA paradigm

Pipeline

Dynamic Regions

Block-wise Structured Attention

CALVIN ABC-D Experiments

We present the average success computed over 1000 rollouts for each task and the average number of completed tasks to solve 5 instructions consecutively (Avg. Len.). DreamVLA shows significant superiority over baselines. The best results are bolde.

CALVIN Demo

Real-world Demo

The real-world demonstration showcases the practical application of our model in a physical environment. We collect post-training data using SoFar, a modular-based robot manipulation method.

BibTeX