ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

Yuyang Zhang1,2,3* Wenyao Zhang1,2,3*† Zekun Qi4 He Zhang3 Haitao Lin3 Jingbo Zhang3 Yao Mu1 Xiaokang Yang1 Wenjun Zeng2 Xin Jin2,5✉
1Shanghai Jiao Tong University 2Eastern Institute of Technology 3Tencent Robotics X 4Tsinghua University 5Zhongguancun Academy

* Equal contribution · † Project lead · ✉ Corresponding author

ImageWAM paper teaser comparing video-generation WAMs and ImageWAM
93.38% RoboTwin 2.0
98.4% LIBERO
83.1% LIBERO-Plus
84.7% FLOPs saved
4.1× Latency speedup

Abstract

World-action reasoning without dense future-video tokens

World Action Models (WAMs) often rely on video generation to bridge visual world modeling and robot control. This design is intuitive but expensive: dense multi-frame future tokens increase latency, full video prediction spends capacity on action-irrelevant appearance details, and long-horizon imagination can introduce visual errors that mislead action prediction.

ImageWAM asks whether WAMs really need video generation. It uses pretrained image editing models as robot policy backbones, because editing is naturally source-grounded, instruction-guided, and change-centric. During inference, ImageWAM does not decode the edited target frame. It takes the KV caches produced by a single image-editing forward step as compact world-action context, then uses a flow-matching action expert to generate future robot actions.

Method

Image editing as an action-relevant visual transformation prior

A manipulation instruction usually specifies what should change in the scene. ImageWAM transfers this instruction-to-change prior from image editing into robot control.

1

Current-state grounding

The editing backbone is conditioned on the current observation, preserving source-image context while focusing on task-relevant edits.

2

Instruction-guided change

Language specifies the target transformation, encouraging features that encode object motion, spatial rearrangement, and contact-relevant changes.

3

Compact inference

At test time, ImageWAM uses one editing forward step and internal KV caches, avoiding dense future-video denoising and decoding.

ImageWAM paper framework with image editing backbone, KV caches, and action expert
Given an observation and instruction, the image editing backbone produces transformation-aware intermediate features. The action expert consumes the KV context and denoises an action chunk through flow matching.

Model family

The codebase supports FLUX.2 ImageWAM, OmniGen2 ImageWAM, and Ovis-U1 ImageWAM. FLUX.2 provides 4B and 9B variants; Ovis-U1 is the smallest variant with a 1.1B image-editing DiT while remaining competitive in many settings.

Implementation signal from code

The implementation wraps editing backbones as video/editing experts, pairs them with ActionDiT-style action experts, and mixes them through a Mixture-of-Transformers interface. FLUX.2 uses a slim compatible action expert with matched attention dimensions and a flow-matching action scheduler.

Results

Strong policy performance with a cheaper world-action intermediate

ImageWAM is evaluated on LIBERO, LIBERO-Plus, RoboTwin 2.0, and real-world dual-arm manipulation without extra embodied policy pretraining.

RoboTwin 2.0

MethodP.T.CleanRand.Avg.
π065.9258.4062.16
π0.582.7476.7679.75
FastWAM×91.8891.7891.83
ImageWAM×92.6593.7093.18

LIBERO

MethodSpatialObjectGoalLongAvg.
π0.598.898.298.092.496.9
LingBot-VA98.599.697.298.598.5
Fast-WAM98.2100.097.095.297.6
ImageWAM97.299.298.898.498.4

LIBERO-Plus robustness

VariantCameraLanguageNoiseLayoutAvg.
ImageWAM OmniGen280.070.977.171.871.8
ImageWAM Ovis-U163.375.475.274.671.2
ImageWAM FLUX.2 4B80.891.493.880.583.1
ImageWAM FLUX.2 9B79.895.293.383.185.2

Real robot & efficiency

ItemFastWAMImageWAM
Real-world avg.79.0%84.5%
LatencyBaseline4.1× faster / 1.15× faster*
FLOPsBaseline84.7% less / 26.4% less*
IntermediateVideo or cacheEditing cache

* FastWAM-IDM uses video intermediates; FastWAM 1-step uses cache intermediates.

ImageWAM experiment setup on RoboTwin, LIBERO, LIBERO-Plus, and real-world robots
Experiment setup from the paper, covering RoboTwin 2.0, LIBERO, LIBERO-Plus, and real-world robot evaluation.
Attention visualization comparing ImageWAM and FastWAM
Attention visualization from the paper comparing ImageWAM with FastWAM on action-relevant regions.

Analysis

Why editing caches help

A

Task-relevant attention

Paper analysis shows ImageWAM concentrates attention on manipulated objects, target receptacles, and contact regions, while suppressing action-irrelevant background.

B

Avoiding future-video artifacts

Because inference does not decode dense future frames, the action expert is less exposed to geometry/layout artifacts that can appear in imagined video rollouts.

Paper figure illustrating image editing versus video generation for WAMs
Paper analysis illustrating why image editing can provide a more action-relevant intermediate than dense future-video generation.

Release

Code and checkpoints

ImageWAM code and model releases are available through GitHub and Hugging Face.

Citation

BibTeX

@article{zhang2026imagewam,
  title   = {ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?},
  author  = {Zhang, Yuyang and Zhang, Wenyao and Qi, Zekun and Zhang, He and Lin, Haitao and Zhang, Jingbo and Mu, Yao and Yang, Xiaokang and Zeng, Wenjun and Jin, Xin},
  year    = {2026}
}