Wenyao Zhang

I'm a final year PhD student of the joint program between Shanghai Jiao Tong University and Eastern Institute of Technology, Ningbo, under the supervision of Wenjun Zeng and Xiaokang Yang . I collaborate closely with Xin Jin, Li Yi, Zhizheng Zhang, He Wang and Zekun Qi.

From June. 2024 to Mar. 2026, I was an intern at Galbot.

From Mar. 2026 to June. 2026, I was an intern at Tencent Robotics X.

My research focuses on Robot Learning, Representation Learning and Multimodal Large Language Models.

Email  /  Google Scholar  /  Github  /  X

profile photo
News

  • 2026-06:Four papers accepted to ECCV 2026 (4/4)
  • 2026-05: One paper accepted to ICML 2026 (1/1)
  • 2026-01: Two papers accepted to CVPR 2026 (2/3)
  • 2026-01: Three papers accepted to ICLR 2026 (3/3)
  • 2025-09: Two papers accepted to NeurIPS 2025 , and one of them as Spotlight presentation (2/2)
  • 2025-06: Two papers accepted to ICCV 2025 , and one of them as Highlight presentation (2/5)
  • 2024-12: One paper accepted to TMI
  • 2024-09: One paper accepted to NeurIPS 2024 as Spotlight presentation (1/1)
  • 2024-09: One paper accepted to TMM
  • 2024-07: Two papers accepted to ECCV 2024 (2/3)
  • 2022-12: One paper accepted to RAL&ICRA 2023
  • Publications

    * indicates equal contribution & ^ indicates project lead     Show Selected / Show by Date

    ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?
    Yuyang Zhang*, Wenyao Zhang*^, Zekun Qi, He Zhang, Haitao Lin, Jingbo Zhang, Yao Mu, Xiaokang Yang, Wenjun Zeng, Xin Jin
    ArXiv Preprint, 2026
    [arXiv] [Project Page] [Code] [Huggingface]

    We introduce ImageWAM, a simple world-action model that repurposes pretrained image editing models for robot action prediction, using editing KV caches as compact world-action context to reduce inference cost while improving policy performance.

    VLA-JEPA VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
    Jingwen Sun*, Wenyao Zhang*, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen
    European Conference on Computer Vision (ECCV 2026)
    [arXiv] [Project Page] [Code]

    We introduce VLA-JEPA, a JEPA-style pretraining framework that learns action-relevant transition semantics by predicting future latent states without pixel reconstruction or information leakage, achieving consistent gains in generalization and robustness over existing methods.

    Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking
    Zekun Qi*, Xuchuan Chen*, Dairu Liu*, Chenghuai Lin*, Yunrui Lian, Sikai Liang, Zhikai Zhang, Yu Guan, Jilong Wang, Wenyao Zhang, Xinqiang Yu, He Wang, Li Yi
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)
    [arXiv] [Project Page] [Code]

    We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a 2B-frame retargeted motion corpus for whole-body control, achieving unprecedented zero-shot generalization to unseen motions and control tasks.

    ReWorld ReWorld: Multi-Dimensional Reward Modeling for Embodied World Models
    Baorui Peng*, Wenyao Zhang*^, Liang Xu, Zekun Qi, Jiazhao Zhang, Hongsi Liu, Wenjun Zeng, Xin Jin
    ArXiv Preprint, 2026
    [arXiv]

    We introduce ReWorld, a framework that employs reinforcement learning to align video-based embodied world models with physical realism, task completion capability, embodiment plausibility, and visual quality through a hierarchical reward model trained on a large-scale video preference dataset.

    DeFI Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining
    Wenyao Zhang*, Bozhou Zhang*, Zekun Qi, Wenjun Zeng, Xin Jin, Li Zhang,
    International Conference on Learning Representations (ICLR 2026)
    [paper] [Code] [Huggingface]

    We propose DeFI, decoupling visual forward and inverse dynamics pretraining with a General Forward Dynamics Model (GFDM) for future prediction and a General Inverse Dynamics Model (GIDM) for latent actions from video, then unified finetuning for robot manipulation.

    ICCV 2025 DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
    Wenyao Zhang*, Hongsi Liu*, Zekun Qi*, Yunnan Wang*, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin
    Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
    [arXiv] [Project Page] [Code]

    We recast the vision–language–action model as a perception–prediction–action model and make the model explicitly predict a compact set of dynamic, spatial and high- level semantic information, supplying concise yet comprehensive look-ahead cues for planning.

    SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
    Zekun Qi*, Wenyao Zhang*, Yufei Ding*, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi
    Annual Conference on Neural Information Processing Systems (NeurIPS 2025) Spotlight
    [arXiv] [Project Page] [Code] [Huggingface]

    We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language.

    ICCV 2025 Hybrid-grained Feature Aggregation with Coare-to-fine Language Guidance for Self-supervised Monocular Depth Estimation
    Wenyao Zhang*, Hongsi Liu *, Bohan Li *, Jiawei He, Zekun Qi, Yunnan Wang, Shengyang Zhao , Xinqiang Yu , Wenjun Zeng Xin Jin,
    International Conference on Computer Vision (ICCV 2025)
    [arXiv] [Code] [Huggingface]

    we propose Hybrid-depth, a novel framework that systematically integrates foundation models (CLIP and DINO) to extract visual priors and acquire sufficient contextual information for self-supervised depth estimation methods. This method achieve SOTA accuracy on KITTI and boost downstream perception.

    TMM Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction
    Wenyao Zhang, Letian Wu , Zequn Zhang, Tao Yu, Chao Ma, Xin Jin, Xiaokang Yang, Wenjun Zeng
    IEEE Transactions on Multimedia (TMM 2024)
    [Paper] [Code]

    We propose a framework that transfers VLMs to downstream tasks by designing visual prompts from an attention perspective that reduces the transfer solution space.

    TMM Predict the Rover Mobility Over Soft Terrain Using Articulated Wheeled Bevameter
    Wenyao Zhang, Shipeng Lyv , Feng Xue, Chen Yao, Zheng Zhu, Zhenzhong Jia
    IEEE Robotics and Automation Letters (RA-L 2022) & IEEE International Conference on Robotics and Automation (ICRA 2023)
    [Paper] [Code]

    We propose an on-board mobility prediction approach using an articulated wheeled bevameter that consists of a force-controlled arm and an instrumented bevameter (with force and vision sensors) as its end-effector.

    ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?
    Yuyang Zhang*, Wenyao Zhang*†, Zekun Qi, He Zhang, Haitao Lin, Jingbo Zhang, Yao Mu, Xiaokang Yang, Wenjun Zeng, Xin Jin
    ArXiv Preprint, 2026
    [arXiv] [Project Page] [Code] [Huggingface]

    We introduce ImageWAM, a simple world-action model that repurposes pretrained image editing models for robot action prediction, using editing KV caches as compact world-action context to reduce inference cost while improving policy performance.

    MaskWAM MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models
    Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang, Chenghao Gu, Heng Li, Ping Tan
    ArXiv Preprint, 2026
    [arXiv] [Project Page] [Code]

    We introduce MaskWAM, an object-centric world-action model that unifies mask prompting and mask prediction to provide spatially grounded supervision and robust policy generalization in language-clear and language-ambiguous manipulation tasks.

    Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking
    Zekun Qi*, Xuchuan Chen*, Dairu Liu*, Chenghuai Lin*, Yunrui Lian, Sikai Liang, Zhikai Zhang, Yu Guan, Jilong Wang, Wenyao Zhang, Xinqiang Yu, He Wang, Li Yi
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)
    [arXiv] [Project Page] [Code]

    We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a 2B-frame retargeted motion corpus for whole-body control, achieving unprecedented zero-shot generalization to unseen motions and control tasks.

    LIMMT LIMMT: Less Is More for Motion Tracking
    Yu Guan*, Zekun Qi*, Chenghuai Lin, Xuchuan Chen, Dairu Liu, Wenyao Zhang, Jilong Wang, Xinqiang Yu, He Wang, Li Yi
    International Conference on Machine Learning (ICML 2026)
    [arXiv] [Project Page]

    We introduce LIMMT, a data-centric framework for humanoid motion tracking that curates motion data through physics feasibility, action diversity, and action complexity, showing that training on just 3% of curated data outperforms using full corpora.

    VLA-JEPA VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
    Jingwen Sun*, Wenyao Zhang*, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen
    European Conference on Computer Vision (ECCV 2026)
    [arXiv] [Project Page] [Code]

    We introduce VLA-JEPA, a JEPA-style pretraining framework that learns action-relevant transition semantics by predicting future latent states without pixel reconstruction or information leakage, achieving consistent gains in generalization and robustness over existing methods.

    ReWorld ReWorld: Multi-Dimensional Reward Modeling for Embodied World Models
    Baorui Peng*, Wenyao Zhang*, Liang Xu, Zekun Qi, Jiazhao Zhang, Hongsi Liu, Wenjun Zeng, Xin Jin
    ArXiv Preprint, 2026
    [arXiv]

    We introduce ReWorld, a framework that employs reinforcement learning to align video-based embodied world models with physical realism, task completion capability, embodiment plausibility, and visual quality through a hierarchical reward model trained on a large-scale video preference dataset.

    DeFI Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining
    Wenyao Zhang*, Bozhou Zhang*, Zekun Qi, Wenjun Zeng, Xin Jin, Li Zhang,
    International Conference on Learning Representations (ICLR 2026)
    [paper]

    We propose DeFI, decoupling visual forward and inverse dynamics pretraining with a General Forward Dynamics Model (GFDM) for future prediction and a General Inverse Dynamics Model (GIDM) for latent actions from video, then unified finetuning for robot manipulation.

    OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
    Mengdi Jia*, Zekun Qi*, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi
    International Conference on Learning Representations (ICLR 2026)
    [arXiv] [Project Page] [Code] [Huggingface]

    Based on cognitive psychology, we introduce a comprehensive and complex spatial reasoning benchmark, including 50 detailed categories and 1.5K manual labeled QA pairs.

    dise Reasoning in Space via Grounding in the World
    Yiming Chen, Zekun Qi, Wenyao Zhang, Xin Jin, Li Zhang, Peidong Liu
    International Conference on Learning Representations (ICLR 2026)
    [arXiv] [Project Page] [Code] [HuggingFace]

    We propose DeFI, decoupling visual forward and inverse dynamics pretraining with a General Forward Dynamics Model (GFDM) for future prediction and a General Inverse Dynamics Model (GIDM) for latent actions from video, then unified finetuning for robot manipulation.

    MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning
    Tianyu Xu*, Jiawei Chen*, Jiazhao Zhang*, Wenyao Zhang, Zekun Qi, Minghan Li, Zhizheng Zhang, He Wang
    European Conference on Computer Vision (ECCV 2024)
    [arXiv] [Project Page]

    We present MM-Nav a multi-view VLA system with 360° perception. The model is trained on large-scale expert navigation data collected from multiple reinforcement learning agents, demonstrating robust generalization in complex navigation scenarios.

    ICCV 2025 DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
    Wenyao Zhang*, Hongsi Liu*, Zekun Qi*, Yunnan Wang*, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin
    Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
    [arXiv] [Project Page] [Code] [Huggingface]

    We recast the vision–language–action model as a perception–prediction–action model and make the model explicitly predict a compact set of dynamic, spatial and high- level semantic information, supplying concise yet comprehensive look-ahead cues for planning.

    SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
    Zekun Qi*, Wenyao Zhang*, Yufei Ding*, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi
    Annual Conference on Neural Information Processing Systems (NeurIPS 2025) Spotlight
    [arXiv] [Project Page] [Code] [Huggingface]

    We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language.

    ICCV 2025 Hybrid-grained Feature Aggregation with Coare-to-fine Language Guidance for Self-supervised Monocular Depth Estimation
    Wenyao Zhang*, Hongsi Liu *, Bohan Li *, Jiawei He, Zekun Qi, Yunnan Wang, Shengyang Zhao , Xinqiang Yu , Wenjun Zeng Xin Jin,
    International Conference on Computer Vision (ICCV 2025)
    [arXiv] [Code]

    we propose Hybrid-depth, a novel framework that systematically integrates foundation models (CLIP and DINO) to extract visual priors and acquire sufficient contextual information for self-supervised depth estimation methods. This method achieve SOTA accuracy on KITTI and boost downstream perception.

    ICCV 2025 DexVLG: Dexterous Vision-Language-Grasp Model at Scale
    Jiawei He*, Danshi Li*, Xinqiang Yu*, Zekun Qi, Wenyao Zhang, Jiayi Chen, Zhaoxiang Zhang, Zhizheng Zhang, Li Yi He Wang,
    International Conference on Computer Vision (ICCV 2025) Highlight
    [arXiv] [Code]

    We introduce DexVLG, a vision‑language‑grasp model trained on the 170M‑pose, 174k‑object dataset that can generate instruction‑aligned dexterous grasp poses and achieves SOTA success and part‑grasp accuracy.

    NeurIPS 2024 Spotlight Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation
    Yunnan Wang, Ziqiang Li, Wenyao Zhang, Zequn Zhang, Baao Xie, Xihui Liu, Wenjun Zeng, Xin Jin
    Annual Conference on Neural Information Processing Systems (NeurIPS 2024) Spotlight
    [arXiv] [Code]

    We propose a framework that disentangles scene graphs into semantic components and recomposes them to achieve complex, generalizable image generation.

    TMM Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction
    Wenyao Zhang, Letian Wu , Zequn Zhang, Tao Yu, Chao Ma, Xin Jin, Xiaokang Yang, Wenjun Zeng
    IEEE Transactions on Multimedia (TMM 2024)
    [Paper] [Code]

    We propose a framework that transfers VLMs to downstream tasks by designing visual prompts from an attention perspective that reduces the transfer solution space.

    ECCV 2024 Closed-Loop Unsupervised Representation Disentanglement with β-VAE Distillation and Diffusion Probabilistic Feedback
    Xin Jin, Bohan Li, Baao Xie, Wenyao Zhang, Jinming Liu, Ziqiang Li, Tao Yang, Wenjun Zeng
    European Conference on Computer Vision (ECCV 2024)
    [Paper] [Code]

    We introduce CL-Dis, a closed-loop unsupervised disentanglement framework that integrates β-VAE distillation with diffusion-based feedback to learn semantically disentangled representations without labels.

    ECCV 2024 Hierarchical Temporal Context Learning for Camera-Based Semantic Scene Completion
    Bohan Li, Jiajun Deng, Wenyao Zhang, Zhujin Liang, Dalong Du, Xin Jin, Wenjun Zeng
    European Conference on Computer Vision (ECCV 2024)
    [Paper] [Code]

    We introduce HTCL, a hierarchical temporal context learning paradigm for camera-based 3D semantic scene completion.

    TMM Predict the Rover Mobility Over Soft Terrain Using Articulated Wheeled Bevameter
    Wenyao Zhang, Shipeng Lyv , Feng Xue, Chen Yao, Zheng Zhu, Zhenzhong Jia
    IEEE Robotics and Automation Letters (RA-L 2022) & IEEE International Conference on Robotics and Automation (ICRA 2023)
    [Paper] [Code]

    We propose an on-board mobility prediction approach using an articulated wheeled bevameter that consists of a force-controlled arm and an instrumented bevameter (with force and vision sensors) as its end-effector.


    Website Template


    © Wenyao Zhang | Last updated: Jun 20, 2026