Wenyao Zhang

Shanghai Jiao Tong University and I'm a PhD student of the joint program between Shanghai Jiao Tong University and Eastern Institute of Technology, Ningbo, under the supervision of Wenjun Zeng and Xiaokang Yang . I collaborate closely with Xin Jin, Li Yi, Zhizheng Zhang and He Wang. I am currently a research intern at GalBot.

Previously, I obtained my master's degrees from Sourthern University of Science and Technoledge , supervised by Zhenzhong Jia. And I received my bachelor's degree from Beijing Jiao Tong University , supervised by Xin Zhang.

My research focuses on Robot Learning, Representation Learning and Multimodal Large Language Models. I am looking for collaborators to work on the following topics:

  • Vision-Language-Action Models
  • World Models
  • And other interesting topics in Robot Learning.

If you are interested in these topics, please feel free to contact me.

Email  /  Google Scholar  /  Github  / 

profile photo
ICCV 2025 DreamVLA: Dream Comprehensive World Knowledge for Vision-Language-Action Model
Wenyao Zhang*, Hongsi Liu*, Zekun Qi*, Yunnan Wang*, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Zhizheng Zhang, He Wang, Li Yi, Wenjun Zeng, Xin Jin
arXiv preprint, 2025
[arXiv] [Code]

We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language.

ICCV 2025 Hybrid-grained Feature Aggregation with Coare-to-fine Language Guidance for Self-supervised Monocular Depth Estimation
Wenyao Zhang*, Hongsi Liu *, Bohan Li *, Jiawei He, Zekun Qi, Yunnan Wang, Shengyang Zhao , Xinqiang Yu , Wenjun Zeng Xin Jin,
International Conference on Computer Vision (ICCV 2025)
[arXiv] [Code]
\
ICCV 2025 DexVLG: Dexterous Vision-Language-Grasp Model at Scale
Jiawei He*, Danshi Li*, Xinqiang Yu*, Zekun Qi, Wenyao Zhang, Jiayi Chen, Zhaoxiang Zhang, Zhizheng Zhang, Li Yi He Wang,
International Conference on Computer Vision (ICCV 2025)
[arXiv] [Code]
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Mengdi Jia*, Zekun Qi*, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi
arXiv preprint, 2025
[arXiv] [Project Page] [Code] [Huggingface]

Based on cognitive psychology, we introduce a comprehensive and complex spatial reasoning benchmark, including 50 detailed categories and 1.5K manual labeled QA pairs.

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
Zekun Qi*, Wenyao Zhang*, Yufei Ding*, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi
arXiv preprint, 2025
[arXiv] [Project Page] [Code] [Huggingface]

We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language.

NeurIPS 2024 Spotlight Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation
Yunnan Wang, Ziqiang Li, Wenyao Zhang, Zequn Zhang, Baao Xie, Xihui Liu, Wenjun Zeng, Xin Jin
Annual Conference on Neural Information Processing Systems (NeurIPS 2024) Spotlight
[arXiv] [Code]

We propose a framework that disentangles scene graphs into semantic components and recomposes them to achieve complex, generalizable image generation.

TMM Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction
Wenyao Zhang, Letian Wu , Zequn Zhang, Tao Yu, Chao Ma, Xin Jin, Xiaokang Yang, Wenjun Zeng
IEEE Transactions on Multimedia (TMM 2024)
[Paper] [Code]

We propose a framework that transfers VLMs to downstream tasks by designing visual prompts from an attention perspective that reduces the transfer solution space.

ECCV 2024 Closed-Loop Unsupervised Representation Disentanglement with β-VAE Distillation and Diffusion Probabilistic Feedback
Xin Jin, Bohan Li, Baao Xie, Wenyao Zhang, Jinming Liu, Ziqiang Li, Tao Yang, Wenjun Zeng
European Conference on Computer Vision (ECCV 2024)
[Paper] [Code]

We introduce CL-Dis, a closed-loop unsupervised disentanglement framework that integrates β-VAE distillation with diffusion-based feedback to learn semantically disentangled representations without labels.

ECCV 2024 Hierarchical Temporal Context Learning for Camera-Based Semantic Scene Completion
Bohan Li, Jiajun Deng, Wenyao Zhang, Zhujin Liang, Dalong Du, Xin Jin, Wenjun Zeng
European Conference on Computer Vision (ECCV 2024)
[Paper] [Code]

We introduce HTCL, a hierarchical temporal context learning paradigm for camera-based 3D semantic scene completion.

TMM Predict the Rover Mobility Over Soft Terrain Using Articulated Wheeled Bevameter
Wenyao Zhang, Shipeng Lyv , Feng Xue, Chen Yao, Zheng Zhu, Zhenzhong Jia
IEEE Robotics and Automation Letters (RA-L 2022)
[Paper] [Code]

We propose an on-board mobility prediction approach using an articulated wheeled bevameter that consists of a force-controlled arm and an instrumented bevameter (with force and vision sensors) as its end-effector.


Website Template


© Zekun Qi | Last updated: Feb 19, 2025