Wenyao Zhang

	DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge Wenyao Zhang, "Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin arXiv preprint, 2025 [arXiv] [Code] We recast the vision–language–action model as a perception–prediction–action model and make the model explicitly predict a compact set of dynamic, spatial and high- level semantic information, supplying concise yet comprehensive look-ahead cues for planning.
	Hybrid-grained Feature Aggregation with Coare-to-fine Language Guidance for Self-supervised Monocular Depth Estimation Wenyao Zhang, Hongsi Liu , Bohan Li , Jiawei He, Zekun Qi, Yunnan Wang, Shengyang Zhao , Xinqiang Yu , Wenjun Zeng Xin Jin, International Conference on Computer Vision (ICCV 2025)* we propose Hybrid-depth, a novel framework that systematically integrates foundation models (CLIP and DINO) to extract visual priors and acquire sufficient contextual information for self-supervised depth estimation methods.
	DexVLG: Dexterous Vision-Language-Grasp Model at Scale Jiawei He, Danshi Li, Xinqiang Yu, Zekun Qi, Wenyao Zhang, Jiayi Chen, Zhaoxiang Zhang, Zhizheng Zhang, Li Yi He Wang, International Conference on Computer Vision (ICCV 2025)* [arXiv] [Code] We introduce DexVLG, a vision‑language‑grasp model trained on the 170M‑pose, 174k‑object dataset that can generate instruction‑aligned dexterous grasp poses and achieves SOTA success and part‑grasp accuracy.
	OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi arXiv preprint, 2025 [arXiv] [Project Page] [Code] [Huggingface] Based on cognitive psychology, we introduce a comprehensive and complex spatial reasoning benchmark, including 50 detailed categories and 1.5K manual labeled QA pairs.
	SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation Zekun Qi, Wenyao Zhang*, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi arXiv preprint, 2025 [arXiv] [Project Page] [Code] [Huggingface] We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language.
	Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation Yunnan Wang, Ziqiang Li, Wenyao Zhang, Zequn Zhang, Baao Xie, Xihui Liu, Wenjun Zeng, Xin Jin Annual Conference on Neural Information Processing Systems (NeurIPS 2024) Spotlight [arXiv] [Code] We propose a framework that disentangles scene graphs into semantic components and recomposes them to achieve complex, generalizable image generation.
	Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multimodal Interaction Wenyao Zhang, Letian Wu , Zequn Zhang, Tao Yu, Chao Ma, Xin Jin, Xiaokang Yang, Wenjun Zeng IEEE Transactions on Multimedia (TMM 2024) [Paper] [Code] We propose a framework that transfers VLMs to downstream tasks by designing visual prompts from an attention perspective that reduces the transfer solution space.
	Closed-Loop Unsupervised Representation Disentanglement with β-VAE Distillation and Diffusion Probabilistic Feedback Xin Jin, Bohan Li, Baao Xie, Wenyao Zhang, Jinming Liu, Ziqiang Li, Tao Yang, Wenjun Zeng European Conference on Computer Vision (ECCV 2024) [Paper] [Code] We introduce CL-Dis, a closed-loop unsupervised disentanglement framework that integrates β-VAE distillation with diffusion-based feedback to learn semantically disentangled representations without labels.
	Hierarchical Temporal Context Learning for Camera-Based Semantic Scene Completion Bohan Li, Jiajun Deng, Wenyao Zhang, Zhujin Liang, Dalong Du, Xin Jin, Wenjun Zeng European Conference on Computer Vision (ECCV 2024) [Paper] [Code] We introduce HTCL, a hierarchical temporal context learning paradigm for camera-based 3D semantic scene completion.
	Predict the Rover Mobility Over Soft Terrain Using Articulated Wheeled Bevameter Wenyao Zhang, Shipeng Lyv , Feng Xue, Chen Yao, Zheng Zhu, Zhenzhong Jia IEEE Robotics and Automation Letters (RA-L 2022) [Paper] [Code] We propose an on-board mobility prediction approach using an articulated wheeled bevameter that consists of a force-controlled arm and an instrumented bevameter (with force and vision sensors) as its end-effector.