World models have become critical in autonomous driving, where accurate scene understanding and future prediction are essential for safe control. Recent works have explored using Vision-Language Models (VLMs) for planning, but existing methods often treat perception, prediction, and planning as separate modules.
We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts future waypoints, conditioning the VLM-based image generator to synthesize plausible future frames. These predictions provide additional supervision signals, enhancing scene understanding, and iteratively refining trajectory generation.
We further compare discrete and continuous output representations for future image prediction, analyzing their impact on downstream driving performance. On the challenging Bench2Drive benchmark, UniDrive-WM generates high-fidelity future images and improves over the previous best method by 5.9% in L2 trajectory error and 9.2% in collision rate, respectively. These results demonstrate the benefits of tightly integrating VLM-powered reasoning, planning, and generative world modeling for autonomous driving.
Figure 2. The overall architecture of UniDrive-WM. Our model unifies scene understanding, trajectory planning, and future generation in a single VLM-based framework. The pipeline of our UniDrive-WM Framework. The pipeline consists of: (1) a QT-Former based encoder to extract historical context and multi-view vision input; (2) The LLM for performing reasoning task and (3) The output layer generates the planning trajectory, future image prediction, which bridge the gaps between the planning space, image space and reasoning space.
Figure 3. The two design choices for image generation in unified multimodal model. For the future image prediction, we use both Autoregressive and Autoregressive+Diffusion architecture. (a) Left: Autoregressive architecture; (b) Right: AR+Diffusion architecture.
@misc{xiong2026unidrivewmunifiedunderstandingplanning,
title={UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving},
author={Zhexiao Xiong and Xin Ye and Burhan Yaman and Sheng Cheng and Yiren Lu and Jingru Luo and Nathan Jacobs and Liu Ren},
year={2026},
eprint={2601.04453},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.04453},
}