UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving

¹Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI), ²Washington University in St. Louis, ³Arizona State University, ⁴Case Western Reserve University

Abstract

World models have become critical in autonomous driving, where accurate scene understanding and future prediction are essential for safe control. Recent works have explored using Vision-Language Models (VLMs) for planning, but existing methods often treat perception, prediction, and planning as separate modules.

We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts future waypoints, conditioning the VLM-based image generator to synthesize plausible future frames. These predictions provide additional supervision signals, enhancing scene understanding, and iteratively refining trajectory generation.

We further compare discrete and continuous output representations for future image prediction, analyzing their impact on downstream driving performance. On the challenging Bench2Drive benchmark, UniDrive-WM generates high-fidelity future images and improves over the previous best method by 5.9% in L2 trajectory error and 9.2% in collision rate, respectively. These results demonstrate the benefits of tightly integrating VLM-powered reasoning, planning, and generative world modeling for autonomous driving.

Method

Figure 2. The overall architecture of UniDrive-WM. Our model unifies scene understanding, trajectory planning, and future generation in a single VLM-based framework. The pipeline of our UniDrive-WM Framework. The pipeline consists of: (1) a QT-Former based encoder to extract historical context and multi-view vision input; (2) The LLM for performing reasoning task and (3) The output layer generates the planning trajectory, future image prediction, which bridge the gaps between the planning space, image space and reasoning space.

Trajectory-Conditioned Future Image Generation

Figure 3. The two design choices for image generation in unified multimodal model. For the future image prediction, we use both Autoregressive and Autoregressive+Diffusion architecture. (a) Left: Autoregressive architecture; (b) Right: AR+Diffusion architecture.

Qualitative Results

Figure 4. Visualization about autoregressive future image predic- tion results. Left: Current frame; Middle: Future frame; Right: Predicted future image.

Figure 5. Visualization of AR+Diffusion result. Left: Current frame; Middle: Future frame; Right: Predicted future image.

Figure 6. Visualization of our VQA result on Bench2Drive open-loop evaluation. For each scene, we show the multi-view input images and the Bird's Eye View(BEV). The question-answer includes the model's understanding of the current state and reasoning about the ego vehicle's movement and driving behavior. Red text highlights our method's scene comprehension and reasoning.

BibTeX

@misc{xiong2026unidrivewmunifiedunderstandingplanning, title={UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving}, author={Zhexiao Xiong and Xin Ye and Burhan Yaman and Sheng Cheng and Yiren Lu and Jingru Luo and Nathan Jacobs and Liu Ren}, year={2026}, eprint={2601.04453}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2601.04453}, }