UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving

1Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI), 2Washington University in St. Louis, 3Arizona State University, 4Case Western Reserve University
Teaser Image

Figure 1. Top: trajectory-conditioned visual generation. Middle: VLM Instructed E2E Planning. Bottom: our unified future world modeling method. Compared with conditioned future image gen- eration models and VLM-Instructed planning method, our frame- work establishes the connection between the reasoning, action and visual generation space via joint VLM-guided trajectory planner and future frame generation.

Abstract

World models have become critical in autonomous driving, where accurate scene understanding and future prediction are essential for safe control. Recent works have explored using Vision-Language Models (VLMs) for planning, but existing methods often treat perception, prediction, and planning as separate modules.

We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts future waypoints, conditioning the VLM-based image generator to synthesize plausible future frames. These predictions provide additional supervision signals, enhancing scene understanding, and iteratively refining trajectory generation.

We further compare discrete and continuous output representations for future image prediction, analyzing their impact on downstream driving performance. On the challenging Bench2Drive benchmark, UniDrive-WM generates high-fidelity future images and improves over the previous best method by 5.9% in L2 trajectory error and 9.2% in collision rate, respectively. These results demonstrate the benefits of tightly integrating VLM-powered reasoning, planning, and generative world modeling for autonomous driving.

Method

UniDrive-WM Architecture

Figure 2. The overall architecture of UniDrive-WM. Our model unifies scene understanding, trajectory planning, and future generation in a single VLM-based framework. The pipeline of our UniDrive-WM Framework. The pipeline consists of: (1) a QT-Former based encoder to extract historical context and multi-view vision input; (2) The LLM for performing reasoning task and (3) The output layer generates the planning trajectory, future image prediction, which bridge the gaps between the planning space, image space and reasoning space.

Trajectory-Conditioned Future Image Generation

Image Generation Results

Figure 3. The two design choices for image generation in unified multimodal model. For the future image prediction, we use both Autoregressive and Autoregressive+Diffusion architecture. (a) Left: Autoregressive architecture; (b) Right: AR+Diffusion architecture.

Qualitative Results

BibTeX

@misc{xiong2026unidrivewmunifiedunderstandingplanning,
  title={UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving}, 
  author={Zhexiao Xiong and Xin Ye and Burhan Yaman and Sheng Cheng and Yiren Lu and Jingru Luo and Nathan Jacobs and Liu Ren},
  year={2026},
  eprint={2601.04453},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2601.04453}, 
}