PrITTI: Primitive-based Generation of
Controllable and Editable 3D Semantic Urban Scenes

1University of Tübingen, Tübingen AI Center, 2Zhejiang University, 3Noah’s Ark Lab, Huawei

PrITTI generates (1) high-quality, controllable 3D semantic urban scenes in a compact primitive-based representation using a latent diffusion model. Starting from a generated scene (e.g. middle sample), we demonstrate downstream applications including (2) scene editing, (3) inpainting, (4) outpainting, and (5) photo-realistic street view synthesis.

Controllable Scene Synthesis

Low Vegetation

Scene #1

Scene #2

Scene #3

Abstract

Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis.

Overview

Method Training Overview. An input 3D semantic layout \( \mathcal{S} \) comprises object primitives, encoded as feature vectors \( \mathbf{F} \), and extruded ground polygons, rasterized into height maps \( \mathbf{H} \) and binary occupancy masks \( \mathbf{B} \). A layout VAE with separate encoder-decoder pairs for objects (\( \mathcal{E}_\mathcal{O}\)/\(\mathcal{D}_\mathcal{O}\)) and ground (\(\mathcal{E}_\mathcal{G}\)/\(\mathcal{D}_\mathcal{G}\)) first compresses \( \mathcal{S} \) into a structured latent representation \( \mathbf{z}_\mathcal{L} \). In the second stage, a diffusion model is trained over this latent space for controllable scene generation. At inference, the diffusion model generates latent codes either unconditionally or conditioned on the scene label \( y\), which are then decoded by the VAE into novel 3D layouts.

BibTeX

@article{Tze2025PrITTI,
  author    = {Tze, Christina Ourania and Dauner, Daniel and Liao, Yiyi and Tsishkou, Dzmitry and Geiger, Andreas},
  title     = {PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes},
  journal   = {arXiv preprint arXiv:2506.19117},
  year      = {2025},
}