Low Vegetation
Scene #1
Scene #2
Scene #3
Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis.
Method Training Overview. An input 3D semantic layout \( \mathcal{S} \) comprises object primitives, encoded as feature vectors \( \mathbf{F} \), and extruded ground polygons, rasterized into height maps \( \mathbf{H} \) and binary occupancy masks \( \mathbf{B} \). A layout VAE with separate encoder-decoder pairs for objects (\( \mathcal{E}_\mathcal{O}\)/\(\mathcal{D}_\mathcal{O}\)) and ground (\(\mathcal{E}_\mathcal{G}\)/\(\mathcal{D}_\mathcal{G}\)) first compresses \( \mathcal{S} \) into a structured latent representation \( \mathbf{z}_\mathcal{L} \). In the second stage, a diffusion model is trained over this latent space for controllable scene generation. At inference, the diffusion model generates latent codes either unconditionally or conditioned on the scene label \( y\), which are then decoded by the VAE into novel 3D layouts.
We generate controllable 3D semantic urban scenes under varying vegetation density conditions. The synthesized layouts exhibit realistic and diverse spatial compositions with clearly structured and well-shaped primitive geometries.
Our method enables scene inpainting through a latent manipulation mechanism, where a binary mask controls which spatial regions are modified. By appropriately configuring this mask, we can edit scenes in diverse ways (e.g., targeting only top, bottom, or lateral regions) while keeping the unmasked content unchanged. The generated regions blend seamlessly with the existing geometry and semantics.
We perform scene outpainting by leveraging the same latent manipulation mechanism used for inpainting, enabling controlled expansion of scenes beyond their original spatial extent. The generated scenes maintain semantic coherence, with realistic and diverse road structures as well as plausible object placements.
We extend scenes iteratively, growing the layout block by block.
Our instance-level primitive-based representation enables intuitive editing of individual objects through direct manipulation of their parameters. Unlike voxel-based methods, editing operations such as rotation, translation, and scaling can be intuitively performed without requiring additional post-processing.
Generated scenes are rendered into semantic maps used to condition photo-realistic street view synthesis, producing street-view images that are both realistic and semantically accurate.
@article{Tze2025PrITTI,
author = {Tze, Christina Ourania and Dauner, Daniel and Liao, Yiyi and Tsishkou, Dzmitry and Geiger, Andreas},
title = {PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes},
journal = {arXiv preprint arXiv:2506.19117},
year = {2025},
}