Joint Camera-LiDAR Scene Synthesis and Perception for Autonomous Driving

Document Type

Article

Publication Title

IEEE Access

Abstract

The advancement of autonomous driving and embedded AI systems has intensified the need for large-scale, richly annotated multimodal datasets encompassing RGB images, semantic labels, and 3D LiDAR data. Manual collection and annotation of such datasets remain costly and time-consuming, especially when temporal and cross-modal consistency is required. The proposed method introduces Joint Camera-LiDAR Scene Synthesis and Perception (JCLSP), a unified generative framework that simultaneously synthesizes photorealistic RGB images, semantic segmentation maps, and LiDAR range images through a compact and optimized diffusion process. Unlike prior approaches that employ separate diffusion branches, JCLSP fuses image and LiDAR modalities early in the pipeline and leverages a shared latent space for coherent multimodal generation. The architecture integrates three key elements: BKSDM, which streamlines the diffusion process by eliminating redundant blocks, a joint image-LiDAR diffusion module that applies the BKSDM framework to enable depth-aware synthesis with geometric fidelity, and modality-specific decoders that extract semantic masks, LiDAR range images, and image scenes from the shared latent representation. Experimental results on synthetic datasets indicate that JCLSP captures meaningful cross-modal correlations and preserves spatial features. By generating joint representations from camera and LiDAR views along with semantic segmentation annotations, the method demonstrates promising potential for cross-modal representation learning with labeled data.

First Page

166740

Last Page

166759

DOI

10.1109/ACCESS.2025.3613054

Publication Date

1-1-2025

This document is currently not available here.

Share

COinS