UAVTwin: Neural Digital Twins for UAVs using Gaussian Splatting

Abstract

We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs). Specifically, our approach focuses on synthesizing foreground components, such as various human instances in motion within complex scene backgrounds, from UAV perspectives. This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses. To the best of our knowledge, UAVTwin is the first approach for UAV-based perception that is capable of generating high-fidelity digital twins based on 3DGS. The proposed work significantly enhances downstream models through data augmentation for real-world environments with multiple dynamic objects and significant appearance variations—both of which typically introduce artifacts in 3DGS-based modeling. To tackle these challenges, we propose a novel appearance modeling strategy and a mask refinement module to enhance the training of 3D Gaussian Splatting. We demonstrate the high quality of neural rendering by achieving a 1.23 dB improvement in PSNR compared to recent methods. Furthermore, we validate the effectiveness of data augmentation by showing a 2.5% to 13.7% improvement in mAP for the human detection task.

Proposed Method

Our approach first constructs a digital twin using UAV-based images captured at different times. We introduce MsGS, a novel 3DGS method to analyze varying appearance images and reconstruct a clean mesh, Gaussian splats, and an MLP for novel-view synthesis. Then, our method generates data by compositing foreground humans rendered in Blender with backgrounds rendered using trained Gaussian splats.

Building Digital Twin

Visualizations of Drone1-Noon and Drone2-Noon across 11 video sequences, along with the corresponding mesh reconstruction results produced by our method. Each video sequence captures a partial area of the scene, making it essential to use all 11 sequences to achieve a complete reconstruction of the scene.

Neural Data Generation

For data generation, our method comprises synthetic human placement, camera trajectory generation, and rendering foreground humans using a graphics engine (e.g. Blender).

(1) 🧍Placing Humans: Place human actors.
(2) 🎥Placing Camera: Camera Trajectory Generation for rendering images
(3) 🕺Applying Motion: Randomly apply mtions for each human actors and render images