Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses

Abstract

In this paper, we present a method to reconstruct the world and multipledynamic humans in 3D from a monocular video input. As a key idea, we representboth the world and multiple humans via the recently emerging 3D GaussianSplatting (3D-GS) representation, enabling to conveniently and efficientlycompose and render them together. In particular, we address the scenarios withseverely limited and sparse observations in 3D human reconstruction, a commonchallenge encountered in the real world. To tackle this challenge, we introducea novel approach to optimize the 3D-GS representation in a canonical space byfusing the sparse cues in the common space, where we leverage a pre-trained 2Ddiffusion model to synthesize unseen views while keeping the consistency withthe observed 2D appearances. We demonstrate our method can reconstructhigh-quality animatable 3D humans in various challenging examples, in thepresence of occlusion, image crops, few-shot, and extremely sparseobservations. After reconstruction, our method is capable of not only renderingthe scene in any novel views at arbitrary time instances, but also editing the3D scene by removing individual humans or applying different motions for eachhuman. Through various experiments, we demonstrate the quality and efficiencyof our methods over alternative existing approaches.

Quick Read (beta)

loading the full paper ...