A Construct-Optimize Approach to Sparse View Synthesis without Camera Pose

Abstract

Novel view synthesis from a sparse set of input images is a challengingproblem of great practical interest, especially when camera poses are absent orinaccurate. Direct optimization of camera poses and usage of estimated depthsin neural radiance field algorithms usually do not produce good results becauseof the coupling between poses and depths, and inaccuracies in monocular depthestimation. In this paper, we leverage the recent 3D Gaussian splatting methodto develop a novel construct-and-optimize method for sparse view synthesiswithout camera poses. Specifically, we construct a solution progressively byusing monocular depth and projecting pixels back into the 3D world. Duringconstruction, we optimize the solution by detecting 2D correspondences betweentraining views and the corresponding rendered images. We develop a unifieddifferentiable pipeline for camera registration and adjustment of both cameraposes and depths, followed by back-projection. We also introduce a novel notionof an expected surface in Gaussian splatting, which is critical to ouroptimization. These steps enable a coarse solution, which can then be low-passfiltered and refined using standard optimization methods. We demonstrateresults on the Tanks and Temples and Static Hikes datasets with as few as threewidely-spaced views, showing significantly better quality than competingmethods, including those with approximate camera pose information. Moreover,our results improve with more views and outperform previous InstantNGP andGaussian Splatting algorithms even when using half the dataset.

Quick Read (beta)

loading the full paper ...