SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction

Carnegie Mellon University
SparseFusion reconstructs a consistent and realistic 3D neural scene representation from as few as 2 input images with known relative pose, while being able to generate detailed and plausible structures in uncertain or unobserved regions (such as front of the hydrant, teddybear's face, back of the laptop, or left side of the toybus). We show selected results below and random results here.

Abstract

We propose SparseFusion, a sparse view 3D reconstruction approach that unifies recent advances in neural rendering and probabilistic image generation.

Existing approaches typically build on neural rendering with re-projected features but fail to generate unseen regions or handle uncertainty under large viewpoint changes. Alternate methods treat this as a (probabilistic) 2D synthesis task, and while they can generate plausible 2D images, they do not infer a consistent underlying 3D. However, we find that this trade-off between 3D consistency and probabilistic image generation does not need to exist. In fact, we show that geometric consistency and generative inference can be complementary in a mode seeking behavior.

By distilling a 3D consistent scene representation from a view-conditioned latent diffusion model, we are able to recover a plausible 3D representation whose renderings are both accurate and realistic. We evaluate our approach across 51 categories in the CO3D dataset and show that it outperforms existing methods, in both distortion and perception metrics, for sparse view novel view synthesis.

Video

Results

Comparison Against Existing Methods

We compare novel view synthesis on 10 categories from CO3D. Please see randomly selected results from all 51 categories here.

Varying Number of Input Views

We compare novel view synthesis with varying number of input views. Please see randomly selected results from 10 categories here.

Related Links

There are many exciting, related, and concurrent works, here are some of them.

DreamFusion, Magic3D, and SJC lifts 2D to 3D with large-scale diffusion models, enabling text-to-3d generation.

3DiM applies 2D diffusion models for multi-view consistent novel view synthesis.

NFD trains a triplane diffusion model to directly generate a 3D neural field.

BibTeX

@misc{zhou2022sparsefusion,
  title={SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction}, 
  author={Zhizhuo Zhou and Shubham Tulsiani},
  year={2022},
  eprint={2212.00792},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Acknowledgements

We thank Naveen Venkat, Mayank Agarwal, Jeff Tan, Paritosh Mittal, Yen-Chi Cheng, and Nikolaos Gkanatsios for helpful discussions and feedback. We also thank David Novotny and Jonáš Kulhánek for sharing outputs of their work and helpful correspondence. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. (DGE1745016, DGE2140739).