A Very Basic Overview of Neural Radiance Fields (NeRF) | by Ta-Ying Cheng | Jul, 2022

Can they one day replace photos?

Figure 1. NeRF Pipeline. Given a large set of images, NeRF learns to implicitly represent the 3D shape, such that new views can later on be synthesised. Image retrieved from the original NeRF paper by Mildenhall et al.

The deep learning era began through the advancements it brought in traditional 2D image-recognition tasks such as classifications, detections, and instance segmentations. As the techniques matured, the research in deep-learning-based computer vision has been shifted towards fundamental 3D computer vision problems — one of the most notable being synthesising new views of an object and reconstructing the 3D shape of it from images. Many approaches tackled this as a conventional machine learning problem, where the goal becomes to learn a system to “inflate” 3D geometry out of images after a finite set of training iterations. Recently, however, a completely new direction, namely Neural Radiance Fields (NeRF), has been introduced. This article dives into the basic concepts of the originally proposed NeRF as well as several of its extensions in recent years.

The biggest difference between a NeRF model and traditional neural networks for 3D reconstruction is that NeRF is an instance-specific implicit representation of an object.

In simple words, given a set of images capturing the same object from multiple angles along with their corresponding poses, the network learns to represent the 3D object such that new views can be synthesised in a consistent manner with the training set of views.

Figure 2. NeRF Training Overview. Image retrieved from the original NeRF paper by Mildenhall et al.

While such implicit representation seems difficult, Mildenhall et al. in their first NeRF paper have shown that a simple Multilayer Perceptron (MLP) withholds enough capacity to perform such a complex task.

Specifically, the input of this fully connected network is a single 5D coordinate (3 for location and 2 for viewing direction), and the output is the density and colour of the given location. In practice, density only matters with the location and not the viewing direction, and so only location is used to to predict the density of the location, while viewing direction is combined with the location features to predict the colour seen.

There are two implementation techniques to better improve NeRF in better representing complex scene — Positional encoding and hierarchical volume sampling.

Positional Encoding

Previous literature have shown that mapping inputs to a higher dimensional space helps networks learn more complex functions. Positional encoding is a particular encoding function that performs exactly that by using high frequency functions. Both the location coordinates and view directions are fed into this encoding function before inputting into the MLP.

Hierarchical Volume Sampling

Two networks, one coarse and one refined, are optimised jointly when training a NeRF. Specifically, we first train a coarse network using standard sampling. Then, given the outputs of the coarse network, refined network samples aim to sample the more relevant parts of the volume to increase the training efficiency.

Figure 3. NeRF could one day replace photos to be the new medium of capturing visual memories.

Photos have been the go-to medium when we want to remember a place we travelled, a person we loved, or a memory we treasure. The rise of NeRF may potentially be a better solution to this.

If we can eliminate the constraints of training time and the number of images, NeRF has a much greater capacity in storing visual memories in multiple-views. It could potentially be a “3D” photo, where every angle (even ones you didn’t capture) is properly presented to you in high resolution.

The introduction of NeRF is somewhat a breath of fresh air to the 3D reconstruction domain. “Overfitting” a model to a particular 3D instance is unorthodox and yet produces impressive, novel view-synthesising qualities. Nevertheless, there are some major drawbacks of the first-proposed architecture. Some of these issues include:

  1. It requires a significant number of images of the same object.
  2. The training time is very long.
  3. Camera pose of each image is required.

Numerous works have recently been introduced to tackle all these issues. Below we list a couple that aim to solve each of these problems.

It requires a significant number of images of the same object.

The training time is very long.

Camera pose of each image is required.

Other Interesting NeRF-related paper

And there you have it — a very simple overview of the original NeRF paper. This new way of representing visual data brought endless potential and inspired numerous state-of-the-art research with constant improvements. Perhaps one day our memories will be stored with the combination of reality and imagination.

“Imagination is more important than knowledge. For knowledge is limited, whereas imagination embraces the entire world, stimulating progress, giving birth to evolution” — Albert Einstein

Thank you for making it this far 🙏! I regularly write about different areas of computer vision/deep learning, so join and subscribe if you are interested to know more! Also, this article did not go into any of the mathematics or details of implementations, and the extensions of radiance fields are far beyond the few I have mentioned here. Please read the original paper(s) for detailed explanations.

Leave a Reply

Your email address will not be published.