23 11 月, 2025

Novel View Synthesis: NeRF, Instant NGP, and 3D Gaussian Splatting

The techniques of Neural Radia…

The techniques of Neural Radiance Fields (NeRF), Instant Neural Graphics Primitives (Instant NGP), and 3D Gaussian Splatting (3DGS) are all advanced methods for novel view synthesis—creating new, photorealistic images of a 3D scene from arbitrary viewpoints given a set of input images.

Below is a compact table summarizing the key differences:

Feature	NeRF (Neural Radiance Fields)	Instant NGP (Instant Neural Graphics Primitives)	3D Gaussian Splatting (3DGS)
Scene Representation	Implicit (Volumetric Field) – A Multi-Layer Perceptron (MLP) maps 5D coordinates (3D position and 2D viewing direction) to color and volume density.	Implicit/Hybrid – Uses a small MLP combined with a Multi-resolution Hash Encoding (a grid of learned feature vectors) to efficiently encode the scene.	Explicit (Point-Based) – A set of discrete, learnable 3D Gaussians, each with parameters for position, scale, orientation, color (using Spherical Harmonics), and opacity.
Rendering Method	Ray Tracing/Volumetric Rendering – Slow, as it requires querying the neural network hundreds of times along each camera ray.	Ray Tracing/Volumetric Rendering – Accelerated dramatically by the hash encoding, reducing the number of costly MLP queries.	Rasterization – Fast, direct projection of 3D Gaussians onto a 2D plane, leveraging standard GPU rasterization pipelines.

Table of Contents

🚀 Pros & Cons of Each Method:

Method	Pros (Advantages)	Cons (Disadvantages)
NeRF	* High Photorealism: Excellent at capturing complex view-dependent effects (e.g., reflections, specularity). * Compact Storage: The scene is stored as a neural network, which is often a relatively small file. * Continuous: Represents the scene as a continuous field.	* Slow Training: Can take many hours for a single scene. * Slow Rendering: Frame rates are low, making real-time interaction difficult. * Computational Costly: High demands on memory and processing power during training/rendering.
Instant NGP	* Ultra-Fast Training: Achieves usable results in minutes or even seconds (hundreds to over 1000x faster than vanilla NeRF). * Faster Rendering: Significantly faster than vanilla NeRF, enabling near-real-time viewing (e.g., 10+ FPS). * Maintains NeRF’s Quality: Still produces high-fidelity results.	* Requires NVIDIA GPU: Highly optimized for NVIDIA CUDA/Tensor Cores, limiting hardware compatibility. * May have less detail than NeRF in some complex scenarios.
3D Gaussian Splatting	* Real-Time Rendering: Achieves very high frame rates (e.g., 100+ FPS) for interactive viewing. * Fast Training: Trains a scene in minutes. * High Quality: Offers visual fidelity comparable to or exceeding NeRF.	* Large File Size: Explicit representation with millions of Gaussians can result in very large scene files (e.g., 10x more memory than NeRF). * Less Flexible for certain tasks like topology optimization or complex scene editing. * Initialization Sensitive: Typically requires a good initial point cloud from Structure-from-Motion (SfM) like COLMAP.

Training / Inferencing Pipeline & Initialization:

The overall process for all three begins with the same data preparation step:

Input Data: A set of 2D images of a scene.
Camera Pose Estimation: Structure-from-Motion (SfM) software, such as COLMAP, is used to estimate the intrinsic (focal length, sensor size) and extrinsic (position, orientation) camera parameters for each image. SfM also generates a sparse 3D point cloud which is crucial for initialization.

🐣 Initialization

The initial state of the scene representation is a critical factor for both training speed and final quality, and it represents a major difference between the methods.

1. NeRF (Vanilla)

Initialization: The weights of the Multi-Layer Perceptron (MLP) are typically initialized randomly (e.g., using Xavier or Kaiming initialization).
Why Random? NeRF is an implicit function. The network doesn’t rely on a pre-existing geometric structure.
It learns the entire scene’s geometry and appearance from scratch by minimizing the photometric loss.
This is one reason why vanilla NeRF training is so slow.

2. Instant NGP

Initialization: The core MLP weights are initialized randomly. The Multi-resolution Hash Encoding feature vectors (the trainable parameters in the hash grid) are also initialized randomly.
Why this way? Instant NGP is still an implicit (or hybrid) volumetric method. The geometry is still learned by the network. The fast training comes from the efficiency of the hash encoding and the optimized ray marching, not from better initial geometry.

3. 3D Gaussian Splatting (3DGS)

Initialization: 3DGS is an explicit method and relies on a strong geometric initialization.
- Position: The position of each Gaussian is initialized using the sparse 3D point cloud generated by the initial Structure-from-Motion (SfM) step (e.g., COLMAP).
- Color/Opacity: Initial color is typically set to the observed color of the corresponding SfM point in the input images. Initial scale is often set to match the distance to the nearest neighbor points.
Why use SfM points? By starting with a good, although sparse, approximation of the scene’s geometry, the model is already close to convergence. It only needs to refine the parameters of the Gaussians (position, size, rotation) to fill in the missing details and achieve high fidelity, which enables the rapid training time.

You can see a comparison of the speed differences between these techniques in this video: 3D Gaussian Splatting vs. NeRF vs. Instant-NGP.

Pipeline Stage	NeRF (Neural Radiance Fields)	Instant NGP (Instant Neural Graphics Primitives)	3D Gaussian Splatting (3DGS)
Scene Representation	MLP (Multi-Layer Perceptron)	Multi-Resolution Hash Grid + Small MLP	Explicit 3D Gaussians (Position, Scale, Rotation, Color, Opacity)
Training (Forward Pass)	Ray Sampling: Ray casting is performed for a batch of pixels from a random input image. Hundreds of points are sampled along each ray.	Optimized Ray Sampling: Uses a multi-scale occupancy grid to efficiently skip sampling in empty space, dramatically reducing the number of points sampled.	3D Gaussian Projection: All 3D Gaussians are projected onto the 2D image plane using a differentiable rasterizer.
Training (Calculation)	Each sampled 3D point $(\mathbf{x}, \mathbf{d})$ is fed through the MLP to predict its color ($\mathbf{c}$) and volume density ($\sigma$).	Each sampled 3D point is first processed by the Hash Encoding to get features, then fed to the small MLP to predict $\mathbf{c}$ and $\sigma$.	The projected 2D Gaussians are sorted by depth and accumulated using alpha blending to form the final rendered image $\hat{\mathbf{C}}$.
Training (Loss & Backprop)	Volumetric Rendering: The predictions $(\mathbf{c}, \sigma)$ along the ray are integrated using volumetric rendering to predict the final pixel color $\hat{\mathbf{C}}$. Loss: The difference between $\hat{\mathbf{C}}$ and the ground-truth pixel color $\mathbf{C}$ (e.g., L2 loss) is backpropagated to update the MLP weights.	Same Volumetric Rendering and Loss as NeRF. The gradient is backpropagated through the MLP and the hash grid feature vectors.	Loss: The difference between the rendered color $\hat{\mathbf{C}}$ and the ground-truth color $\mathbf{C}$ is backpropagated to update the explicit Gaussian parameters (position, scale, rotation, color, opacity).
Key Optimization	Hierarchical Volume Sampling (HVS): An initial “coarse” network guides a “fine” network to sample more points in relevant areas (where density is high).	Hash Encoding: Replaces the heavy positional encoding with a fast, learnable, multi-resolution hash table lookup. Kernel Fusion: Optimizes network operations for maximum GPU speed.	Adaptive Density Control: During training, Gaussians that are too sparse are cloned and those that are too dense are split to better fit the scene.
Inference (Rendering)	Slow Volumetric Rendering: The full sampling and integration process must be performed for every pixel in the desired novel view.	Fast Volumetric Rendering: Same principle, but accelerated by the efficient hash encoding and occupancy grid, allowing near real-time rates.	Real-Time Rasterization: The final set of optimized 3D Gaussians is projected and rendered using a fast, GPU-optimized tile-based rasterizer, achieving very high frame rates.

Input/Output and Datasets

Input

The primary input for all three methods is a set of multiple 2D images of a scene taken from different viewpoints. This typically includes:

A collection of calibrated RGB images (e.g., from a video or a set of photos).
Camera poses (position and orientation) for each input image. These are usually estimated using traditional computer vision techniques like Structure-from-Motion (SfM), such as COLMAP.

Output

The core output is a novel view synthesis of the 3D scene from a new, arbitrary camera pose. (i.e. an “image” taken from a new novel perspective)
Additionally, the methods train a model that represents the scene:

NeRF / Instant NGP: A trained Neural Network checkpoint.
3DGS: A file (commonly a .ply file) containing the parameters of the millions of optimized 3D Gaussians.
- Need way more memory to store these GS parameters (since its explicit – like point clouds, but each point has more than 3 dimension (not only x, y, z – but scale, orientation, color (Spherical Harmonics), and opacity.