Why “Deep & Narrow” Beats “Shallow & Wide” Neural Networks?

Why Deep Neural Networks? Imag…

One Deep backbone beats multiple shallow networks.

Why Deep Neural Networks?

Imagine you are trying to teach a self-driving car how to recognize a pedestrian.

If you build a neural network that is incredibly wide—meaning it has just one single, massive hidden layer with millions of neurons sitting side-by-side—the car struggles. It might memorize what a pedestrian looks like in an exact lighting condition, but give it a rainy night in downtown Taipei, and it panics.

However, if you take those exact same millions of neurons and stack them on top of each other—creating a model that is Deep and Narrow rather than Shallow and Wide—the car suddenly learns to see the world with incredible, robust abstraction.

Why does depth beat width? Why do we stack layers instead of just making them wider?

To answer that question, we have to unpack the inner machinery of how neural networks actually think, learn, and optimize. Grab a cup of coffee. By the end of this guide, you’ll understand the math, the mechanics, and the ultimate reason why deep learning scaled the way it did.

🗺️ Table of Contents

  1. Activation Functions: The Magic of Non-Linearity
  2. The Learning Engine: Loss Functions & Backpropagation
  3. The Navigation Engine: Navigating the Optimizer Landscape
  4. The Grand Finale: Why Deep & Narrow Beats Shallow & Wide
  5. The Robotics Connection: Depth in Autonomous Driving

1. Activation Functions: The Magic of Non-Linearity

Before we can talk about deep networks, we have to understand what a single layer actually does. Mathematically, a layer in a neural network performs a simple linear transformation:

z = Wx + b

Where x is your input (like raw camera pixels),  W represents the weights (the network’s knobs and dials), and b is the bias.

Here is the ultimate trap of linear algebra: If you stack ten linear layers back-to-back, the entire combination is still just a single linear transformation.

Mathematically, if y = W1(W2x), you can just multiply W1 x W2 into a single matrix Wnew.
No matter how “deep” you make a purely linear network, it has the exact same mathematical capacity as a single-layer network. It can only draw straight lines. It can never learn complex structures like the curve of a steering wheel or the silhouette of a human being.

On the left, you can see how a purely linear model (representing a network without activation functions) struggles to separate the interlocking data points with a straight line. On the right, a non-linear model (representing a neural network with activation functions) gracefully curves to perfectly classify the same complex dataset.

If you want to separate them into two groups with a straight line, you’d struggle to death. However, if you can draw non-linear lines (e.g. a horseshoe or parabola) you can classify points into 2 categories with very high accuracy!

To break this curse, we introduce the Activation Function.

Input (x) ──> [ Wx + b ] ──> [ Activation Function (σ) ] ──> Output (a)

An activation function is a non-linear mathematical filter wrapped around the output of a layer:

a = σ(z),
where z =Wx +b 

x is the input, (W, b) is the weights and biases of the layer (what the model learns). z is the output of a layer before applying the activation function. σ() is the activation function, and a is the final output of this layer.

This tiny tweak changes everything. It allows the network to bend, warp, and curve its decision boundaries to fit the messy, chaotic reality of real-world data, as shown in the example above.

The Evolution of Activation Functions

Over the years, the way we introduce non-linearity has shifted drastically. Let’s look at the three most famous ones in modern history:

A. Sigmoid

The Sigmoid function squashes any input value into a strict range between 0 and 1.

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

B. Tanh (Hyperbolic Tangent)

Tanh scales inputs into a range between -1 and 1.

tanh(z)=(ezez)(ez+ez)\tanh(z) = \frac{(e^z – e^{-z})}{(e^z + e^{-z})}

C. ReLU (Rectified Linear Unit) — The Modern Standard

ReLU is incredibly simple, yet it completely revolutionized deep learning.

f(z)=max(0,z)f(z) = max(0, z)

If the input is negative, the output is 0. If the input is positive, the output is exactly the input.

ReLU unlocks the Deep Neural Networks!

⚠️ The Hidden Cons of standard ReLU

While ReLU rescued deep learning from vanishing gradients, it introduced a new, notorious problem: The Dying ReLU Problem (also known as “Dead Neurons”).

Look at the left side of the ReLU graph where z < 0. The output is a flat line at 0, meaning its derivative (slope) is exactly zero.

If a neuron gets a large negative input during training, it will output 0. During backpropagation, when the optimizer multiplies gradients across the network, that 0 slope acts like a brick wall. No gradient can flow backward through it.

🛠️ The Fix: Modern Variants of ReLU

To prevent neurons from dying, engineers modified the negative side of the ReLU curve so it never completely drops to a flat zero.

1. Leaky ReLU

Instead of setting negative values to zero, Leaky ReLU multiplies negative inputs by a tiny, constant slope (usually = 0.01).

f(z)=(αz,z)f(z) = (\alpha z, z)

Why it matters: Because the slope on the negative side is non-zero ( ), a small gradient can always flow backward. Neurons can recover from bad initialization or aggressive updates and start learning again.

2. Parametric ReLU (PReLU)

PReLU takes Leaky ReLU a step further. Instead of hardcoding to a fixed number like 0.01, it treats as a learnable parameter.

3. ELU (Exponential Linear Unit)

ELU smooths out the sharp corner of ReLU at 0 using an exponential function for negative values.

4. GELU (Gaussian Error Linear Unit) — The Transformer Standard

GELU is the state-of-the-art activation function used in modern LLMs (like GPT architectures) and Vision Transformers (ViT). It weights inputs by their value according to a Gaussian cumulative distribution function.

A close approximation of its formula is:

f(z)0.5z(1+tanh(2π(z+0.044715z3)˙))f(z) \approx 0.5z (1 + \tanh(\sqrt \frac{2}{\pi} \dot{ (z + 0.044715 z^3 )} ))

Softmax:

The purpose of Softmax is to convert logits into probabilities. Logits are the outputs from your neural network layers. Probability sums up to one. Softmax is usually used at the final layer of a classification model, so we can classify an input vector into one class – by picking the argmax over the output vector.

Softmax v.s. Sigmoid:

2. The Learning Engine: Loss Functions & Backpropagation

Now that our network has non-linear layers, how does it actually learn?

A neural network starts its life completely random. We feed it an image of a stop sign, and because its weights (W, b) are randomly initialized, it might guess “green light.” We need a mechanism to quantify exactly how wrong the network is, and a way to adjust the weights to fix it.

Step 1: The Loss Function (The Scorecard)

The loss function is a mathematical formula that takes the network’s incorrect guess and compares it to the ground truth. It outputs a single number representing the “penalty score” of the model.

The choice of loss functions depends on the problem you try to solve. Here are some typical loss functions for Regression and Classification problems.

A. Regression Losses (Predicting Numbers)

When your robot needs to predict a continuous value—like the exact steering angle of a car or the joint position of a robotic arm—we use regression losses.

The figure below illustrates the differences between MAE(L1), MSE(L2),  Smooth L1, and Huber loss. 

Here’s a comparison between MAE and MSE losses:

FeatureMean Absolute Error (MAE)Mean Squared Error (MSE)
Formula1n i=1n y -y1n(y -y)2
Error PenaltyLinearQuadratic (punishes large errors heavily)
Handling of OutliersIgnores/RobustMagnifies/Sensitive
Mathematical TargetMedianMean
Gradient BehaviorConstant step size; non-differentiable at 0Variable step size; smoothly goes to 0

B. Classification Losses (Predicting Categories)

When your robot needs to identify what an object is (e.g., Pedestrian vs. Car vs. Tree), we transition to classification losses.

Source: https://superlinked.com/glossary/loss-function

🚀 Advanced Loss Functions in the Wild

As you move past the fundamentals into modern robotics and AI, you will encounter highly specialized loss functions designed for complex tasks. Here are some examples:

1. Chamfer Distance & Hungarian Matching Loss (For Online Mapping)

In modern autonomous driving, cars generate high-definition vector maps of their surroundings in real-time. The model predicts lines for road boundaries, crosswalks, and lanes, and must match them to the actual ground-truth layout.

To do this, the network calculates the Chamfer Distance—which measures the geometric distance between the shapes of the predicted lines and the true lines. It then uses Hungarian Matching Loss to find the absolute best one-to-one pairing that yields the absolute minimum total distance. If the model mistakenly matches a predicted road boundary with a ground-truth driveway, this loss hits it with a massive penalty. (We’ll dive deep into this fascinating topic in a future post! 😉)

2. Cosine Similarity Loss (For Contrastive Learning like CLIP)

This loss powers the foundation models ($VLM$) that allow modern robots to understand open-ended text commands. Models like OpenAI’s CLIP use Cosine Similarity Loss to train a visual encoder and a text encoder simultaneously.

Instead of asking “Is this a car?”, it measures the angle between a text embedding vector (e.g., the phrase “a red sedan”) and an image embedding vector. The loss function forces the weights to update so that matching image-text pairs point in the exact same direction in vector space, minimizing the similarity loss.

If you’re interested in learning more about the mathematics behind each loss function, their pros and cons, and their applications — we’ll dive deep into that in a separate blog post about Loss Functions

Step 2: Backpropagation (The Chain Rule)

Once we have our loss score, we need to send this error feedback backward through the network to calculate how much each individual weight contributed to the mistake. This process is called Backpropagation.

Think of this similar to the Closed-loop feedback control in a control system – we compute the error (mostly defined as the difference between the target and the output (error = Target – Output) and feed the error “back” to the system – with some weights and mechanism (so called P.I.D. control) so that the system can update it’s activation to minimize the error.

The differences between a control system and a neural network is: 

Backpropagation is powered entirely by the Chain Rule from calculus. 

Think of a neural network as a long chain of dominoes:

Weights → Layer 1 → Layer 2 → Loss

If you want to know how a change in the very first weight affects the final Loss, the chain rule tells us we can simply multiply the local slopes all the way back:

LossWeight=LossLayer2×Layer2Layer1×Layer1Weight\frac{\partial{Loss}}{\partial{Weight}} = \frac{\partial{Loss}}{\partial{Layer2}}\times \frac{\partial{Layer2}}{\partial{Layer1}} \times \frac{\partial{Layer1}}{\partial{Weight}}
          Forward Pass:  Input ──────> Layer 1 ──────> Layer 2 ──────> Loss (Error)

                                                                            │

          Backward Pass: Weight <─── dL/dW1 <─── dL/dLayer2   <─────────────┘

Remember the Vanishing Gradient Problem we mentioned with Sigmoid? Look at the chain rule above. If your activation functions have flat-lined, their local derivatives LayerPrevious are tiny numbers close to zero (like 0.01). If you multiply many tiny numbers together across 20 layers:

0.01 x 0.01 x 0.01 x 0.01 = 0.00000001

The gradient vanishes to zero by the time it reaches the early layers. The earliest layers stop learning entirely, breaking the network’s ability to extract basic structural features. This is why replacing Sigmoid with ReLU was so profoundly impactful.

白話文說:由於 Chain rule 會把下一層的 gradients 乘以上一層,如果每層的 gradients 都很小,從 Loss 端一路往回推到最前面幾層(假設你的 network 有 100層)時的影響會變得微乎其微。這讓 model 難以更新最前面幾層的 weights,導致學習成效不佳。

用人類來比喻: Sigmoid/Tanh 這種極端的函式,

3. The Navigation Engine: Navigating the Optimizer Landscape

Once backpropagation tells us the direction and magnitude we need to change our weights, the Optimizer steps in to physically update those weights.

Think of optimization as hiking down a dense, foggy mountain (the loss landscape) trying to find the lowest valley (the minimum error).

Loss

 │     \

 │       \  ⛰️ (High Error)

 │        \

 │          └───> 📉 Gradient Descent (Stepping down the mountain)

 │               \

 └───────────────┴─────────────> Minimum Error (The Valley Floor)

The Gradient Descent Family

At its core, Gradient Descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent. In machine learning, it is the engine used to train models by minimizing the loss function (the error between the model’s predictions and the actual data).

Imagine you are blindfolded on top of a foggy mountain, and your goal is to find the very bottom of the valley. Here is how Gradient Descent solves that problem:

The Core Steps

This cartoon visualizes the foggy mountain analogy we discussed. Here, the blindfolded hiker represents the algorithm. He cannot see the bottom, but by feeling the slope (the Gradient) beneath his feet, he knows which way to take a step. The distance he moves each time is his Learning Rate (Step Size). By iteratively taking these steps downward, he eventually navigates from the “High Loss” altitude to the lowest point in the valley (the Minimum).

The Mathematical Update Rule

To update the parameters (weights w) of a model, the algorithm subtracts a portion of the gradient from the current weights:

wnew=woldL(w)w_{new} = w_{old} – L(w)

Where:

The core formula for gradient descent is:

wt+1=wtαL(wt) w_{t+1} = w_t – \alpha \nabla L(w_t)
Source: https://www.geeksforgeeks.org/deep-learning/optimization-rule-in-deep-neural-networks/

1. Batch Gradient Descent

2. Stochastic Gradient Descent (SGD)

3. Mini-Batch Gradient Descent (The Sweet Spot)

Advanced Adaptive Optimizers

Standard Mini-Batch SGD can get stuck in flat regions or bounce back and forth across narrow valleys. To solve this, engineers built smarter algorithms: (The problem of getting stuck at local minima)

Here is the comparison table featuring the mathematical update rules and formulas for each of these optimizers.

Core Setup & Notations

To keep the formulas clean and consistent, we define:

#OptimizerMathematical Update Rule & Key FormulasKey Mechanism / Components
1Batch Gradient Descentwt+1=wtα1Ni=1NLi(wt)w_{t+1} = w_t – \alpha \frac{1}{N} \sum_{i=1}^{N} \nabla L_i(w_t)Computes the average gradient over the entire dataset (N samples) before updating.
2Stochastic Gradient Descent (SGD)wt+1=wtαLi(wt)w_{t+1} = w_t – \alpha \nabla L_i(w_t)Computes the gradient using only one random sample i per step.
3Mini-Batch Gradient Descentwt+1=wtα1Bi=1BLi(wt)w_{t+1} = w_t – \alpha \frac{1}{B} \sum_{i=1}^{B} \nabla L_i(w_t)Computes the average gradient over a small subset/batch of size B.
4SGD with Momentumvt=βvt1+αLi(wt)v_t = \beta v_{t-1} + \alpha \nabla L_i(w_t)
wt+1=wtvtw_{t+1} = w_t – v_t
vt: Velocity/momentum accumulator.
Beta: Momentum friction coefficient (usually 0.9).
5AdagradGt=Gt1+Li(wt)2G_t = G_{t-1} + \nabla L_i(w_t)^2
wt+1=wtαGt+ϵLi(wt)w_{t+1} = w_t – \frac{\alpha}{\sqrt {G_t}+\epsilon } \nabla L_i(w_t)
Gt: Running sum of historical squared gradients.
Scales down learning rate for frequently updated features.
6RMSPropvt=βvt1+(1β)(Li(wt))2v_t = \beta v_{t-1} + (1-\beta) (\nabla L_i(w_t))^2
wt+1=wtαvt+ϵL(wt)w_{t+1} = w_t – \frac{\alpha}{\sqrt{v_t} + \epsilon} \nabla L(w_t)
vt: Exponentially decaying moving average of squared gradients, preventing Gt from growing infinitely.
7AdamMomentum step:
mt=β1mt1+(1β1)L(wt)m_t = \beta_1 m_{t-1} + (1 – \beta_1) \nabla L(w_t)
RMSProp step:
vt=β2vt1+(1β2)(L(wt))2v_t = \beta_2 v_{t-1} + (1 – \beta_2) (\nabla L(w_t))^2
Bias Corrections:
m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 – \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 – \beta_2^t}
Update Rule:
wt+1=wtαv^t+ϵm^tw_{t+1} = w_t – \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
mt: First moment (mean gradient).
vt: Second moment (uncentered variance).
mt, vt: Corrects initialization bias toward zero during early steps.
8AdamWUpdate Rule:
wt+1=wtλwtαv^t+ϵm^tw_{t+1} = w_t – \lambda w_t – \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
(Uses identical mt and vt formulas as Adam above)
λ\lambda: Weight decay coefficient.
Decouples L2 regularization from the gradient tracking (mt, vt), applying decay directly to the current weight.

Key Takeaway on Adam vs. AdamW

In standard Adam, if you add an L2 regularization penalty (Lreg = 12 w2) to the loss function, the weight penalty gets mixed right into the gradient terms mt  and vt. This means weights that have large historical gradients get their regularization scaled down, which isn’t what you want.

AdamW mathematically isolates the weight decay ( wt ) by subtracting it directly during the parameter update phase, keeping the decay independent of the historical gradient variances. This is why it performs significantly better for complex, large-scale architectures.

4. Why Deep & Narrow Beats Shallow & Wide

Now that we understand the ingredients—activation functions, backpropagation, and advanced optimizers—we can finally tackle the central question: 

Why do we prefer Deep and Narrow networks over Shallow and Wide ones?

Mathematically, there is a famous theorem called the Universal Approximation Theorem. It proves that a network with just a single hidden layer can approximate absolutely any mathematical function, no matter how complex—if it is wide enough.

So if a shallow network can technically learn anything, why did the AI industry abandon width for depth?

Reason A: Hierarchical Feature Learning (How the Universe is Built)

The physical world is inherently hierarchical. Small things combine to make bigger things.

A Deep Network matches this reality perfectly through its architecture. Each layer extracts information at a higher level of abstraction than the layer before it:

Another way to think of it is: Each layer just does its own job. But, master its skills! 每個 layer 守好自己本分,想辦法做到精熟(非常善於辨識 edges, shapes, objects…etc)

Input –> [ Early Layers: Edges ] –> [ Mid Layers: Shapes ] –> [ Deep Layers: Concepts ]

A Shallow Network cannot do this. It tries to learn complex semantic concepts (like a pedestrian) directly from raw pixels in a single step. It is the educational equivalent of forcing a toddler to read Shakespeare without teaching them the alphabet or words first.

白話文來想:像原子習慣一樣,take baby steps, one step at a time,漸進式學習/成長比一次逼迫你大躍進容易。Aim for 1% better everyday — you’ll improve by 37x after one year.

A super wide NN with a Single layer tries to learn extremely complex concepts in one step. A narrower but deep NN tries to learn in a more systematic way. Like divide-and-conquer in algorithms, this method is much more effective and most importantly — practical. 

Reason B: Exponential Parameter Efficiency

This is where the raw math favors depth. To learn the exact same highly complex nonlinear function, a shallow network requires an exponential number of neurons. A deep network can achieve the exact same expressive capability with a fraction of the parameters.

💡 The Analogy of Origami: Imagine trying to cut out a highly intricate lace snowflake pattern using a pair of scissors on a flat piece of paper. It takes thousands of meticulous cuts (Shallow & Wide).

Now, imagine folding that paper in half, then in half again, and again (Deep & Narrow). With just a few simple cuts through the folded layers, you unfold a deeply intricate, massive pattern. Depth is mathematical folding.

By choosing deep and narrow, we get models that are exponentially faster to compute, require drastically less memory, and generalize to unseen data far better than wide models.

5. The Robotics Connection: Depth in Autonomous Driving

In robotics and autonomous driving, the “Deep & Narrow” paradigm isn’t just a preference—it is a critical requirement for real-world safety.

Consider an autonomous vehicle processing a live camera stream at 30 frames per second. The vehicle must simultaneously track lanes, detect pedestrians, predict vehicle trajectories, and plan paths.

                                    ┌──> Head 1: Object Detection (Cars/Pedestrians)

                                    │

Camera Feed ──> [ Deep Backbone ] ─┼> Head 2: Lane Segmentation ( Boundary/Lanes/Speed Bumps)

                    (Shared)  │

                                    └──> Head 3: Depth Estimation

Because deep networks learn a universal hierarchical representation of the visual world, we can use a single, deep, highly optimized network (called a Backbone) to read the camera pixels and extract core semantic concepts.

Once the deep layers have extracted these abstract concepts, we split the network into multiple small “heads” to perform different tasks:

  1. Perception Head: Uses the abstract features to draw bounding boxes around obstacles.
  2. Segmentation Head: Color-codes every pixel to isolate the drivable road area.
  3. Depth Estimation Head: Gauges how far away objects are based entirely on 2D visual cues.

If we tried to use shallow, wide networks for this, we would need completely separate, massive models for every single task, causing our vehicle’s onboard computer to overheat and fail under the computational weight. Depth gives autonomous vehicles the power to compress real-world context into a structured, highly efficient mental model, allowing them to make life-saving decisions in a fraction of a millisecond.

Summary

Recap on what we’ve discussed in this post:

1. Activation Functions: Stacking raw linear layers can only draw straight lines, so we use non-linear activation functions to let the network bend and warp to fit real-world data. While historical functions caused networks to freeze or die, modern standard variants like GELU keep the gradient pipeline flowing smoothly.

2. Loss Functions & Backpropagation: The loss function acts as a scorecard to measure how wrong a prediction is, while backpropagation uses the calculus chain rule to pass that error score backward. This feedback loop tells every individual weight exactly how much it needs to adjust to correct the model’s mistakes.

3. The Optimizer Landscape: Optimizers physically guide the model weights down the “error mountain,” with Mini-Batch GD serving as the efficient industry baseline. Advanced AdamW stands out as the modern gold standard because it decouples weight decay penalties, preventing highly active weights from bypassing regularization.

4. Why Deep & Narrow Beats Shallow & Wide: Deep networks learn hierarchically—breaking complex concepts down from edges to shapes to objects—which matches how the physical universe is structured. This “mathematical folding” approach allows deep models to be exponentially more memory-efficient and generalize better than massive, single-layer flat networks.

5. The Autonomous Driving Application: Real-world robotics relies on this efficiency for safety. Instead of running three massive, separate models that would overheat the car’s computer, a single deep Backbone processes the camera pixels once. It then passes those core concepts to lightweight Heads that instantly handle lane tracking, obstacle detection, and depth tracking all at the same time.

What’s Next?

Now that we’ve mastered the core foundations of network depth, activations, and optimizers, we are ready to tackle the messy reality of training models in production. In our upcoming deep dives, we will unpack how to diagnose, regularize, and normalize your architectures for peak performance.

Here is what we are covering next:

Comments

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *