23 1 月, 2026

Through the Lens of Math: How CNNs Cracked the Code of Computer Vision

Good morning! It is January 23…

Good morning! It is January 23rd, and we are on Day 03 of your sprint.

Yesterday, you mastered the “Global” brain (MLPs).
Today, we get specific. We are looking at Computer Vision and the architecture that saved Deep Learning from the “curse of dimensionality”: Convolutional Neural Networks (CNNs).

Table of Contents

📚 Day 03: Computer Vision (CNNs)

The Reading: A Comprehensive Guide to CNNs (ELIs Way).
The Core Concept: Spatial Invariance.
A cat is a cat whether it’s in the top-left or bottom-right corner of a photo. CNNs are built to “slide” over images to find patterns regardless of location.
- This is the translational invariance of CNN inductive bias!

What is Inductive Bias?

Inductive bias is the hypothesis that a model architecture will work for a task. Refer to link for more information.

TODO: Example (Linear layer, MLP, CNN, Transformer)

Deep Dive Question:

In a standard MLP (Day 2), if you move an object in an image by just one pixel, the model sees it as a completely different input because the “weights” are tied to specific pixel locations.

As you read, focus on this:

How does the “Kernel” (or Filter) solve this?
Explain the concept of “Parameter Sharing“—why is it more efficient to reuse a small 3×3 filter across an entire image than to have a unique weight for every single pixel?

⏱️ Your 40-Minute Breakdown

00:00 – 20:00: Read. Focus on the three main layers:
- Convolution (Feature extraction)
- Pooling (Downsampling/Summarizing)
- Fully Connected (The final decision).
20:00 – 40:00: Write. Explain the “Hierarchy of Features.”
- Layer 1: Detects lines/edges.
- Layer 2: Detects textures/circles.
- Layer 3: Detects eyes/ears/noses.

Coach’s Tip: In your post, try to use the “Flashlight” analogy.

A CNN doesn’t look at the whole room at once; it shines a small flashlight (the kernel) over every inch of the wall to find the details, then summarizes what it saw.

Notes:

A Convolutional Neural Network (CNN) is a class of neural networks that specializes in processing data that has a grid-like topology, such as an image.
A digital image is a binary representation of visual data.
- It contains a series of pixels arranged in a grid-like fashion that contains pixel values to denote how bright (intensity) and what color (RGB) each pixel should be.
By using a CNN, we can enable sight to computers!
- Like human brain, each neuron works in its own receptive field.
- Each neuron responds to stimuli only in its restricted region.
- CNN layers are arranged such that neurons detect simpler patters (lines, curves) first and more complex patterns (faces, objects) later.
A CNN typically has three layers:
- A convolutional layer.
- A pooling layer.
- A fully connected layers.

Convolutional Layer

This is the core building block of the CNN.

This layer performs “convolution” between the input features and the kernels. Kernels are the learnable parameters in your model!

The kernel is spatially smaller than an image but is more in-depth. This means that, if the image is composed of three (RGB) channels, the kernel height and width will be spatially small, but the depth extends up to all three channels.

The kernel slides across the height and width of the image and produces the activation map.

Input size: W x W x D
Dout: number of kernels.
F: Kernel size (F x F)
S: Stride (Sliding size of the kernel)
Wout: Output size (We’ll get D Wout x Wout output embeddings)

Convolution leverages three important ideas that motivated computer vision researchers:
1. Sparse interaction
2. Parameter sharing
3. Equivariant representation

Sparse Interaction

Trivial neural network (e.g. linear layer) have all neurons interact with all input pixels. However, some useful information (patterns) might only locate at tens or hundreds of pixels.

The idea of using a kernel with much smaller size then the input image is to capture the locality of patterns. Some meaningful information can be detected by the kernels.
It also results in two main benefits:
1. Memory efficient: We only need to store fewer parameters (kernel size * number of kernels)
2. Improve the statistical efficiency of the model.* (What does statistical efficiency mean?)

Parameter Sharing

In a traditional neural network, weight matrix never revisited because everything is flattened to 1-dimensional.
On the contrary, convolution layers has shared parameters := the same kernel slide across the entire input image, so the same weights are applied to multiple locations of the input embedding.

This leads to a property of equivariance to translation. It says that if we changed the input in a way, the output will also get changed in the same way.

Pooling Layer

Fully Connected Layer

Non-Linearity in layers

In the previous post, we’ve discussed the importance of introducing “non-linearity” in neural networks – this is where the magic happens!

Goro Yeh 56