24 1 月, 2026

Attention is All You Need: How Transformers Replaced the Sequence with the Relationship

Good morning! It is January 24…

Good morning! It is January 24th, and we’ve reached the “pivot point” of modern AI.
Today is Day 04, and we are tackling the Transformer. If CNNs (Day 3) were inspired by the visual cortex, Transformers are inspired by the way we understand context. This is the architecture that powers GPT-4, Gemini, and almost every “Foundation Model” we will discuss next week.

Table of Contents

📚 Day 04: Sequence & Transformers

The Reading: The Illustrated Transformer by Jay Alammar.
The Core Concept: Self-Attention. Moving away from processing data in order (like a line at a grocery store) to processing everything at once by “attending” to the most relevant parts.

Deep Dive Question:

In the sentence “The animal didn’t cross the street because it was too tired,” how does the model know that “it” refers to the animal and not the street?

As you read, focus on the Query, Key, and Value (Q, K, V) vectors. Can you explain, in plain English, how these three vectors act like a “Search Engine” inside the model to calculate the relationship between words?

⏱️ Your 40-Minute Breakdown

00:00 – 20:00: Read. Don’t let the “Multi-Head” part scare you. Focus strictly on Self-Attention and Positional Encoding (how the model knows word order without a sequence).
20:00 – 40:00: Write. Explain why this was a breakthrough.
- Hint: It’s all about Parallelization. Unlike older models (RNNs), Transformers can look at the whole sentence at once, making them incredibly fast to train on massive hardware.

🧠 Coach’s Corner: The “GNN” Connection

The syllabus also mentions Graph Neural Networks (GNNs). If you have an extra 2 minutes, think of it this way: A Transformer is actually just a special type of Graph Neural Network where every word is a “node” and the “attention” is the edge connecting them. They both care about the relationships between entities rather than just their order.

The timer is on! Ready to dive into the math of “Attention”? I’m here if you want me to break down the “Query/Key/Value” analogy using a library catalog example.

What is “Transformer”?

TODO: Put a transformer movie image & actual transformer hardware. (for FUN!)

The Transformer – a model that leverages Attention mechanism to boost the training speed. (Assuming you have large number of data)
Transformer was proposed in the paper Attention is All You Need.

It originates from solving the problem: Machine Translation (or Natural Language Processing). In a simpler way: How does a machine understand human language? (natural language).
The paper uses the example of “French <-> English” translation.

In this post, we’ll dive into the following topics, which I personally think is the most important to learn to understand how Transformer works.

Attention mechanism.
- Word Embedding
- Scaled Dot-Product Attention
- Multi-Head Self-Attention
- Self-Attention v.s. Cross-Attention
- Positional Encoding
Transformer Architecture
- Encoder
- Decoder
- Bridging encoder to decoder
Applications
Extended Research
- Vision Transformer (ViT)
- DEtection TRansformer (DETR)
- Deformable DETR (Deformable Attention)
- DiNO
- Flash Attention
- Foundation Models
  - LLM
  - VLM
  - VLA
Applications in Autonomous Driving

Terminology

Throughout the post, we’ll use the following terminology:

Sentence = sequence.
Learnable parameters = weights.

Attention Mechanism

Word Embedding

How to feed an input sentence to the model? Vector Embeddings

You must have a question: How to feed an input sentence to the model?
E.g. “I have a dream.”
The answer is: use embedding algorithms!
We transform natural language into floating point vectors using embedding algorithms. These vectors – or tensors are called “embeddings”.

Each embedding vector corresponds to a word in the sequence.

TODO: Illustration.

The input to the Transformer model is usually a tensor of shape (B, N, E).

B: Batch size. The number of sentences in a batch input. (e.g. 4)
- We usually pass multiple sentences at a time for faster processing.
- Each sentence is processed independently, so we can leverage parallel computing with the power of modern GPUs.
N: Number of embeddings (words): One word <=> one embedding, this is the number of words in a sentence. Usually we set this to be the maximum sequence length in a dataset.
E: Embedding dimension(size): the size of an embedding. Usually we set it to 512. This mean: after applying the embedding algorithm, every word is transformed into a vector of size 512.

Scaled Dot-Product Attention

This is the core part of transformer. Let’s dive into the math!
Below is the Scaled Dot-Product Attention from the original paper.

Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

Given the input embedding vectors from the input sentence:

Multiply each embedding with Wq, Wk, and Wv matrices to get Q, K, V.
- This step generates Queries, Keys, and Values.
- Wq, Wk, Wv are the learnable parameters the model should learn.
  - a.k.a. Weights of the model.
Apply dot product of Queries with Keys to get the score matrix S.
Scale score matrix S – divide S by sqrt(dk) – get S’.
- This step is to normalize the score matrix for stable training. The model will get more stable gradients during back propagation.
  - dk is the embedding dimension – 64 in the paper.
  - This is due to “Multi-Head” (will discuss later), so 512/8=64.
(Optional) Add mask to avoid queries from cheating.
- A word cannot attends to the future word – when you’re speaking, you cannot peak the future!
Apply Softmax to S’ to get the matrix A.
- Softmax function converts logits into probabilities – normalizes A so that they sum up to 1.
Multiply A by Value matrix to get the final result R.

What are “queries”, “keys”, and “values” vectors?

TODO: Notes & Videos
TL;DR; They are abstractions that are useful for calculating attention scores.

# TODO: Example of Multi-head attention

def softmax():
  return 0
  
def scaled_dot_product_attention():
  return A
  
class MultiHeadAttention()
  def __init__():
    pass
    
  def forward(self, input, embed_dim):
  
    # to be implemented
    raise NotImplementedError("forward() should not be empty.")

# TODO: Example of Multi-head attention

def softmax():
  return 0
  
def scaled_dot_product_attention():
  return A
  
class MultiHeadAttention()
  def __init__():
    pass
    
  def forward(self, input, embed_dim):
  
    # to be implemented
    raise NotImplementedError("forward() should not be empty.")

Multi-Head Attention

Why Multi-Head? Two benefits:

Expands model’s ability to focus on different positions.
- If we have a very long sentence, (e.g. 100 words), if we use a single head, the attention matrix will be 100×100, so the resulting vectors might be dominated by the original word.(*)
It gives the attention layer multiple “representation subspaces”.
- Use 8 heads as the example, we’ll end up having 8 sets of randomly initialized weights. Each set of (Wq, Wk, Wv) projects the input embeddings into a different representation subspace.

Great Visualization:

Self-Attention v.s. Cross-Attention

What is the difference between Self-Attention and Cross-Attention?

In one sentence: where Queries, Keys, and Values are from.

Self-Attention: All Q, K, V are from the same input sequence (or same input embedding).
Cross-Attention: Q is from one input embedding, while K, V are from the other embedding.
- That’s why we call it “cross”! It’s trying to find the relationship a”cross” different inputs!

Positional Encoding: Representing The Order of The Sequence

What about the “order” of a word in a sentence? How do we capture this information in the model?

We use “Positional Encoding” to capture information.

Reference:

https://jalammar.github.io/illustrated-transformer/