27 1 月, 2026

The Reward Loop: How Agents Learn to Navigate the World through Reinforcement

Good morning! It is January 25…

Good morning! It is January 25th, and we are shifting gears from Pattern Recognition (predicting the next pixel or word) to Decision Making.

Welcome to Day 05: Reinforcement Learning (RL). This is the logic behind AlphaGo, self-driving cars, and the “Fine-Tuning” (RLHF) that makes LLMs helpful.

Table of Contents

📚 Day 05: Reinforcement Learning

The Reading: A Thorough Introduction to RL (Paperspace).
The Core Concept: The Agent-Environment Loop. Learning via trial and error to maximize a cumulative reward.

The Deep Dive Question: In a Markov Decision Process (MDP), we assume that the “Future is independent of the past, given the present.” This means the current “State” must contain all the information the agent needs to make a decision.

As you read, focus on this: The Exploration vs. Exploitation Trade-off. If an agent finds a strategy that gives a +10 reward (Exploitation), how do we force it to keep trying new things that might lead to a +100 reward (Exploration)? Why is “Randomness” (Entropy) actually a feature, not a bug, in RL?

⏱️ Your 40-Minute Breakdown

00:00 – 20:00: Read. Focus on the components of an MDP: State, Action, Reward, and Policy ($\pi$). Try to grasp the difference between a Policy (the strategy) and a Value Function (the prediction of future rewards).
20:00 – 40:00: Write. Explain why RL is harder than supervised learning.
- Hint: In Day 1–4, you had a “Label” (the right answer). In RL, you only have a “Reward” which might be delayed (e.g., you don’t get the reward for a chess move until 20 moves later). This is the Delayed Reward/Credit Assignment Problem.

🧠 Coach’s Corner: The “Credit Assignment” Hook

For your blog post, try to use an analogy. Learning via RL is like training a dog: you don’t show the dog a picture of a “Sit,” you wait for the dog to sit and then give it a treat. The challenge is making sure the dog knows it got the treat for sitting, not for wagging its tail at the same moment.

The clock is running! Do you want me to explain the difference between “Model-Based” and “Model-Free” RL before you start your write-up?

My Reading Notes

The Exploration vs. Exploitation Trade-off

Exploration:

The agent will explore different possible actions to try get a better return, then optimize the policy based on it.

Exploitation:

Using the current policy, how to get the maximum “Return”? (Accumulative rewards in an episode.)

Goro Yeh 56