29 3 月, 2026

Machine Learning Domain Knowledge (Co-authored with Notion AI)

這是我使用Notion AI 幫我 summarize 的筆…

這是我使用Notion AI 幫我 summarize 的筆記：

Here’s a technical fundamentals post you can use as a baseline “domain knowledge” doc and iterate on. It is intentionally opinionated toward how fundamentals show up in real ML systems, not just textbook definitions.

Table of Contents

Fundamentals of Machine Learning (A Practical Technical Primer)

Machine learning is the practice of learning a function ( f_theta ) that maps inputs (x) to outputs (y), using data rather than hard-coded rules.

The “learning” is the process of choosing parameters (theta) that minimize a loss function on training data, while still performing well on unseen data.

That’s the entire game.

先備知識 Prerequisite:

Machine Learning Model 其實就是一個 function of parameters. 用數學式表達的話：
- y = f_theta(x).
- x 是你餵給 model 的input (可能是 images, texts, audios, LiDAR point clouds)
- y 是 model 的 output: 可能是離散的種類 (classification), 可能是連續的數值 (regression)
- f_theta: 這就是我們要訓練的模型。Model 是由多個 trainable parameters組成的函式。
這些 trainable (learnable parameters) 是“非人為控制的” – 取決於你 model 訓練過程學到的。當然，根據你怎麼設計模型和訓練的方法（loss function 和 metrics 怎麼訂），高度影響模型最終訓練的結果。你需要透過調整 Hyperparameters (所以人為可以選擇的參數）來提升模型的 performance。
如何訓練模型？通常會經過『迭代』的方式去優化模型。這裡的優化指的是：透過『更新模型的參數 Update model’s learnable parameters 』。這個過程叫做 Optimization – there are multiple different optimization algorithms (optimizers) we can choose to train the model.
Optimization 是什麼？根據你訂立的 loss function objectives，我們想要 minimize the loss such that the model performs the best. 要如何降低 loss 呢？每一次的 forward pass, a loss score will be computed. Then, given the optimizer you picked (e.g. Gradient descent, SGD, mini-batch GD+Momentum, Adagrad, Adam, RMSprop, AdamW), a “gradient vector” will be computed at the operating point. We can optimize the model by applying the following update rule:
- parameters_(t+1) <- parameters_t + learning_rate * gradient
This way, the model’s “learnable parameters” will be updated such that the next iteration, the loss function should be decreased.
Now you know the basics of training an ML model! Congratulation!

This post lays out the core building blocks:

Problem framing
Objectives
Optimization
Generalization
The failure modes you must diagnose to improve iteration speed.

1) Problem framing: what are we predicting and why?

Most ML problems can be framed as one of:

Supervised learning: learn (x -> y) from labeled pairs.
- Classification, regression, detection, segmentation.
Self-supervised learning (SSL): learn representations by creating pseudo-labels from raw data.
- Contrastive learning, masked modeling, autoencoding.
Unsupervised learning: discover structure without explicit prediction targets.
- Clustering, density estimation.
Reinforcement learning: learn a policy to maximize long-term reward under an environment.

Before architectures, the most important decision is:

What is the output space and what does “good” mean?

That answer determines your labels, your loss, your metrics, and often your model family.

2) Data: the “model” you actually ship is data + pipeline

A trained model is only as reliable as the data distribution it sees.

Key concepts:

Train / validation / test split: validation is for model selection; test is for final unbiased reporting.
Data leakage: any feature or processing step that gives the model access to information it would not have at inference time. Leakage can make metrics look “amazing” and then fail in production.
Distribution shift:
- Covariate shift: (p(x)) changes.
- Label shift: (p(y)) changes.
- Concept drift: (p(y mid x)) changes.

Practical habit: track not only aggregate metrics but slices (conditions, classes, environments) and monitor drift over time.

3) Model: choose inductive biases that match the data

A model is a parameterized function family. Different families encode different assumptions (inductive biases):

Linear / logistic models: strong bias, fast, interpretable.
Tree-based methods: handle heterogeneous tabular patterns well.
CNNs: locality + translation bias for images.
Transformers: flexible global interactions, high capacity, often needs more data/compute.

A useful mental model:

More inductive bias → faster learning with less data, but less flexible.
Less inductive bias → more flexible, but needs more data/compute and careful training.

4) Objective: loss functions define what the model learns

Training typically solves:

[

theta^* = argmin_theta mathbb{E}{(x,y)sim mathcal{D}}[,mathcal{L}(ftheta(x), y),] + lambda Omega(theta)

]

Where:

(mathcal{L}) is the task loss (alignment to prediction goal).
(Omega) is regularization (controls complexity / stability).
(lambda) trades off fit vs simplicity.

Common losses:

Cross-entropy for classification: encourages correct class probability.
MSE / MAE / Huber for regression: different sensitivity to outliers.
IoU-family losses for boxes/segmentation: optimize overlap directly.

If your metrics don’t improve, suspect either:

loss is misaligned with the metric, or
the data labels encode something different than what you think.

5) Optimization: how you actually find good parameters

Deep learning is mostly stochastic gradient descent variants:

Compute gradient estimate on mini-batches.
Update weights with optimizer (SGD+momentum, Adam, AdamW).
Use learning rate schedules (warmup, cosine decay, step decay).

The two most important knobs:

Learning rate
Effective batch size

Rules of thumb:

Larger batch → more stable gradients → can often raise LR.
Smaller batch → noisier gradients → lower LR or better schedule.

Instability patterns:

NaNs / exploding gradients: often too high LR, bad initialization, numerical issues (log(0), division by zero).
Fixes: gradient clipping, lower LR, stable ops, mixed precision done correctly.

Optimization is not “just training”; it is the mechanism that controls whether your architecture and data can actually converge.

6) Generalization: the core risk is fitting the wrong thing

Generalization is performance on unseen data. The classic regimes:

Underfitting: model too weak or constrained; high training error and high validation error.
Overfitting: model memorizes noise; low training error but high validation error.

Typical levers:

To address underfitting

Increase model capacity.
Reduce regularization.
Train longer.
Improve features / representation.
Fix learning rate (often too high prevents convergence).

To address overfitting

Increase regularization (weight decay, dropout).
Use more or better data, augmentation.
Early stopping.
Simplify the model.

A subtle point: “get more data” helps overfitting much more reliably than underfitting.

7) Representation learning: why SSL and contrastive methods matter

Many tasks become easier if you learn a good embedding space.

Contrastive learning: pull “positives” together, push “negatives” apart (e.g., InfoNCE).
Vision-language pretraining (CLIP-style): align image and text embeddings so text prompts become classifiers.
Masked modeling (MAE-style): learn by reconstructing missing parts.

These are not “tricks”. They are ways to learn transferable features when labels are scarce or expensive.

8) Uncertainty: knowing when not to trust the model

Two core types:

Epistemic uncertainty: model uncertainty from lack of data coverage. Often reduced by more data or better modeling.
Aleatoric uncertainty: irreducible noise (sensor noise, ambiguous labels).

In safety-critical ML, “uncertainty handling” is often as important as accuracy:

abstain,
trigger fallback logic,
request human review,
or route to a safer subsystem.

9) Evaluation: metrics are the contract with reality

Always separate:

Training objective (loss) vs
Evaluation metric (what you actually care about)

Good evaluation practice:

report confidence intervals where possible,
test on hard slices,
validate that improvements are consistent across conditions,
guard against leakage,
measure both accuracy and cost (latency, memory, FLOPs).