Machine Learning Domain Knowledge (Co-authored with Notion AI)

這是我使用Notion AI 幫我 summarize 的筆…

這是我使用Notion AI 幫我 summarize 的筆記:

Here’s a technical fundamentals post you can use as a baseline “domain knowledge” doc and iterate on. It is intentionally opinionated toward how fundamentals show up in real ML systems, not just textbook definitions.


Fundamentals of Machine Learning (A Practical Technical Primer)

Machine learning is the practice of learning a function ( f_theta ) that maps inputs (x) to outputs (y), using data rather than hard-coded rules.

The “learning” is the process of choosing parameters (theta) that minimize a loss function on training data, while still performing well on unseen data.

That’s the entire game.

先備知識 Prerequisite:

  • Machine Learning Model 其實就是一個 function of parameters. 用數學式表達的話:
    • y = f_theta(x).
    • x 是你餵給 model 的input (可能是 images, texts, audios, LiDAR point clouds)
    • y 是 model 的 output: 可能是離散的種類 (classification), 可能是連續的數值 (regression)
    • f_theta: 這就是我們要訓練的模型。Model 是由多個 trainable parameters組成的函式。
  • 這些 trainable (learnable parameters) 是“非人為控制的” – 取決於你 model 訓練過程學到的。當然,根據你怎麼設計模型和訓練的方法(loss function 和 metrics 怎麼訂),高度影響模型最終訓練的結果。你需要透過調整 Hyperparameters (所以人為可以選擇的參數)來提升模型的 performance。
  • 如何訓練模型?通常會經過『迭代』的方式去優化模型。這裡的優化指的是:透過『更新模型的參數 Update model’s learnable parameters 』。這個過程叫做 Optimization – there are multiple different optimization algorithms (optimizers) we can choose to train the model.
  • Optimization 是什麼?根據你訂立的 loss function objectives,我們想要 minimize the loss such that the model performs the best. 要如何降低 loss 呢?每一次的 forward pass, a loss score will be computed. Then, given the optimizer you picked (e.g. Gradient descent, SGD, mini-batch GD+Momentum, Adagrad, Adam, RMSprop, AdamW), a “gradient vector” will be computed at the operating point. We can optimize the model by applying the following update rule:
    • parameters_(t+1) <- parameters_t + learning_rate * gradient
  • This way, the model’s “learnable parameters” will be updated such that the next iteration, the loss function should be decreased.
  • Now you know the basics of training an ML model! Congratulation!

This post lays out the core building blocks:

  1. Problem framing
  2. Objectives
  3. Optimization
  4. Generalization
  5. The failure modes you must diagnose to improve iteration speed.

1) Problem framing: what are we predicting and why?

Most ML problems can be framed as one of:

Before architectures, the most important decision is:

What is the output space and what does “good” mean?

That answer determines your labels, your loss, your metrics, and often your model family.


2) Data: the “model” you actually ship is data + pipeline

A trained model is only as reliable as the data distribution it sees.

Key concepts:

Practical habit: track not only aggregate metrics but slices (conditions, classes, environments) and monitor drift over time.


3) Model: choose inductive biases that match the data

A model is a parameterized function family. Different families encode different assumptions (inductive biases):

A useful mental model:


4) Objective: loss functions define what the model learns

Training typically solves:

[

theta^* = argmin_theta mathbb{E}{(x,y)sim mathcal{D}}[,mathcal{L}(ftheta(x), y),] + lambda Omega(theta)

]

Where:

Common losses:

If your metrics don’t improve, suspect either:

  1. loss is misaligned with the metric, or
  2. the data labels encode something different than what you think.

5) Optimization: how you actually find good parameters

Deep learning is mostly stochastic gradient descent variants:

The two most important knobs:

Rules of thumb:

Instability patterns:

Optimization is not “just training”; it is the mechanism that controls whether your architecture and data can actually converge.


6) Generalization: the core risk is fitting the wrong thing

Generalization is performance on unseen data. The classic regimes:

Typical levers:

To address underfitting

To address overfitting

A subtle point: “get more data” helps overfitting much more reliably than underfitting.


7) Representation learning: why SSL and contrastive methods matter

Many tasks become easier if you learn a good embedding space.

These are not “tricks”. They are ways to learn transferable features when labels are scarce or expensive.


8) Uncertainty: knowing when not to trust the model

Two core types:

In safety-critical ML, “uncertainty handling” is often as important as accuracy:


9) Evaluation: metrics are the contract with reality

Always separate:

Good evaluation practice: