Table of Contents

How to use this note:

Take a glance at the Machine Learning Lifecycle.
Follow the mock question and write down your answers.
Review the sample answers and compare with your own.
Update your answers/notes and iterate until you feel comfortable answering all in a natural flow!

Machine Learning Life Cycle:

Let’s use a sample question to walk through the pipeline:

Question Prompt:

You are tasked with designing an online traffic light and traffic sign recognition system for autonomous driving. The system must not only detect traffic lights/signs in real time from sensor inputs (camera, optionally lidar), but also associate each detected traffic light with the correct driving lane the ego vehicle is on (i.e., lane-level association). Design the overall ML system with particular focus on data-related components-such as data collection, labeling, input representation, and handling edge cases.

Follow the ML System Design Framework:

1. Problem Formulation (Define ML Objectives)

This step defines what are in the scope of this problem, and what are out of scopes.

Keywords:
1. Online -> real-time (runtime efficiency is critical)
2. recognition -> detection + classification (Bounding Boxes)
3. sensors: camera & lidar (optional)
Assumptions:
- Assume we have HD map (road information)

Define the Inputs/Outputs of the System:

Inputs:
- Images: from front-facing camera mounted on the car.
Outputs:
- 3D Bounding Boxes of Traffic Light & Traffic Signs
  - Attributes: (x, y, z, height, width, depth, orientation, class)
- Regression:
  - Predict Bbox location, dimension, and orientation
  - Total 3 + 3 + 4 parameteres.
- Classification:
  - Two Separate Classifiers
  - Traffic Sign: Multi-class Classification (mutually-exclusive)
    - e.g. Stop Sign, Yield Sign, Regulatory Sign
  - Traffic Light: Multi-label Classification.
    - State: Red/Green/Yellow (multi-class classifier)
    - Directions (each direction is a binary classfier)
      - Left Turn (Y/N)
      - Right Turn(Y/N)
      - Go Straight (Y/N)

Plot the high-level diagram on the system pipeline: (From input to output)

How many output heads are we supervising? This should be connected when we reaches Model Training (Loss Functions Design)

1. Define ML objectives:
- Detect the traffic light & signs and label their classes
  - Input:  images from front-facing camera on the car
  - Output: bounding boxes of traffic light & signs and their corresponding labels.
            Coordinate frame: 2D image-frame (pixel coordinates)
    - traffic sign -> multi-class classifiction(stop sign, speed limit)
    - traffic light -> multi-label classfication
       - red / green /yellow
       - direction: all direction / left-turn / left+straignt

1. Define ML objectives:
- Detect the traffic light & signs and label their classes
  - Input:  images from front-facing camera on the car
  - Output: bounding boxes of traffic light & signs and their corresponding labels.
            Coordinate frame: 2D image-frame (pixel coordinates)
    - traffic sign -> multi-class classifiction(stop sign, speed limit)
    - traffic light -> multi-label classfication
       - red / green /yellow
       - direction: all direction / left-turn / left+straignt

2. Data Pipeline

This step we should talk about two important components in the ML lifecycle:

Data Collection
Data Preprocessing

Data Collection:

Assume this is a Supervised Learning problem: Need (training data, labels) pairs.
Datasets:
1. Can use SOTA open-source datasets (e.g. Mapillary for Traffic Sign Classifications), Waymo Open Dataset
2. Use dataset collected in-house.
  - Discuss how to create own dataset from scratch.
  - This may be “out-of-scope” in this system design problem.
3. Use a mixture of open-source + in-house.

Data Preprocessing (on data & labels)

Images:

Resize to proper size (e.g. 640×640): this step is optional since it can be merged during training/inferencing.
Normalize/Standardize pixel values (scale pixel to [0,1] using pixel/255, or normalize it to 0-mean, 1-std.
Crop unwanted region (Crop out ego vehicle’s hood, or camera housing)
Data Augmentation: Create more difficult scenes
- Nowadays, we can utilize generative models / VLM to create harder examples to make the model more generalized to unseen examples.

Labels

One-hot encoding (e.g. 8 classes-> [0 0 1 0 0 0 0 0] 8×1 vector)
- If label is not provided: use oracle model (off-the-shelf SOTA detectors) to generate pseudo-labels, then add human-in-the-loop to verify labels quality.
- Alternative: use LiDAR labels in world coordinate.
  - Project LiDAR points from 3D to 2D using camera projection matrix.
  - Run Clustering (DBSCAN) to get convex hulls of clustered points.
  - Approximate 2D bounding boxes from convex hull as the labels.

Handle Imbalanced Dataset

This is CRITICAL in ML System Design, as this happens super often in real world.

There are many ways to handle imbalanced datasets:

Undersample the majority classes.
Oversample the minority classes.
Design the loss function to downweight the majority classes. (e.g. use Focal Loss)

1. What is focal loss? How does it downweight the majority classes?

Focal Loss adds a modulating factor to the standard Cross Entropy formula.

p_t: The model’s estimated probability for the correct class.
gamma: The “Focusing” parameter (usually set to 2).
alpha: A balancing weight for the class.

Standard Cross Entropy Loss:

CE Loss = -log(p_t)

Focal Loss:

FL(p_t) = -\alpha_t (1 – p_t)^\gamma \log(p_t)

Example:

Imagine a majority class example (like background) that the model is already 90% sure about (pt=90%).

In Cross Entropy: The loss is -log(0.9) ~ 0.1.
In Focal Loss (with gamma = 2): The loss is (1 – 0.9)^2 * 0.1 = 0.001.

By squaring the error (0.1^2), Focal Loss reduces the impact of that easy example by 100 times. Conversely, for a hard, minority class example where the model is only 10% sure (p_t = 0.1), the factor is (1 – 0.1)^2 = 0.81, which keeps the loss much closer to its original value.

2. Handling Multiple Majority Classes

In your example (60 Stop Signs, 60 Background, 10 Yield Signs), you have two “majority” classes and one “minority” class. Focal Loss handles this through two distinct mechanisms:

A. Automatic Sample-Wise Downweighting (The $\gamma$ factor)

Focal Loss doesn’t actually care which label a class has; it cares about how easy the example is for the model.

If the model becomes very confident in identifying both “Stop Signs” and “Background,” the $(1 - p_t)^\gamma$ term will naturally shrink the loss for both.
The “Yield Sign” (with only 10 samples) will likely remain a “hard” example for longer, maintaining a higher relative weight in the total loss.

B. Explicit Class Balancing (The $\alpha$ factor)

To handle multiple majority classes more strictly, you use the $\alpha$ parameter, which is a vector of weights—one for each class.

You would assign $\alpha$ values inversely proportional to class frequency: (you give the smaller class a higher $\alpha$ to boost its importance).

$\alpha_{Background} = 0.1$ (Low weight)
$\alpha_{StopSign} = 0.1$ (Low weight)
$\alpha_{YieldSign} = 0.8$ (High weight)

Summary for your 60/60/10 Scenario:

Focal Loss will treat each individual image. If the 60 Stop Signs are visually similar and “easy” to learn, their total loss contribution will drop off quickly.
Alpha Balancing will ensure that the 10 Yield Signs have a higher “baseline” importance so the model doesn’t just ignore them to perfect the Stop Signs.

2. Data Collection:
   Supervised learning -> (training data, labels)
   - Datasets: Mapillary (traffic sign), Waymo Open Source dataset
   - TODO: discuss how to create own dataset from scratch

   - Preprocessing:
     - images: resize (640x640)  <- optional (merged during training/inferencing)
               normalize/standardize pixel values (pixel/255 -> [0, 1] / 0-mean, 1-std)
               crop unwanted (crop out ego vehicle's hood) 
               data augmentaion -> create more difficult scene (copy & paste traffic signs/labels)
                   make the model more generalized
    - label - one-hot encoding (e.g. 8 classes -> [0 0 1 0 0 0 0 0] 8x1 vector)
            if label not provided  1.Oracle model 
                                   2. lidar label in world coordinate. (Transform (project) lidar points from 3D to 2D camera projection matrix)
                                      Clustering (DBSCAN) -> get convexl hull of clustered points -> approximate 2D box from convex hull
            how to balance out label distribution?  (imbalanced)
                - majority will be background 
                1. Undersample the majority 
                2. Oversample minority (similar augmentation)
                3. Deisgn loss function to down-weight the loss from majority class

2. Data Collection:
   Supervised learning -> (training data, labels)
   - Datasets: Mapillary (traffic sign), Waymo Open Source dataset
   - TODO: discuss how to create own dataset from scratch

   - Preprocessing:
     - images: resize (640x640)  <- optional (merged during training/inferencing)
               normalize/standardize pixel values (pixel/255 -> [0, 1] / 0-mean, 1-std)
               crop unwanted (crop out ego vehicle's hood) 
               data augmentaion -> create more difficult scene (copy & paste traffic signs/labels)
                   make the model more generalized
    - label - one-hot encoding (e.g. 8 classes -> [0 0 1 0 0 0 0 0] 8x1 vector)
            if label not provided  1.Oracle model 
                                   2. lidar label in world coordinate. (Transform (project) lidar points from 3D to 2D camera projection matrix)
                                      Clustering (DBSCAN) -> get convexl hull of clustered points -> approximate 2D box from convex hull
            how to balance out label distribution?  (imbalanced)
                - majority will be background 
                1. Undersample the majority 
                2. Oversample minority (similar augmentation)
                3. Deisgn loss function to down-weight the loss from majority class

3. Modeling

This steps cover the core part in a Machine Learning System: Model selection, model training, and model evaluation.

Model Selection

It’s vital to discuss the pros and cons of different model architectures. Understanding the tradeoffs between different options demonstrates your technical depth and domain knowledge.

There are two types of 3D object detection models based on architecture types:

CNN based:

Two-staged: Faster R-CNN (good on small/crowded objects, most accurate) but slower due to two-stage architecture
- Stage 1: Region Proposal Network
- Stage 2: Detection Network
Single Stage: YOLOv8: Faster (evaluate traffic light(small objects))

Transformer based:

Transformers are data-hungry, which means it requires more “data” and higher compute to train, which is more computationally expensive.

Detection Transformer (DETR):
- They perform bad on small objects. Reason: use global attention. Small objects only occupy small number of pixels. It’s difficult to learn small objects using large attention matrices.
Deformable DETR: Better performance on small objects because it uses the deformable attention mechanism, which only attends to small subset of the pixels (Referenced points).
Advantage of using transformer based: end-to-end detection without any post-processing such as Non-Maximum Suppression.

Model Training

Loss Function Design: A Combinational Loss of:

Regression loss (Bounding Box locations): Smooth L1 loss / GIoU / CIoU
- This regress bounding box coordinates.
Classification loss (multi-class): Cross-entropy loss
Objectness Score (confidence): 1/0 Binary Cross-Entropy Loss.

What are the differences between IoU, GIoU, and GIoU losses?

1. IoU is the gold standard for measuring the overlap between two boxes. It is calculated by dividing the area of overlap by the area of the union.

IoU = \frac{|A \cap B|}{|A \cup B|}

The Problem: If two boxes do not overlap, the IoU is 0. Because the gradient becomes zero, the model receives no feedback on which direction to move the predicted box to find the target. It also doesn’t account for how the boxes are aligned (e.g., one could be inside the other, but off-center).

2. Generalized IoU (GIoU)

GIoU was designed to solve the “zero gradient” problem when boxes don’t overlap. It introduces a “penalty term” using the smallest enclosing convex box C that contains both the predicted box and the ground truth.

GIoU = IoU – \frac{|C \setminus (A \cup B)|}{|C|}

The Fix: When there is no overlap, GIoU will be negative. The loss function will then push the predicted box to move toward the ground truth to minimize that empty space (C) between them.
GIoU loss = 1 – GIoU
Minimizing the GIoU loss means:
- Maximizing IoU when boxes are overlapped
- Minimizing distance between boxes when boxes are not overlapped

The Limitation: Once the boxes start overlapping significantly, GIoU struggles to distinguish between different orientations or “inner” alignments, often leading to slow convergence. The below example explains the problem.

3. Complete IoU (CIoU)

CIoU is currently one of the most popular choices (often used in YOLOv5/v8). It argues that a good box loss should consider three geometric factors: Overlap area, Central point distance, and Aspect ratio.

It adds two specific terms to the IoU calculation:

Distance Penalty: Measures the Euclidean distance between the center points of the two boxes.
Aspect Ratio Penalty: Measures how well the predicted box matches the proportions (width/height) of the target.

CIoU = IoU – \frac{\rho^2(b, b^{gt})}{c^2} – \alpha v

The Fix: Even if the boxes overlap perfectly at the centers, the aspect ratio penalty v ensures the model shapes the box correctly. This leads to much faster convergence and better precision than GIoU.

Loss Comparison:

Summary

Feature	IoU	GIoU	CIoU
Non-overlapping feedback	None (Gradient is 0)	Yes (Enclosing box)	Yes (Center distance)
Alignment Sensitivity	Low	Medium	High
Considers Aspect Ratio	No	No	Yes
Convergence Speed	Slowest	Medium	Fastest

Reference:

GIoU, CIoU and DIoU: Variants of IoU and how they are better compared to IoU

3. Modeling           
   - Accurately detect & classify traffic light & labels
     - two-stage: faster r-cnn (small/crowded objects) slower 
     - one-stage: YOLOv8: faster (evalute traffic light (small objects)) -> V (real-time constraint)
     - transformer based (ViT/DETR/Deformable DETR) (require more data/higher compute cost/small objects)
          - Deformable attention (only attends to a set of neighbor pixels)
          - DETR standard attention -> attends to 
  - Train the model
    - Design the loss function (combinational loss)
      - bounding box (regression): Smooth L1 loss / CIoU /GIoU (regress bounding box coordinates)
        IoU loss 
      - Classification loss (multi-class): Cross-entropy loss
      - Objectness score (confidence): 1/0 Binary Cross-Entropy loss

3. Modeling           
   - Accurately detect & classify traffic light & labels
     - two-stage: faster r-cnn (small/crowded objects) slower 
     - one-stage: YOLOv8: faster (evalute traffic light (small objects)) -> V (real-time constraint)
     - transformer based (ViT/DETR/Deformable DETR) (require more data/higher compute cost/small objects)
          - Deformable attention (only attends to a set of neighbor pixels)
          - DETR standard attention -> attends to 
  - Train the model
    - Design the loss function (combinational loss)
      - bounding box (regression): Smooth L1 loss / CIoU /GIoU (regress bounding box coordinates)
        IoU loss 
      - Classification loss (multi-class): Cross-entropy loss
      - Objectness score (confidence): 1/0 Binary Cross-Entropy loss

Model Evaluation

There are usually two types of evaluations you should run for an ML system:

Evaluation DURING training.
- The metrics you pick should act as a proxy for you to tell: How does the model perform? Is the model improving/learning from iterating/training?
- This gives you a quick glance at how good the model is learning. If training isn’t progressing as expected, troubleshoot the pipeline immediately instead of waiting for it to complete.
- Common metrics for ML problems:
  - Object Detections: mAP(mean Average Precision),
  - Object Tracking:
  - Semantic Segmentation: mIOU
  - HD Map Learning / Online Mapping: mAP, mIOU
- What is mAP? How to derive mAP score for each image?
  - mAP is the mean AP across all “categories”. (All classes)
  - First, plot the Precision-Recall (PR) Curve for each category.
    - Tune the confidence threshold t to count the # True Positives.
      - True Positive: The model detects a bounding box that is matched with a ground-truth bounding box of same class.
      - The detection confidence score > t.
    - Compute (Precision, Recall) from the Confusion Matrix.
      - TODO: Confusion Matrix in ML.
    - Plot a point on the PR-Curve at that particular t.
  - Average Precision (AP) is the “area” under the PR-Curve.
- For classification, we can also use Accuracy as a metric.
  - Accuracy = (TP + TN) / (TP +FP + TN + FN)
Evaluation AFTER the model is trained.
- This is a more comprehensive evaluation (sometimes we call it gated evaluation) to evaluate your model once it finishes its training.
- There should be a curated test dataset specifically used to evaluate the overall performance of the model.

4. Evaluation
   - Metrics:
     - mAP (mean Average Precision)
     - Precion & Recall @ different confidence threshold -> plost PR-Curve
       - AP for each class
       - mAP: average for all classes' AP
    - Classificaion: Accuracy / Precision (V)

4. Evaluation
   - Metrics:
     - mAP (mean Average Precision)
     - Precion & Recall @ different confidence threshold -> plost PR-Curve
       - AP for each class
       - mAP: average for all classes' AP
    - Classificaion: Accuracy / Precision (V)

4. Model Deployment & Monitoring & Maintenance

I did not have time to cover this part, and the interviewer did not ask much about this. I guess this step also depend on the roles you’re interviewing for, and what the interviewer’s backgrounds are/what they are interested in seeing.

Model Deployment

In order to deploy a model to an onboard system, we need to downsize the model, and accelerate its inference speed.

There are some common methods we can use.

Knowledge Distillation

First train a teacher model.
Then, distill the knowledge to a way smaller student model.

Parameter Pruning

Analyze and prune the most unimportant parameters.

Quantization:

By using lower floating point precision (Using TensorRT library to reduce from FP_32 to FP_16, or even INT_8) to achieve runtime acceleration.

Model Monitoring & Maintenance

TODO

Extra questions I was asked

1. How do you associate the detected traffic signs & lights to road lanes?

Non-learning based approach:
- 1. Bird’s eye view lifting: Assume we have depth information (from depth camera or from Lidar Point information, we can project 2D pixels to 3D coordinates.
- 2. Use distance based heuristics
ML Based approach:
- Train another model to learn the TL/TS <-> lane associations.
- Input: object, GT(label): lane
  - Loss: Cross-entropy loss (mutually exclusive)
  - metric: accuracy/precision.

Assume: have map (road info) / lidar 
        Project 2D -> 3d
              - bev lifting (assume depth)
              - lidar / 
        Distance based heuristic

ML -based:
Objective: associate corresponding traffic sign/light to their lane
Assume both (lane & objects) are in BEV space
Assume have GT:
(object1, lane) -> classification: which lane
  - loss: CE loss
  - metric: accuracy / precision

Assume: have map (road info) / lidar 
        Project 2D -> 3d
              - bev lifting (assume depth)
              - lidar / 
        Distance based heuristic

ML -based:
Objective: associate corresponding traffic sign/light to their lane
Assume both (lane & objects) are in BEV space
Assume have GT:
(object1, lane) -> classification: which lane
  - loss: CE loss
  - metric: accuracy / precision

Goro Yeh 56

MLSD Notes

How to use this note:

Machine Learning Life Cycle:

Question Prompt:

Follow the ML System Design Framework:

1. Problem Formulation (Define ML Objectives)

Define the Inputs/Outputs of the System:

Plot the high-level diagram on the system pipeline: (From input to output)

2. Data Pipeline

Data Collection:

Data Preprocessing (on data & labels)

Images:

Labels

Handle Imbalanced Dataset

1. What is focal loss? How does it downweight the majority classes?

2. Handling Multiple Majority Classes

A. Automatic Sample-Wise Downweighting (The $\gamma$ factor)

B. Explicit Class Balancing (The $\alpha$ factor)

Summary for your 60/60/10 Scenario:

3. Modeling

Model Selection

CNN based:

Transformer based:

Model Training

Loss Function Design: A Combinational Loss of:

What are the differences between IoU, GIoU, and GIoU losses?

Loss Comparison:

Summary

Reference:

Model Evaluation

4. Model Deployment & Monitoring & Maintenance

Model Deployment

In order to deploy a model to an onboard system, we need to downsize the model, and accelerate its inference speed.

Knowledge Distillation

Parameter Pruning

Quantization:

Model Monitoring & Maintenance

Extra questions I was asked

1. How do you associate the detected traffic signs & lights to road lanes?

MLSD Notes

How to use this note:

Machine Learning Life Cycle:

Question Prompt:

Follow the ML System Design Framework:

1. Problem Formulation (Define ML Objectives)

Define the Inputs/Outputs of the System:

Plot the high-level diagram on the system pipeline: (From input to output)

2. Data Pipeline

Data Collection:

Data Preprocessing (on data & labels)

Images:

Labels

Handle Imbalanced Dataset

1. What is focal loss? How does it downweight the majority classes?

2. Handling Multiple Majority Classes

A. Automatic Sample-Wise Downweighting (The factor)

B. Explicit Class Balancing (The factor)

Summary for your 60/60/10 Scenario:

3. Modeling

Model Selection

CNN based:

Transformer based:

Model Training

Loss Function Design: A Combinational Loss of:

What are the differences between IoU, GIoU, and GIoU losses?

Loss Comparison:

Summary

Reference:

Model Evaluation

4. Model Deployment & Monitoring & Maintenance

Model Deployment

In order to deploy a model to an onboard system, we need to downsize the model, and accelerate its inference speed.

Knowledge Distillation

Parameter Pruning

Quantization:

Model Monitoring & Maintenance

Extra questions I was asked

1. How do you associate the detected traffic signs & lights to road lanes?

A. Automatic Sample-Wise Downweighting (The $\gamma$ factor)

B. Explicit Class Balancing (The $\alpha$ factor)