4 1 月, 2026

DETR: Detection Transformer: See Object Detection as a Bipartite Matching Task

In this post, we’ll disc…

In this post, we’ll discuss a game changer in “Object Detection” world in deep learning – DETR.

Table of Contents

Introduction

The DETR paper (End-to-end Object Detection) was published in 2020 in ECCV by Facebook AI.

The presented a new method: DETR that view **object detection** as a **direct set prediction problem**. The main contributions of DETR paper are:
1. The first end-to-end object detection method.
2. Remove hand-designed components such as: non-max suppression (NMS) and anchor generation (explicitly encode prior knowledge about the task).

The main ingredients of DETR (DEtection TRansformer) are:
1. A set-based global loss. (forces unique prediction via bipartite matching)
2. A transformer encoder-decoder architecture.

Given a small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. (Note: this is different from traditional transformer in NLP which output text auto-regressively.)

DETR demonstrates (1) accuracy and (2) run-time performance on par with the well-established and highly-optimized Faster R-CNN baseline on the challenging COCO object detection dataset.
Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner.

To summarize on DETR’s contributions:

For the first time treat Object Detection as a direct set prediction problem.
Remove hand-designed components. (NMS, anchors)
Transformer architecture + Bipartite matching (force unique predictions)
Predict all outputs in parallel, so is efficient!

What does “direct set prediction” mean? Instead of “indirect” set prediction that current modern methods use, DETR “directly” predict the set of predictions in one forward pass.

Indirect set prediction: first generate proposals/anchors/windor centers, then predict bounding boxes and their class based on these region proposals/anchor boxes.

Review the Object Detection Task

The goal of object detection is to predict a set of bounding boxes for all potential objects. Each bounding box has:

Center (x, y)
Dimension (hight, width)
Category (the class this object belongs to).

DEtection TRansformer (DETR)

DETR streamlines the training pipeline by viewing object detection as a direct set prediction problem. It adopts an encoder-decoder architecture based on transformers. The self-attention mechanism explicitly models all pairwise interactions between elements in a sequence.

DETR predicts all objects at once, and is trained end-to-end with a set loss function which performs bipartite matching between predicted objects and ground-truth objects. DETR doesn’t require any customized layers, and thus can be reproduced easily in any framework that contains standard CNN & Transformer classes.

Bipartite Matching Loss

The matching loss function uniquely assigns a prediction to a ground-truth object, and is invariant to a permutation of predicted objects, so we can emit them in parallel.

Transformer with Parallel Decoding

Limitations in DETR

DETR obtains a lower performance on small objects. This is because it uses “global attention” mechanism to capture information from the entire image. This would make small objects difficult to be represented and differentiated compared with large objects if using the same amount of tokens.

Fortunately, this is mitigated by a successive paper: Deformable DETR which proposed “Deformable Attention” to solve the issue of “poor performances on small objects detection” in DETR.

幾個問題：

Faster RCNN/YOLO 是如何 assign predictions to G.T. boxes? DETR?
- Faster R-CNN/YOLO:
  - For each G.T. box, if a predicted box overlap and has IOU > threshold (e.g. 0.7): assign this prediction box to this G.T. box.
  - Thus, one G.T. box can have multiple prediction matched. (one-to-many)
  - This is why we need Non-max suppressions to eliminate redundant predictions.
為什麼 Faster R-CNN/YOLO 沒有像 DETR 一樣的 Permutation Invariant的特性？
- In previous object detection, each prediction boxes are centered around a certain pixel location.
為什麼 Permutation Invariant 是優勢？
為什麼傳統方法不能也用 Bipartite matching來算 loss?
- CNN methods has the “locality” inductive biases. Bipartite matching is a global operation which violates CNN’s property.
- DETR is a Transformer architecture, which is based on attention mechanism and have a more global view then CNN, which is why it’s more suitable for using bipartite matching to assign predicions to G.T. boxes and compute the loss.