DETR (Detection Transformer)

DETR (Detection Transformer) is a 2020 architecture from Meta AI that reformulated object detection as a direct set prediction problem, replacing the complex post-processing pipelines of traditional detectors with a clean end-to-end design. Instead of generating thousands of proposals and filtering them with Non-Maximum Suppression, DETR uses a fixed set of learned "object queries" that attend to the image features through a transformer decoder and directly output the final set of detections.

The training process uses Hungarian matching — a bipartite assignment algorithm that pairs each prediction with a ground-truth object (or "no object") to compute the loss. This one-to-one matching eliminates duplicate detections by design, removing the need for NMS. The original DETR was slow to converge and struggled with small objects. Deformable DETR fixed this by using deformable attention (attending to sparse, learned key locations instead of every pixel), reducing training time by 10x.

The DETR family has since expanded: RT-DETR (Baidu, 2023) achieved real-time inference speeds competitive with YOLO, D-FINE improved detection accuracy through fine-grained distribution refinement, and the architecture influenced SAM's mask decoder design. DETR proved that transformers could handle detection without hand-designed components like anchors or NMS.

Get Started Now

Get Started using Datature’s platform now for free.