Bounding box

A bounding box is a rectangle drawn around an object in an image, defined by its coordinates, that marks where the object is located. In the simplest form, it's specified by four numbers: the x and y coordinates of the top-left corner plus the width and height (or alternatively, the coordinates of two opposite corners). Bounding boxes are the most common annotation type for object detection tasks.

Detection models output bounding boxes with an associated class label and confidence score. During training, the model learns to predict boxes that closely overlap with the ground-truth annotations. The quality of this overlap is measured using Intersection over Union (IoU): the area where the predicted and ground-truth boxes overlap divided by their total combined area. An IoU of 0.5 or higher is the standard threshold for considering a detection correct.

Bounding boxes are fast to annotate (10-30 seconds per object) and sufficient for many applications like counting objects, tracking, and coarse localization. Their limitation is that they include background pixels around non-rectangular objects, which matters for tasks requiring precise shape information. For those cases, polygon annotations or pixel-level segmentation masks are used instead.

Resources