Articles

A Historical Breakdown of YOLO: A Landmark Model in Object Detection

Thomas Dorfer

March 27, 2023

MIN READ

Sections

Object Detection and YOLO

Object detection is a subfield of computer vision that involves the identification and localization of objects in images or videos with a certain degree of confidence. Generally, an annotated bounding box is drawn around each given object, which can provide valuable information to the viewer about the object’s nature and location in the scene.

Through the lens of history, object detection can be divided into traditional methods, which were largely developed in the 1990s and early 2000s, and deep learning methods, which started to have significant breakthroughs soon after the world witnessed the rebirth of convolutional neural networks (CNNs) in 2012.

A further distinction can be drawn between one-stage and two-stage object detection systems. One-stage detectors predict the object’s bounding box and its class probabilities in a single pass, whereas two-stage detectors first go through region proposals, whereby candidate bounding boxes of potential objects are proposed, and only then perform object classification within those proposed regions. Due to this additional step of region proposals, two-stage detectors are typically slower than their one-stage counterparts; however, they tend to be more accurate.

YOLO, or You Only Look Once, was the first one-stage object detector in the deep learning era and was proposed by Redmon et al. in 2015. Compared to previous approaches, bounding boxes and class probabilities are predicted from the full image through the application of a single neural network, thus dramatically increasing speed. Furthermore, this single-network approach allows for the end-to-end optimization directly on detection performance. Since its introduction in 2015, YOLO has seen continuous improvements in prediction accuracy and computational efficiency, which will be outlined in the following section.

‍

Historical Breakdown

‍

‍

The initial version, YOLOv1, was based on a CNN capable of predicting both bounding boxes and class probabilities of objects on a full image in one pass. This approach resulted in extremely fast inference speeds and in the model’s ability to perform object detection in real-time at 45 frames per second (fps) while performing at a mean average precision (mAP) of 63.4. A smaller version of this model, termed Fast YOLO, could even process an astonishing 155 fps while still achieving double the mAP of other real-time object detectors. Moreover, it has a decreased likelihood of predicting false positives, i.e. predicting objects where none exist. The main disadvantage compared to two-stage detectors is that it makes more localization errors due to its simplified neural architecture.

YOLOv2 was released by Redmon & Farhadi in 2016 in a paper titled “YOLO9000: Better, Faster, Stronger”. The 9000 signified its ability to detect over 9000 object categories. It sported an incredible 78.6 mAP at 40 fps compared to other object detectors - an improvement of almost 24%. This precision boost was achieved mainly through three factors: (1) the introduction of anchor boxes, which are predefined areas that illustrate the idealized position of the objects to be detected, (2) the adoption of batch normalization, and (3) multiple-scale training - the random resizing of the model throughout the training process.

Two years later, the same authors introduced yet another version in a paper they titled “YOLOv3: An Incremental Improvement”. While this model was a little bigger than the previous ones, its accuracy saw a marked increase and its inference speed was still sufficiently fast. The authors report that at 320 x 320, YOLOv3 runs in 22 ms at 28.2 mAP. The new architecture is composed of 75 convolutional layers and does not use fully connected or pooling layers, which ultimately helped decrease the model in size and weight. The main contribution to increased accuracy, however, was the application of a feature extractor called feature pyramid network, or FPN, which enabled the enhanced detection of objects at different sizes by combining feature maps from multiple levels of the CNN. All of this is achieved while still maintaining real-time performance.

After another two years, YOLOv4 was released, but this time by different authors. Bochkovskiy et al. (2020) introduced this version through a paper titled “YOLOv4: Optimal Speed and Accuracy of Object Detection”. This version further improved the speed and accuracy of object detection and performs at a speed of ~65 fps at 43.5 mAP on the COCO dataset. It achieves this through the application of various data augmentation techniques, most notably mosaic data augmentation, whereby multiple images are combined into a single image which is then used for training, along with experimentation of different bounding box regressions types, regularization, and cross mini-batch normalization.

Only a few days after Bochkovskiy et al. released YOLOv4, YOLOv5 was released by the company Ultralytics. It achieves a remarkable mAP of 55.6 on the COCO dataset while also requiring less computational power compared to its predecessors. It comes with a variety of improvements, including a shift of the base model from C to PyTorch, which facilitates better data augmentation and loss calculations, auto-learning of anchor boxes, and the use of path aggregation networks (PANs) in the neck of the model. Furthermore, instead of CFG files, this version supports YAML files which significantly enhance the readability and overall layout of the configuration files of the model.

In 2021, the YOLO family grew by yet another member through the release of YOLOX by Ge et al. (2021). Unlike previous YOLO architectures, which use anchor-based detection, YOLOX was implemented using an anchor-free method and an overall simpler design. Some of its key improvements include a decoupled head, a modified backbone network (DarkNet53), the addition of Mosaic and MixUp into its augmentation strategies, and SimOTA - an advanced label assignment methodology. The best accuracy is attained with the extra-large model version, YOLOX-x, which sports an mAP of 51.5 on the COCO dataset.

Last year, YOLOv6 was released by Li et al. (2022) in a paper titled “YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications”. This version is able to achieve 43.5 mAP at 495 fps, with the quantized version achieving a similar AP at an astounding 869 fps. As it the case with YOLOX, YOLOv6 was implemented using an anchor-free method, making it 51% faster compared to most anchor-based object detectors. Furthermore, it uses a revised backbone called EfficientRep and varifocal loss (VFL) and distribution focal loss (DFL) loss functions for class probability prediction and bounding box regression, respectively.

In the same year, YOLOv7 was released by Wang et al. (2022) in a paper titled “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors”. This version surpasses all other object detectors in both speed and accuracy, sporting a remarkable mAP of 56.8% at 36 fps. The goal of this version was to predict the bounding boxes more accurately than its peers while at the same time maintaining similar inference speeds. This was achieved by implementing a more efficient layer aggregation - in this case, E-ELAN, enhanced model scaling and re-parameterization techniques, and coarse-to-fine auxiliary head supervision, whereby supervision is passed back from the lead head at different granularities in order to overcome the otherwise long downstream distance of the network head.

At the time of writing, the latest and highest performing model of the YOLO family is YOLOv8 by Ultralytics. Its medium version, YOLOv8m, achieves an accuracy of 50.2 mAP on the COCO dataset. Apart from high accuracy and efficiency, however, this version stands out compared to its predecessors through its enhanced ease of use. Specifically, YOLOv8 comes with a CLI that makes model training more intuitive and a Python package that provides a more seamless coding experience for developers. Since no research paper has been published along with the model release, there is no direct insight into the underlying methodology and experimental studies. However, based on the GitHub repository, the following can be deduced: (1) YOLOv8 is an anchor-free detection model, meaning that it directly predicts the center of an object as opposed to the offset from an anchor box, (2) it contains an overhauled convolution module, and (3) modifications were made to the originally proposed mosaic augmentations.

Overall, since the inception of the YOLO models, they have been building upon their predecessors by continuously improving the network architecture, introducing novel data augmentation methodologies, and enhancing the accuracy of bounding box predictions. These iterative refinements, yielding to ever more accurate, faster, and easier-to-use outcomes, have made these models gain enormous traction within the realm of object detection and as a result, they have been adopted in numerous real-world applications. However, there remain challenges that still need to be overcome in the future, such as the detection of small or overlapping objects in complex scenery, or the high memory usage that these models often require.

‍

Real-World Applications

Since YOLO was first released in 2015, its applicability has been increasing extensively to a multitude of domains. For instance, YOLO has been adopted by the police and various surveillance systems in order to detect potential suspects and objects, which then enables the systematic tracking of their movements. In addition, it is being applied for research purposes such as the tracking of wildlife and the identification of endangered species. Most notably, its applicability extends to autonomous driving and is currently being used by Tesla AI to detect people and objects around its vehicles, enabling them to make autonomous decisions based on the environment that surrounds them.

Before the advent of the YOLO models, these tasks had to be performed using either rule-based or traditional, less effective computer vision techniques. Examples include motion detection, object tracking, or background subtraction, all of which relied heavily on hand-crafted features and required a lot of manual tuning. Autonomous vehicles, at that time, relied mainly on techniques such as edge detection or Hough transforms to detect lane markings and template matching to detect traffic signs. In addition, classical machine learning models such as Support Vector Machines or Random Forests would be employed in order to detect and classify objects like pedestrians or vehicles. However, they often lacked the accuracy, efficiency, and generalization capabilities that we see today with YOLO models.

‍

Conclusion

The YOLO family has made significant contributions to the domain of object detection. Since their inception in 2015, they have undergone several iterations, each of which demonstrating considerable improvements in prediction accuracy and computational efficiency. While they do have some limitations, such as the challenge of detecting small or overlapping objects, we can confidently assume that these are currently being worked on and will be addressed in future releases.

At Datature we are committed to facilitating a seamless end-to-end machine learning experience for first-timers and advanced developers alike, such that the user experience is simplified and easy to use but the performance and technical depth is never compromised. Currently, we support the YOLOv4 and YOLOX models. To that end, we plan to update our platform offerings for computer vision models in the YOLO family to ensure that our model performance is cutting-edge. Ready to get started? Begin training your model on Datature Nexus today!

Object Detection and YOLO

‍

Historical Breakdown

‍

‍

Real-World Applications

‍

Conclusion

Object Detection and YOLO

‍

Historical Breakdown

‍

‍

Real-World Applications

‍

Conclusion

Resources

Get Started using Datature’s platform now for free.

Book Demo

A Historical Breakdown of YOLO: A Landmark Model in Object Detection

Object Detection and YOLO

Historical Breakdown

Real-World Applications

Conclusion

Object Detection and YOLO

Historical Breakdown

Real-World Applications

Conclusion

Object Detection and YOLO

Historical Breakdown

Real-World Applications

Conclusion

More reading...

Get Started using Datature’s platform now for free.