YOLO

YOLO (You Only Look Once) is a family of real-time object detection models that predict bounding boxes and class labels in a single forward pass through a neural network. Joseph Redmon published the original YOLO paper in 2015, and the core idea was simple: instead of scanning an image multiple times with sliding windows or region proposals, process the whole image once and output all detections at the same time. That one-pass design made YOLO fast enough to run on live video, which changed how engineers thought about deploying detection models in production.

Over the past decade, the YOLO family has grown through more than a dozen versions, each one pushing the boundary between speed and accuracy. The architecture has gone from a research prototype to the default choice for real-time detection across industries: factory inspection lines, autonomous vehicles, drone analytics, retail cameras, and agricultural robotics. For a full version-by-version walkthrough, see our historical breakdown of the YOLO family.

How YOLO Works

Every YOLO model follows the same basic pattern. The input image is divided into a grid of cells. Each cell is responsible for predicting a fixed number of bounding boxes along with the probability that an object of each class is present. The network processes the entire image in one shot (hence "You Only Look Once") and outputs a tensor containing all predicted boxes, confidence scores, and class probabilities.

The architecture has three main components:

Backbone. A convolutional neural network that extracts visual features from the raw image. Early YOLO versions used custom backbones (Darknet-19, Darknet-53). Later versions adopted more efficient designs: CSPDarknet in YOLOv5, and modified CSPNet variants in YOLOv8 through YOLO26. The backbone produces feature maps at multiple resolutions, capturing both fine-grained details (edges, textures) and high-level patterns (shapes, object parts).

Neck. A feature aggregation module that combines feature maps from different backbone layers. Why does this matter? The backbone produces features at multiple scales, but the detection head needs all of them at once. Low-resolution maps capture large objects well. High-resolution maps catch small ones. The neck merges both. FPN, PANet, and BiFPN are the most common neck designs across YOLO versions.

Head. The detection head takes the aggregated features and predicts bounding box coordinates, objectness scores, and class probabilities for each grid cell. How the head makes those predictions has changed over time. Anchor-based versions (YOLOv3 through YOLOv7) predict offsets relative to pre-defined reference boxes. YOLOv8 dropped anchors and predicts coordinates directly. YOLO26 goes further still: it also eliminates Non-Maximum Suppression, so the head's raw output is the final result. No post-processing step needed.

The YOLO Timeline

The YOLO family is not a single project with a linear development path. Different research groups and companies have forked, extended, and reimagined the architecture at different points. Here are the versions that matter most.

YOLOv1 (2015)

Joseph Redmon's original paper reframed object detection as a single regression problem. The model divided the image into a 7x7 grid, predicted 2 bounding boxes per cell, and ran at 45 FPS on a GPU. It was fast but inaccurate by modern standards: 63.4 mAP on Pascal VOC. The key insight was that detection did not require the expensive two-stage pipeline of R-CNN. One network, one pass, real-time speed.

YOLOv2 / YOLO9000 (2016)

Introduced batch normalization, anchor boxes (borrowed from Faster R-CNN), multi-scale training, and a new backbone (Darknet-19). YOLO9000 could detect over 9,000 object categories by training jointly on detection and classification datasets. Accuracy jumped to 78.6 mAP on VOC.

YOLOv3 (2018)

Multi-scale detection arrived. YOLOv3 predicted at three different resolutions using a Feature Pyramid Network, which meant small objects that earlier versions missed were now detectable. Darknet-53 replaced the backbone, and the model now predicted across 10,647 anchor boxes per image. This was Redmon's last contribution. He left CV research shortly after, troubled by the military applications of the technology he had created.

YOLOv4 (2020)

With Redmon gone, the community took over. Alexey Bochkovskiy led this release. YOLOv4 introduced a bag of training tricks: mosaic augmentation (stitching four images into one training sample), CIoU loss for better box regression, cross-stage partial connections (CSP) in the backbone, and spatial pyramid pooling. These additions improved accuracy without sacrificing much speed. YOLOv4 also introduced the "bag of freebies" and "bag of specials" framework for categorizing detection improvements.

YOLOv5 (2020)

Released by Ultralytics, YOLOv5 was the first YOLO implemented entirely in PyTorch (previous versions used the Darknet framework in C). It was not a paper but an engineering product: a polished training pipeline with automatic anchor calculation, hyperparameter evolution, model export to ONNX/TensorRT/CoreML, and a range of model sizes (Nano, Small, Medium, Large, XLarge). YOLOv5 became the most widely deployed YOLO version because it was easy to train, easy to export, and well-documented. Many production systems still run YOLOv5 today.

YOLOv7 (2022)

Published by Chien-Yao Wang (the WongKinYiu group), YOLOv7 introduced E-ELAN (Extended Efficient Layer Aggregation Network) for better gradient flow and compound model scaling. It briefly held the accuracy-speed crown, pushing past 56 AP on COCO at real-time speeds.

YOLOv8 (2023)

Three changes defined this Ultralytics release. The biggest: anchor-free detection. Instead of adjusting pre-defined reference boxes, the model now predicts box centers and dimensions directly, which eliminates an entire class of hyperparameters that teams previously had to tune per dataset. Second, a decoupled head that separates classification and regression into independent branches, which improves both tasks. Third, native support for multiple CV tasks beyond detection: instance segmentation, classification, pose estimation, and oriented bounding boxes, all sharing the same backbone and training pipeline.

YOLOv8 also introduced the Ultralytics Python package and CLI, making model training a one-liner: yolo train model=yolov8n.pt data=coco.yaml. For a hands-on guide, see our tutorial on training YOLOv8 on a custom dataset.

YOLOv9 (2024)

Introduced Programmable Gradient Information (PGI) and GELAN (Generalized Efficient Layer Aggregation Network). PGI addresses the information bottleneck problem in deep networks by using auxiliary reversible branches during training that preserve gradient information through the full network depth. Better accuracy at the same parameter count, with no extra inference cost. Our YOLOv9 guide covers the architecture and fine-tuning workflow.

YOLO11 (2025)

Ultralytics focused this release on modular architecture and efficiency. YOLO11 refined the backbone with attention mechanisms and improved small-object recall, while the export pipeline for edge runtimes got faster and more portable. Backward compatibility with the YOLOv8 training API held, so existing codebases required minimal changes to adopt it.

YOLO26 (January 2026)

This is the biggest architectural overhaul since YOLOv8. Five interconnected changes, all pointing in one direction: edge-first design.

NMS-free inference. Non-Maximum Suppression is gone. YOLO26 uses one-to-one label assignment during training so each ground truth object maps to exactly one prediction. The model outputs clean, non-redundant detections with no post-processing. This means deterministic latency regardless of how many objects appear in the scene, simpler deployment pipelines, and no IoU threshold tuning.

DFL removal. Distribution Focal Loss (the bounding box regression approach from YOLOv8) is replaced with direct linear regression. This removes Softmax operations from the head, simplifies ONNX and TensorRT exports, and improves large-object detection by removing fixed regression range limits.

MuSGD optimizer. A hybrid of classical SGD with momentum and Muon (an optimizer from the large language model community). MuSGD stabilizes training for smaller model variants where gradient noise is more problematic, reducing the need for extensive hyperparameter sweeps.

Small-Target-Aware Label Assignment (STAL). A dynamic IoU threshold that scales with object size, giving small objects a better chance of being matched to prediction anchors during training. This directly improves recall on objects under 32x32 pixels.

Progressive Loss (ProgLoss). A training schedule that shifts focus from coarse localization to fine-grained box refinement as training progresses, producing tighter bounding boxes without additional inference cost.

The numbers tell the story. YOLO26n runs 43% faster on CPU than YOLO11n while gaining +1.4 mAP. YOLO26x hits 57.5 mAP on COCO at 11.8ms on a T4 GPU. Six tasks out of one family: detection, segmentation, classification, pose estimation, OBB, and open-vocabulary segmentation. Datature Nexus supports training and deploying YOLO26 through its visual interface.

YOLO vs Other Detection Architectures

YOLO is not the only way to do object detection. Where does it fit?

Two-stage detectors (Faster R-CNN, Cascade R-CNN) generate region proposals first, then classify each one. More accurate on small objects and crowded scenes, but slower. Use them when accuracy matters more than speed and your GPU has headroom to spare.

Transformer-based detectors (D-FINE, RT-DETR, RF-DETR) use learned object queries and attention mechanisms to predict detections without anchors or NMS. RF-DETR (2026) broke 60 AP on COCO at real-time speed. These models are closing the gap with YOLO on speed while often surpassing it on accuracy, especially for complex scenes with many overlapping objects.

Other one-stage detectors (SSD, RetinaNet, EfficientDet) share YOLO's single-pass design but differ in the details. RetinaNet brought focal loss. EfficientDet brought compound scaling. Neither built the kind of ecosystem that makes YOLO sticky: community support, model variety from nano to XL, and export tooling for every runtime.

Why does YOLO dominate production? Not architecture alone. The Ultralytics package handles training, validation, export to 10+ formats, and inference in one unified API. That operational convenience matters as much as raw mAP numbers when a team has two weeks to ship.

Training a YOLO Model on Custom Data

Training YOLO on your own dataset follows a consistent workflow regardless of which version you use.

1. Collect and annotate data. Gather images that match production conditions, and label every object with a bounding box and class. Quality beats quantity: 500 well-annotated images often outperform 5,000 sloppy ones. Image augmentation (flips, rotations, mosaic, color jitter) stretches the training set further by exposing the model to variations it hasn't seen.

2. Choose a model size. Every YOLO generation ships in multiple sizes, from Nano (runs on a Raspberry Pi) through Small (phones), Medium (GPU inference default), up to Large and XLarge (maximum accuracy when compute is not a constraint). Start with Nano or Small for prototyping. Benchmark on your data. Scale up only if the accuracy gain justifies the added latency and cost.

3. Start from pre-trained weights. Always fine-tune from COCO pre-trained checkpoints. Training from scratch throws away thousands of GPU-hours of learned visual features: edges, textures, shapes, object parts. Fine-tuning adapts those features to your domain. With under 1,000 images, pre-trained weights are the difference between a working model and one that outputs noise.

4. Train and monitor. Watch the training graphs: box loss, classification loss, and validation mAP. If validation loss starts rising while training loss keeps falling, the model is overfitting. Reduce training epochs, add augmentation, or collect more data.

5. Evaluate. Use mAP@0.50 and mAP@[.50:.95] on a held-out test set. Check per-class AP to find classes the model struggles with. Inspect false positives and false negatives visually. A confusion matrix reveals which classes the model confuses with each other. Eigen-CAM visualizations show which image regions the model focuses on, helping you diagnose whether it is looking at the right features.

Multi-Task Capabilities

Starting with YOLOv8, the YOLO family expanded beyond pure object detection. YOLO26 supports six tasks from a single framework:

Object detection: bounding boxes and class labels. The core task.

Instance segmentation: pixel-level masks for each detected object. Useful when you need precise boundaries rather than rectangles, like measuring the area of a defect or segmenting overlapping cells under a microscope.

Image classification: whole-image labeling. YOLO's detection backbone transfers to classification tasks with minimal fine-tuning, since it already encodes rich visual features from millions of training examples.

Pose estimation: predicting keypoint locations (body joints, facial landmarks) for applications like fitness tracking, ergonomic monitoring, sign-language recognition, and sports analytics.

Oriented bounding boxes (OBB): rotated rectangles for objects at arbitrary angles, most commonly used in aerial imagery where nothing is axis-aligned, but also useful for document layout analysis and conveyor-belt inspection.

Open-vocabulary segmentation: new in YOLO26. The model segments objects described by text prompts without retraining, bridging YOLO into the vision-language model space.

One API for all six. You switch tasks by changing the model suffix and dataset format. No new framework to learn.

Deploying YOLO Models

A trained model needs to run where the cameras are: cloud server, factory-floor GPU, or a $35 Raspberry Pi. The deployment target shapes every decision about model size, export format, and optimization.

Cloud / server GPU. Export to ONNX or TensorRT, serve behind an API. TensorRT on an NVIDIA T4 gives the best throughput for batch inference. YOLO26x at 11.8ms per image on a T4 handles most server workloads without breaking a sweat.

NVIDIA Jetson (Orin, Xavier, Nano). The standard edge platform for industrial vision. Export to TensorRT with FP16 or INT8 quantization, and YOLO26n on a Jetson Orin Nano runs well under 20ms per frame. Our guide to deploying vision models on edge hardware covers the full export and optimization pipeline.

Raspberry Pi. CPU-only inference, but YOLO26n was built for exactly this scenario: 38.9ms per frame with 40.9 mAP on COCO. Export to ONNX or TFLite. For workloads where even that latency is too high, YOLO26-pico variants trade accuracy for sub-20ms CPU speed.

Mobile (iOS / Android). CoreML handles iOS, TFLite handles Android. The Nano and Small variants fit comfortably within mobile memory and compute budgets, and our TFLite export guide walks through the conversion step by step.

Optimization techniques apply across all targets. Post-training quantization converts FP32 weights to INT8, cutting model size by 4x and speeding up inference on hardware with integer math units. Model pruning removes redundant parameters. YOLO26's NMS-free design removes a deployment variable that plagued earlier versions: latency no longer spikes when the scene is crowded.

Common Use Cases

Manufacturing inspection. Detecting surface defects (scratches, dents, cracks, missing components) on production lines at line speed. YOLO's real-time performance means inspection keeps pace with manufacturing output. Companies like Ingroth use YOLO-class models to inspect industrial barrels, and Trendspek applies them to structural crack detection in infrastructure.

Autonomous driving. Detecting pedestrians, vehicles, cyclists, traffic signs, and lane markings. YOLO's deterministic latency (especially YOLO26's NMS-free design) matters for safety-critical systems where worst-case inference time must be bounded.

Video surveillance and security. People counting, intrusion detection, unattended object alerts, and license plate reading. YOLO processes live camera feeds at 30+ FPS, enabling real-time alerting. Object tracking algorithms like ByteTrack run on top of YOLO detections to maintain persistent IDs across frames.

Agriculture. Counting fruit for yield estimation, detecting weeds for precision spraying, grading produce on sorting lines, and monitoring livestock health. Drones cover hundreds of acres per day, and edge-deployed YOLO models process images on-device with no cloud connection needed.

Retail. Shelf monitoring, planogram compliance, checkout-free shopping, foot traffic analysis. The Nano and Small variants fit on store-mounted edge devices with no dedicated GPU, keeping per-camera costs low.

Medical imaging. Detecting nodules, lesions, polyps, and fractures as a screening aid. Medical detection models flag regions for radiologist review rather than making diagnostic decisions. SAHI (Slicing Aided Hyper Inference) helps YOLO detect small findings in high-resolution medical scans by running inference on overlapping image crops.

Getting Started with YOLO

For a new project, start with the latest stable release (YOLO26 as of early 2026) and pick the Nano variant for prototyping. Fine-tune from COCO pre-trained weights on your annotated dataset. Evaluate with mAP and per-class metrics, then export to your target runtime. If accuracy falls short, scale up to a larger variant.

Datature Nexus supports training and deploying YOLO models, including YOLO26, through a visual interface: upload images, annotate with AI-assisted bounding box tools, train with one click, evaluate results with built-in metrics, and deploy to cloud or edge. No training infrastructure code required. For hands-on walkthroughs, see our tutorials on YOLOv8 training, YOLOX custom datasets, and YOLO11 vs YOLOv8 comparison.

Get Started Now

Get Started using Datature’s computer vision platform now for free.