Tutorials

YOLO11: Step-by-Step Training on Custom Data and Comparison with YOLOv8

Marcus Neo

October 22, 2024

MIN READ

Sections

Ultralytics YOLO11 is the latest advancement in the YOLO series of real-time object detection models. By substituting several architectural components from its prior version, YOLOv8, it caters to the increasing demand for quicker and more precise predictions. This is mainly in diverse applications such as self-driving cars, surveillance, and augmented reality.

In this piece, we will guide you step-by-step to demonstrate the improved performance of YOLO11 in aspects of speed and precision, using YOLOv8 as a reference. The enhancements not only simplify the detection procedure but also enhance the model's capacity to recognize and categorize objects in real-time, positioning it as a perfect pick for demanding scenarios.

Introducing YOLO11

Developed by Ultralytics in version 8.3, YOLO11 is a state-of-the-art model designed for various computer vision tasks, including object detection, instance segmentation, pose estimation, and classification. This versatility makes it a powerful tool for applications requiring nuanced understanding of visual data.

Main Differences from YOLOv8

C3k2 Block: A major improvement in YOLO11 is the substitution of the C2f block in the neck with the C3k2 block. This innovative version of the CSP Bottleneck utilizes two convolution operations with smaller kernel size rather than a single large one, as in YOLOv8. This leads to quicker processing while still delivering strong performance.
C2PSA Block: While YOLO11 keeps the SPPF block, it also adds a new C2PSA (Cross Stage Partial with Spatial Attention) block afterward. This new block enhances the model's spatial attention capabilities in the feature maps, enabling it to better focus on crucial areas of the image. By spatially pooling features, the C2PSA block significantly improves the model's effectiveness in identifying specific regions of interest, boosting overall detection accuracy.

Comparison of YOLO11 and YOLO8's architecture — Comparison of YOLO11 and YOLOv8's Architecture

Training YOLO11 On Custom Dataset

We will be training a YOLO11 model to detect crops in a field in Datature's Nexus Platform. This dataset consists of top-down photographs of the soil. It includes seven unique crop species, specifically sugar beet, maize, soy, sunflower, bean, pumpkin, and potato.

Choosing This Dataset for YOLO11

In agriculture, the vast fields and diverse crops create challenges for data processing and analysis. Large field sizes complicate monitoring, while species diversity requires adaptable models. Class imbalances can lead to biases in machine learning, resulting in misidentification of less common crops, which can cause economic losses and ineffective pest management.

Annotating Complex and Small Crops in Datature Nexus

To address these issues, there's a need for models that process data quickly and accurately identify all crop species, particularly underrepresented ones. This dataset highlights these challenges and essential features for modern agricultural models. We will now evaluate the performance of YOLO11 with this dataset.

Training

Parameters

We will train a YOLO11-Medium object detection model using its default pretrained weights for a total of 10,000 steps, with a batch size of 32 split between 2 NVIDIA T4 GPUs. Given our training set of 2,670 assets, this translates to just under 120 epochs. We are employing the Adam optimizer with a learning rate of 0.001.

YOLOv8 Control

For comparison, we have set up an identical training regimen using a YOLOv8-Medium object detection model, maintaining the same hyperparameters. The seed-lock feature in Nexus simplifies the setup of test controls, ensuring that all models are exposed to the same training images at precisely the same training step, provided the batch size remains consistent. This allows for a fair and accurate comparison of model performance.

Setting a fixed seed for deterministic training comparisons.

Training Results

After training the YOLO11m, model, the post-training analysis revealed an impressive mean average precision (mAP) of 0.7586 at an IoU threshold of 0.50 at step 2000. This performance highlights YOLO11's effectiveness in object detection, especially considering the challenge of identifying less common classes. The model demonstrated strong capabilities in detecting underrepresented crops, with Pea (1.9%) and Potato (2%) achieving average precisions of 0.666 and 0.669, respectively. This is a significant achievement, as it indicates that YOLO11 can accurately recognize and classify these minority species, which is crucial for comprehensive agricultural monitoring and management.

In comparison, the YOLOv8m model reached a peak mAP of 0.7459 at the same step, with lower average precisions for the underrepresented classes—0.624 for Pea and 0.618 for Potato. While YOLOv8m showed improvement in its average precision for Potato at step 6000, increasing it to 0.741, this came at the cost of overall performance, as the mAP decreased to 0.743. This trade-off suggests that YOLOv8 may struggle to maintain a balanced performance across all classes, particularly when focusing on enhancing accuracy for specific underrepresented species.

The results demonstrate that YOLO11 not only surpasses YOLOv8 in overall mAP but also maintains a stronger performance across minority classes without compromising the detection accuracy of other species. This balance is crucial for applications in agriculture, where accurate identification of all crop types is essential for effective management practices.

The ability of YOLO11 to achieve higher average precision for underrepresented crops while maintaining a robust overall performance underscores its advantages over YOLOv8. This makes YOLO11 a more reliable tool for farmers and agronomists seeking to leverage advanced object detection capabilities in their operations.

Model size and Prediction Latency

After training the YOLO11m and YOLOv8m models, we compared their prediction latencies to assess operational efficiency. When quantizing to ONNX int8, the YOLO11m model shrank from 64.3 MB to 13.2 MB, while YOLOv8m reduced from 74.1 MB to 13.9 MB. Evaluating 100 test images revealed an average prediction time of 124.6 ms for YOLO11m and 120.6 ms for YOLOv8m — showcasing near-identical performance, with only a 4 ms difference in latency.

Despite both models achieving similar compactness and speed, YOLO11m stands out by delivering superior accuracy. These results emphasize that YOLO11m offers a well-balanced solution for real-time applications, combining efficiency with enhanced precision.

Conclusion

The importance of these findings extends beyond mere numbers; in real-world applications, even small differences in latency can impact overall system responsiveness. While YOLOv8m shows a slight edge with an average prediction time that is 4 milliseconds faster, this difference of about 3% is relatively modest. Nonetheless, faster prediction times can be advantageous in scenarios where speed is a priority.

On the other hand, YOLO11m’s strong performance in detecting minority classes, combined with its competitive latency, makes it a solid choice for agricultural applications, where accuracy and reliability are paramount. This balance allows users to select the model that best fits their specific needs without sacrificing essential performance aspects. YOLO11 also supports multiple export formats, allowing it to be deployed in a wide variety of environments.

In summary, while YOLOv8m edges out in prediction speed, YOLO11m stands out for its balanced performance across all classes, particularly in recognizing less common species. This balance positions YOLO11 as a strong candidate for practical implementations in agricultural monitoring and management, where both speed and precision are essential for effective decision-making.

Our Developer’s Roadmap

Datature is constantly looking to incorporate state-of-the-art models that can boost accuracy and inference throughput. With our extensive model offering ranging from segmentation to pose estimation, it becomes crucial to understand which model architectures are best-suited for each use case. Datature is actively developing a cross-training comparison tool built into Nexus that allows you to easily compare multiple training runs across a variety of evaluation metrics.

Want to Get Started?

If you have questions, feel free to join our Community Slack to post your questions or contact us if you wish to learn more about training a YOLO11 model on Datature Nexus.

For more detailed information about the YOLO11 architecture, customization options, or answers to any common questions you might have, read more on our Developer Portal.