Data Preprocessing

Data preprocessing covers the steps applied to raw images and annotations before they enter a training pipeline. The goal is to get data into a consistent format that the model expects and to remove noise that could hurt learning. Common operations include resizing images to a fixed resolution, normalizing pixel values (scaling to 0-1 or standardizing with dataset mean/std), converting color spaces (BGR to RGB), and padding images to uniform dimensions for batching.

For object detection and segmentation, preprocessing also transforms the labels: bounding box coordinates get rescaled when images are resized, segmentation masks are resampled to match new dimensions, and class mappings are encoded as integer tensors. YOLO models expect a specific letterbox resize that preserves aspect ratio with gray padding. Transformer-based models often use different normalization statistics (ImageNet mean/std vs. dataset-specific values).

Poor preprocessing is a common source of silent bugs. A mismatch between training and inference preprocessing — different resize methods, forgotten normalization, or inconsistent color channel order — can drop model accuracy significantly without any obvious error. Standardizing the preprocessing pipeline and version-controlling it alongside the model is a best practice.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

Image Augmentation for Machine Learning: Techniques, Examples & Code

MIN READ

March 11, 2026

Image augmentation is one of the cheapest ways to improve computer vision performance, turning existing images into realistic variations that help models generalize instead of overfitting to narrow training data. This guide breaks down the main augmentation techniques, common mistakes, and library choices, then shows how to apply them in both code-based pipelines and Datature’s no-code workflow.

Read

Solving Class Imbalance: Upsampling in Machine Learning Projects

MIN READ

March 4, 2026

Class imbalance represents a significant challenge in machine learning where the distribution of classes in a dataset is heavily skewed. When one class substantially outnumbers others, models tend to develop a bias toward the majority class, potentially compromising their performance on minority classes. In this article we explore upsampling as an effective strategy to address class imbalance problems.

Read

How to Visually Inspect Your Dataset and Annotations for Model Training

MIN READ

March 4, 2026

Gain greater insight into your dataset's quality with Aggregation Statistics which will help you improve your dataset for more effective ML model training.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo