Articles

Solving Class Imbalance: Upsampling in Machine Learning Projects

Marcus Neo

April 29, 2025

MIN READ

Class imbalance represents a significant challenge in machine learning where the distribution of classes in a dataset is heavily skewed. When one class substantially outnumbers others, models tend to develop a bias toward the majority class, potentially compromising their performance on minority classes. In this article, we explore upsampling as an effective strategy to address class imbalance problems.

What is Class Imbalance?

Class imbalance occurs when the distribution of classes in a dataset is disproportionate. For instance, in defect detection systems, the vast majority of images typically show no defects, while only a small percentage contain imperfections. This natural imbalance can severely impact a model's ability to learn patterns associated with the minority class.

Why is Class Imbalance Problematic?

Most machine learning algorithms optimize for overall accuracy, which becomes problematic with imbalanced data. In such scenarios, a model can achieve seemingly high accuracy by simply predicting the majority class in most cases, while performing poorly on the minority class. This is particularly concerning in critical applications like medical diagnosis, where failing to detect a rare condition (the minority class) could have serious consequences for patient outcomes.

Common Solutions to Class Imbalance

Class imbalance can be addressed through several effective methodologies that engineers often combine for maximum effectiveness:

Undersampling reduces majority class samples to better balance with the minority class, preventing bias but potentially discarding useful information.
Oversampling increases minority class samples through duplication or synthesis, preserving all majority class information while balancing the dataset, though risk of overfitting exists.
SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic minority examples by interpolating between existing samples and their neighbors, expanding minority representation while avoiding overfitting.
Adjusted Loss Functions modify the objective function to penalize minority class misclassifications more heavily. Class weighting and focal loss are common implementations that maintain all data while addressing imbalance in the learning process itself, though finding optimal weights can be challenging and may require careful tuning.

Addressing Class Imbalance Through Upsampling

Understanding Upsampling

Upsampling is a resampling technique that balances skewed datasets by increasing the representation of minority classes. Unlike downsampling (which reduces majority class instances), upsampling preserves all available data while creating a more balanced class distribution.

How Upsampling Works - Wildlife Dataset

Imagine a simple wildlife classification dataset containing:

1 Squirrel image
3 Butterfly images
6 Dog images

In this scenario, butterflies and squirrels are the minority classes. If a model is trained as-is, it might become biased toward predicting dogs because dogs dominate the dataset. With upsampling, you could take those three butterfly images and copy them twice each, resulting in 6 butterfly images and 6 dog images. The model now sees butterflies just as frequently as dogs during training, which helps it learn both classes more evenly.

The more extreme case is portrayed by the squirrel class. To balance it perfectly, the single squirrel image is replicated 6 times. This highlights a drawback of naive upsampling: lack of diversity. If the minority class is duplicated too many times, the model might overfit to those specific examples, as it is seeing the same image over and over.

This simple example demonstrates why upsampling requires careful consideration. In practice, upsampling involves strategically adding more samples to the minority class through one of two primary approaches:

Replication: Duplicating existing minority class samples to increase their representation. While straightforward, this approach can lead to overfitting if samples are replicated excessively, as we saw with the squirrel example.
Synthetic Sample Generation: To mitigate the risk of overfitting, upsampling is often combined with domain-specific data augmentation techniques. These create meaningful variations of minority samples rather than exact duplicates, thereby enriching the dataset while maintaining its integrity. An example library that we can use is Albumentations.

Caveat of Upsampling with Context-Bound Classes - Blood Cell Dataset

While upsampling (with or without augmentation) can balance the numbers, there are important caveats, especially in computer vision tasks. Unlike tabular data where each sample is independent, images can contain multiple objects or classes within one image, and the way classes appear might have natural ratios that are difficult to modify.

Imagine an object detection dataset of blood samples. In each image:

Red Blood Cells are the majority class
Platelets and White Blood Cells are the minority classes

In this scenario, Platelets and White Blood Cells are the minority classes, but in a different way than the wildlife example. These classes are inherently rare not just across the dataset, but within each individual image due to the inherent nature of blood. This creates a unique challenge for upsampling: even if images containing Platelets are duplicated to increase their representation, each duplicated image still contains many more Red Blood Cells. Unlike the wildlife dataset where each image contained a single class, here the class imbalance exists within each sample itself.

This highlights a fundamental limitation of traditional upsampling: artificially creating an image with equal numbers of Platelets, Red Blood Cells and White Blood Cells would significantly alter the authentic nature of the data. Some classes are not only globally rare across the dataset, but also locally constrained by their natural context.

In such cases, a practical compromise is necessary—improving class balance to a reasonable degree while respecting the inherent distribution constraints that exist in the real world.

Datature’s Class Balancing Approach

Datature's new Class Balancing feature allows you to tune the class representation to achieve the most balanced outcome. We also offer Autobalancing to intelligently identify the best class weights to optimize model performance across all classes, so you don't have to manually fiddle with your dataset or manually adjust the class weights. Here's how it works and what it brings to your workflow.

Calculating Optimal Upsampling Ratios

Determining the ideal upsampling frequency requires a balance—enough samples of minority classes are needed without introducing excessive duplication that might cause overfitting. A systematic mathematical approach can be used to calculate precise upsampling ratios for each class, while acknowledging that perfect balance may not be achievable. This best-effort approach aims to improve representation as much as feasible while respecting practical limitations of the dataset.

Represent class distribution as a matrix: Create an M × N matrix where M rows represent classes and N columns represent individual items in the dataset. For classification tasks, each cell typically contains a binary indicator (1 if the item belongs to that class, 0 otherwise). For object detection tasks, cells may contain counts of objects from each class present in the image or scene.
Apply double normalization to ensure balanced representation:
- Column normalization: Normalize each column so values sum to 1, giving equal weight to each sample
- Row normalization: Normalize each row so values sum to 1, giving equal weight to each class
Apply sampling weights: Multiply the normalized matrix by the desired sampling weights for each class (for balanced classes, these weights would be equal)
Calculate column sums: After double normalization, sum each column to obtain the relative weight for each image
Scale by the smallest entry: Divide all values in the column sum by the smallest non-zero entry to establish relative sampling frequencies
Quantize to whole numbers: Round each value up to the nearest integer to determine how many times each sample should be included in the balanced dataset

This approach ensures that minority classes receive sufficient representation while maintaining the diversity of the original dataset.

Classification Example: Wildlife Dataset

Let's apply this method to the previously mentioned wildlife dataset containing:

1 squirrel image
3 butterfly images
6 dog images

Step 1: We represent this class distribution in a matrix where rows correspond to classes and columns to images:

Step 2: Now we perform double normalization:

For classification tasks, column-wise normalization effectively treats each image equally
After row-wise normalization, the squirrel's value remains at 1 (since it's the only instance of its class), while each butterfly becomes 1/3 and each dog 1/6 of its respective classes

Step 3: We then apply desired sampling weights to the normalized matrix. For this scenario, the weights can be set to one.

Step 4: Next, we calculate column sums of the resultant matrix to determine relative image weights.

Step 5: We divide these weights by the smallest value (1/6) to obtain integer sampling frequencies.

Calculating the Image Sampling Frequency

‍

Results interpretation: This calculation shows us that to achieve balanced class representation:

The squirrel image should be sampled 6 times (addressing its severe underrepresentation)
Each butterfly image should be sampled twice (moderate underrepresentation)
Each dog image requires only a single sample (already well-represented)

Through this strategic upsampling, our dataset grows from 10 to 18 instances. Each class constituting approximately one-third of the total, creating an optimal balance for training without losing any original information.

Object Detection Example: Blood Cell Dataset

In this second example, the focus is on detecting and differentiating cells from the aforementioned blood cell dataset.

Five sample images will be selected to demonstrate how the upsampling frequencies can be obtained for this object detection task, where Red Blood Cells form the majority class while White Blood Cells and Platelets remain heavily underrepresented throughout.

Step 1: We first obtain the class-image matrix. Unlike the classification example, this matrix isn't binary (ones and zeros) because each image contains multiple instances of different cells.

Step 2: We perform double normalization on the matrix:

First, we normalize at the image (column) level
Then, we normalize at the class (row) level

Step 3: We multiply the resulting normalized matrix by the desired class weights to address the varying degrees of class imbalance:

Platelets (heavily underrepresented): weight of 45
Red Blood Cells (majority class): weight of 10
White Blood Cells (moderately underrepresented): weight of 45

Step 4: Next, we calculate column sums of the weighted normalized matrix to determine relative image weights.

‍

Step 5: We divide these weights by the smallest value to obtain integer sampling frequencies. The resulting sampling frequency becomes [3, 1, 2, 2, 4].

This means that the first image should be sampled three times, the second image once, the third and fourth images twice each, and the fifth image four times. This strategic upsampling ensures that the underrepresented Platelets and White Blood Cells receive appropriate emphasis in the training process while maintaining the natural diversity of the dataset.

Advantages and Limitations of Datature’s Upsampling Approach

When implementing this upsampling technique to address class imbalance, it's important to understand its strengths and limitations:

Advantages

Complete Data Preservation: All items in the dataset will be sampled at least once, preventing any data from disappearing in the resulting balanced dataset.
Adaptive Sampling: The method automatically adapts to varying degrees of imbalance across different classes, providing appropriate emphasis without requiring manual tuning for each class.
Preservation of Data Relationships: By upsampling entire images rather than generating synthetic examples, the natural co-occurrence relationships between objects are preserved.
Mathematically Principled: The approach is based on matrix normalization techniques with a clear mathematical foundation, making it more systematic than ad-hoc sampling methods.
Customizable Class Importance: The sampling weights provide a mechanism to incorporate domain knowledge about the relative importance of different classes beyond just addressing imbalance.
No Information Loss: Unlike downsampling techniques, this approach retains all original data points, which is particularly valuable for small datasets.

Limitations

Class Weight Constraints: For object detection datasets where images contain both majority and minority classes, you will never reach the exact desired class weights due to the inherently skewed nature of the dataset.
Potential Overfitting: Repeated sampling of the same images, especially for severely underrepresented classes, may lead to overfitting as the model sees the same instances multiple times.
Computational Overhead: The increased dataset size after upsampling requires more computational resources for training.
Matrix Calculation Complexity: The double normalization and weighting process is more complex than simple random oversampling techniques, potentially making it harder to implement and explain.
Limited by Original Distribution: The effectiveness is constrained by the original distribution of objects within images—if a minority class never appears alone, perfect balance remains impossible.
Less Effective Without Augmentation: Without additional data augmentation, the technique may not introduce enough variability in minority class examples to create a robust model.
Class Boundary Challenges: For images containing multiple classes, the technique doesn't specifically address the challenge of object boundary detection between majority and minority classes.

To mitigate some of these limitations, this upsampling approach is often combined with data augmentation techniques to introduce more variability while maintaining the calculated sampling frequencies.

Class Balancing Feature On Datature Nexus

Implementing class balancing in real-world machine learning projects has been made significantly more accessible through Datature’s Nexus platform. This section walks through the practical steps of applying the upsampling techniques discussed earlier, directly within the Nexus workflow interface.

Setting Up a Training Workflow

To leverage Nexus’s class balancing capabilities, start by creating or opening a workflow in your project. The workflow interface presents a visual pipeline of your machine learning process, with connected modules representing each step from dataset preparation to model training.

Step 1: The Dataset Module

Begin with the Dataset module, which allows you to configure your dataset for training.

After configuring your dataset, you can activate the class balancing feature:

Navigate to the “Class Balancing Setup” section in the dataset configuration panel.
Toggle “Enable Class Balancing” to “Yes”.
The distribution control interface will appear.

Step 2: Using the Class Distribution Balancer

The Class Distribution Balancer provides a visual interface to manage class representation in your dataset:

Current Distribution View: The interface displays your dataset’s original class distribution, showing the proportion of each class in the dataset.
Sample Count: You can see the number of samples in the original dataset, and what the balanced dataset will contain.
Class-Specific Controls: For each class in your dataset (e.g. “dog”, “squirrel” or “butterfly” in our wildlife dataset example), you can view and adjust the target representation percentage.

Step 3: Adjusting Target Distributions

You have two options for determining the target class distribution:

Manual Adjustment: Use the slider controls to specify the desired percentage for each class. As you adjust one class, the platform dynamically updates others to maintain a valid distribution totalling 100%. Behind the scenes, these target weights are used with the double normalization technique explained earlier to calculate the best-effort upsampling ratios for each image.
Auto-Balance Feature: Click the “Auto-Balance Classes” button to have Nexus automatically search for the optimal target weights that will result in a distribution as close as possible to uniform across all classes.

Both methods use the same mathematical principles outlined earlier; the Auto-Balance feature simply eliminates the guesswork by identifying the target weights that will achieve the most balanced outcome possible given your dataset’s constraints.

The interface provides immediate visual feedback on your adjustments, showing both the best-effort and target percentages side by side for each class, helping you understand how the balancing will affect your dataset composition.

Step 4: Finalize and Confirm

After setting your desired class distribution:

Review the proposed changes, noting the overall image count for the balanced dataset.
Click “Confirm” to apply the class balancing settings to your workflow.
The system will calculate the appropriate sampling frequencies for each image based on the mathematical approach described earlier.

Integration with the Training Pipeline

Once class balancing is enabled and configured, it seamlessly integrates with the rest of your workflow:

Augmentation Module: Combine class balancing with augmentation techniques to further enhance diversity in minority classes
Model Configuration: Set your chosen architecture (e.g. DFINE) and batch size for training.
Training Process: When you run training, the platform automatically applies the upsampling according to your specified distribution.

Implementation Notes and Best Practices

When using Nexus’s class balancing feature, keep these considerations in mind:

Best-Effort Balancing: As emphasized earlier, the balancing operates on a best-effort basis; perfect equality between classes may not be achievable in all scenarios due to inherent dataset characteristics or extreme imbalances.
Increased Dataset Size: Remember that upsampling increases your overall dataset size, sometimes significantly. Be mindful that you will need to adjust your training parameters accordingly, particularly by increasing the total number of training steps to account for the larger dataset.
Complementary Techniques: Class balancing works most effectively when combined with appropriate data augmentation.
Performance Monitoring: After training with a balanced dataset, pay close attention to the per-class metrics (precision, recall, etc.) rather than just overall metrics to ensure the balancing has improved minority class performance.
Interactive Refinement: You may need to experiment with different target distributions to find the optimal balance for your specific use case and dataset characteristics.

By implementing class balancing through Nexus, you can effectively address the challenges of imbalanced datasets without requiring in-depth knowledge of the underlying mathematical techniques. The platform handles the complex calculations of sampling frequencies, allowing you to focus on optimizing the overall performance of your model across all classes.

Training Setup

We conducted an experiment using an imbalanced animal dataset to demonstrate the impact of class balancing techniques on model performance. Our dataset consisted of:

10 squirrel samples (10%)
30 butterfly samples (30%)
60 dog samples (60%)

We trained identical models for 100 epochs under two conditions:

Without class balancing (using the raw imbalanced dataset)
With class balancing (using oversampling techniques)

Note: When class balancing was enabled, the effective dataset size increased due to oversampling of minority classes, resulting in more training steps per epoch.

Testing The Animal Dataset Without Class Balancing

Testing The Animal Dataset With Class Balancing

Results Analysis

The performance difference between the two approaches was striking, particularly for underrepresented classes:

Key Findings

Our experiment revealed several important insights:

Minority Class Improvement: The squirrel class, our most underrepresented category with only 10 samples, saw the most dramatic improvement with a 32% increase in F1 score (from 0.25 to 0.33).
Well-Represented Classes: Butterfly samples, despite being outnumbered by dogs, had sufficient representation to achieve perfect F1 scores in both scenarios.
Majority Class Enhancement: Interestingly, even the dog class (our majority class) benefited from the balanced training approach, with its F1 score improving from 0.85 to 0.94.
Extended Training Effect: The increased dataset size from oversampling resulted in more training steps per epoch, which likely contributed to the overall performance improvement.

‍

Conclusion

Matrix-based upsampling provides a systematic approach to addressing class imbalance in both classification and object detection tasks. By calculating precise sampling frequencies through double normalization and weighted adjustments, we can create balanced datasets that give appropriate representation to all classes.

For classification tasks with mutually exclusive classes, this approach can achieve a perfect balance. For object detection tasks with co-occurring objects, it offers a significant improvement over the original distribution while preserving natural relationships between objects.

The key benefit of this approach is its mathematical foundation, which removes subjective decision-making from the balancing process. By following the steps outlined in this document and understanding the advantages and limitations, machine learning practitioners can implement effective upsampling strategies that lead to more robust and fair models.

When combined with appropriate data augmentation techniques, this upsampling method can significantly improve model performance on minority classes without compromising overall accuracy, addressing one of the fundamental challenges in applied machine learning.