Contrastive Learning

Contrastive learning trains a model to produce similar representations for related inputs and different representations for unrelated inputs, without needing class labels. Take an image, create two augmented versions (random crop, color jitter, flip), and train the model to map both versions close together in embedding space while pushing embeddings of different images apart.

SimCLR uses a shared encoder that maps augmented pairs through a projection head, with a contrastive loss (NT-Xent) that maximizes agreement between positive pairs relative to negatives in the batch. MoCo maintains a momentum-updated encoder and a queue of negative embeddings, removing the need for very large batch sizes. BYOL and SimSiam showed that negative pairs aren't strictly necessary — asymmetric architectures with stop-gradients can learn good representations from positive pairs alone. DINO and DINOv2 apply self-distillation with vision transformers and produce features with strong emergent properties, including semantic segmentation without any pixel-level training.

These methods produce general-purpose visual representations that transfer well to downstream tasks — classification, detection, segmentation — with minimal labeled data. They're especially useful in domains where labels are expensive: medical imaging, satellite analysis, and industrial inspection.

In the vision-language domain, contrastive learning takes a specific form: CLIP and SigLIP train by pushing matching image-text pairs together and pulling non-matching pairs apart in embedding space. Given a batch of N image-text pairs, the model learns to maximize similarity for the N correct pairings while minimizing it for the N*(N-1) incorrect ones. This produces a shared embedding space where images and text can be directly compared, forming the foundation for zero-shot classification, visual search, and VLM pre-training. The quality of this contrastive pre-training determines how well downstream VLMs perform on vision-language tasks.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

Introducing t-SNE Embedding Visualization on Datature: Discover Image Similarity and Patterns

MIN READ

March 4, 2026

Datature is excited to launch t-SNE Embedding Visualization, an intuitive way to explore image embeddings and reveal patterns in your datasets. By projecting high-dimensional vectors into an interactive 2D map, you can quickly spot duplicates, detect anomalies, understand distribution, and speed up labeling. With lasso selection, hover previews, and tunable parameters, the Projector turns raw embeddings into clear, interpretable clusters—helping teams across industries manage large-scale image data with ease.

Read

Introducing PaliGemma 2: Use Cases and Improvements

MIN READ

March 7, 2026

This article examines the latest advancements in PaliGemma 2, a next-generation vision-language model designed for scalability, high-resolution processing, and domain-specific adaptability. We dive into its architecture, benchmarks, and innovations, offering a comprehensive overview for machine learning practitioners and researchers seeking to understand its capabilities and potential applications.

Read

Introducing Florence-2: Microsoft’s Latest Multi-Modal, Compact Visual Language Model

MIN READ

March 7, 2026

Florence-2 is Microsoft’s compact vision-language model that unifies detection, segmentation, captioning, and grounding in one transformer, delivering strong zero-shot performance despite being much smaller than many SOTA VLMs. Its edge comes from multi-task training on FLD-5B, a massive autonomously annotated dataset (5.4B annotations over 126M images), and it stays competitive with specialist models when fine-tuned.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo