Contrastive Learning

Contrastive learning trains a model to produce similar representations for related inputs and different representations for unrelated inputs, without needing class labels. Take an image, create two augmented versions (random crop, color jitter, flip), and train the model to map both versions close together in embedding space while pushing embeddings of different images apart.

SimCLR uses a shared encoder that maps augmented pairs through a projection head, with a contrastive loss (NT-Xent) that maximizes agreement between positive pairs relative to negatives in the batch. MoCo maintains a momentum-updated encoder and a queue of negative embeddings, removing the need for very large batch sizes. BYOL and SimSiam showed that negative pairs aren't strictly necessary — asymmetric architectures with stop-gradients can learn good representations from positive pairs alone. DINO and DINOv2 apply self-distillation with vision transformers and produce features with strong emergent properties, including semantic segmentation without any pixel-level training.

These methods produce general-purpose visual representations that transfer well to downstream tasks — classification, detection, segmentation — with minimal labeled data. They're especially useful in domains where labels are expensive: medical imaging, satellite analysis, and industrial inspection.

In the vision-language domain, contrastive learning takes a specific form: CLIP and SigLIP train by pushing matching image-text pairs together and pulling non-matching pairs apart in embedding space. Given a batch of N image-text pairs, the model learns to maximize similarity for the N correct pairings while minimizing it for the N*(N-1) incorrect ones. This produces a shared embedding space where images and text can be directly compared, forming the foundation for zero-shot classification, visual search, and VLM pre-training. The quality of this contrastive pre-training determines how well downstream VLMs perform on vision-language tasks.

Get Started Now

Get Started using Datature’s computer vision platform now for free.