CLIP (Contrastive Language-Image Pre-training)

CLIP (Contrastive Language-Image Pre-training) is a model from OpenAI that learns a shared embedding space for images and text. Give it an image and a set of text descriptions, and CLIP picks the best match, even for tasks it was never trained on. This zero-shot capability reshaped how vision models are built. CLIP-based systems skip task-specific labeled datasets entirely.

CLIP uses a dual-encoder architecture. A vision encoder (ViT or ResNet) converts the image into a vector. A text encoder (Transformer) does the same for a caption. During training on 400 million image-text pairs from the internet, the model pushes matching pairs close together in embedding space and pulls non-matching pairs apart. At inference, you encode the image and a set of candidate text prompts ("a photo of a dog," "a photo of a cat"), then pick the text with the highest cosine similarity. SigLIP, Google's successor to CLIP, swaps the softmax contrastive loss for a sigmoid loss that scales better to larger batch sizes. SigLIP powers PaliGemma 2 and other recent VLMs.

CLIP's embedding space sits at the foundation of modern VLMs, open-vocabulary detectors like YOLO-World and Grounding DINO, and text-to-image generators (Stable Diffusion uses CLIP's text encoder). The model also works well on its own. Teams use CLIP for image search, content moderation, and data curation: finding duplicates, spotting outliers, or filtering mislabeled images across large collections. It's a practical choice for zero-shot classification when collecting labeled data costs too much time or money.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

Introducing PaliGemma 2: Use Cases and Improvements

MIN READ

March 7, 2026

This article examines the latest advancements in PaliGemma 2, a next-generation vision-language model designed for scalability, high-resolution processing, and domain-specific adaptability. We dive into its architecture, benchmarks, and innovations, offering a comprehensive overview for machine learning practitioners and researchers seeking to understand its capabilities and potential applications.

Read

Introducing Florence-2: Microsoft’s Latest Multi-Modal, Compact Visual Language Model

MIN READ

March 7, 2026

Florence-2 is Microsoft’s compact vision-language model that unifies detection, segmentation, captioning, and grounding in one transformer, delivering strong zero-shot performance despite being much smaller than many SOTA VLMs. Its edge comes from multi-task training on FLD-5B, a massive autonomously annotated dataset (5.4B annotations over 126M images), and it stays competitive with specialist models when fine-tuned.

Read

A Primer on Fine-Tuning PaliGemma and VLMs

MIN READ

March 7, 2026

This article provides a comprehensive guide to fine-tuning PaliGemma - Google's new Visual Language Model (VLM) - for tasks such as image captioning, object detection, and segmentation, addressing specific challenges and potential solutions for optimizing performance and ensuring reliable outputs.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo