CLIP (Contrastive Language-Image Pre-training)

CLIP (Contrastive Language-Image Pre-training) is a model from OpenAI that learns a shared embedding space for images and text. Give it an image and a set of text descriptions, and CLIP picks the best match, even for tasks it was never trained on. This zero-shot capability reshaped how vision models are built. CLIP-based systems skip task-specific labeled datasets entirely.

CLIP uses a dual-encoder architecture. A vision encoder (ViT or ResNet) converts the image into a vector. A text encoder (Transformer) does the same for a caption. During training on 400 million image-text pairs from the internet, the model pushes matching pairs close together in embedding space and pulls non-matching pairs apart. At inference, you encode the image and a set of candidate text prompts ("a photo of a dog," "a photo of a cat"), then pick the text with the highest cosine similarity. SigLIP, Google's successor to CLIP, swaps the softmax contrastive loss for a sigmoid loss that scales better to larger batch sizes. SigLIP powers PaliGemma 2 and other recent VLMs.

CLIP's embedding space sits at the foundation of modern VLMs, open-vocabulary detectors like YOLO-World and Grounding DINO, and text-to-image generators (Stable Diffusion uses CLIP's text encoder). The model also works well on its own. Teams use CLIP for image search, content moderation, and data curation: finding duplicates, spotting outliers, or filtering mislabeled images across large collections. It's a practical choice for zero-shot classification when collecting labeled data costs too much time or money.

Get Started Now

Get Started using Datature’s computer vision platform now for free.