SigLIP

SigLIP (Sigmoid Loss for Language-Image Pre-training) is a vision-language model from Google that improves on CLIP's training approach. Like CLIP, SigLIP learns a shared embedding space for images and text through contrastive learning on image-text pairs. The key difference is the loss function: CLIP uses a softmax-based contrastive loss that requires computing scores across all pairs in a batch, while SigLIP uses a sigmoid loss that treats each image-text pair independently. This seemingly small change has practical consequences.

The sigmoid loss lets SigLIP scale to much larger batch sizes without the memory overhead of the full softmax matrix. SigLIP 2, released by Google DeepMind in February 2025, introduced multi-resolution training, captioning-based pre-training alongside contrastive learning, and self-distillation. SigLIP models serve as the vision encoder in PaliGemma and PaliGemma 2, meaning they form the visual backbone of Google's VLM family. Available variants range from ViT-B/16 (86M parameters) to ViT-SO400M (400M parameters), with the larger models used in production VLMs.

SigLIP is used as a drop-in replacement for CLIP wherever a vision-language encoder is needed: VLM backbones, image retrieval, zero-shot classification, and multimodal embedding generation. For practitioners fine-tuning PaliGemma or building custom VLMs, understanding that SigLIP is the vision encoder helps with debugging, choosing the right model variant, and understanding resolution and performance tradeoffs.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

Introducing PaliGemma 2: Use Cases and Improvements

MIN READ

March 7, 2026

This article examines the latest advancements in PaliGemma 2, a next-generation vision-language model designed for scalability, high-resolution processing, and domain-specific adaptability. We dive into its architecture, benchmarks, and innovations, offering a comprehensive overview for machine learning practitioners and researchers seeking to understand its capabilities and potential applications.

Read

A Primer on Fine-Tuning PaliGemma and VLMs

MIN READ

March 7, 2026

This article provides a comprehensive guide to fine-tuning PaliGemma - Google's new Visual Language Model (VLM) - for tasks such as image captioning, object detection, and segmentation, addressing specific challenges and potential solutions for optimizing performance and ensuring reliable outputs.

Read

Introducing PaliGemma: Google’s Latest Visual Language Model

MIN READ

March 7, 2026

PaliGemma pushes the boundaries for efficient multi-modality in Visual Language Models through task-specific finetuning that is highly competitive with larger architectures.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo