Vision Transformer (ViT)

A Vision Transformer (ViT) takes the transformer architecture, originally designed for text, and applies it to images. Instead of processing pixels with convolutional filters, ViT splits an image into fixed-size patches (16x16 or 14x14 pixels), flattens each patch into a vector, adds positional embeddings, and feeds the sequence through standard transformer encoder layers. The model can attend to relationships between distant image regions from the first layer. CNNs need many stacked layers to achieve the same global view.

Google's original ViT (2020) proved that a pure transformer, pre-trained on large datasets like ImageNet-21k, matches or beats CNNs at image classification. The idea spread fast. DeiT added training tricks for smaller datasets. Swin Transformer brought shifted windows for high-resolution images. DINOv2 showed that self-supervised ViT pre-training produces strong general-purpose visual features. Today, ViT is the default vision encoder in nearly every major VLM: PaliGemma uses SigLIP-ViT, Qwen-VL has its own ViT, and Florence-2 uses DaViT.

ViTs serve as the backbone for classification, object detection (DETR, DINO), segmentation (SAM, Mask2Former), and vision-language models. Most pre-trained vision models available today use a ViT variant under the hood. Pick any recent object detector, segmentation model, or VLM, and the vision encoder is almost always some form of ViT.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

SAM 3: A Technical Deep Dive into Meta's Next-Generation Segmentation Model

MIN READ

March 7, 2026

SAM 3 is Meta’s next-generation segmentation model that shifts from geometric, prompt-based segmentation to concept-level understanding through Promptable Concept Segmentation, enabling open-vocabulary instance detection via text or visual exemplars.

Read

Introducing PaliGemma 2: Use Cases and Improvements

MIN READ

March 7, 2026

This article examines the latest advancements in PaliGemma 2, a next-generation vision-language model designed for scalability, high-resolution processing, and domain-specific adaptability. We dive into its architecture, benchmarks, and innovations, offering a comprehensive overview for machine learning practitioners and researchers seeking to understand its capabilities and potential applications.

Read

Introducing Florence-2: Microsoft’s Latest Multi-Modal, Compact Visual Language Model

MIN READ

March 7, 2026

Florence-2 is Microsoft’s compact vision-language model that unifies detection, segmentation, captioning, and grounding in one transformer, delivering strong zero-shot performance despite being much smaller than many SOTA VLMs. Its edge comes from multi-task training on FLD-5B, a massive autonomously annotated dataset (5.4B annotations over 126M images), and it stays competitive with specialist models when fine-tuned.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo