Vision Transformer (ViT)
A Vision Transformer (ViT) takes the transformer architecture, originally designed for text, and applies it to images. Instead of processing pixels with convolutional filters, ViT splits an image into fixed-size patches (16x16 or 14x14 pixels), flattens each patch into a vector, adds positional embeddings, and feeds the sequence through standard transformer encoder layers. The model can attend to relationships between distant image regions from the first layer. CNNs need many stacked layers to achieve the same global view.
Google's original ViT (2020) proved that a pure transformer, pre-trained on large datasets like ImageNet-21k, matches or beats CNNs at image classification. The idea spread fast. DeiT added training tricks for smaller datasets. Swin Transformer brought shifted windows for high-resolution images. DINOv2 showed that self-supervised ViT pre-training produces strong general-purpose visual features. Today, ViT is the default vision encoder in nearly every major VLM: PaliGemma uses SigLIP-ViT, Qwen-VL has its own ViT, and Florence-2 uses DaViT.
ViTs serve as the backbone for classification, object detection (DETR, DINO), segmentation (SAM, Mask2Former), and vision-language models. Most pre-trained vision models available today use a ViT variant under the hood. Pick any recent object detector, segmentation model, or VLM, and the vision encoder is almost always some form of ViT.

