Vision Transformers (ViTs)
Vision Transformers (ViTs) apply the transformer architecture, originally designed for natural language processing, to image understanding. The core idea is simple: split an image into fixed-size patches (typically 16x16 pixels), flatten each patch into a vector, add positional embeddings to retain spatial information, and feed the resulting sequence through standard transformer encoder layers with self-attention and feed-forward networks.
Self-attention allows each patch to attend to every other patch in the image, giving ViTs a global receptive field from the first layer. This contrasts with CNNs, which build global understanding gradually through many stacked local convolution operations. ViTs excel when pre-trained on large datasets (ImageNet-21K, JFT-300M, or with self-supervised objectives like DINOv2) and then fine-tuned on target tasks. Without large-scale pre-training, they tend to underperform CNNs on smaller datasets because they lack the built-in spatial inductive biases that convolutions provide.
Key ViT variants include DeiT (data-efficient training with distillation), Swin Transformer (hierarchical feature maps with shifted windows for dense prediction), BEiT (BERT-style masked image modeling), and DINOv2 (self-supervised ViT producing strong general features). ViTs now serve as backbones in state-of-the-art detection (DINO-DETR), segmentation (SAM, Mask2Former), and multimodal models (CLIP, LLaVA). They are the default architecture for foundation models in computer vision.


