Transformer

A transformer is a neural network architecture built on self-attention, a mechanism that lets each element in a sequence attend to every other element and learn which relationships matter most. Originally designed for natural language processing, transformers have become the dominant architecture across nearly all areas of deep learning, including computer vision, speech recognition, and multimodal reasoning.

In computer vision, the Vision Transformer (ViT) introduced a simple idea: split an image into fixed-size patches (typically 16x16 pixels), flatten each patch into a vector, add positional embeddings, and feed the sequence through standard transformer encoder layers. This approach matches or exceeds convolutional networks on image classification when pre-trained on large datasets. Subsequent architectures like Swin Transformer added hierarchical feature maps and shifted windows to handle dense prediction tasks like detection and segmentation more efficiently.

Transformers now form the backbone of most state-of-the-art vision systems. DETR uses a transformer decoder for end-to-end object detection without anchor boxes or NMS. SAM (Segment Anything Model) uses a ViT encoder for universal image segmentation. Multimodal models like CLIP and GPT-4V pair vision transformers with language models to connect images and text. The main drawback is compute cost: self-attention scales quadratically with sequence length, making high-resolution image processing expensive. Windowed attention, linear attention approximations, and hybrid CNN-transformer designs address this trade-off.

Get Started Now

Get Started using Datature’s platform now for free.