Patch Embedding
Patch embedding is how Vision Transformers convert an image into a sequence of tokens. The image is divided into a grid of fixed-size patches (commonly 16x16 or 14x14 pixels). Each patch is flattened into a 1D vector and passed through a linear projection (or a small convolutional layer) to produce an embedding vector. A 224x224 image split into 16x16 patches produces 196 tokens, each representing a region of the image, similar to how words are tokens in text.
Positional embeddings are added to each patch embedding so the model knows where each patch sits in the original image. Without this, the model would treat patches as an unordered set. Some architectures add a special [CLS] token that aggregates global image information. The patch size is a key design choice: smaller patches (14x14) give more tokens and finer spatial resolution but increase compute quadratically due to self-attention. Larger patches (32x32) are cheaper but lose detail. Variable-resolution approaches like NaViT and Qwen-VL's dynamic resolution handle different image sizes by adjusting the number of patches rather than resizing.
Understanding patch embedding explains why VLMs have resolution limits, why processing high-resolution images is expensive, and why some models handle fine details better than others. When a VLM fails to read small text in an image, it's often because the text falls within a single patch and the embedding can't represent those details. Tiling strategies (splitting large images into overlapping crops, processing each separately) are a common workaround.

.jpg)