Patch Embedding

Patch embedding is how Vision Transformers convert an image into a sequence of tokens. The image is divided into a grid of fixed-size patches (commonly 16x16 or 14x14 pixels). Each patch is flattened into a 1D vector and passed through a linear projection (or a small convolutional layer) to produce an embedding vector. A 224x224 image split into 16x16 patches produces 196 tokens, each representing a region of the image, similar to how words are tokens in text.

Positional embeddings are added to each patch embedding so the model knows where each patch sits in the original image. Without this, the model would treat patches as an unordered set. Some architectures add a special [CLS] token that aggregates global image information. The patch size is a key design choice: smaller patches (14x14) give more tokens and finer spatial resolution but increase compute quadratically due to self-attention. Larger patches (32x32) are cheaper but lose detail. Variable-resolution approaches like NaViT and Qwen-VL's dynamic resolution handle different image sizes by adjusting the number of patches rather than resizing.

Understanding patch embedding explains why VLMs have resolution limits, why processing high-resolution images is expensive, and why some models handle fine details better than others. When a VLM fails to read small text in an image, it's often because the text falls within a single patch and the embedding can't represent those details. Tiling strategies (splitting large images into overlapping crops, processing each separately) are a common workaround.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

SAM 3: A Technical Deep Dive into Meta's Next-Generation Segmentation Model

MIN READ

March 7, 2026

SAM 3 is Meta’s next-generation segmentation model that shifts from geometric, prompt-based segmentation to concept-level understanding through Promptable Concept Segmentation, enabling open-vocabulary instance detection via text or visual exemplars.

Read

Introducing PaliGemma 2: Use Cases and Improvements

MIN READ

March 7, 2026

This article examines the latest advancements in PaliGemma 2, a next-generation vision-language model designed for scalability, high-resolution processing, and domain-specific adaptability. We dive into its architecture, benchmarks, and innovations, offering a comprehensive overview for machine learning practitioners and researchers seeking to understand its capabilities and potential applications.

Read

A Primer on Fine-Tuning PaliGemma and VLMs

MIN READ

March 7, 2026

This article provides a comprehensive guide to fine-tuning PaliGemma - Google's new Visual Language Model (VLM) - for tasks such as image captioning, object detection, and segmentation, addressing specific challenges and potential solutions for optimizing performance and ensuring reliable outputs.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo