Cross-Attention

Cross-attention is a variant of the transformer attention mechanism where queries come from one sequence and keys/values come from a different one. In a VLM, text tokens generate queries that attend to image patch tokens. This lets the language model focus on specific image regions when generating each output word. A captioning model "looks at" the dog in the image when writing "dog" in the caption.

In self-attention, queries, keys, and values all come from the same input. Cross-attention splits them: queries from the decoder (text side), keys and values from the encoder (image side). Models implement this in different ways. Flamingo inserts cross-attention layers between frozen LLM layers. BLIP-2's Q-Former uses learned query tokens that cross-attend to image features. Other models like LLaVA and PaliGemma skip explicit cross-attention, instead projecting image tokens into the LLM's input and relying on self-attention over the combined sequence. The choice affects training cost, memory usage, and how tightly vision and language components are coupled.

Cross-attention appears in VLMs, text-to-image generators (Stable Diffusion uses it to condition generation on text prompts), video understanding models, and multimodal retrieval systems. It's also a useful debugging tool. Attention maps show which image regions the model relies on for each output token. This helps teams build trust in predictions and catch failure modes early.

Get Started Now

Get Started using Datature’s computer vision platform now for free.