Cross-Attention
Cross-attention is a variant of the transformer attention mechanism where queries come from one sequence and keys/values come from a different one. In a VLM, text tokens generate queries that attend to image patch tokens. This lets the language model focus on specific image regions when generating each output word. A captioning model "looks at" the dog in the image when writing "dog" in the caption.
In self-attention, queries, keys, and values all come from the same input. Cross-attention splits them: queries from the decoder (text side), keys and values from the encoder (image side). Models implement this in different ways. Flamingo inserts cross-attention layers between frozen LLM layers. BLIP-2's Q-Former uses learned query tokens that cross-attend to image features. Other models like LLaVA and PaliGemma skip explicit cross-attention, instead projecting image tokens into the LLM's input and relying on self-attention over the combined sequence. The choice affects training cost, memory usage, and how tightly vision and language components are coupled.
Cross-attention appears in VLMs, text-to-image generators (Stable Diffusion uses it to condition generation on text prompts), video understanding models, and multimodal retrieval systems. It's also a useful debugging tool. Attention maps show which image regions the model relies on for each output token. This helps teams build trust in predictions and catch failure modes early.


