Cross-Attention

Cross-attention is a variant of the transformer attention mechanism where queries come from one sequence and keys/values come from a different one. In a VLM, text tokens generate queries that attend to image patch tokens. This lets the language model focus on specific image regions when generating each output word. A captioning model "looks at" the dog in the image when writing "dog" in the caption.

In self-attention, queries, keys, and values all come from the same input. Cross-attention splits them: queries from the decoder (text side), keys and values from the encoder (image side). Models implement this in different ways. Flamingo inserts cross-attention layers between frozen LLM layers. BLIP-2's Q-Former uses learned query tokens that cross-attend to image features. Other models like LLaVA and PaliGemma skip explicit cross-attention, instead projecting image tokens into the LLM's input and relying on self-attention over the combined sequence. The choice affects training cost, memory usage, and how tightly vision and language components are coupled.

Cross-attention appears in VLMs, text-to-image generators (Stable Diffusion uses it to condition generation on text prompts), video understanding models, and multimodal retrieval systems. It's also a useful debugging tool. Attention maps show which image regions the model relies on for each output token. This helps teams build trust in predictions and catch failure modes early.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

VLM Training Metrics and Loss Functions: A Technical Reference [2026]

MIN READ

March 7, 2026

Comprehensive technical guide to VLM evaluation and fine-tuning, covering key metrics (BLEU, METEOR, CIDEr, SPICE, BERTScore, CLIPScore, VQA Accuracy, ANLS) and core loss functions (cross-entropy, contrastive, focal, KL divergence, DPO). Includes mathematical formulations, step-by-step worked examples, and practical code snippets for implementation.

Read

Introduction to Chain-of-Thought for Vision-Language Models

MIN READ

March 7, 2026

Vision-language models can see, but without reasoning they often hallucinate, miss spatial details, or fail silently. This post shows how Chain-of-Thought prompting transforms VLMs into interpretable, step-by-step reasoners, with real examples of grounding, VQA, and physical AI using Cosmos Reason1 - and how to train and deploy them in practice.

Read

Introducing PaliGemma 2: Use Cases and Improvements

MIN READ

March 7, 2026

This article examines the latest advancements in PaliGemma 2, a next-generation vision-language model designed for scalability, high-resolution processing, and domain-specific adaptability. We dive into its architecture, benchmarks, and innovations, offering a comprehensive overview for machine learning practitioners and researchers seeking to understand its capabilities and potential applications.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo