Image Captioning

Image captioning generates a natural language sentence describing what's in an image. Given a photo, the model might produce "two people sitting at a table in a restaurant" or "an aerial view of a flooded highway." This is different from classification, which picks from fixed labels, or detection, which draws boxes. Captioning produces free-form text covering objects, actions, relationships, and context.

Modern captioning uses VLMs with an encoder-decoder design: a vision encoder (ViT or SigLIP) extracts visual features, and a language model decoder generates the caption token by token. BLIP-2 introduced the Q-Former to bridge frozen vision and language models efficiently. CoCa combined contrastive and captioning objectives in a single model. PaliGemma, Florence-2, and Qwen-VL all support captioning as a core task. Dense captioning extends this by generating separate captions for multiple regions in the image. Evaluation metrics include CIDEr (consensus with human references), BLEU (n-gram overlap), and METEOR (semantic matching).

Captioning powers several practical workflows: automatic alt-text for web accessibility, visual search engines that index images by description, content moderation, and product catalog generation from photographs. Medical teams use it to draft reports from diagnostic images. For organizations managing thousands of images, automated captioning cuts per-image processing time from minutes to milliseconds.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

Visual Question Answering: A Comprehensive Guide to Fine-tuning VLMs for Intelligent Image Understanding

MIN READ

March 11, 2026

Visual Question Answering (VQA) enables AI models to answer natural language questions about images, powering use cases from healthcare and retail to accessibility and industrial inspection. In this article we show you how you can fine-tune your own VQA model with your dataset on Datature Vi.

Read

Introducing PaliGemma 2: Use Cases and Improvements

MIN READ

March 7, 2026

This article examines the latest advancements in PaliGemma 2, a next-generation vision-language model designed for scalability, high-resolution processing, and domain-specific adaptability. We dive into its architecture, benchmarks, and innovations, offering a comprehensive overview for machine learning practitioners and researchers seeking to understand its capabilities and potential applications.

Read

Introducing Florence-2: Microsoft’s Latest Multi-Modal, Compact Visual Language Model

MIN READ

March 7, 2026

Florence-2 is Microsoft’s compact vision-language model that unifies detection, segmentation, captioning, and grounding in one transformer, delivering strong zero-shot performance despite being much smaller than many SOTA VLMs. Its edge comes from multi-task training on FLD-5B, a massive autonomously annotated dataset (5.4B annotations over 126M images), and it stays competitive with specialist models when fine-tuned.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo