Visual Grounding

Visual grounding is the task of taking a natural language phrase and locating the corresponding region in an image. Given the text "the person wearing a blue jacket on the left" and a photo of a crowded sidewalk, a visual grounding model returns the bounding box coordinates around that specific person. This differs from object detection, which uses fixed class labels, and from image captioning, which generates text. Visual grounding goes the other direction: from language to spatial location.

Visual grounding models use cross-modal attention to align text tokens with image regions. Grounding DINO combines a DINO-based object detector with text encoders to perform open-set grounding, detecting objects based on arbitrary text prompts. Florence-2 supports grounding as one of its multi-task outputs. Evaluation uses RefCOCO, RefCOCO+, and RefCOCOg benchmarks, which test whether models can tell apart objects based on spatial relationships, attributes, and context ("the smaller dog" vs "the larger dog"). Accuracy is measured by the percentage of predictions where the predicted box overlaps the ground truth by more than 0.5 IoU.

Visual grounding enables text-based annotation (describe what you want labeled rather than clicking on it), interactive image editing ("remove the tree behind the house"), and robotic manipulation (connecting spoken commands to physical objects). It also powers accessibility features that answer spatial questions like "where is the exit sign?" with a highlighted region. For computer vision workflows, the key benefit is dataset querying: search for "images where the defect is near the edge" across thousands of inspection photos using plain language.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

Visual Question Answering: A Comprehensive Guide to Fine-tuning VLMs for Intelligent Image Understanding

MIN READ

March 11, 2026

Visual Question Answering (VQA) enables AI models to answer natural language questions about images, powering use cases from healthcare and retail to accessibility and industrial inspection. In this article we show you how you can fine-tune your own VQA model with your dataset on Datature Vi.

Read

Introduction to Chain-of-Thought for Vision-Language Models

MIN READ

March 7, 2026

Vision-language models can see, but without reasoning they often hallucinate, miss spatial details, or fail silently. This post shows how Chain-of-Thought prompting transforms VLMs into interpretable, step-by-step reasoners, with real examples of grounding, VQA, and physical AI using Cosmos Reason1 - and how to train and deploy them in practice.

Read

Introducing Florence-2: Microsoft’s Latest Multi-Modal, Compact Visual Language Model

MIN READ

March 7, 2026

Florence-2 is Microsoft’s compact vision-language model that unifies detection, segmentation, captioning, and grounding in one transformer, delivering strong zero-shot performance despite being much smaller than many SOTA VLMs. Its edge comes from multi-task training on FLD-5B, a massive autonomously annotated dataset (5.4B annotations over 126M images), and it stays competitive with specialist models when fine-tuned.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo