Referring Expression Comprehension

Referring expression comprehension (REC) is the task of localizing a specific object in an image given a natural language description that uniquely identifies it. Unlike object detection (which finds all instances of dog), REC finds one specific object: the brown dog sitting under the table, not the one standing by the door. The model must parse the language, understand spatial relationships and attributes, and return the bounding box of the intended object.

REC is evaluated on the RefCOCO family of benchmarks: RefCOCO (short referring expressions), RefCOCO+ (no location words, forcing attribute-based descriptions), and RefCOCOg (longer, more complex descriptions). Models like Florence-2, Grounding DINO, and Qwen-VL support REC natively. The task requires cross-modal reasoning: the model must ground the linguistic description (the person wearing red) in specific image regions. Accuracy is measured by the fraction of predictions where the predicted box overlaps the ground truth box above 0.5 IoU.

REC enables natural-language-driven annotation: describe the object you want to label instead of drawing a box. It also powers interactive image editing, robotic manipulation, and accessibility tools. Tell a robot pick up the blue wrench on the left, and it knows which object you mean. For data annotation workflows, REC can reduce labeling time when the target object is easier to describe than to click on.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

Visual Question Answering: A Comprehensive Guide to Fine-tuning VLMs for Intelligent Image Understanding

MIN READ

March 11, 2026

Visual Question Answering (VQA) enables AI models to answer natural language questions about images, powering use cases from healthcare and retail to accessibility and industrial inspection. In this article we show you how you can fine-tune your own VQA model with your dataset on Datature Vi.

Read

Introduction to Chain-of-Thought for Vision-Language Models

MIN READ

March 7, 2026

Vision-language models can see, but without reasoning they often hallucinate, miss spatial details, or fail silently. This post shows how Chain-of-Thought prompting transforms VLMs into interpretable, step-by-step reasoners, with real examples of grounding, VQA, and physical AI using Cosmos Reason1 - and how to train and deploy them in practice.

Read

Introducing Florence-2: Microsoft’s Latest Multi-Modal, Compact Visual Language Model

MIN READ

March 7, 2026

Florence-2 is Microsoft’s compact vision-language model that unifies detection, segmentation, captioning, and grounding in one transformer, delivering strong zero-shot performance despite being much smaller than many SOTA VLMs. Its edge comes from multi-task training on FLD-5B, a massive autonomously annotated dataset (5.4B annotations over 126M images), and it stays competitive with specialist models when fine-tuned.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo