Referring Expression Comprehension

Referring expression comprehension (REC) is the task of localizing a specific object in an image given a natural language description that uniquely identifies it. Unlike object detection (which finds all instances of dog), REC finds one specific object: the brown dog sitting under the table, not the one standing by the door. The model must parse the language, understand spatial relationships and attributes, and return the bounding box of the intended object.

REC is evaluated on the RefCOCO family of benchmarks: RefCOCO (short referring expressions), RefCOCO+ (no location words, forcing attribute-based descriptions), and RefCOCOg (longer, more complex descriptions). Models like Florence-2, Grounding DINO, and Qwen-VL support REC natively. The task requires cross-modal reasoning: the model must ground the linguistic description (the person wearing red) in specific image regions. Accuracy is measured by the fraction of predictions where the predicted box overlaps the ground truth box above 0.5 IoU.

REC enables natural-language-driven annotation: describe the object you want to label instead of drawing a box. It also powers interactive image editing, robotic manipulation, and accessibility tools. Tell a robot pick up the blue wrench on the left, and it knows which object you mean. For data annotation workflows, REC can reduce labeling time when the target object is easier to describe than to click on.

Get Started Now

Get Started using Datature’s computer vision platform now for free.