Visual Grounding
Visual grounding is the task of taking a natural language phrase and locating the corresponding region in an image. Given the text "the person wearing a blue jacket on the left" and a photo of a crowded sidewalk, a visual grounding model returns the bounding box coordinates around that specific person. This differs from object detection, which uses fixed class labels, and from image captioning, which generates text. Visual grounding goes the other direction: from language to spatial location.
Visual grounding models use cross-modal attention to align text tokens with image regions. Grounding DINO combines a DINO-based object detector with text encoders to perform open-set grounding, detecting objects based on arbitrary text prompts. Florence-2 supports grounding as one of its multi-task outputs. Evaluation uses RefCOCO, RefCOCO+, and RefCOCOg benchmarks, which test whether models can tell apart objects based on spatial relationships, attributes, and context ("the smaller dog" vs "the larger dog"). Accuracy is measured by the percentage of predictions where the predicted box overlaps the ground truth by more than 0.5 IoU.
Visual grounding enables text-based annotation (describe what you want labeled rather than clicking on it), interactive image editing ("remove the tree behind the house"), and robotic manipulation (connecting spoken commands to physical objects). It also powers accessibility features that answer spatial questions like "where is the exit sign?" with a highlighted region. For computer vision workflows, the key benefit is dataset querying: search for "images where the defect is near the edge" across thousands of inspection photos using plain language.


.jpg)