Visual Prompting

Visual prompting is a way to guide a vision model by providing spatial cues directly on the image rather than through text. Click a point on an object, draw a bounding box around a region, or paint a rough mask. The model then uses that visual prompt to understand which part of the image you care about. The Segment Anything Model (SAM) is the most prominent example: click a point, get a precise segmentation mask.

SAM accepts three types of visual prompts: point prompts (positive clicks on the object, negative clicks on background), box prompts (a bounding box around the region of interest), and mask prompts (a rough mask that SAM refines). The prompt encoder converts these spatial inputs into tokens that condition the mask decoder. This is distinct from text prompting: visual prompts are spatial, while text prompts are semantic. Some models combine both: Grounding DINO takes text prompts and outputs boxes, which can then serve as visual prompts for SAM to generate precise masks.

Visual prompting powers interactive annotation tools (click to segment), medical image analysis (point at a lesion to get its boundary), image editing (select an object by boxing it), and robotic grasping (indicate which object to pick up). In annotation workflows, visual prompting with SAM reduces per-object labeling time from 30-60 seconds with manual polygon drawing to 2-5 seconds with a single click.

Get Started Now

Get Started using Datature’s computer vision platform now for free.