Open Vocabulary Detection

Open vocabulary detection (OVD) lets a model find objects described by arbitrary text, not just a fixed set of classes defined during training. Consider a traditional detector trained on 80 COCO classes. It can only find those 80 categories. An OVD model can spot a "rusty bolt," "cracked windshield," or "overripe banana" without retraining, because it matches image regions against text embeddings instead of learned class IDs.

OVD models use CLIP-style vision-language alignment to match region proposals with text descriptions. YOLO-World extends the YOLO architecture with a text encoder so you can pass custom category names at inference time. Grounding DINO combines grounding with detection to find objects from phrases. OWL-ViT from Google uses a ViT backbone with text-conditioned detection heads. The shared pattern: replace the classification head (which outputs scores for N fixed classes) with a similarity head (which computes cosine similarity between region features and text features). This opens up the class space to anything you can describe in words.

OVD is valuable when object categories change often or are too numerous to label up front. Manufacturing defect types vary by product line and evolve over time. Warehouse contents shift constantly. Environmental monitoring targets change by season. OVD lets teams deploy detection without retraining every time the target vocabulary changes, which makes it a strong fit for any domain where the list of things to find isn't static.

Get Started Now

Get Started using Datature’s computer vision platform now for free.