Vision-language models combine image understanding with text reasoning in a single architecture. Instead of just drawing a bounding box, a VLM can point to a region in an image and describe it in natural language. This tutorial shows you how to build one from scratch using Datature Vi, from raw data to working model.
What This Tutorial Covers
- What phrase grounding is and how it differs from standard object detection
- How visual and textual inputs merge inside a VLM architecture
- Annotating multimodal training data with text-region pairs
- Setting up and running a VLM training workflow in Datature Vi
- Running inference on new images and interpreting grounded outputs
When Phrase Grounding Matters
Standard detection gives you boxes and class labels. Phrase grounding gives you boxes tied to free-form text descriptions. That distinction matters when labels alone aren't enough: product catalogs where items need text descriptions tied to specific regions, medical scans where radiologists need findings linked to anatomical locations, or retail settings where visual search needs to match natural language queries to image content.
What Makes This Tutorial Different
Most VLM content focuses on using pre-trained models for inference. This tutorial starts from annotation and goes through training. You control the vocabulary, the grounding targets, and the model behavior. No pre-trained checkpoints, no prompt engineering around someone else's model. Every step runs in the browser with no local GPU or code required.

