Tutorial

Building VLMs for Phrase Grounding with Datature Vi

January 14, 2026

Datature Vi

Back to Tutorials

https://www.youtube.com/embed/x0JEe0hz8ls

Vision-language models combine image understanding with text reasoning in a single architecture. Instead of just drawing a bounding box, a VLM can point to a region in an image and describe it in natural language. This tutorial shows you how to build one from scratch using Datature Vi, from raw data to working model.

What This Tutorial Covers

What phrase grounding is and how it differs from standard object detection
How visual and textual inputs merge inside a VLM architecture
Annotating multimodal training data with text-region pairs
Setting up and running a VLM training workflow in Datature Vi
Running inference on new images and interpreting grounded outputs

When Phrase Grounding Matters

Standard detection gives you boxes and class labels. Phrase grounding gives you boxes tied to free-form text descriptions. That distinction matters when labels alone aren't enough: product catalogs where items need text descriptions tied to specific regions, medical scans where radiologists need findings linked to anatomical locations, or retail settings where visual search needs to match natural language queries to image content.

What Makes This Tutorial Different

Most VLM content focuses on using pre-trained models for inference. This tutorial starts from annotation and goes through training. You control the vocabulary, the grounding targets, and the model behavior. No pre-trained checkpoints, no prompt engineering around someone else's model. Every step runs in the browser with no local GPU or code required.

Go Deeper

Resources

More reading...

Back to Tutorials

Improving Your Computer Vision Models with Metadata

July 1, 2025

Explained

Improve model accuracy by adding metadata to your training pipeline. Learn how camera settings, timestamps, and sensor data boost CV predictions.

Read

Class Imbalance in Computer Vision, Explained

June 6, 2025

Explained

Learn why class imbalance hurts model performance and how to fix it. Covers oversampling, weighted loss functions, focal loss, and augmentation strategies.

Read

Upload DICOM and NIfTI Files to Datature Nexus

May 16, 2025

Medical AI

Upload DICOM and NIfTI medical imaging files to Datature Nexus. Prepare CT and MRI volumes for 3D annotation and segmentation model training.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

STAY IN THE LOOP

Subscribe to receive email updates from Datature.

By subscribing you agree to with our Privacy Policy

Thank you for Subscribing!  
Check your inbox for the latest from Datature

Oops! Something went wrong while submitting the form.

Datature is trusted by industry leaders to turn visual data into actionable results, cut operational costs, and ship computer vision models to production faster.

Terms of Service Privacy Policy

All right reserved © Datature 2026