Tutorial

Building VLMs for Phrase Grounding with Datature Vi

https://www.youtube.com/embed/x0JEe0hz8ls

Vision-language models combine image understanding with text reasoning in a single architecture. Instead of just drawing a bounding box, a VLM can point to a region in an image and describe it in natural language. This tutorial shows you how to build one from scratch using Datature Vi, from raw data to working model.

What This Tutorial Covers

  • What phrase grounding is and how it differs from standard object detection
  • How visual and textual inputs merge inside a VLM architecture
  • Annotating multimodal training data with text-region pairs
  • Setting up and running a VLM training workflow in Datature Vi
  • Running inference on new images and interpreting grounded outputs

When Phrase Grounding Matters

Standard detection gives you boxes and class labels. Phrase grounding gives you boxes tied to free-form text descriptions. That distinction matters when labels alone aren't enough: product catalogs where items need text descriptions tied to specific regions, medical scans where radiologists need findings linked to anatomical locations, or retail settings where visual search needs to match natural language queries to image content.

What Makes This Tutorial Different

Most VLM content focuses on using pre-trained models for inference. This tutorial starts from annotation and goes through training. You control the vocabulary, the grounding targets, and the model behavior. No pre-trained checkpoints, no prompt engineering around someone else's model. Every step runs in the browser with no local GPU or code required.

Go Deeper

Video Description Lorem Ipsum

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Resources

More reading...

Improving Your Computer Vision Models with Metadata
July 1, 2025
Explained

Improve model accuracy by adding metadata to your training pipeline. Learn how camera settings, timestamps, and sensor data boost CV predictions.

Read
Class Imbalance in Computer Vision, Explained
June 6, 2025
Explained

Learn why class imbalance hurts model performance and how to fix it. Covers oversampling, weighted loss functions, focal loss, and augmentation strategies.

Read
Upload DICOM and NIfTI Files to Datature Nexus
May 16, 2025
Medical AI

Upload DICOM and NIfTI medical imaging files to Datature Nexus. Prepare CT and MRI volumes for 3D annotation and segmentation model training.

Read
Get Started Now

Get Started using Datature’s computer vision platform now for free.