Learning Rate

The learning rate controls how much a model's weights change in response to each gradient update during training. A large learning rate takes big steps on the loss surface (fast progress but risk of overshooting the minimum and diverging), while a small learning rate takes tiny steps (stable but slow, and prone to getting stuck in poor local minima). Getting the learning rate right is often the difference between a model that converges to a good solution and one that fails entirely.

Most training runs don't use a fixed learning rate. Schedules adjust it over the course of training: step decay drops the rate by a factor at specific epoch milestones, cosine annealing smoothly decreases it following a cosine curve (the most popular schedule for modern detectors), warm restarts periodically reset the rate to re-explore the loss landscape, and one-cycle policy ramps up then ramps down for fast convergence. Warmup (starting with a very small rate for the first few hundred steps) is standard for transformer training and fine-tuning to avoid destroying pre-trained weights early on.

Common starting points: 0.01-0.1 for SGD training from scratch, 0.001-0.0001 for Adam, and 1e-5 to 5e-5 for fine-tuning pre-trained models. Learning rate finders (popularized by fast.ai) sweep across rates to find the steepest loss descent point automatically, taking much of the guesswork out of the process.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

VLM Training Metrics and Loss Functions: A Technical Reference [2026]

MIN READ

March 7, 2026

Comprehensive technical guide to VLM evaluation and fine-tuning, covering key metrics (BLEU, METEOR, CIDEr, SPICE, BERTScore, CLIPScore, VQA Accuracy, ANLS) and core loss functions (cross-entropy, contrastive, focal, KL divergence, DPO). Includes mathematical formulations, step-by-step worked examples, and practical code snippets for implementation.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo