Learning Rate

The learning rate controls how much a model's weights change in response to each gradient update during training. A large learning rate takes big steps on the loss surface (fast progress but risk of overshooting the minimum and diverging), while a small learning rate takes tiny steps (stable but slow, and prone to getting stuck in poor local minima). Getting the learning rate right is often the difference between a model that converges to a good solution and one that fails entirely.

Most training runs don't use a fixed learning rate. Schedules adjust it over the course of training: step decay drops the rate by a factor at specific epoch milestones, cosine annealing smoothly decreases it following a cosine curve (the most popular schedule for modern detectors), warm restarts periodically reset the rate to re-explore the loss landscape, and one-cycle policy ramps up then ramps down for fast convergence. Warmup (starting with a very small rate for the first few hundred steps) is standard for transformer training and fine-tuning to avoid destroying pre-trained weights early on.

Common starting points: 0.01-0.1 for SGD training from scratch, 0.001-0.0001 for Adam, and 1e-5 to 5e-5 for fine-tuning pre-trained models. Learning rate finders (popularized by fast.ai) sweep across rates to find the steepest loss descent point automatically, taking much of the guesswork out of the process.

Get Started Now

Get Started using Datature’s platform now for free.