The learning rate controls how much a model's weights change in response to each gradient update during training. A large learning rate takes big steps on the loss surface (fast progress but risk of overshooting the minimum and diverging), while a small learning rate takes tiny steps (stable but slow, and prone to getting stuck in poor local minima). Getting the learning rate right is often the difference between a model that converges to a good solution and one that fails entirely.
Most training runs don't use a fixed learning rate. Schedules adjust it over the course of training: step decay drops the rate by a factor at specific epoch milestones, cosine annealing smoothly decreases it following a cosine curve (the most popular schedule for modern detectors), warm restarts periodically reset the rate to re-explore the loss landscape, and one-cycle policy ramps up then ramps down for fast convergence. Warmup (starting with a very small rate for the first few hundred steps) is standard for transformer training and fine-tuning to avoid destroying pre-trained weights early on.
Common starting points: 0.01-0.1 for SGD training from scratch, 0.001-0.0001 for Adam, and 1e-5 to 5e-5 for fine-tuning pre-trained models. Learning rate finders (popularized by fast.ai) sweep across rates to find the steepest loss descent point automatically, taking much of the guesswork out of the process.
