Gradient Descent

Gradient descent is the optimization algorithm that neural networks use to learn. It works by computing how much the loss (prediction error) would change if each weight were nudged slightly, then updating all weights in the direction that reduces the loss. The gradient is just the vector of these partial derivatives, and "descent" means you move downhill on the loss surface.

In practice, computing the gradient over the entire dataset per update is too slow, so stochastic gradient descent (SGD) estimates it from a random mini-batch (typically 8-64 samples). This introduces noise but allows much faster updates and helps escape shallow local minima. Modern optimizers build on SGD: Adam combines momentum (using a running average of past gradients to smooth updates) with adaptive learning rates per parameter, making it the default for most vision tasks. AdamW adds proper weight decay for better regularization. SGD with momentum still outperforms Adam on some image classification tasks and remains the optimizer of choice for training YOLO and ResNet from scratch.

The learning rate controls step size and is the single most important hyperparameter. Too large and training diverges (loss explodes). Too small and training stalls or gets stuck. Learning rate schedulers (cosine decay, step decay, one-cycle) reduce the rate over training to balance exploration with convergence.

Resources

Relevant Blog Posts ↘

Glossary

Our Blog

Documentation

VLM Training Metrics and Loss Functions: A Technical Reference [2026]

MIN READ

March 7, 2026

Comprehensive technical guide to VLM evaluation and fine-tuning, covering key metrics (BLEU, METEOR, CIDEr, SPICE, BERTScore, CLIPScore, VQA Accuracy, ANLS) and core loss functions (cross-entropy, contrastive, focal, KL divergence, DPO). Includes mathematical formulations, step-by-step worked examples, and practical code snippets for implementation.

Read

How To Interpret Training Graphs to Understand and Improve Model Performance

MIN READ

March 4, 2026

With sufficiently detailed data and the knowledge to analyze them, users can reduce redundant experimentation and pinpoint improvements for model setup.

Read

Get Started Now

Get Started using Datature’s computer vision platform now for free.

Book Demo