Gradient Descent

Gradient descent is the optimization algorithm that neural networks use to learn. It works by computing how much the loss (prediction error) would change if each weight were nudged slightly, then updating all weights in the direction that reduces the loss. The gradient is just the vector of these partial derivatives, and "descent" means you move downhill on the loss surface.

In practice, computing the gradient over the entire dataset per update is too slow, so stochastic gradient descent (SGD) estimates it from a random mini-batch (typically 8-64 samples). This introduces noise but allows much faster updates and helps escape shallow local minima. Modern optimizers build on SGD: Adam combines momentum (using a running average of past gradients to smooth updates) with adaptive learning rates per parameter, making it the default for most vision tasks. AdamW adds proper weight decay for better regularization. SGD with momentum still outperforms Adam on some image classification tasks and remains the optimizer of choice for training YOLO and ResNet from scratch.

The learning rate controls step size and is the single most important hyperparameter. Too large and training diverges (loss explodes). Too small and training stalls or gets stuck. Learning rate schedulers (cosine decay, step decay, one-cycle) reduce the rate over training to balance exploration with convergence.

Get Started Now

Get Started using Datature’s platform now for free.