Gradient Descent Doesn't Know Where the Minimum Is: It Only Knows Which Way Is Down

gradient-descent optimization momentum stochastic-gradient-descent math

Backpropagation tells you the gradient: how much each weight should change to reduce the cost. Gradient Descent is the algorithm that actually uses that gradient to update the weights, and its entire job is to minimize the cost function using exactly two pieces of information: which direction reduces the cost, and how big a step to take in that direction. The step size is the learning rate. Set it too high and you overshoot the minimum, bouncing around instead of settling. Set it too low and convergence crawls.

The three variants differ only in how much data they look at before taking that step.

Batch Gradient Descent computes the gradient using the entire training dataset on every single iteration: $x_{new} = x_{old} - (\eta \times \text{slope of entire data})$ . This gives the most stable, accurate direction, but it's slow to converge and memory-hungry, since every step requires touching every row.

Stochastic Gradient Descent goes to the opposite extreme: one training example per iteration. $x_{new} = x_{old} - (\eta \times \text{slope of one row})$ . Convergence is noisier since each step is based on a single, possibly unrepresentative, data point, but each step is cheap and memory usage stays low.

Mini-Batch Gradient Descent splits the difference: a small batch of rows per iteration instead of one or all. It's the compromise that's actually used in practice almost everywhere, because it keeps the stability of batch descent without the memory cost.

Plain gradient descent has a real weakness: it can oscillate badly across steep, narrow valleys in the cost surface. Momentum fixes this by giving the update a memory of past movement, the same way a ball rolling downhill keeps some of its speed instead of resetting at every step:

v_{new} = \gamma \cdot v_{old} + \eta \cdot \nabla J(\theta) \qquad \theta_{new} = \theta_{old} - v_{new}

Nesterov Accelerated Gradient improves on this further by looking ahead before computing the gradient: first estimate where momentum would carry you ( $\theta_{lookahead} = \theta_{old} - \gamma v_{old}$ ), compute the gradient at that future point instead of the current one, then update velocity and parameters using that lookahead gradient. It converges faster than plain momentum precisely because it's reacting to where it's about to be, not where it already was. AdaGrad, Adam, and RMSProp push this idea one step further, adapting the learning rate per parameter instead of using one global $\eta$ , but that's a story for another day.