Backpropagation Through Time Is Just Backprop That Remembers Every Timestep
An RNN's hidden state at time depends on the hidden state at , which depends on , all the way back to the start of the sequence. That chain is what gives the network memory, and it's also exactly what makes training one harder than training a feedforward network. Backpropagation Through Time (BPTT) is ordinary backpropagation, but applied across an unfolded sequence instead of across static layers.
The forward equations are simple to state. At each timestep, the hidden state combines the current input with the previous hidden state, then applies an activation function:
The error at time compares the actual output against the desired one: .
Updating the weights is where BPTT earns its name. There are three weight matrices to adjust, and each one gets a different treatment because each one is used differently across the unfolded sequence. The output weight only affects the current timestep's output directly, so its gradient is a simple two-term chain rule: .
The hidden-state weight and input weight are different: both were reused at every single timestep leading up to the current one, so the error at time 3 has to account for their contribution at time 1, time 2, and time 3 all at once. That gives a summed chain rule instead of a single term:
with the identical structure for , just swapping in . Every term in that sum is itself a product of several derivatives chained across timesteps, since depends on , which depends on , and so on.
That's precisely the mechanism behind BPTT's two failure modes. Multiply enough numbers smaller than 1 together across a long sequence and the gradient shrinks toward zero before it reaches the early timesteps: the vanishing gradient problem, where the network effectively forgets anything more than a few steps back. Multiply enough numbers larger than 1 and the gradient blows up instead: exploding gradients, where weight updates become unstable. Both are consequences of the same summed, multiplied chain, not separate bugs, and both are the exact motivation for the gated architectures that came next.