GPT-2 from Scratch — Siddhesh More

The only way to really understand a transformer is to build one from nothing and watch it learn. This project follows Karpathy's curriculum exactly: micrograd first, then makemore, then GPT-2.

micrograd is a 100-line autograd engine. Every Value node stores a scalar and a backward function. Calling .backward() on the loss walks the computation graph in reverse and accumulates gradients. Building this by hand makes backpropagation stop feeling like magic.

makemore is a character-level language model trained to generate names. The progression goes bigram (a lookup table, no neural net) to MLP with BatchNorm to the Backprop Ninja session where you derive every gradient in the backward pass manually, including through BatchNorm, before touching PyTorch's autograd again.

The GPT-2 build starts from CausalSelfAttention: Q/K/V linear projections, scaled dot-product attention (scale by 1/sqrt(d_k) to keep softmax out of saturation), a causal mask so each token only attends to past tokens, then multi-head to let different heads specialize on different relationships. Stacked into Blocks with residual connections and LayerNorm. Trained on TinyShakespeare.

Current implementation milestones: training loop verified, W&B logging tokens/sec and loss, FSDP launch working with torchrun --nproc-per-node=2. In progress: KV cache (cache K and V tensors for past tokens so each new token only runs one new attention row instead of recomputing the full sequence), ONNX export, INT8 quantization, and ablation experiments on context length and number of attention heads.