LeNet Proved Deep Learning Works, One Layer at a Time

lenet cnn deep-learning architecture image-classification

Before CNNs were a default assumption, someone had to prove the idea worked end to end on a real task. LeNet did that: a small convolutional architecture built for digit recognition, and its structure is worth walking through because every later CNN is a variation on the same seven-layer idea. It applies hierarchical learning, simple patterns in early layers building up to complex ones in later layers, and it's simple and efficient enough to work well even on small datasets.

The input is a $32 \times 32$ grayscale image. Every convolutional and pooling layer that follows has its own trainable-parameter count, computed the same way throughout:

\text{Trainable Params} = \big((f_h \times f_w \times c_{in}) + 1\big) \times f_{num}

Layer C1 is a convolutional layer: 6 filters, each $5\times5$ , stride 1, no padding, taking the $32\times32\times1$ input down to a $28\times28\times6$ feature map with $5\times5\times1\times6 + 6 = 156$ trainable parameters. Layer S2 pools that down with a $2\times2$ filter, stride 2, giving $14\times14\times6$ with just 12 parameters, since pooling only learns a coefficient and bias per filter, not a full kernel. Layer C3 applies 16 new $5\times5$ filters across all 6 input channels, producing $10\times10\times16$ with $(5\times5\times6\times10)+16 = 1516$ parameters. Layer S4 pools again down to $5\times5\times16$ with 32 parameters.

At this point the spatial structure is small enough to flatten. Layer C5 is fully connected, taking the $5\times5\times16 = 400$ values down to 120 units, with $(400\times120)+120 = 48120$ parameters, the single largest jump in the whole network. Layer S6 is another fully connected layer, 120 down to 84 units. The output layer finally maps those 84 units to 10 classes, one per digit.

What jumps out laid end to end like this: the convolutional layers are cheap (156, then 1516 parameters) while the fully connected layers are enormous by comparison (48120, then thousands more). Convolution's weight sharing keeps early layers lean; it's only once the network flattens into dense connections that the parameter count explodes. That asymmetry is exactly why modern architectures push convolutional layers as deep as they can before ever reaching for a fully connected one.