Entry 21 of 24
ML Fundamentals Series
·2 min read

LeNet Proved Deep Learning Works, One Layer at a Time

Before CNNs were a default assumption, someone had to prove the idea worked end to end on a real task. LeNet did that: a small convolutional architecture built for digit recognition, and its structure is worth walking through because every later CNN is a variation on the same seven-layer idea. It applies hierarchical learning, simple patterns in early layers building up to complex ones in later layers, and it's simple and efficient enough to work well even on small datasets.

The input is a 32×3232 \times 32 grayscale image. Every convolutional and pooling layer that follows has its own trainable-parameter count, computed the same way throughout:

Trainable Params=((fh×fw×cin)+1)×fnum\text{Trainable Params} = \big((f_h \times f_w \times c_{in}) + 1\big) \times f_{num}

Layer C1 is a convolutional layer: 6 filters, each 5×55\times5, stride 1, no padding, taking the 32×32×132\times32\times1 input down to a 28×28×628\times28\times6 feature map with 5×5×1×6+6=1565\times5\times1\times6 + 6 = 156 trainable parameters. Layer S2 pools that down with a 2×22\times2 filter, stride 2, giving 14×14×614\times14\times6 with just 12 parameters, since pooling only learns a coefficient and bias per filter, not a full kernel. Layer C3 applies 16 new 5×55\times5 filters across all 6 input channels, producing 10×10×1610\times10\times16 with (5×5×6×10)+16=1516(5\times5\times6\times10)+16 = 1516 parameters. Layer S4 pools again down to 5×5×165\times5\times16 with 32 parameters.

At this point the spatial structure is small enough to flatten. Layer C5 is fully connected, taking the 5×5×16=4005\times5\times16 = 400 values down to 120 units, with (400×120)+120=48120(400\times120)+120 = 48120 parameters, the single largest jump in the whole network. Layer S6 is another fully connected layer, 120 down to 84 units. The output layer finally maps those 84 units to 10 classes, one per digit.

What jumps out laid end to end like this: the convolutional layers are cheap (156, then 1516 parameters) while the fully connected layers are enormous by comparison (48120, then thousands more). Convolution's weight sharing keeps early layers lean; it's only once the network flattens into dense connections that the parameter count explodes. That asymmetry is exactly why modern architectures push convolutional layers as deep as they can before ever reaching for a fully connected one.