A CNN Doesn't See an Image: It Sees a Stack of Filtered Patches
A fully connected network fed a raw image throws away the one thing that makes images images: spatial structure. A pixel's neighbors matter more than a pixel three rows away, and dense layers treat every pixel as equally related to every other pixel. Convolutional Neural Networks exist specifically to process data with grid-like topology, images being the obvious case, by respecting that local structure instead of flattening it away.
Four components do the work. Convolutional layers apply filters (kernels) across the input to detect features and complex patterns: edges, textures, shapes, whatever the filter has learned to respond to. Pooling layers downsample the spatial dimensions of the input, which reduces both computational complexity and the number of parameters flowing into later layers. Activation functions introduce non-linearity, same role they play in any network. Fully connected layers sit at the end and are responsible for turning the high-level features the earlier layers extracted into an actual prediction.
The pipeline runs in a fixed order: an input image goes through a convolutional layer, then a pooling layer, then (often) another conv/pool pair, then flattens into fully connected layers, and finally produces an output. Each conv+pool pair extracts features at a higher level of abstraction than the one before it: early layers might respond to edges, later ones to shapes made of those edges, later still to parts of objects.
Training a CNN follows the same loop as any supervised network, with one CNN-specific step at the front. Data preparation matters more here than elsewhere: every image has to be preprocessed and resized to a consistent shape, since the convolution and pooling operations assume fixed input dimensions. After that it's the familiar sequence: a loss function measures how wrong the predictions are, an optimizer (some flavor of gradient descent) decides how to update the weights, and backpropagation actually computes those updates, exactly the same chain-rule machinery from any other network, just applied through convolutional layers instead of dense ones.
What makes CNNs work isn't a different learning algorithm. It's a different assumption baked into the architecture: that nearby pixels are related, and that the same filter useful in one part of an image is probably useful everywhere else in it.