Every Neural Network Is Just Weighted Sums Stacked on Weighted Sums
Starting a new block: Neural Networks. An Artificial Neural Network is a system designed to mimic how the human brain processes information, and the core promise is that it can learn from the data it processes rather than being explicitly programmed with rules. Strip away the biology metaphor and what's actually there is simpler than it sounds.
Every ANN has three kinds of layers. The input layer receives the raw data. Hidden layers process it: this is where the actual computation happens, and there can be one or many. The output layer produces the final decision or prediction. Learning happens in two passes: the network trains on a set of data, makes predictions, and uses backpropagation to adjust weights based on how wrong those predictions were.
The simplest topology is a Single Layer Feedforward Network: input connects directly to output through one set of weights, no hidden layer, no feedback. Data flows one direction only. It's essentially linear regression or binary classification wearing a neural network's clothing, and it's limited to problems that are linearly separable.
A Multilayer Feedforward Network adds one or more hidden layers between input and output. Data still flows strictly forward, but now there's a nonlinear transformation in the middle: input signals hit the first hidden layer, each neuron computes a weighted sum and applies an activation function, that output becomes the input to the next layer, and so on until you reach the output. The network then computes how wrong the output was and propagates that error backward to update every weight along the way. This is what makes tasks like image classification, medical diagnosis, and fraud detection tractable, none of which are linearly separable.
Beyond feedforward, two other architectures matter enormously: Convolutional Networks, which process grid-structured data like images by applying filters that extract edges and textures, and Recurrent Networks, which handle sequential data like time series or text by feeding information back to previous layers, giving the network something resembling memory. All three get trained with variants of the same idea: gradient descent, whether that's plain Stochastic GD, RMSProp, or Adam. The architecture changes what the network can see. The training loop underneath barely does.