Without Activation Functions, a Neural Network Is Just Linear Regression in Disguise

activation-functions sigmoid relu softmax non-linearity

Stack ten linear layers on top of each other with no activation function between them and you still only have a linear model: matrix multiplication composed with matrix multiplication is still just a matrix multiplication. Activation functions are what break that collapse. Applied to the weighted sum of inputs before it leaves a neuron, they introduce non-linearity, which is what lets a network draw curved decision boundaries and model patterns that aren't just straight lines through the data.

Sigmoid was the original choice. S-shaped, output squashed between 0 and 1:

\sigma(x) = \frac{1}{1 + e^{-x}}

Good for producing something that looks like a probability. Bad because its gradient flattens out at both extremes, which is where the vanishing gradient problem starts.

Tanh fixes one issue: it's zero-centered, with output between -1 and 1 instead of 0 and 1:

f(x) = \tanh(x) = \frac{2}{1 + e^{-2x}} - 1

Zero-centered outputs make optimization slightly better behaved, but tanh still saturates at the extremes the same way sigmoid does.

ReLU (Rectified Linear Unit) is the default in most modern networks for a reason: $A(x) = \max(0, x)$ . Output range is $[0, \infty)$ , it only ever produces non-negative values, and it only activates neurons that received a positive signal. That sparsity is a feature: fewer active neurons means faster, more efficient computation. The cost is the dying ReLU problem: a neuron that gets pushed permanently negative outputs zero forever and stops learning.

Leaky ReLU patches this by allowing a small negative slope instead of a hard zero:

f(x) = \begin{cases} x, & x > 0 \\ \alpha x, & x \leq 0 \end{cases}

That tiny $\alpha x$ keeps a gradient flowing even for negative inputs, so the neuron never fully dies.

Softmax is different from the other four: it's not really about a single neuron's decision, it's about turning a whole layer's raw output scores into a probability distribution across classes, output between 0 and 1, values summing to 1. It's used almost exclusively in the output layer of multiclass classifiers, the very last step before the network commits to an answer. Everything before it is about shaping signal inside the network; softmax is about turning that signal into a decision you can actually read.