Activations and softmax, the curves that bring a network to life

A deep network needs a little bend in it, or every layer just collapses back into one. Here we meet the small functions that add that bend, and then meet softmax, the trick that turns raw scores into honest probabilities. We build both up slowly, assuming you have never opened PyTorch.

First, a couple of words we will lean on

Before we dive in, let us agree on a tiny bit of vocabulary so nothing trips you up later. A neuron is just a small unit inside a network that holds a number. A layer is a row of these neurons working together. A linear layer takes the numbers coming in, multiplies them by some weights, adds them up, and passes the result along. That is it. If you can picture multiplying and adding, you already understand the most common building block in a neural network.

We are going to meet two ideas in this article. The first is the activation, a small function that adds a gentle bend to those numbers. The second is softmax, a function that turns a list of raw scores into a set of probabilities. Both are short, both are everywhere inside a language model, and both become obvious once you see them move. If any of this feels abstract at first, that is completely normal. Stick with it and the interactive plots will make it concrete.

Why a network needs a curve in it

Here is a surprising fact that motivates the whole topic. If you stack two linear layers directly on top of each other, with nothing in between, you do not get something more powerful. You get one linear layer. The math just folds the two together. You could add ten of them, or a hundred, and they would still collapse down to a single straight-line transformation. All that depth, wasted.

The fix is to slip a small nonlinear function between the layers. Nonlinear simply means “not a straight line”, so the function is allowed to curve. This little function is the activation, and it is applied to each number on its own (one value in, one value out). That curve is what stops the layers from folding together, and it is what lets a deep network represent rich shapes: curves, switches that turn on and off, and interactions between inputs. Without it, depth means nothing. With it, depth means everything.

A few activations you will see by name. ReLU is the simplest: it keeps positive numbers as they are and flattens every negative number to zero, making a sharp hinge at zero. GELU and SiLU (also called Swish) do something similar but smoother. They are nearly a straight line for large positive inputs, and they gently squeeze values toward zero when the input is negative, letting just a little negative signal slip through instead of chopping it off hard. That smoothness is one reason they are the default inside modern transformers.

Here is something you can play with: pick an activation below and watch its curve appear, then read its formula right off the plot. Notice where each one bends and how ReLU’s hard corner compares to the soft, rounded shape of GELU and SiLU.

Loading interactive widget…

GELU and SiLU let a little negative signal through, unlike ReLU which slams everything negative to a hard zero. That trickle of signal helps the network keep learning. One popular variant called SwiGLU, used in LLaMA and Mistral, multiplies a SiLU-gated branch by a plain linear branch to give the network an extra dial to turn.

Softmax, turning scores into probabilities

Often a network finishes by producing a list of raw scores, one score per option it is choosing between. These raw, unprocessed scores have a name: we call them logits. The trouble is that logits can be any numbers at all, positive or negative, large or small, and they do not add up to anything meaningful. We usually want probabilities instead: a set of values that are all positive and that sum to exactly 1, so they can be read as “how likely is each option”.

Softmax is the function that does this conversion. The formula is softmax(x)ᵢ = eˣⁱ / Σ eˣ, and let us unpack it gently rather than rush past it. The symbol e is just a fixed number (about 2.718), and raising it to a power, written eˣ, is the exponential function. Two things to notice. First, eˣ is always positive, no matter what x is, which guarantees our probabilities are never negative. Second, the bottom of the fraction (the symbol Σ means “add them all up”) divides each value by the total, which forces everything to sum to 1. Larger logits end up with larger probabilities, smaller ones with smaller probabilities, and they all share the pie.

Two details matter enormously for language models, so let us slow down for both.

The first is numerical stability. If a logit is large, say 1000, then eˣ is an astronomically big number, far too big for a computer to store, and it overflows to inf (the computer’s word for “infinity, gave up”). The clever fix is that subtracting the same value from every logit before the exponential does not change the answer at all, because it cancels out top and bottom. PyTorch quietly subtracts the largest logit in the row first, which keeps every exponential small and safe. So softmax([1000, 1001, 1002]) works perfectly even though e¹⁰⁰² on its own would explode.

The second is temperature. Before applying softmax, we can divide every logit by a number T that we call the temperature. Low temperature makes the biggest logit dominate, so the distribution becomes sharp and pointy, leaning almost entirely on the single top option (this is close to just picking the maximum, which people call greedy). High temperature flattens the distribution, spreading the probability out more evenly across the options. This one knob is exactly how text generation is tuned between safe-and-predictable and wild-and-creative.

Here is something to try: drag the temperature toward zero and watch the bars collapse onto the single tallest logit, then push it up and watch them spread out and even out. Notice that a logit of negative infinity always maps to exactly zero, which turns out to be how a model “blocks” an option entirely.

Loading interactive widget…

Drag temperature toward 0 and the distribution collapses onto the top logit. Push it up and the probability spreads out toward an even split. This is the very same knob that sampling uses when a model generates text (topic 16), and a logit of negative infinity always becomes exactly 0.

A few traps worth keeping in your pocket

Each idea here has a sharp edge that catches almost everyone the first time. Here are the ones worth remembering before you meet them in the wild.

Loading interactive widget…

Where this shows up in a real language model

None of this is a toy you will outgrow. Inside a real transformer, the part that does most of the per-position thinking is a small stack called the MLP block, and its shape is exactly Linear → activation → Linear (using GELU, or a SwiGLU gate like the one mentioned above). The attention mechanism, which lets each word look at other words, uses softmax to decide how much to weigh each one. When the model measures how wrong it is during training, it uses log_softmax, which is softmax followed by a logarithm, computed in one careful pass for stability. And when the model generates text, it divides the logits by a temperature before softmax, just as we saw. One small thing ties it all together: the axis we normalize over is always written dim=-1, which is PyTorch’s way of saying “the last dimension”, the row of scores.

The whole thing, runnable

If you have PyTorch installed, you can paste this into a file and run it. The comments walk through what each line is doing, so read them as you go and do not worry about memorizing anything.

import torch, torch.nn.functional as F
x = torch.randn(2, 3, 4)        # a block of random numbers to feed the activations

F.relu(x)                       # keeps positives, flattens negatives to 0:  max(0, x)
F.gelu(x)                       # smooth version; exact (erf) or approximate='tanh'
F.silu(x)                       # x * sigmoid(x), also known as Swish

logits = torch.tensor([[2.0, 1.0, 0.1]])  # three raw scores for three options
p = F.softmax(logits, dim=-1)   # turns them into probabilities [0.659, 0.242, 0.099] that sum to 1

big = torch.tensor([[1000., 1001., 1002.]])    # deliberately huge logits
print(torch.isinf(big.exp()).any())          # True: exponentiating these directly overflows to inf
print(torch.isnan(F.softmax(big, -1)).any()) # False: softmax stays safe by subtracting the row max first

masked = torch.tensor([[2.0, float('-inf'), 0.1]])  # the middle option is blocked
print(F.softmax(masked, -1))    # [0.87, 0.0, 0.13]: a logit of -inf becomes exactly 0

Activations and softmax, the curves that bring a network to life

First, a couple of words we will lean on

Why a network needs a curve in it

Softmax, turning scores into probabilities

A few traps worth keeping in your pocket

Where this shows up in a real language model

The whole thing, runnable

Discussion