Dropout, and why we switch neurons off on purpose
When a network leans too hard on a few neurons, it memorises instead of truly learning. Dropout gently breaks that habit by hiding random neurons while the model trains, and one small rescaling trick keeps everything in balance. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.
First, the problem we are trying to solve
Picture studying for an exam with a study group, where one friend always knows the answer. If you lean on that friend for everything, you will look brilliant in the group and then freeze the moment you sit the exam alone. Neural networks fall into the exact same trap. While they train, they sometimes lean too hard on a small handful of neurons, so they end up memorising the training examples instead of learning the general pattern underneath. We call this overfitting, and it is one of the most common reasons a model scores beautifully on data it has already seen and then stumbles on anything new.
Before we go further, two words that will keep coming up. A neuron is just a tiny unit inside the network that holds a number. The number it produces is called an activation. That is all an activation is: a single number flowing forward through the network. You do not need any more than that to follow along.
Dropout is a wonderfully simple cure for overfitting, and once it clicks you will reach for it often. While the network trains, we randomly switch off some of its neurons on every single step. Because no neuron can count on being there next time, the network cannot afford to rely on any one of them, so it learns to spread its understanding across many neurons at once. The result is a model that leans on the whole group rather than one star student.
Drop a few, then rescale the rest
Here is the actual mechanism, and it is short. The layer Dropout(p) walks through each activation and, with probability p, sets it to zero. We say that neuron was dropped for this step. Every activation that survives then gets divided by 1 - p.
That little division is the clever part, and it even has a name, inverted dropout. The reason we do it is worth slowing down for. Suppose we drop half the neurons (p = 0.5). On average, only half the signal reaches the next layer, so that layer suddenly receives roughly half the magnitude it expected. By dividing the survivors by 1 - p (which is 0.5 here, so we multiply them by 2), we top the signal back up. The average value stays about the same whether dropout is switched on or off, which means the rest of the network barely notices the difference.
There is one more piece, and it is the part beginners most often forget. When you are done training and you actually want to use the model, you put it in eval mode, and in eval mode dropout does nothing at all. No neurons are zeroed, nothing is rescaled, and the output is completely deterministic. Training is the noisy, playful phase. Evaluation is calm and predictable.
Below is a small vector of activations you can play with directly. Drag the p slider to change how many neurons get dropped, press resample to roll a new random mask, and toggle between train and eval. Watch the bars disappear, the survivors grow taller, and the average stay close to where it started.
More dropout means more zeros, but the survivors grow taller to compensate. On average the total stays close to the original, which is exactly why you do not have to change anything at test time except switching to eval mode.
A few traps worth remembering
Every idea here has a sharp edge that catches people the first time. These are the ones to keep in your pocket.
Where this shows up in a real language model
This is not a toy idea you will outgrow. In a real transformer, dropout sits after the attention weights, after the residual projections, and inside the feed forward block. It is controlled by a single flag on the module called .training, which is why calling model.train() switches dropout on and model.eval() switches it off for generation and for measuring loss. One surprising detail: many of the largest pretraining runs actually set p = 0, because when you have an enormous amount of data, overfitting is much less of a worry and the regularisation can even hurt. The layers stay in the model regardless, quietly doing nothing until you ask them to.
The whole thing, runnable
If you have PyTorch installed, you can paste this into a file and run it. Read the comments as you go, they walk through exactly what each line does.
import torch, torch.nn as nn
torch.manual_seed(0) # makes the random dropout repeatable so your numbers match mine
drop = nn.Dropout(p=0.5) # a dropout layer that zeros each value with probability 0.5
x = torch.ones(10) # ten activations, all equal to 1.0, to keep things easy to read
drop.train() # turn ON dropout (training mode)
y = drop(x)
print(y) # about half are 0, the rest are 2.0, because 1 / (1 - 0.5) = 2
print(y.mean()) # close to 1.0, the average is preserved
drop.eval() # turn OFF dropout (evaluation mode)
print(torch.equal(drop(x), x)) # True, because in eval mode dropout is a pure pass through