Optimizers, and how a model actually learns with AdamW
A model learns by nudging its numbers in the right direction, over and over. The optimizer is the part that decides how big each nudge should be, and AdamW is the one almost everyone reaches for. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.
First, what an optimizer even does
A neural network is, at heart, a big pile of numbers. Each of those adjustable numbers is called a parameter (you will also hear people say weight, which means the same thing for our purposes). When we say a model is “learning”, what we really mean is that these numbers are slowly being changed until the model gets better at its job. The optimizer is the piece that does the changing.
Two more words before we go on, because everything below leans on them. A gradient is a number that tells you which way to nudge a parameter, and how strongly, to make the model a little less wrong. Think of it as an arrow pointing downhill. And the loss is a single number that measures how wrong the model currently is, where smaller is better. So the whole game is: compute the loss, look at the gradients (the downhill arrows), and take a step downhill. If that feels abstract right now, that is completely normal. It will get concrete fast.
The simplest possible optimizer just walks straight downhill: take each parameter, subtract a little bit of its gradient, repeat. That little bit is scaled by the learning rate, a knob you choose that controls how big each step is. This plain approach works, but it is slow and it gets fooled easily. AdamW is the grown-up version, and it is what real language models are trained with. Let us build up to it.
Adding memory and a sense of scale
Plain downhill stepping treats every step as if the past never happened. AdamW does something smarter: it remembers.
Adam (the name comes from “adaptive moment estimation”) keeps two running averages for each parameter, and they update a little on every step. The first is m, a smoothed average of the recent gradients. This is momentum, and the analogy is exactly what it sounds like: a ball rolling downhill builds up speed in a consistent direction instead of stopping and starting. If the gradient keeps pointing the same way, m grows and the model moves faster. If the gradient keeps flip-flopping, those pushes cancel out and the model does not get thrown around.
The second average is v, a smoothed average of the gradient squared. Squaring throws away the direction and keeps only the size, so v is basically a memory of how big and noisy the gradients have been for that one parameter. This is the “adaptive scale” part, and it is genuinely clever.
Here is why v matters. The update Adam applies is, roughly, −lr · m / (√v + eps). Read that as: step in the direction of the momentum m, but divide by the square root of v first. (The eps is a tiny number, just there so we never divide by zero.) Dividing by √v means every parameter gets its own personal learning rate. A parameter whose gradients have been huge and jumpy gets a smaller, more careful step, while a parameter sitting in a flat, quiet region gets a bigger step to help it along. Without this, you would have to pick one learning rate that somehow works for thousands of very different parameters at once, which is nearly impossible.
There is one small bookkeeping detail called bias correction. Both m and v start at zero, so for the very first few steps they read as artificially small, like a thermometer that has not warmed up yet. Adam corrects for this so the early steps are not wildly off. You do not have to do anything; it happens automatically, and it is why the model sometimes seems to “warm up” before it really gets moving.
The W in AdamW
So far we have described Adam. The W stands for weight decay, and it is the one extra idea AdamW adds.
Weight decay is a gentle pressure that pulls every weight slightly toward zero on each step, which discourages the model from letting any number grow enormous. Why would we want that? Big, extreme weights are often a sign that a model is memorizing its training data instead of learning the general pattern, and keeping the weights modest tends to make the model behave better on data it has never seen. (This is a form of regularization, the same family of ideas as dropout.)
The “decoupled” detail in AdamW, and the reason it exists as its own optimizer, is that it applies this shrink directly to the weight (lr · wd · w) rather than smuggling it in through the gradient. In plain Adam the two got tangled together and the decay did not work the way people expected. AdamW separates them cleanly, which turns out to be the correct way to regularize when you are using Adam. You do not need to derive this; just know that AdamW is the version that does weight decay right.
Something to play with
Here is the fun part. Below is a single parameter sitting on a simple bowl-shaped loss, loss = ½(w − 3)². The bottom of the bowl, the spot we want the parameter to roll to, is at w = 3. You can press a button to take one optimizer step at a time and watch the parameter move, along with the running averages m and v.
Try this: step a few times and notice how the first couple of steps are timid (that is bias correction warming up), then it picks up speed and rolls toward the bottom. Now crank the learning rate way up and step again, and watch it overshoot the bottom and oscillate back and forth, like pushing a swing too hard. Then add some weight decay and notice the resting spot shifts a little away from 3 and toward 0, because that gentle pull toward zero is now competing with the bowl. Playing with these knobs is the fastest way to build real intuition for what each one does.
The first few steps are gentle because bias correction is still warming up m and v, then the parameter accelerates toward the bottom of the bowl at w = 3. Set the learning rate too high and it overshoots and oscillates; add weight decay and the resting point drifts away from the true minimum toward zero.
A few traps worth remembering
Every idea here has a sharp edge that catches people the first time they use it. These are the ones worth keeping in your pocket.
Where this shows up in a real language model
This is not a toy you will outgrow. Real training code builds torch.optim.AdamW and feeds it parameter groups, which is just a way of giving different parameters different rules. The common pattern: the big weight matrices get weight decay, while one-dimensional things like biases and the gains inside LayerNorm or RMSNorm get weight_decay=0. (Applying decay to a single one-dimensional gain or bias tends to hurt rather than help, so we turn it off for those.)
A couple of practical numbers you will see everywhere: the two smoothing factors, called betas, are usually (0.9, 0.95) for language models, where the first controls how much m remembers and the second controls v. And every training step follows the same four-beat rhythm: optimizer.zero_grad() to clear out the old gradients, loss.backward() to compute fresh ones, clip_grad_norm_() to cap any gradients that got dangerously large, and finally optimizer.step() to actually move the weights. The learning rate itself is usually not held constant; it is driven by a schedule that changes it over time, which is its own topic.
The whole thing, runnable
If you have PyTorch installed, here is roughly what the setup looks like in real code. Read the comments as you go; they walk through exactly what each line is doing. (This assumes you already have a model and a loss from the rest of your training loop.)
import torch
# Sort the parameters into two buckets: those that should get weight decay
# (the big 2-D weight matrices) and those that should not (1-D biases and gains).
decay, no_decay = [], []
for name, p in model.named_parameters():
(no_decay if p.ndim < 2 else decay).append(p) # 1-D params get no decay
# Build AdamW with two parameter groups, one per bucket, each with its own rule.
opt = torch.optim.AdamW([
{"params": decay, "weight_decay": 0.1}, # matrices: regularize them
{"params": no_decay, "weight_decay": 0.0}, # biases/gains: leave them alone
], lr=3e-4, betas=(0.9, 0.95)) # lr and the m/v smoothing factors
# One training step, the four-beat rhythm:
opt.zero_grad(set_to_none=True) # 1. clear out last step's gradients
loss.backward() # 2. compute this step's gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # 3. cap oversized gradients
opt.step() # 4. nudge the weights downhill