Learning rate schedules, and why the speed of learning changes over time

A model learns by taking small steps, and how big those steps are matters more than almost anything else. Start too fast and everything falls apart; stay slow forever and you never arrive. Here is the gentle idea of changing your step size as training goes on, built up slowly, assuming you have never touched PyTorch.

First, what is a learning rate at all

Before we talk about changing the learning rate, let us make sure we agree on what it is, because everything here rests on it.

When a model trains, it is trying to get better at a task by adjusting a huge pile of internal numbers. Those numbers are called parameters, and they are simply the knobs the model is allowed to turn. At each step of training, the model looks at how wrong it currently is, figures out which direction to nudge each knob to be a little less wrong, and then takes a small step in that direction. The size of that step is the learning rate. That is the whole definition: it is how big a nudge we take each time we update the model.

If the learning rate is tiny, the model creeps toward a good answer very slowly, like tiptoeing across a room. If it is large, the model takes bold leaps, which is faster but risky, because a leap that is too big can overshoot the target entirely and send everything in the wrong direction. So the learning rate is a balance between speed and safety, and getting it right matters more than almost any other choice you make while training.

Here is the key insight of this whole article: the best learning rate is not the same at the start of training as it is at the end. Instead of picking one number and leaving it fixed, we let it change over time, following a planned curve. That planned curve is called a schedule, and it turns out a very particular shape works beautifully for large language models. If that sounds abstract right now, that is completely normal. By the end it will feel obvious.

Warm up slowly, then cool down along a curve

There are two dangers that sit at the two ends of training, and the schedule is designed to handle both.

At the very beginning, the model knows nothing, and the machinery that decides which direction to step is still finding its feet. (If you have heard of optimizers like Adam, this is the part where its internal running averages are still mostly noise.) If we hit the model with a full sized learning rate right away, those bold early steps can blow up the loss, which is the single number measuring how wrong the model is. Bigger loss means worse, so a blown up loss is a disaster. The fix is gentle and intuitive: we warm up. We start the learning rate near zero and ramp it up in a straight line over the first stretch of training, often the first few hundred to a few thousand steps. This gives the model time to steady itself before we ask it to take real strides.

At the other end, near the finish, the opposite problem appears. With a large learning rate, the model keeps taking big steps and bounces around the good answer without ever settling into it, like a marble that will not stop rolling around the bottom of a bowl. The fix is to slow down on purpose. We decay the learning rate, shrinking it back down so the model can settle gently into a good resting place. The most popular way to shrink it follows the shape of a cosine curve, which is just a smooth wave that starts off nearly flat, falls steepest through the middle, and flattens out again as it approaches the bottom. So the rate eases down from its peak to a small floor value without any sudden jolts.

Put those two pieces together and you get the classic shape: a straight ramp up, then a smooth curve down. The beautiful part is that the schedule is just a plain function of the step number. You hand it the current step, and it hands you back the learning rate to use right now. Each iteration you read that value and set it on the model before taking the step.

Here is something you can play with directly. Drag the warmup length to change how long the gentle ramp lasts, drag the total steps to stretch or squeeze the whole curve, and drag the peak rate to set how high the climb goes. The moving marker shows you the exact learning rate at the current step. Notice how it climbs in a straight line, then bends over the top and glides down, nearly flat at both ends and steepest in the middle.

Loading interactive widget…

Drag the warmup length, the total number of steps, and the peak rate, then watch the curve change shape. The marker reads off the exact learning rate at the current step. It is nearly flat at the start of the cosine and again near the end, and steepest right through the middle.

A few traps worth remembering

Every idea here has a sharp edge that catches people the first time. Here are the ones to keep in your pocket, with something you can poke at to see each one.

Loading interactive widget…

Where this shows up in a real language model

This is not a toy idea you will outgrow. In a real training run, the trainer computes the learning rate once per iteration by calling the schedule with the current step, then writes that value into the optimizer before stepping. The optimizer organizes the parameters into groups called param groups, and we set the rate on each one. In plain terms, the loop says: figure out today’s rate, tell every group of parameters to use it, then take the step.

A couple of details from real codebases are worth knowing. The warmup is usually a small slice of the whole run, often just a percent or two of the total steps. The cosine typically decays down to roughly ten percent of the peak rather than all the way to zero, because a little bit of learning at the end still helps. PyTorch does ship ready made schedulers with names like LambdaLR and CosineAnnealingLR that you advance by calling scheduler.step(), and they work fine. But many language model codebases skip them and just compute the number by hand, because it gives you complete control and the math is short, as you are about to see.

The whole thing, runnable

If you have PyTorch installed, the pattern below is the heart of it. Read the comments as you go, they walk through exactly what each line is doing. The function takes the current step and returns the learning rate for that moment.

import math

def lr_at(step, warmup, total, max_lr, min_lr):
    if step < warmup:                        # still warming up: ramp straight up from ~0
        return max_lr * step / warmup        # at step 0 this is ~0, at step=warmup it hits the peak
    if step > total:                         # past the planned end: just hold the floor value
        return min_lr
    # how far we are through the decay phase, as a fraction from 0.0 to 1.0
    prog = (step - warmup) / (total - warmup)
    # cosine glides from the peak down to the floor; cos goes 1 -> -1 as prog goes 0 -> 1
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * prog))

# each training iteration, before taking a step:
lr = lr_at(step, warmup=2000, total=100_000, max_lr=3e-4, min_lr=3e-5)
for g in optimizer.param_groups:             # tell every group of parameters
    g["lr"] = lr                             # to use today's learning rate
optimizer.step()                             # now take the step at that rate