Autograd, the engine that figures out gradients for you

You only ever write the easy part, the forward calculation. PyTorch quietly remembers everything you did and then hands you every gradient for free. We will build a tiny example, press one button, and watch the numbers flow backward. No PyTorch experience needed.

Before we start, three words you will need

This article is about how a neural network learns, so let us get a few words straight first. There is no PyTorch knowledge assumed here at all.

A tensor is just a box of numbers. It might hold a single number, a list of numbers, or a whole grid of them. When you hear “tensor”, picture an array of numbers and you are already there.

A parameter is one of the numbers the network is allowed to adjust as it learns. People often call these the weights. At the start they are random, and learning is nothing more than nudging them, a little at a time, until the network gets good at its job.

A gradient is the answer to one very practical question: “if I increase this particular number a tiny bit, does my error go up or down, and by how much?” A gradient is a slope. If you know the slope, you know which way to step to make things better. That is the whole game, and if that already feels like a lot, do not worry. We are going to build it up slowly.

You write the forward pass, PyTorch writes the backward one

Here is the part that feels almost too good to be true. To train a network you need gradients for every single parameter, and there can be billions of them. Working those out by hand would be hopeless. Autograd is the tool inside PyTorch that does it for you, automatically.

The trick is that as your calculation runs forward, autograd quietly writes down every step you took, like keeping a receipt for each operation. This running record is called a computation graph, which is just a map of what was computed from what. Once you have your final error (we usually call that error the loss, since lower is better), you call one function named backward(). Autograd then reads its receipts in reverse and works out the gradient for everything. You never derive a single formula yourself.

Every tensor quietly carries a few labels that explain its role in this graph. The most important one is requires_grad. If a tensor has requires_grad=True, autograd watches it and will compute a gradient for it. If it is False, autograd leaves it alone. That single flag is how PyTorch tells the difference between your data, which is fixed, and your parameters, which it should learn.

Below is a small live example you can poke at. It computes n = x1*w1 + x2*w2 + b, then squashes that through a function called tanh to get o, and finally measures how far o is from a target with loss = (o - target)^2. The x values are the incoming data and the w values and b are the learnable parameters. Drag the weights around, press the backward button, and read the gradient that appears on every node. If the numbers feel mysterious at first, that is completely normal. Just watch how a change in one place ripples through the rest.

Loading interactive widget…

The x1 and x2 values are data, so they have requires_grad set to False and their gradient stays empty. Turn a weight off and watch its gradient vanish. Switch on no_grad and the graph is never even recorded, so backward has nothing to do.

The one gradient that trains every language model

Let us connect this to real language models, because the payoff is lovely. A language model is trained to guess the next word, or more precisely the next token, which is just a chunk of text. For each guess the model produces a set of raw scores, one per possible token. Those raw scores are called logits. We turn logits into probabilities with a function called softmax, which simply squeezes the scores so they are all positive and add up to one, like turning a vote count into percentages.

Now for the beautiful part. When you ask autograd for the gradient that gets pushed back into those logits, the answer is astonishingly simple. It is the predicted probabilities minus a vector that is 1 for the correct token and 0 everywhere else (that all-or-nothing vector is called the one-hot target). In words: every token the model thought was likely but was wrong gets nudged down, and the correct token gets nudged up. That single, tidy signal is what flows backward through the entire model and teaches it to write.

Here is something you can play with. Pick which token is the right answer, then nudge the logits up and down. Watch the gradient on the correct token go negative, which is autograd’s way of saying “push this one higher”, while every wrong token gets a positive gradient saying “push these lower”. And notice the moment it clicks: when the model is already perfect, every gradient drops to zero, because there is nothing left to fix.

Loading interactive widget…

Choose the correct token and nudge the logits. The right token gets a negative gradient, which means push it up, and every other token gets a positive gradient, which means push it down. At a perfect prediction all the gradients are zero.

backward() adds up, so remember to reset

There is one habit you must build early, and it trips up almost everyone at the start. When you call backward(), the gradients are not freshly written. They are added on top of whatever was already stored. PyTorch keeps a running total.

Most of the time this is exactly what you want. It powers a technique called gradient accumulation: you run several small groups of examples (each small group is called a micro-batch), call backward() after each one, and let the gradients pile up. The sum behaves as if you had processed one big group all at once, which is handy when a big group will not fit in memory.

But it also means you have to clean up before each new training step, or yesterday’s leftovers will quietly poison today’s update. That cleanup is one line, optimizer.zero_grad(), and it resets the stored gradients back to empty. Forget it and your gradients keep stacking across steps that have nothing to do with each other, and your model will train on nonsense. So the rhythm of every training loop is simple: reset, compute, step, repeat.

Loading interactive widget…

Each call to backward adds this micro-batch's gradient into the running total. Calling zero_grad resets it back to empty, not to zero but to nothing at all, which is why you should never assume a gradient is already there before the first backward.

no_grad, detach, and eval, three things people mix up

These three tools sound similar and get confused constantly, so let us pull them apart gently. They are not competing options, they do completely different jobs.

torch.no_grad() is a way of telling autograd “stop keeping receipts for a moment”. Inside a no_grad block, no computation graph is built, so nothing can be differentiated. You reach for it whenever you just want an answer and have no intention of learning from it, which saves both memory and time.

model.eval() does something unrelated. It flips certain layers into their “we are done training” behavior. (For example, dropout, a trick that randomly switches off neurons during training, stops doing that once you call eval.) Importantly, eval does not stop the graph from being built. It only changes how a few layers act.

.detach() is the surgical option. It snips one specific tensor out of the graph so gradients stop flowing through that point, while the numbers themselves stay exactly the same. When you simply want to record a loss value for a chart, you call .item(), which hands you a plain Python number with no graph attached at all. A good rule: generating text needs both no_grad() and eval(), and logging a number needs .item().

A few traps worth keeping in your pocket

Every idea here has a sharp edge that catches people the first time. Here are the ones worth memorizing, with something to click through.

Loading interactive widget…

How this looks in a real language model project

None of this is a toy you will outgrow. A real pretraining loop follows the exact rhythm we just described. It calls optimizer.zero_grad() once at the top of each step to clear the slate, then runs a small loop over micro-batches where each one does loss = loss / grad_accum followed by loss.backward() so the gradients add up cleanly. After that comes a safety step called gradient clipping (which caps gradients that grow too large), and finally a single optimizer.step() that actually nudges the parameters. Any number that gets written to a log uses .item() so the heavy graph is thrown away immediately and memory stays low. Text generation is wrapped in no_grad() and paired with model.eval(). One last detail you will see in reinforcement learning code: log-probabilities are computed in high precision (F.log_softmax(logits.float(), ...)), because those methods subtract two log-probabilities and low precision rounding would wash out the tiny difference that matters.

The whole thing, runnable

If you have PyTorch installed, paste this into a file and run it. Read the comments as you go, they walk through what each line is doing and why.

import torch
import torch.nn.functional as F
torch.manual_seed(0)            # fixes the randomness so your numbers match mine

B, T, n_embed, vocab = 2, 3, 4, 5
W = torch.randn(n_embed, vocab, requires_grad=True)  # a learnable parameter: autograd will track it
b = torch.zeros(vocab, requires_grad=True)           # another learnable parameter
x = torch.randn(B, T, n_embed)                        # the input data: requires_grad is False, so it is fixed
targets = torch.randint(0, vocab, (B, T))             # the correct token for each position

def forward():
    logits   = x @ W + b                              # raw scores; tracked because W and b require grad
    logprobs = F.log_softmax(logits.float(), dim=-1)  # use full precision before log_softmax
    # pick out the log-probability of the correct token, then average to get one loss number
    loss = -logprobs.gather(-1, targets.unsqueeze(-1)).squeeze(-1).mean()
    return logits, loss

logits, loss = forward()
print(logits.requires_grad, logits.is_leaf)   # True False: logits were computed, so they are not a leaf
loss.backward()                                # this fills in every gradient automatically
print(W.grad.shape == W.shape)                 # True: every parameter gets a gradient the same shape as itself
print(x.grad is None)                          # True: x was not trainable, so it has no gradient

# accumulation: a SECOND backward ADDS into the stored gradient (rebuild the graph first)
g1 = W.grad.clone()
_, loss2 = forward(); loss2.backward()
print(torch.allclose(W.grad, 2 * g1))          # True: the two gradients summed, they did not overwrite

W.grad = None                                  # this is what zero_grad does: reset the stored gradient
with torch.no_grad():                          # inside here, autograd keeps no receipts
    probe = x @ W + b
print(probe.requires_grad)                     # False: no graph was built, so nothing can be differentiated

Autograd, the engine that figures out gradients for you

Before we start, three words you will need

You write the forward pass, PyTorch writes the backward one

The one gradient that trains every language model

backward() adds up, so remember to reset

no_grad, detach, and eval, three things people mix up

A few traps worth keeping in your pocket

How this looks in a real language model project

The whole thing, runnable

Discussion