Cross-entropy loss made friendly

Every time a language model guesses the next word, we need one honest number that says how good that guess was. Cross-entropy is that number, and the rule it follows is simple: being confident and right is cheap, while being confident and wrong is expensive. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.

First, what are we even measuring

Imagine a language model reading a sentence and trying to guess the next word. It does not just pick one word and stop. Instead, it spreads its belief across every possible word it knows, giving each one a slice of confidence. We call the full list of words it knows the vocabulary, and a probability is just a number between 0 and 1 that says how strongly the model believes in a particular choice. All those slices add up to 1, the way the slices of a single pie add up to one whole pie.

Now, training a model means nudging it to make better guesses over and over. To nudge it, we need a single number that scores how wrong each guess was. That number is called the loss. A small loss means a good guess, and a big loss means a bad one. The whole game of training is gently pushing the loss down.

Cross-entropy is the particular loss we use for this kind of guessing. If some of these words feel unfamiliar, that is completely normal, and we will define each one as it comes up. By the end you will see that the core idea fits in a single sentence.

The negative log of the right answer

Here is that single sentence. Cross-entropy looks at the probability the model gave to the word that actually came next, and the loss is the negative logarithm of that one number. In symbols, loss = -log p[target], where p[target] is the probability the model assigned to the correct word.

Let us unpack that slowly, because two pieces might be new.

The target is simply the right answer, the word that really did come next in the text. We are not interested in how confident the model was about the other words. We only care about the slice of belief it put on the truth.

The logarithm, written log, is a function that grows very slowly for big inputs and plunges toward negative infinity as its input approaches 0. We put a minus sign in front so the loss comes out as a positive number that we want to make small. The effect is this: if the model put almost all of its belief on the correct word, then p[target] is close to 1, and -log of a number near 1 is close to 0, so the loss is tiny. If the model confidently backed the wrong word, then p[target] is close to 0, and -log of a number near 0 shoots up toward infinity, so the loss is enormous. That asymmetry is the heart of it: confident and right is cheap, and confident and wrong is brutal.

One more useful number to hold onto. Before a model has learned anything, it has no reason to prefer any word, so it spreads its belief evenly across all of them. If the vocabulary has V words, an even guess gives each word a probability of 1/V, and the loss works out to exactly log V. That is the score a brand new, untrained model starts at, and it is a handy baseline to compare against. If your loss is sitting near log V, your model has not learned anything yet.

The raw scores the model produces before they get turned into probabilities have a name too. We call them logits. They can be any number, positive or negative, and a step called softmax squeezes them into proper probabilities that add up to 1. You do not need the details of softmax to follow along here, just the idea that logits go in and probabilities come out.

Here is something you can play with. Below you can pick which word is the target and reshape the logits by dragging them up and down. Watch the loss respond in real time. Push the target word’s logit up and the loss falls toward 0. Push a wrong word’s logit up instead and the loss climbs. The dashed line marks log V, the score of an even, untrained guess.

Loading interactive widget…

Drag the target word's bar up and the loss falls toward 0, because the model is getting more confident about the right answer. Drag a wrong word up and the loss climbs. The dashed line is log V, the score a fresh model starts at before it has learned anything.

There is a lovely detail hiding underneath this. When training adjusts the model, it needs to know which direction to nudge each logit. That direction is called the gradient, and for cross-entropy it comes out to be p - onehot(target). In words, the gradient is the model’s predicted probabilities minus a list that is 1 at the correct word and 0 everywhere else. So the model is pushed to raise the probability of the truth and lower everything else, by exactly the amount it is currently off. If that result looks familiar, it is the same clean formula we derived in the Autograd article.

The loss is a curve you can picture

Let us strip away every word except the right one and watch what the loss does as the probability of that one correct word changes. The formula becomes just -log(p), where p is the belief placed on the truth.

The shape of this curve tells a story. When the model is already fairly confident and p is large, the curve is gently sloped, so there is little pressure to improve. But as p slides toward 0, the curve turns punishingly steep. That steep wall is exactly what drives the model hard away from confident mistakes. A near miss costs a little, and a confident blunder costs a fortune.

Here is another thing to play with. Drag the probability of the correct word left and right and watch the loss trace out that curve. Notice how flat it is near p = 1 and how violently it climbs as p heads toward 0.

Loading interactive widget…

Slide the probability of the correct word and watch the loss. It barely moves when you are already confident, then shoots upward as the probability drops toward zero. The same shape governs perplexity, which is just exp(loss), the effective number of words the model feels torn between.

That word perplexity is worth a quick note since you will hear it constantly. Perplexity is simply exp(loss), the loss fed back through the exponential function. You can read it as the effective number of words the model feels it is choosing between. A perplexity of 1 means the model is certain, while a perplexity of 50 means it is about as unsure as if it were picking from 50 equally likely words.

A few traps worth remembering

Every idea here has a sharp edge that catches people the first time. These are the ones to keep in your pocket.

Loading interactive widget…

How this shows up in a real language model

This is not a toy idea you will outgrow. In real code, the workhorse is a function called F.cross_entropy, and there are a few practical details worth knowing.

First, it wants raw logits, not probabilities. You do not run softmax yourself beforehand. The function fuses the two steps, softmax and the negative log, into one operation, which is both convenient and far more numerically stable, meaning it avoids the rounding errors that creep in when you compute them separately.

Second, the shapes. A language model processes many sequences at once. A batch is a group of sequences handled together for efficiency, often written with the letter B. Each sequence has some number of positions, written T, one per word slot. So the model produces logits shaped (B, T, vocab), a score for every word at every position in every sequence. Before handing them to the loss, the code flattens the batch and time dimensions together with logits.reshape(B*T, vocab) and matches them with the targets reshaped to (B*T,). Now it is just a long list of guesses and a long list of right answers.

Third, padding. To make every sequence the same length, shorter ones get filled out with meaningless filler tokens. We do not want the model graded on those. Passing ignore_index=-100 tells cross-entropy to skip any position marked with that value, so the padding does not pollute the loss. The same exact machinery, log-softmax followed by gathering the target’s log-probability, also shows up in reinforcement learning fine-tuning, just dressed in slightly different clothes.

The whole thing, runnable

If you have PyTorch installed, you can paste this into a file and run it. Read the comments as you go, they walk through exactly what each line does.

import torch, torch.nn.functional as F
B, T, vocab = 2, 3, 5            # 2 sequences, 3 positions each, a tiny vocabulary of 5 words
logits  = torch.randn(B, T, vocab)   # random raw scores, one per word at every position
targets = torch.randint(0, vocab, (B, T))  # a random correct word index for each position

# F.cross_entropy wants a flat list of guesses (N, C) and a flat list of answers (N,), so flatten:
loss = F.cross_entropy(logits.reshape(B*T, vocab), targets.reshape(B*T))
print(loss)                       # a single number: the average loss over all positions

# the same thing by hand: turn logits into log-probabilities, then pick out the target's log-prob
logp = F.log_softmax(logits, dim=-1)   # log of each word's probability
manual = -logp.gather(-1, targets.unsqueeze(-1)).squeeze(-1).mean()  # grab the right word, negate, average
print(torch.allclose(loss, manual))   # True, proving the built-in matches our hand version

# the score a fresh, untrained model starts at is about log(vocab)
print(torch.log(torch.tensor(float(vocab))))   # about 1.609 for a vocabulary of 5

Cross-entropy loss made friendly

First, what are we even measuring

The negative log of the right answer

The loss is a curve you can picture

A few traps worth remembering

How this shows up in a real language model

The whole thing, runnable

Discussion