How a language model writes one word at a time

A language model does not write whole sentences in one go. It guesses the next word, adds it, then guesses again, over and over. A few friendly knobs let you steer how adventurous those guesses are, and you can turn each one yourself and watch a sentence appear.

First, what does it mean for a model to generate text

When you watch a chatbot reply, it feels like it knows the whole sentence before it starts. It does not. Under the hood it is doing something much simpler and, honestly, much more charming. It looks at everything written so far and asks a single question: what is the most likely next word. It picks one word, sticks it onto the end, and then asks the very same question again with that new word now included. One word at a time, again and again, until it decides to stop. That loop is the entire trick, and by the end of this page you will understand every piece of it.

Before we go further, a couple of words that will keep coming up. We keep saying “word”, but a model actually works in tokens, which are small chunks of text. A token might be a whole word like cat, or a piece of one like ing, or even a single space. You can keep picturing words for now; just know that the real unit is the token. The other word is logits. When the model finishes thinking about what comes next, it hands you one raw score for every possible token in its vocabulary. Those raw scores are the logits. A higher score means the model finds that token more plausible. They are not yet probabilities, just unpolished numbers, and a lot of what follows is about turning them into something we can actually sample from.

The loop, step by step

Let us walk the loop slowly, because every generation system you will ever meet is built from these same moves.

First, we look at the text so far and keep only the most recent stretch of it, the last context_length tokens. A model can only pay attention to so much at once, so if the conversation gets long we trim the old tail. Think of it as the model’s working memory.

Second, we run the model on that recent text. The model looks at every position, but here is a detail that surprises people: we only care about its opinion at the very last position. That is the one that predicts what comes next. In code this shows up as logits[:, -1, :], which is just a way of saying “give me the scores at the final spot, throw away the rest.” If that slice notation looks cryptic, do not worry, it simply grabs the last position’s logits.

Third, we shape those logits into a set of probabilities and draw one token from them. This is where the fun knobs live, and we will spend most of our time here.

Fourth, we glue the chosen token onto the end of the text, and then we go back to step one and do it all again. That is it. The loop just keeps turning.

The knobs that control how it picks

If we always grabbed the single highest-scoring token, the model would be perfectly predictable and often dull, repeating itself and reaching for the same safe phrases. So instead we give ourselves a few ways to tune the boldness of each pick. Here are the three you will use constantly.

Temperature controls how adventurous the model is. We divide every logit by the temperature number before going further. A low temperature (close to zero) makes the high scores tower even higher over the low ones, so the model almost always takes its top choice. We call that greedy, because it greedily grabs the most likely token every time. A high temperature flattens the scores so they look more equal, which lets unlikely tokens slip through and makes the writing wilder and more surprising. Turn it too high and you get pure gibberish. A temperature around 1.0 leaves the scores as they are.

top-k is a simple guardrail. After scoring, it keeps only the k most likely tokens and throws the rest away entirely, no matter how high the temperature is. If k is 40, the model may only choose from its top 40 candidates. This stops the long tail of bizarre, barely-plausible tokens from ever being picked.

top-p, also called nucleus sampling, is a smarter version of the same idea. Instead of always keeping a fixed number of tokens, it keeps the smallest group of top tokens whose probabilities add up to at least p. So if the model is very confident, top-p might keep just two or three tokens, and if the model is unsure, it might keep many. It adapts to how certain the model feels at each step.

Once we have filtered down to the eligible tokens, we renormalize, which just means we rescale the survivors’ probabilities so they add up to 1.0 again (we threw some away, so they no longer summed to a whole). Then we draw one token at random, weighted by those probabilities, using a function called torch.multinomial. Weighted random drawing is the heart of it: more likely tokens get picked more often, but the rare ones still get an occasional turn, which is what keeps the writing alive instead of robotic.

Here is something you can play with directly. The widget below shows a fixed set of scores for the next token. Turn the temperature, top-k, and top-p knobs and watch which tokens stay eligible (the filtered-out ones grey out), then press to sample tokens one by one and watch a little sequence build up.

Loading interactive widget…

Slide temperature toward zero and the model turns greedy, always taking its top token. Crank it up with no top-k or top-p and watch the rare tokens light up and the output drift toward nonsense. top-k and top-p trim the unlikely tail so the sampling stays coherent while still being a little unpredictable.

A few traps worth remembering

Every knob here has a sharp edge that catches people the first time they generate text. These are the ones to keep in your pocket.

Loading interactive widget…

How this shows up in a real language model

This is not a toy version you will outgrow. The same loop runs inside real systems, with a few practical touches. The generation function is wrapped in @torch.no_grad(), which tells PyTorch not to track the math it would normally save for learning. While generating we are only using the model, not teaching it, so skipping that bookkeeping saves a lot of memory and time. The function also calls model.eval(), which flips the model into its calm, predictable mode for inference.

Each step does exactly what we described: it crops the context with idx[:, -context_length:], takes the last position with logits[:, -1, :], divides by the temperature, optionally hides the unwanted tokens by setting their scores to negative infinity (so they can never be chosen), turns the survivors into probabilities with softmax, draws one with torch.multinomial, and appends it with torch.cat. One extra note if you ever wander into reinforcement learning: those systems also record log_softmax(logits) for each token they sampled, which is just the logarithm of the probability, kept around so the training step can later reward or discourage that choice. You do not need it for plain generation.

The whole thing, runnable

If you have PyTorch installed, here is the complete loop. Read the comments as you go, they walk through exactly what each line does.

@torch.no_grad()                                  # we are only using the model, not training it
def generate(model, idx, max_new, temperature=1.0, top_k=None):
    model.eval()                                  # calm, predictable inference mode
    for _ in range(max_new):                      # produce one new token per loop
        idx_cond = idx[:, -context_length:]       # keep only the most recent tokens (working memory)
        logits = model(idx_cond)[:, -1, :]        # scores for the next token, from the LAST position only
        logits = logits / temperature             # lower = sharper/greedier, higher = wilder
        if top_k is not None:
            v, _ = torch.topk(logits, top_k)      # find the top_k highest scores
            logits[logits < v[:, [-1]]] = float('-inf')  # hide everything below the cutoff so it can't be picked
        probs = torch.softmax(logits, dim=-1)     # turn the surviving scores into probabilities
        nxt = torch.multinomial(probs, num_samples=1) # draw one token, weighted by probability
        idx = torch.cat((idx, nxt), dim=1)        # stick the new token on the end, then loop again
    return idx