nn.Linear, the workhorse behind almost everything
Inside a language model, the same simple building block shows up again and again: the attention projections, the feed forward block, the final layer that picks the next word. They are all one humble layer called nn.Linear. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.
First, a couple of words we will lean on
Before we touch any code, let us agree on two plain words, because everything else rests on them.
A tensor is just a box of numbers. A single number is a tensor, a list of numbers is a tensor, a grid (rows and columns) is a tensor, and so on. When you hear “tensor”, picture an organised pile of numbers and nothing more. The numbers flowing through a network are tensors, and the shape of a tensor (how many numbers, arranged how) is something we will keep an eye on the whole way through.
A parameter is a number the model gets to adjust as it learns. At the start these numbers are random, and over many rounds of training they slowly settle into useful values. You never set them by hand. The layer we are about to meet is basically a tidy container for a bunch of parameters, plus one small calculation that uses them. If that sounds underwhelming, good. The magic of large language models comes from stacking this simple thing thousands of times, not from any one piece being clever.
What nn.Linear actually does
When you write nn.Linear(in_features, out_features), you are asking PyTorch to create a layer that takes in vectors of size in_features and hands back vectors of size out_features. A vector here just means a flat list of numbers, so a vector of size four is four numbers in a row.
Inside, the layer keeps two sets of parameters. The first is a weight, a grid of numbers with shape (out, in), which you can read as “one row of weights for each output number you want”. The second is an optional bias, a flat list of size (out,), one spare number per output. You can turn the bias off when you do not want it, and we will see exactly when that makes sense.
The calculation it performs is y = x Wᵀ + b. That looks like a lot, so let us unpack it in words, because the words are friendlier than the symbols. To produce one output number, the layer takes your input vector x, lines it up against one row of the weight grid W, multiplies the matching numbers together, and adds all those products into a single total. That “multiply matching pairs and add them up” move is called a dot product, and it is the heartbeat of this whole article. Then it adds the bias for that row, and that is your output number. Repeat once per row of W, and you have your full output vector. (The little ᵀ symbol and the talk of a “matmul”, short for matrix multiplication, are just the fast, batched way of doing all those dot products at once. You do not need to picture the matrix algebra to understand what is happening.)
Here is the detail that trips people up at first, so let us say it plainly. The layer only touches the last dimension of whatever you feed it. Language models pass around tensors shaped (B, T, C), where B is the batch (how many separate sequences you are processing together, just for speed), T is the number of tokens, roughly the number of word pieces, in each sequence, and C is the size of the vector that represents each token. When you feed that whole thing to a Linear layer, the (B, T) part rides along untouched and only the C at the end gets transformed, giving you (B, T, out). If that feels strange, that is normal. The short version is: a Linear layer reshapes the per token vector and leaves the bookkeeping about which token and which sequence completely alone.
Something to play with
Here is something you can poke at directly. There is an input vector x of size four on one side, a weight grid in the middle that you can resize, and the output it produces on the other side. Each output cell shows you the exact dot product it came from, so you can watch the multiply and add happen in slow motion. Try toggling the bias on and off, and try resizing out to see how the same layer becomes wildly different things just by changing one number.
Drag the controls and watch the output rebuild itself. Set out to 2 with the bias off and you have a Q, K, or V projection from attention. Set out to 10 with the bias on and you have the final layer that scores every word in the vocabulary. Set out to 1 and you have a layer that produces a single score, like a reward. It is the same layer every time, just reshaped.
A few traps worth keeping in your pocket
Every idea here has a sharp edge that catches people the first time they meet it. Here are the ones worth remembering, with a small widget so you can see each one rather than just take our word for it.
Where this shows up in a real language model
This is not a toy you will outgrow once you reach the real thing. Inside an actual transformer, nn.Linear is genuinely everywhere, just wearing different shapes.
Attention builds its three ingredients, the query, key, and value (often written q, k, and v), with nn.Linear(n_embed, head_size, bias=False). Notice the bias is switched off here. The reason is neat: those vectors immediately flow into a softmax, a step that turns a list of numbers into proportions that add up to one, and a softmax ignores a constant added to everything, so the bias would do nothing useful. Leaving it off saves a few parameters for free.
The feed forward block, the part of the model that does a chunk of the actual “thinking” between attention steps, is simply two Linear layers with an activation in between. An activation is a small bend applied to each number that lets the network represent curves and not just straight lines, and two stacked Linears with that bend between them can learn far richer patterns than one alone.
The final piece, the one that actually picks the next word, is nn.Linear(n_embed, vocab_size). It takes each position’s vector and produces one number for every possible word in the vocabulary. Those raw scores are called logits, and a softmax later turns them into probabilities. And if you ever build a model that rates things, say a reward model that scores how good an answer is, you reach for nn.Linear(n_embed, 1), a layer that boils a whole vector down to a single number. Same humble layer, four different jobs.
The whole thing, runnable
If you have PyTorch installed, you can paste this into a file and run it. Read the comments as you go, they walk through exactly what each line is doing and why.
import torch, torch.nn as nn
torch.manual_seed(0) # fix the randomness so your numbers match mine exactly
B, T, C = 2, 3, 4 # batch of 2 sequences, 3 tokens each, vectors of size 4
x = torch.randn(B, T, C) # random input shaped (2, 3, 4) to stand in for real data
# Q/K/V-style projection: bias turned off, turns size-C vectors into size-2 vectors
q_proj = nn.Linear(C, 2, bias=False)
q = q_proj(x) # shape becomes (B, T, 2): only the last dim changed
print(torch.allclose(q, x @ q_proj.weight.T)) # True: the layer really is just x times Wᵀ
# Final word-scoring layer: C -> 5 imaginary vocabulary words, bias on this time
lm_head = nn.Linear(C, 5)
logits = lm_head(x) # shape (B, T, 5): one raw score per word
print(lm_head.weight.shape, lm_head.bias.shape) # (5,4) and (5,): the stored parameters
# Single-score head: C -> 1, the shape a reward model uses
reward = nn.Linear(C, 1, bias=False)
print(reward(x).shape) # (B, T, 1): one number per token
# Proof that Linear only touches the LAST dim: the (B, T) prefix rides through untouched
print(logits.shape[:2] == x.shape[:2]) # True