Self-attention from scratch

Imagine every word in a sentence quietly asking the words before it, who here matters to me right now. Self-attention is exactly that conversation, written out in math. We will build it up one gentle step at a time, assuming you have never touched PyTorch.

First, what problem are we even solving

When you read the sentence “the cat sat because it was tired”, your brain quietly figures out that “it” refers to the cat, not the sitting. You do this without thinking. A language model has to learn that same trick, and the mechanism it uses is called self-attention. It is the heart of every modern model that writes text, and the very thing whose name fills the word “transformer”. If that sounds intimidating, take a breath. By the end of this page it will feel almost obvious.

Let me introduce a few words gently, because they will keep coming up. A token is just a chunk of text the model works with, often a word or a piece of a word. Each token is turned into a list of numbers called an embedding, which is the model’s private way of describing what that token means. A list of numbers like this is called a vector, and a grid of numbers is called a matrix. PyTorch calls both of these a tensor, which is simply the general word for an array of numbers the model can do math on. That is the entire vocabulary we need to begin.

The whole goal of self-attention is this: for each token, look back at the earlier tokens and pull in a blend of their meanings, paying more attention to the ones that matter. The word “it” should reach back and borrow heavily from “cat”. Let us see how the model decides who to borrow from.

Three roles every token plays

Here is the first idea, and it is a lovely one. Every token gets to play three roles at once, and each role is just its embedding passed through a small transformation.

The first role is the query, which you can think of as the question a token is asking: “what kind of information am I looking for?” The second is the key, which is like a label a token wears: “here is the kind of information I offer.” The third is the value, which is the actual content a token will hand over if someone decides to pay attention to it. Query, key, and value. Ask, advertise, give.

How do we produce these three from one embedding? With something called a linear layer, written nn.Linear in PyTorch. A linear layer is just a learned multiplication: it takes a vector in and gives a transformed vector out, using numbers the model adjusts as it learns. We use three separate linear layers, one for queries, one for keys, and one for values. If the phrase “learned numbers” feels vague right now, that is completely normal. For today, treat each linear layer as a small machine that reshapes a token’s embedding into one of its three roles.

How a token decides who to listen to

Now the magic step. To find out how much a query should care about a key, we compare them with a dot product. A dot product takes two vectors, multiplies them element by element, and adds up the results into a single number. The bigger that number, the more the two vectors “agree”, which here means the more relevant that token is. So the query for “it” and the key for “cat” should produce a large number, because the model has learned they belong together.

We do this for every query against every key at once, which gives us a grid of scores. In code that grid is scores = Q @ Kᵀ, where the @ symbol means matrix multiplication and the small ᵀ means we flip one matrix on its side so the shapes line up. Do not worry about memorizing the shape rules. Just hold onto the picture: one number for every pair of tokens, saying how much one wants to listen to the other.

There are three small adjustments we make to these raw scores before they are ready, and each one fixes a real problem.

First, we scale the scores by dividing by the square root of the head size (the length of those query and key vectors). Without this, longer vectors produce huge dot products that make the next step behave badly. The division gently keeps the numbers in a sensible range.

Second, we apply a causal mask. When a model writes text left to right, a token is only allowed to look at itself and the tokens before it, never the ones ahead, because in real life it has not written those yet. So for any score where a token would be peeking at the future, we set that score to negative infinity. That sounds dramatic, and there is a reason: in the next step, negative infinity turns into exactly zero attention, which is the cleanest way to say “you may not look here.”

Third, we run each row of scores through softmax. Softmax is a function that takes a list of numbers and squashes them into positive values that add up to 1, so they become proper weights, like slices of a pie. After softmax, each query’s row tells you what fraction of its attention goes to each earlier token. A token that scored high gets a fat slice. The ones masked to negative infinity get a slice of zero. This is the moment the model commits to who it is listening to, and these final numbers are called the attention weights.

Pulling it all together into an output

We have decided how much attention each token pays to each earlier token. The last step is to actually collect the information. Remember the values, the content each token offers? We take a weighted blend of those value vectors, using the attention weights as the recipe. In code that is out = attn @ V. A token that gave most of its attention to “cat” will have an output that looks mostly like the cat’s value vector, with a little seasoning from the others.

And that is the whole of one self-attention head (a single complete copy of this query, key, value machinery). Five steps: compare queries and keys into scores, scale them, mask the future, softmax into weights, then blend the values. If you followed even the shape of that, you genuinely understand attention. The rest is detail.

Here is something you can play with. Below is a real four-token sequence run through one head. Flip through the stages to watch the scores turn into a mask, then into soft weights, then into outputs. Click any query row to see exactly which tokens it chose to attend to and how their values combine.

Loading interactive widget…

Notice the staircase shape: query 0 can only attend to itself, while query 3 can see all four tokens. That triangle is the causal mask at work, stopping any token from peeking ahead. Each output is the token's attention row blended with the value vectors it was allowed to see.

A few traps worth keeping in your pocket

Every idea here has a sharp edge that catches people the first time they try it themselves. Here are the ones worth remembering.

Loading interactive widget…

How this looks inside a real language model

What you just learned is not a simplified toy. It is essentially the exact code running inside real models. One head computes q @ k.transpose(-2, -1) * scale, then hides the future with masked_fill(tril[:T,:T] == 0, float('-inf')), then softmaxes along the last axis with dim=-1, then blends the values with attn @ v. Here tril is a lower-triangular grid of ones and zeros that encodes “you may look here, but not there”, and it is stored once as a fixed buffer rather than learned.

Real models also run many heads at the same time, which is called multi-head attention. Each head learns to notice a different kind of relationship, one might track grammar while another tracks meaning. In code, the token dimension is reshaped from (B, T, C) into (B, n_head, T, head_dim) so all heads can run the same math in parallel, and afterward the heads are stitched back together by concatenating them. The letters there are just sizes: B is the batch (how many sequences at once), T is the number of tokens, and C is the embedding width. If those letters are new to you, that is fine, they are only labels for the dimensions of the tensors.

The whole thing, runnable

If you have PyTorch installed, you can paste this into a file and run it. Read the comments as you go, because they walk through exactly what each line is doing.

import torch, torch.nn as nn, torch.nn.functional as F
torch.manual_seed(0)              # fix the randomness so your numbers match mine exactly

B, T, C, hs = 1, 4, 8, 8          # 1 sequence, 4 tokens, 8 embedding numbers each, head size 8
x = torch.randn(B, T, C)          # fake input embeddings: random numbers standing in for real tokens

# turn each token's embedding into its three roles: query, key, value
q, k, v = [nn.Linear(C, hs, bias=False)(x) for _ in range(3)]

scores = q @ k.transpose(-2, -1) * hs ** -0.5      # how much each query agrees with each key, then scaled down
tril   = torch.tril(torch.ones(T, T))              # a lower-triangular mask: 1 means "allowed to look", 0 means "future"
scores = scores.masked_fill(tril == 0, float('-inf'))   # block the future by sending those scores to negative infinity
attn   = F.softmax(scores, dim=-1)                 # turn each row into attention weights that add up to 1
out    = attn @ v                                  # blend the value vectors using those weights

print(attn[0])        # each row i has exactly i+1 nonzero weights, the staircase from the widget
print(out.shape)      # (1, 4, 8): one output vector per token, ready for the next layer