Embeddings, the lookup table that turns words into vectors
Before a language model can do any math, it has to turn words into numbers. An embedding is just a trainable lookup table that hands each word its own little list of numbers, and the model slowly tunes those numbers as it learns. Here is the whole idea, built up gently, assuming you have never touched PyTorch.
First, why we even need this
A computer cannot do arithmetic on the word “cat”. It can only work with numbers. So the very first job in any language model is to turn each word into numbers, and not just one number but a small list of them. That list is called a vector, which is simply an ordered row of numbers like [0.2, -1.1, 0.5, 0.0]. Think of a vector as a handful of dials, where each dial captures some shade of meaning the model has learned.
Before we go further, a couple of words you will see again and again. A token is one piece of text the model reads, often a whole word but sometimes a word fragment. Every token gets a token id, which is just a whole number that names it, like word number 7 or word number 412. And a parameter is any number inside the model that gets adjusted during training. The model has lots of these, and learning is nothing more than nudging them toward better values. Keep those three in your pocket and the rest will follow.
So the question becomes: how do we go from a token id (a single whole number) to a vector (a small list of numbers the model can actually compute with)? The answer is an embedding, and it is delightfully simple.
A lookup, not a multiplication
Imagine a spreadsheet. Down the left side you have one row for every possible token in your vocabulary, and across the top you have a few columns. Each cell holds a number. To find the vector for token id 7, you do not multiply anything. You just walk to row 7 and read off that row. That is the entire idea behind an embedding: it is a table, and you look things up in it.
In PyTorch this table is called nn.Embedding(num_embeddings, embed_dim). The first argument, num_embeddings, is how many rows the table has, which is the size of your vocabulary. The second, embed_dim, is how many numbers are in each row, which is how wide each vector is. People often shorten embed_dim to C (for channels), so you will see C floating around. Inside, the layer keeps one big grid of numbers of shape (vocab, C), meaning vocab rows and C columns. This grid is called the weight matrix, and a matrix is just a grid of numbers, rows by columns.
Now the part that surprises people, so let us say it slowly. When you hand the embedding a token id, it does not do any multiplication. It literally goes to that row and copies it out. In code, token_embed(idx) gives you exactly the same thing as token_embed.weight[idx], which is just reading row idx straight from the table. If you have heard that neural networks are full of matrix multiplication, this might feel strange, and that is completely normal. The embedding is the one friendly exception: it is pure lookup.
One more piece worth knowing. The rows of this table are parameters, which means they start as random numbers and get improved during training. Here is the lovely part: on any given step, only the rows for the tokens that actually showed up get adjusted. Words the model did not see that step are left untouched. The table teaches itself, one row at a time, which numbers make each word most useful.
Here is something you can play with. Below is a tiny table with vocab=10 rows and C=4 columns. Set the token at each spot in the sequence and watch its row light up and flow into the output. Then flip on the position embedding, which we will explain in a moment, and notice how things change.
Two spots that hold the same token id pull the very same row, so they get identical vectors. Switching on the position embedding breaks that tie and gives each spot in the sequence its own distinct vector, so the model can tell where each token sits.
Why position matters too
There is a catch. The lookup we just built gives the same vector to a word no matter where it appears in the sentence. But word order carries meaning. “Dog bites man” and “man bites dog” use the same words yet mean very different things. So the model needs a way to know not just which token sits at each spot, but where that spot is in the sequence.
The fix is charmingly simple: use a second lookup table, one indexed by position instead of by word. The first position gets its own vector, the second position gets another, and so on. We then add the position vector onto the token vector. Now the model receives, for each spot, a blend that says both “this is the word cat” and “this is position 3”. The numbers in this second table are parameters too, so the model learns for itself what each position should mean.
A few traps worth remembering
Every idea here has a sharp edge that catches people the first time. These are the ones to keep handy.
Where this shows up in a real language model
This is not a toy you will outgrow. A real transformer keeps exactly these two tables side by side: a token_embedding_table = nn.Embedding(vocab, n_embed) for the words and a position_embedding_table = nn.Embedding(context_length, n_embed) for the spots. When the model runs, it builds the list of positions with torch.arange(T) (which just makes the numbers 0, 1, 2 up to T - 1, where T is how many tokens are in the sequence), looks up both tables, and adds them together with x = tok + pos. The position table automatically lines up across every example in the batch, so you only need it once.
There is one limit to keep in mind. The position table has a fixed number of rows, set by context_length, which is the longest sequence the model can handle. If you ever ask for a position past the last row, PyTorch stops you with an error rather than guessing. That is the model honestly telling you it has run out of positions it was trained to understand.
The whole thing, runnable
If you have PyTorch installed, you can paste this into a file and run it. Read the comments as you go, they walk through exactly what each line does.
import torch, torch.nn as nn
torch.manual_seed(0) # makes the random tables repeatable so your numbers match mine
vocab, context_length, C = 10, 8, 4 # 10 possible tokens, up to 8 positions, 4 numbers per vector
token_embed = nn.Embedding(vocab, C) # the word table: one row per token, C numbers wide
position_embed = nn.Embedding(context_length, C) # the position table: one row per spot in the sequence
idx = torch.randint(0, vocab, (2, 3)) # fake token ids: 2 sequences, each 3 tokens long (B, T)
tok = token_embed(idx) # look up each token's row -> shape (B, T, C)
pos = position_embed(torch.arange(3)) # look up rows for positions 0, 1, 2 -> shape (T, C)
x = tok + pos # add them; the (T, C) positions line up across both sequences
# A lookup IS just reading a row from the table, not a multiplication:
print(torch.allclose(token_embed(idx), token_embed.weight[idx])) # True
# The position table has a fixed size: asking past the last row is an error.
position_embed(torch.tensor([context_length])) # IndexError, position 8 does not exist (rows are 0..7)