Tensors, the one object that PyTorch is built on

Every number inside a language model lives in a tensor, so if you understand this one object you understand most of PyTorch. We will build the idea up slowly by poking at live ones: change a shape, swap a stride, hide the future, and watch the numbers move. No prior PyTorch needed.

Where we are starting

If you are new to PyTorch, welcome. We are going to begin at the very beginning, with the single object that everything else is made of. You do not need any background in machine learning to follow along, and if a word looks unfamiliar, we will stop and define it the first time it appears. Take your time, play with the interactive bits, and let the ideas settle.

The good news is that there is really only one thing to learn here, and we will spend the whole article getting comfortable with it.

The one object everything is made of

A tensor is just a box of numbers arranged in a grid. A single number is a tensor with no dimensions. A list of numbers is a one-dimensional tensor (like a row in a spreadsheet). A table of numbers is two-dimensional, and you can keep stacking from there into three, four, or more dimensions. In PyTorch this box is called a torch.Tensor, and it carries a few labels that describe it:

its shape, which is how many numbers sit along each dimension (for example, three rows and four columns),
its dtype, short for data type, which is the kind of number each cell holds (a whole number, or a decimal, and how many bits of precision it uses),
its device, which says where the numbers physically live, either on the CPU (your computer’s main processor) or on the GPU (a graphics chip that does many calculations at once),
and, when the tensor is part of training, some extra bookkeeping for gradients, which we will meet in a later article. For now, a gradient is just a number that tells the model how to improve. You can safely ignore it today.

Here is why this matters so much. Inside a language model, every single quantity is a tensor: the token ids coming out of the data loader, the embedding table, the attention scores, the logits, the loss, and the gradients. (Do not worry about those terms yet, they each get their own article.) Once you are at home with this one object, the rest of PyTorch is just functions that take tensors in and hand tensors back out. That is the whole trick.

One more idea before we make one. A network never processes a single example at a time. It processes a batch, which is a stack of many examples handled together in one go. We do this because a GPU has thousands of small cores, and feeding it a batch keeps all of them busy at once, which is far faster than going one number at a time. So let us manufacture our first tensor. Pick a constructor (the function that builds the tensor) and a shape, and watch what comes out.

Loading interactive widget…

Here is something you can play with: choose a constructor and a shape, and the tensor appears. Watch the dtype badge, which tells you what kind of number is inside. Notice that arange and randint hand back int64 (whole numbers), because they make indices, and indices are always whole.

How a tensor is really stored, and why some operations are free

Now for a secret that explains why PyTorch is so fast, and it is genuinely delightful once it clicks. If it feels strange at first, that is completely normal, so read it twice.

Underneath, a tensor does not store its numbers in a grid at all. It stores them in one long flat line in memory, one number after another. The grid you see is an interpretation laid on top of that line. PyTorch figures out the grid using two pieces of information: the shape (which you already know) and the stride, which is the number of steps to take along the flat line to move one position along a given dimension. So a tensor is really a view: a way of reading a flat block of memory as if it were n-dimensional.

Here is the payoff. Many operations that look like they rearrange data do not actually move a single number. Slicing, transpose (swapping two dimensions), permute (reordering dimensions), squeeze (removing a dimension of size one), and view (relabeling the shape) are usually free. They simply hand you a new tensor object that points at the same flat memory with different shape and stride labels. Nothing is copied. That is wonderful for speed.

Below is a contiguous tensor built by arange(24).reshape(2, 3, 4), which lays out the numbers 0 through 23 and views them as 2 by 3 by 4. (“Contiguous” just means the numbers sit in their natural order in memory.) Run operations on it and watch the storage, the strides, and whether it stays contiguous. Hover over any cell to see exactly where that number lives in the flat line.

Loading interactive widget…

Drag and run operations, then watch the storage row and the strides. Most operations only relabel the view, so the storage stays put. After a transpose the strides no longer line up for view, so PyTorch asks you to use reshape() or contiguous().view() instead. The storage row only reshuffles when a real copy actually happens.

Broadcasting, or how a small tensor stretches to meet a big one

Often you want to combine two tensors of different shapes, for example adding a short list of numbers to a big block. PyTorch handles this with a rule called broadcasting, and it is one of those things that saves you from writing loops everywhere.

The rule lines the two shapes up starting from the right. For each dimension, the sizes must either be equal, or one of them must be 1 (or absent entirely). When a dimension has size 1, PyTorch virtually stretches it to match the other tensor, and it does this without copying any data. That is how a bias of shape (C,) can be added to a whole block of activations of shape (B, T, C) with no loop in sight. (Here B is the batch size, T is the sequence length, and C is the number of channels, which are just the names we give those three dimensions.)

Loading interactive widget…

Edit either shape, or load a preset, and watch them line up from the right. Dimensions that get stretched glow. If two dimensions disagree and neither is 1, that is a hard error, and the widget will show you exactly where the clash happens.

Picking out pieces with indexing and slicing

Most of the time you do not want the whole tensor, you want a slice of it. Basic slicing uses ranges, single positions, and negative positions (which count from the end), and like before, it usually returns a view rather than a copy.

Two patterns come up constantly, so they are worth seeing now. Writing x[:, -1, :] keeps every batch, then picks the last time step (using a single position, which removes that dimension), then keeps every channel. Writing x[:, 1:] keeps the time dimension but drops its first element. That second one is the “teacher-forcing shift” you will meet again when we train a model, where the targets are the inputs slid over by one step. You do not need the details yet, just notice how a small slice expresses it.

Loading interactive widget…

Set the slice for each dimension on its own and watch the chosen elements light up. A single position (an integer index) removes its dimension from the result shape, while a range keeps the dimension but trims it.

Joining tensors with cat and stack

Sometimes you need to glue tensors together, and PyTorch gives you two ways that are easy to confuse, so let us be precise. torch.cat joins tensors along a dimension that already exists, so the number of dimensions stays the same and that one dimension just grows longer. torch.stack instead creates a brand new dimension whose length equals how many tensors you passed in.

Mixing these two up gives you a tensor with one too many (or one too few) dimensions, and that kind of mistake often stays hidden until something far downstream breaks. So it is worth building a clear mental picture now.

Loading interactive widget…

Blue cells come from tensor a, amber cells from tensor b. Flip between cat and stack and watch the result shape: stack adds a fresh dimension, while cat just makes an existing one longer.

Hiding the future with a mask

Here is an idea that sounds fancy but is simple once you see it. A language model that reads left to right must not be allowed to peek at words that come later, because at prediction time those words do not exist yet. We enforce this with a mask, which is a way of blocking out the positions a model is not allowed to look at.

The trick is scores.masked_fill(tril == 0, float('-inf')). It finds every forbidden position and writes negative infinity into it. Then a function called softmax, which turns a row of scores into probabilities that add up to one, runs across each row. Because negative infinity becomes exactly probability zero after softmax, the model gives those future positions no weight at all. Two small but important details: the mask is built from a boolean test (== 0, which gives true or false), and the fill value is true negative infinity. A merely large negative number would only get close to zero, not exactly zero, and “close” is not good enough here.

Loading interactive widget…

Toggle the mask on and off, and reroll the scores. The grid on the right is the attention distribution each query row would actually use. With the mask on, it is strictly lower-triangular, meaning each position can see itself and the past but never the future.

gather, the heart of the language-model loss

This last operation is the one sitting at the center of how a language model measures its own mistakes, so it is worth slowing down for. Suppose the model has produced, for every position, a full set of scores over the whole vocabulary (every word it could possibly predict). What we actually care about is the score it gave to the one word that truly came next. We need to reach into that big grid and pull out exactly those chosen entries.

That is what gather does. Given a tensor of shape (B, T, vocab) (batch, time, and one score per vocabulary word), the call logp.gather(-1, idx) reads, at every position, the single entry whose last coordinate matches the target word’s id. Then squeeze(-1) removes the now-pointless dimension of size one, leaving a clean (B, T) grid of exactly the numbers the loss needs.

Loading interactive widget…

Click any cell to choose the realized token (the word that actually came next) for that position. The matrix at the bottom is what the loss truly sees: just the gathered log-probabilities. (One detail for later: the log-softmax is computed in fp32, a higher-precision number format, because doing it in a lower-precision format would throw away too many digits.)

A few traps worth keeping in your pocket

Every idea above has a sharp edge that catches people the first time they meet it. Here are the ones to remember, and do not feel bad if you have already hit one of them, everyone does.

Loading interactive widget…

How all of this shows up in a real language model

None of what we covered is filler. Every piece is load-bearing in an actual model. The data loader slices random_samples[:, :ctx] for the inputs and [:, 1:ctx+1] for the targets, which is exactly the drop-the-first-token shift you saw in slicing. The attention layer keeps a tril mask and applies masked_fill(... == 0, -inf), then computes q @ k.transpose(-2, -1) (a matrix multiply using the transpose you met in the storage section). The outputs of multiple attention heads are glued together with cat(..., dim=-1). The forward pass flattens with logits.reshape(B*T, C), and it uses reshape rather than view precisely because the data came from a non-contiguous slice, where view would throw an error. And the whole scoring core is gather plus fp32 plus boolean masks: logprobs_all.gather(-1, targets).squeeze(-1). Every line traces straight back to something you just played with.

The whole thing, runnable

If you have PyTorch installed, paste this into a file and run it. Read the comments as you go, since they walk through what each line is doing.

import torch
torch.manual_seed(0)                 # fixes the randomness so your numbers match mine

# creation: the building blocks of one LLM batch
B, T, C = 2, 3, 4                     # batch size, sequence length, channels
ids = torch.randint(0, 10, (B, T))   # whole-number token ids (int64)
x   = torch.randn(B, T, C)           # random decimal activations (fp32)

# slicing: the last time step, and the teacher-forcing shift
last    = x[:, -1, :]                # keep all batches/channels, pick the last step -> (B, C)
shifted = ids[:, 1:]                 # drop the first token along time -> (B, T-1)

# broadcasting: a (C,) bias adds to every (B, T) position with no loop
bias = torch.arange(C, dtype=torch.float32)
y = x + bias                         # the size-C bias is stretched across B and T -> (B, T, C)

# view / reshape / transpose: relabeling shapes, usually with no copy
flat = x.reshape(B * T, C)           # flatten the batch for the loss
tr   = x.transpose(-2, -1)           # swap the last two dims -> (B, C, T) for q @ k.T

# causal mask: write -inf into the future before softmax
tril   = torch.tril(torch.ones(T, T))            # 1s on/below the diagonal, 0s above
scores = torch.randn(T, T)
masked = scores.masked_fill(tril == 0, float('-inf'))  # block the future positions
probs  = torch.softmax(masked, dim=-1)           # forbidden spots become exactly 0

# gather: pull out the log-prob of the token that actually came next
logp   = torch.log_softmax(torch.randn(B, T, 5).float(), dim=-1)
tgt    = ids.clamp(max=4).unsqueeze(-1)          # target ids, shaped (B, T, 1)
picked = logp.gather(-1, tgt).squeeze(-1)        # one chosen number per position -> (B, T)

# the classic gotcha: view needs contiguous memory, reshape does not
try:
    tr.view(B, T, C)                 # FAILS, because tr is non-contiguous after transpose
except RuntimeError:
    print("view failed; use reshape")
print(tr.reshape(B, T, C).shape)     # reshape copies if it must, so this works

Tensors, the one object that PyTorch is built on

Where we are starting

The one object everything is made of

How a tensor is really stored, and why some operations are free

Broadcasting, or how a small tensor stretches to meet a big one

Picking out pieces with indexing and slicing

Joining tensors with cat and stack

Hiding the future with a mask

gather, the heart of the language-model loss

A few traps worth keeping in your pocket

How all of this shows up in a real language model

The whole thing, runnable

Discussion