nn.Module, the building block you assemble models from
Every layer and every full model in PyTorch is built from one friendly class called nn.Module. It quietly keeps track of all the numbers your model needs to learn, lets you save and load the whole thing in one line, and switches the entire model between training and using mode with a single call. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.
First, what a model is actually made of
Before we name anything, let us get a feel for the thing. A neural network is really just a big pile of numbers, plus some rules for how those numbers get combined. When you hear that a model “learned” something, what physically changed is that pile of numbers. Training is the process of nudging those numbers, over and over, until the model gets good at its job.
Two words will keep coming up, so let us pin them down right now. A tensor is just PyTorch’s word for an array of numbers. A single number, a list of numbers, a grid of numbers, all of those are tensors. You can think of it as a spreadsheet that can have any number of dimensions. A parameter is a special tensor, one that the model is allowed to adjust while it learns. The parameters are the pile of numbers we just talked about. Everything else (the rules, the wiring) stays fixed; the parameters are what move.
If a model has millions of these numbers, you might worry about how anyone keeps track of them all. That is the exact problem nn.Module solves, and it solves it so gracefully that you mostly forget it is even there. If this feels abstract right now, that is completely normal. It will get concrete fast.
One class, the whole tree
An nn.Module is a container, and it automatically keeps track of two kinds of things for you.
The first kind is parameters, the adjustable tensors we just met. In PyTorch you mark a tensor as a parameter by wrapping it in nn.Parameter. The moment you do that, it gets a property called requires_grad=True, which is PyTorch’s way of saying “please remember how to adjust this number during training”. You do not have to set that flag yourself; wrapping the tensor in nn.Parameter turns it on for you.
The second kind is submodules, which are simply other nn.Modules living inside this one. A real model is made of smaller pieces (a layer here, a block there), and each of those pieces is itself an nn.Module. When you write something like self.head = nn.Linear(...) inside your model, you are tucking a child module inside the parent. From that moment on, the parent can see everything inside the child. Ask the parent for its parameters and it hands you the child’s parameters too. This nesting goes as deep as you like, and the top model can reach every single number in the entire structure. That is why we picture it as a tree: one model at the top, branching down into blocks, and blocks branching down into layers.
Two handy methods let you walk this tree. .parameters() gives you every adjustable tensor, and .named_parameters() gives you the same thing but with a readable name attached to each one, like blocks.0.ffn.0.weight. Those dotted names are not random; they are just the path down the tree, the same way folders nest inside folders on your computer.
Here is something you can play with. Below is a small model shaped like a tiny GPT, the same family of model that powers chatbots. Expand the branches of the tree and watch how the names build up piece by piece, and notice how the parameter counts at the bottom add up as you climb toward the total.
Open up the tree and explore. named_parameters() walks this whole structure top to bottom, and the dotted names you see (like blocks.0.ffn.0.weight) are just the path to each tensor. Saving a model uses these exact same names, which is the whole reason loading a saved model only needs the names and shapes to line up.
train() and eval() flip the whole tree at once
Every module carries a single true-or-false flag called .training. When it is true, the model knows it is in the middle of learning. When it is false, the model knows it is being used for real, which we usually call inference or evaluation.
You almost never set this flag by hand. Instead you call model.train() to switch the whole model into training mode, or model.eval() to switch it into evaluation mode. Here is the lovely part: one call flips the flag on every submodule in the tree, all the way down to the smallest leaf. You set it once at the top and the whole model agrees.
Why bother having two modes at all? Most layers genuinely do not care and behave identically either way. But a couple of layers behave differently on purpose. Dropout, for example, randomly switches off some numbers during training to keep the model from leaning too hard on any one of them, and then does nothing at all during evaluation. BatchNorm is another one that changes its behavior between the two modes. So forgetting to call model.eval() before using your model is a classic beginner stumble, and now you know to watch for it.
One more thing worth separating in your mind, because the names sound similar. This .training flag is not the same as torch.no_grad(). The flag controls how certain layers behave. torch.no_grad() is a separate tool that tells PyTorch to stop tracking how to adjust the numbers, which saves memory and time when you are only using the model and not training it. They often get used together, but they are doing two different jobs.
Flip the switch and watch one call ripple out to every leaf of the tree. The rule to remember: your training loop calls train() before each learning step, and generating text or measuring how good the model is must call eval() first.
A few traps worth keeping in your pocket
Every idea here has a sharp edge that catches people the first time around. None of these are hard once you have seen them once, so here they are ready to play with.
How this shows up in a real language model
This is not a toy idea you will outgrow. A real transformer, the kind behind modern language models, is itself just a subclass of nn.Module. Inside it you will find an nn.Embedding (a lookup table that turns each word-piece into a tensor of numbers), an nn.ModuleList holding a stack of repeated blocks, a final normalization layer, and an nn.Linear layer at the end that produces the model’s predictions. Its forward() method, the method that says what the model actually does with an input, simply calls these pieces one after another in order.
Because nn.Module is tracking everything for you, some genuinely useful one-liners just work. Writing sum(p.numel() for p in model.parameters()) counts every single adjustable number in the model (numel means “number of elements”), which is how people arrive at figures like “this model has 400 million parameters”. Calling model.to(device) moves every registered tensor onto your graphics card in one go, so the math runs faster. And torch.save(model.state_dict(), path) writes the entire tree of named tensors to a file so you can come back to it later. The state_dict is just a tidy dictionary mapping those dotted names to their tensors, which is exactly the same naming scheme you explored in the tree above.
The whole thing, runnable
If you have PyTorch installed, you can paste this into a file and run it. Read the comments as you go; they walk through exactly what each line is doing. If a line looks mysterious, that is fine, let the comment carry you and trust that it will make more sense the second time through.
import torch, torch.nn as nn
class GPTLite(nn.Module):
def __init__(self, vocab, n_embed, n_layer):
super().__init__() # MUST call this first, it sets up the bookkeeping
self.scale = nn.Parameter(torch.ones(n_embed)) # a bare learnable tensor on its own
self.token_embed = nn.Embedding(vocab, n_embed) # lookup table: each token gets a vector
self.blocks = nn.ModuleList([ # a list of layers the parent will track
nn.Linear(n_embed, n_embed) for _ in range(n_layer)
])
self.ln_f = nn.LayerNorm(n_embed) # final normalization layer
self.head = nn.Linear(n_embed, vocab) # turns the result into one score per token
def forward(self, idx): # forward says what the model does with an input
x = self.token_embed(idx) # look up a vector for each input token
for blk in self.blocks: # run through each block in turn
x = torch.relu(blk(x)) # relu just clamps negatives to zero
return self.head(self.ln_f(x)) # normalize, then produce the final scores
model = GPTLite(vocab=10, n_embed=4, n_layer=2)
for name, p in model.named_parameters(): # walk the whole tree, name by name
print(name, tuple(p.shape))
# scale (4,) | token_embed.weight (10,4) | blocks.0.weight (4,4) ...
print(sum(p.numel() for p in model.parameters()), "params") # count every adjustable number
model.eval() # switch the whole tree into evaluation mode for using the model
torch.save(model.state_dict(), "ckpt.pt") # save every named tensor to a file