19 results

PyTorch for LLMs · Jun 2026 · 6 min read

Training across many GPUs without losing your mind

When one GPU is too slow, the trick is wonderfully simple: put a full copy of your model on every GPU, feed each copy a different slice of the data, and then average their lessons so all the copies stay perfectly in sync. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 7 min read

Weight initialization and the trick of sharing one matrix

Before a network learns anything, every one of its numbers has to start somewhere. The starting values turn out to matter enormously: pick them too big or too small and learning stalls before it begins. Here is how to choose them well, plus a lovely space-saving trick called weight tying, all explained assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 6 min read

Saving and loading checkpoints so you never lose a training run

A checkpoint is much more than the model's weights. To pick up a training run exactly where it stopped, you also need the optimizer's memory, the step counter, and the random number state. We will build that complete bundle together, save it, load it back, and see why each piece matters, all assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 7 min read

How a language model writes one word at a time

A language model does not write whole sentences in one go. It guesses the next word, adds it, then guesses again, over and over. A few friendly knobs let you steer how adventurous those guesses are, and you can turn each one yourself and watch a sentence appear.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 7 min read

Fitting a bigger model into a smaller machine

Sometimes the model you want to train is simply too big for the memory you have. Two gentle tricks come to the rescue: gradient accumulation lets you act as if you trained on a large batch while only ever holding a tiny one, and activation checkpointing trades a little extra computing time for a much smaller memory footprint. We build both up slowly, assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 8 min read

Mixed precision and the art of using fewer bits

Numbers inside a neural network are usually stored in a big, careful format. If we switch most of them to a smaller, lighter format, training gets roughly twice as fast and uses about half the memory. The whole skill is knowing which few numbers we must leave in the careful format, and this walks you through it slowly, assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 6 min read

The training loop, where a model actually learns

This is the heartbeat of every neural network. One step is just five small moves done in a fixed order, repeated again and again, and slowly the model gets better. We will walk through each move gently, assuming you have never written a line of PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 7 min read

Learning rate schedules, and why the speed of learning changes over time

A model learns by taking small steps, and how big those steps are matters more than almost anything else. Start too fast and everything falls apart; stay slow forever and you never arrive. Here is the gentle idea of changing your step size as training goes on, built up slowly, assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 8 min read

Optimizers, and how a model actually learns with AdamW

A model learns by nudging its numbers in the right direction, over and over. The optimizer is the part that decides how big each nudge should be, and AdamW is the one almost everyone reaches for. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 9 min read

Cross-entropy loss made friendly

Every time a language model guesses the next word, we need one honest number that says how good that guess was. Cross-entropy is that number, and the rule it follows is simple: being confident and right is cheap, while being confident and wrong is expensive. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 5 min read

Dropout, and why we switch neurons off on purpose

When a network leans too hard on a few neurons, it memorises instead of truly learning. Dropout gently breaks that habit by hiding random neurons while the model trains, and one small rescaling trick keeps everything in balance. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 9 min read

Self-attention from scratch

Imagine every word in a sentence quietly asking the words before it, who here matters to me right now. Self-attention is exactly that conversation, written out in math. We will build it up one gentle step at a time, assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 8 min read

Activations and softmax, the curves that bring a network to life

A deep network needs a little bend in it, or every layer just collapses back into one. Here we meet the small functions that add that bend, and then meet softmax, the trick that turns raw scores into honest probabilities. We build both up slowly, assuming you have never opened PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 8 min read

LayerNorm and RMSNorm, the gentle reset button for deep networks

Stack enough layers and the numbers flowing through a network tend to drift, growing huge or shrinking to nothing until training falls apart. Normalization is a small, reliable trick that resets those numbers to a sane size at every layer. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 7 min read

nn.Linear, the workhorse behind almost everything

Inside a language model, the same simple building block shows up again and again: the attention projections, the feed forward block, the final layer that picks the next word. They are all one humble layer called nn.Linear. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 7 min read

Embeddings, the lookup table that turns words into vectors

Before a language model can do any math, it has to turn words into numbers. An embedding is just a trainable lookup table that hands each word its own little list of numbers, and the model slowly tunes those numbers as it learns. Here is the whole idea, built up gently, assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 8 min read

nn.Module, the building block you assemble models from

Every layer and every full model in PyTorch is built from one friendly class called nn.Module. It quietly keeps track of all the numbers your model needs to learn, lets you save and load the whole thing in one line, and switches the entire model between training and using mode with a single call. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 10 min read

Autograd, the engine that figures out gradients for you

You only ever write the easy part, the forward calculation. PyTorch quietly remembers everything you did and then hands you every gradient for free. We will build a tiny example, press one button, and watch the numbers flow backward. No PyTorch experience needed.

lettuceresearch
PyTorch for LLMs · Jun 2026 · 13 min read

Tensors, the one object that PyTorch is built on

Every number inside a language model lives in a tensor, so if you understand this one object you understand most of PyTorch. We will build the idea up slowly by poking at live ones: change a shape, swap a stride, hide the future, and watch the numbers move. No prior PyTorch needed.

lettuceresearch