Fitting a bigger model into a smaller machine

Sometimes the model you want to train is simply too big for the memory you have. Two gentle tricks come to the rescue: gradient accumulation lets you act as if you trained on a large batch while only ever holding a tiny one, and activation checkpointing trades a little extra computing time for a much smaller memory footprint. We build both up slowly, assuming you have never touched PyTorch.

First, the problem we are trying to solve

Imagine you want to bake a hundred cookies, but your oven only fits ten at a time. You have two honest options. You can bake ten, then another ten, and so on, keeping a running tally until all hundred are done. Or you could squeeze the cookies closer together so more fit in one tray, though there is a limit before they burn into each other. Training a big neural network on a small machine feels exactly like this. You have a model you would love to train, and a piece of hardware that simply does not have enough room to hold all of it at once.

Before we go further, a few plain words that will keep coming up. A model is the thing we are training: a big pile of numbers, called parameters, that slowly get adjusted until the model is good at its task. A batch is the group of training examples we show the model in one go. Memory here means the working space on your graphics card (often called VRAM), and it is the thing that runs out. You do not need anything more technical than that to follow along.

The two tricks in this article both let you train a model that, on paper, should not fit. Neither one changes what the model eventually learns. They only change how the work is arranged so it squeezes into the space you have. If that sounds like a free lunch, it almost is, and we will be honest about the small price each one charges.

Trick one: pretend the batch is big by adding up the pieces

There is a quiet fact about training that makes the first trick possible. Every time the model sees a batch, it works out a gradient, which is just a set of numbers that say “nudge each parameter this much, in this direction, to do a little better next time.” Bigger batches give a steadier, less jumpy gradient, which often makes training smoother. The trouble is that bigger batches also need more memory, and memory is exactly what we are short of.

Gradient accumulation is the cookies-in-ten approach. Instead of one big batch, we run several small ones, which we call micro-batches. After each micro-batch we compute its gradient and quietly add it onto a running total, without updating the model yet. Only once we have added up all the pieces do we let the model take a single step. The model cannot tell the difference: adding up the gradients from many small batches gives the same result as one big batch. We say the effective batch is micro_batch * accum_steps, where accum_steps is how many micro-batches we summed. The beautiful part is that peak memory only ever has to hold one micro-batch at a time, because we finish with each one before starting the next. We get the calm gradient of a large batch while paying the memory bill of a tiny one. The only cost is time, since running four micro-batches takes about four times as long as running one.

Trick two: forget things on purpose, then recompute them

The second trick needs one more word. As the model processes a batch, every layer produces intermediate numbers called activations, which are simply the partial results flowing forward through the network. Here is the catch: training has two phases. First a forward pass runs the data through the model to get an answer, and then a backward pass runs in reverse to compute the gradient. The backward pass needs those activations from the forward pass to do its work, so normally we keep every single one of them in memory the whole time. In a deep model that is a mountain of saved numbers.

Activation checkpointing asks a daring question: what if we just throw most of them away? During the forward pass we keep only a few activations, at chosen “checkpoints,” and discard the rest. Then, when the backward pass needs the activations we deleted, we quickly redo a small slice of the forward computation to get them back. Memory drops sharply, from growing with the number of layers (O(layers)) down toward the square root of that (O(sqrt(layers))), which for a deep model is an enormous saving. The price is one extra forward pass worth of computation, which adds roughly 30 percent more work. If this feels like a strange bargain, that is normal: we are spending time to buy back memory, which is exactly the trade we want when memory is the thing in short supply.

Put the two tricks together and you can train a model far larger than your card has room for. Here is something you can play with directly. Set the micro-batch size and the number of accumulation steps, and toggle checkpointing on and off. Watch the effective batch climb while the memory bar refuses to budge, and watch checkpointing shrink the memory bar while the time bar grows taller in exchange.

Loading interactive widget…

Adding accumulation steps lifts the effective batch up while the memory bar stays put, because we only ever hold one micro-batch at a time. Turning on checkpointing cuts the memory bar down and pushes the time bar up: that is the classic trade of computing time for memory.

A few traps worth remembering

Every idea here has a sharp edge that catches people the first time. These are the ones to keep in your pocket.

Loading interactive widget…

How this shows up in a real language model

Neither of these is a toy idea you will outgrow; both live in real training code. The trainer usually exposes a grad_accum setting, and it scales each micro-batch loss by 1 / grad_accum so that the summed gradient comes out equal to the average over the whole effective batch (otherwise the gradient would be too large by a factor of accum_steps). Checkpointing is opt-in: you wrap each transformer block in torch.utils.checkpoint.checkpoint(block, x), and PyTorch handles recomputing that block’s activations during the backward pass for you. The thing to hold onto is that both of these are pure memory and compute knobs. They do not change the math, and they do not change the final weights the model ends up with. They only change how the work fits into the room you have.

The whole thing, runnable

If you have PyTorch installed, the snippet below shows both tricks in their natural setting. Read the comments as you go; they walk through exactly what each line is doing.

import torch
from torch.utils.checkpoint import checkpoint

# Gradient accumulation: the effective batch = micro_bs * grad_accum.
opt.zero_grad(set_to_none=True)            # clear out any leftover gradients first
for _ in range(grad_accum):                # run one micro-batch per loop
    logits, loss = model(*get_batch())     # forward pass on a small micro-batch
    (loss / grad_accum).backward()         # divide so the summed grads come out as an average
opt.step()                                 # take ONE optimizer step after all pieces are summed

# Activation checkpointing, used inside the model's own forward pass:
def forward(self, x):
    for block in self.blocks:              # walk through each transformer block in turn
        x = checkpoint(block, x, use_reentrant=False)  # drop this block's activations, recompute them in backward
    return self.head(self.ln_f(x))         # final layers turn the result into an answer