lettuceresearch

lettuceresearchA personal research archive for AI work in progress — published openly, discussed in the open.https://lettuceresearch.com/en-usTraining across many GPUs without losing your mindhttps://lettuceresearch.com/articles/distributed-training/https://lettuceresearch.com/articles/distributed-training/When one GPU is too slow, the trick is wonderfully simple: put a full copy of your model on every GPU, feed each copy a different slice of the data, and then average their lessons so all the copies stay perfectly in sync. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.Fri, 19 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchdistributedddpmulti-gpuinteractiveWeight initialization and the trick of sharing one matrixhttps://lettuceresearch.com/articles/weight-init/https://lettuceresearch.com/articles/weight-init/Before a network learns anything, every one of its numbers has to start somewhere. The starting values turn out to matter enormously: pick them too big or too small and learning stalls before it begins. Here is how to choose them well, plus a lovely space-saving trick called weight tying, all explained assuming you have never touched PyTorch.Thu, 18 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchinitializationweight-tyinginteractiveSaving and loading checkpoints so you never lose a training runhttps://lettuceresearch.com/articles/gradient-checkpointing/https://lettuceresearch.com/articles/gradient-checkpointing/A checkpoint is much more than the model's weights. To pick up a training run exactly where it stopped, you also need the optimizer's memory, the step counter, and the random number state. We will build that complete bundle together, save it, load it back, and see why each piece matters, all assuming you have never touched PyTorch.Wed, 17 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchcheckpointstraininginteractiveHow a language model writes one word at a timehttps://lettuceresearch.com/articles/generation/https://lettuceresearch.com/articles/generation/A language model does not write whole sentences in one go. It guesses the next word, adds it, then guesses again, over and over. A few friendly knobs let you steer how adventurous those guesses are, and you can turn each one yourself and watch a sentence appear.Tue, 16 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchgenerationsamplinginferenceinteractiveFitting a bigger model into a smaller machinehttps://lettuceresearch.com/articles/scaling-up/https://lettuceresearch.com/articles/scaling-up/Sometimes the model you want to train is simply too big for the memory you have. Two gentle tricks come to the rescue: gradient accumulation lets you act as if you trained on a large batch while only ever holding a tiny one, and activation checkpointing trades a little extra computing time for a much smaller memory footprint. We build both up slowly, assuming you have never touched PyTorch.Mon, 15 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchmemorycheckpointingscalinginteractiveMixed precision and the art of using fewer bitshttps://lettuceresearch.com/articles/mixed-precision/https://lettuceresearch.com/articles/mixed-precision/Numbers inside a neural network are usually stored in a big, careful format. If we switch most of them to a smaller, lighter format, training gets roughly twice as fast and uses about half the memory. The whole skill is knowing which few numbers we must leave in the careful format, and this walks you through it slowly, assuming you have never touched PyTorch.Sun, 14 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchmixed-precisionbf16autocastinteractiveThe training loop, where a model actually learnshttps://lettuceresearch.com/articles/training-loop/https://lettuceresearch.com/articles/training-loop/This is the heartbeat of every neural network. One step is just five small moves done in a fixed order, repeated again and again, and slowly the model gets better. We will walk through each move gently, assuming you have never written a line of PyTorch.Sat, 13 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchtrainingoptimizationinteractiveLearning rate schedules, and why the speed of learning changes over timehttps://lettuceresearch.com/articles/lr-schedules/https://lettuceresearch.com/articles/lr-schedules/A model learns by taking small steps, and how big those steps are matters more than almost anything else. Start too fast and everything falls apart; stay slow forever and you never arrive. Here is the gentle idea of changing your step size as training goes on, built up slowly, assuming you have never touched PyTorch.Fri, 12 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchlearning-rateschedulingtraininginteractiveOptimizers, and how a model actually learns with AdamWhttps://lettuceresearch.com/articles/optimizers/https://lettuceresearch.com/articles/optimizers/A model learns by nudging its numbers in the right direction, over and over. The optimizer is the part that decides how big each nudge should be, and AdamW is the one almost everyone reaches for. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.Thu, 11 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchoptimizersadamwinteractiveCross-entropy loss made friendlyhttps://lettuceresearch.com/articles/cross-entropy/https://lettuceresearch.com/articles/cross-entropy/Every time a language model guesses the next word, we need one honest number that says how good that guess was. Cross-entropy is that number, and the rule it follows is simple: being confident and right is cheap, while being confident and wrong is expensive. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.Wed, 10 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchcross-entropylossinteractiveDropout, and why we switch neurons off on purposehttps://lettuceresearch.com/articles/dropout-and-regularization/https://lettuceresearch.com/articles/dropout-and-regularization/When a network leans too hard on a few neurons, it memorises instead of truly learning. Dropout gently breaks that habit by hiding random neurons while the model trains, and one small rescaling trick keeps everything in balance. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.Tue, 09 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchregularizationdropoutinteractiveSelf-attention from scratchhttps://lettuceresearch.com/articles/self-attention/https://lettuceresearch.com/articles/self-attention/Imagine every word in a sentence quietly asking the words before it, who here matters to me right now. Self-attention is exactly that conversation, written out in math. We will build it up one gentle step at a time, assuming you have never touched PyTorch.Mon, 08 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchattentiontransformersinteractiveActivations and softmax, the curves that bring a network to lifehttps://lettuceresearch.com/articles/activations/https://lettuceresearch.com/articles/activations/A deep network needs a little bend in it, or every layer just collapses back into one. Here we meet the small functions that add that bend, and then meet softmax, the trick that turns raw scores into honest probabilities. We build both up slowly, assuming you have never opened PyTorch.Sun, 07 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchactivationssoftmaxinteractiveLayerNorm and RMSNorm, the gentle reset button for deep networkshttps://lettuceresearch.com/articles/normalization/https://lettuceresearch.com/articles/normalization/Stack enough layers and the numbers flowing through a network tend to drift, growing huge or shrinking to nothing until training falls apart. Normalization is a small, reliable trick that resets those numbers to a sane size at every layer. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.Sat, 06 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchnormalizationlayernormrmsnorminteractivenn.Linear, the workhorse behind almost everythinghttps://lettuceresearch.com/articles/linear-layers/https://lettuceresearch.com/articles/linear-layers/Inside a language model, the same simple building block shows up again and again: the attention projections, the feed forward block, the final layer that picks the next word. They are all one humble layer called nn.Linear. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.Fri, 05 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchlinearprojectionsinteractiveEmbeddings, the lookup table that turns words into vectorshttps://lettuceresearch.com/articles/embeddings/https://lettuceresearch.com/articles/embeddings/Before a language model can do any math, it has to turn words into numbers. An embedding is just a trainable lookup table that hands each word its own little list of numbers, and the model slowly tunes those numbers as it learns. Here is the whole idea, built up gently, assuming you have never touched PyTorch.Thu, 04 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchembeddingtransformersinteractivenn.Module, the building block you assemble models fromhttps://lettuceresearch.com/articles/modules/https://lettuceresearch.com/articles/modules/Every layer and every full model in PyTorch is built from one friendly class called nn.Module. It quietly keeps track of all the numbers your model needs to learn, lets you save and load the whole thing in one line, and switches the entire model between training and using mode with a single call. Here is the whole idea, built up slowly, assuming you have never touched PyTorch.Wed, 03 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchnn-moduleparametersinteractiveAutograd, the engine that figures out gradients for youhttps://lettuceresearch.com/articles/autograd/https://lettuceresearch.com/articles/autograd/You only ever write the easy part, the forward calculation. PyTorch quietly remembers everything you did and then hands you every gradient for free. We will build a tiny example, press one button, and watch the numbers flow backward. No PyTorch experience needed.Tue, 02 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchautogradgradientsbackpropinteractiveTensors, the one object that PyTorch is built onhttps://lettuceresearch.com/articles/tensors-the-atom-of-everything/https://lettuceresearch.com/articles/tensors-the-atom-of-everything/Every number inside a language model lives in a tensor, so if you understand this one object you understand most of PyTorch. We will build the idea up slowly by poking at live ones: change a shape, swap a stride, hide the future, and watch the numbers move. No prior PyTorch needed.Mon, 01 Jun 2026 00:00:00 GMTPyTorch for LLMspytorchtensorsfoundationsinteractive