The conventional assumption in large model training is that parameters live on GPU memory, which means model size is bounded by how many GPUs you can coordinate. MegaTrain inverts this. It treats the GPU as a transient compute engine and stores parameters and optimiser states in CPU host memory. A pipelined double-buffered execution engine overlaps data movement with computation across multiple CUDA streams, and stateless layer templates replace persistent autograd graphs to eliminate graph metadata overhead. The result: full-precision training of a 120B parameter model on a single H200 GPU with 1.5TB of host memory, and 7B models with 512k token contexts on hardware that doesn’t require a cluster.
MegaTrain achieves 1.84x improvement over DeepSpeed ZeRO-3 on 14B models. That’s a meaningful comparison because ZeRO-3 is the standard baseline for distributed training with aggressive memory offloading. Beating it on a single GPU while maintaining full precision isn’t a toy result. The architecture choice (CPU host memory as the primary store) has been explored before, but the pipelining approach that keeps the GPU occupied while parameters are being swapped is what makes the throughput competitive rather than just theoretically possible.
The practical ceiling is the 1.5TB host memory requirement. An H200 GPU paired with enough CPU RAM to hit that figure is not consumer hardware, but it is significantly cheaper than the multi-node cluster that 100B training previously required. For teams doing research, fine-tuning experiments, or production training runs on large models, the reduction in infrastructure complexity is as significant as the cost. Multi-GPU training introduces distributed systems problems: communication overhead, failure modes that span nodes, checkpoint complexity. A single GPU setup eliminates that entire class of problem, which matters for iteration speed.
The 512k context figure for 7B models is the detail most practitioners should sit with. Long-context training has been limited by the quadratic memory cost of attention and the GPU memory ceiling. Offloading state to CPU host memory effectively decouples context length from GPU memory capacity, at the cost of throughput. For use cases where context length matters more than training speed, such as document-processing models, code understanding, or long-form reasoning tasks, this unlocks experiments that weren’t practically feasible on single-machine setups.
The limitation to note: host memory bandwidth is much lower than HBM bandwidth on GPU, so compute utilisation depends heavily on the pipelining implementation. The authors’ double-buffering approach addresses this, but the technique requires careful tuning for different model architectures and batch sizes. This is a research prototype that demonstrates the approach works, not a drop-in replacement for your training stack. The significance is the existence proof: if the memory model is right, single-GPU training of 100B models is tractable. That changes what’s worth attempting.