Why Your GPU Isn't Going Brrrr: Compute, Memory, and Overhead from First Principles

Performance tuning for deep learning often devolves into superstition, but Horace He argues it becomes tractable once you classify where time is actually being spent: compute (FLOPS on the GPU), memory bandwidth (moving tensors in and out of DRAM), or overhead (everything else). Each regime calls for different optimizations, and applying the wrong fix wastes effort — adding FLOPS to a bandwidth-bound workload buys nothing, and rewriting in C++ won’t help a matmul-bound model.

Modern accelerators are wildly lopsided toward matrix multiplication. Nvidia’s Tensor Cores deliver 312 teraflops on matmuls but only 19.5 on everything else, and since compute capacity grows faster than memory bandwidth, the gap keeps widening. That’s tolerable because non-matmul ops account for a tiny fraction of total FLOPS in models like BERT — but those same ops can dominate wall-clock time because each kernel launch pays the cost of shuttling tensors between DRAM and the compute units.

The canonical fix is operator fusion: instead of writing intermediate results back to global memory between pointwise ops, chain them into a single kernel so the data stays near the compute. He frames this with a factory analogy — DRAM as the warehouse, SRAM as the factory floor, bandwidth as the trucks — to make clear why fusing something like x.cos().cos() halves the memory traffic and roughly doubles throughput. The broader lesson: profile to identify your regime first, then pick the optimization that actually targets it.