Absolute memory bandwidth figures tend to look fairly large, especially for GPUs. This is deceptive. It’s much more useful to relate memory bandwidth to say the number of clock cycles or instructions being executed, to get a feel for what you can (and can’t) get away with.
Let’s start with a historical example: the MOS 6502, first released in 1975 – 42 years ago, and one of the key chips in the microcomputer revolution. A 6502 was typically clocked at 1MHz and did a 1-byte memory access essentially every clock cycle, which are nice round figures to use as a baseline. A typical 6502 instruction took 3-5 cycles; some instructions with more complex addressing modes took longer, a few were quicker, and there was some overlapping of the fetch of the next instruction with execution of the current instruction, but no full pipelining like you’d see in later (and more complex) workstation and then microcomputer CPUs, starting around the mid-80s. That gives us a baseline of 1 byte/cycle and let’s say about 4 bytes/instruction memory bandwidth on a 40-year old CPU. A large fraction of that bandwidth went simply into fetching instruction bytes.
Next, let’s look at a recent (as of this writing) and relatively high-end desktop CPU. An Intel Core i7-7700K, has about 50GB/s and 4 cores, so if all 4 cores are under equal load, they get about 12.5GB/s each. They also clock at about 4.2GHz (it’s safe to assume that with all 4 cores active and hitting memory, none of them is going to be in “turbo boost” mode), so they come in just under 3 bytes per cycle of memory bandwidth. Code that runs OK-ish on that CPU averages around 1 instruction per cycle, well-optimized code around 3 instructions per cycle. So well-optimized code running with all cores busy has about 1 byte/instruction of available memory bandwidth. Note that we’re 40 years of Moore’s law scaling later and the available memory bandwidth per instruction has gone down substantially. And while the 6502 is a 8-bit microprocessor doing 8-bit operations, these modern cores can execute multiple (again, usually up to three) 256-bit SIMD operations in one cycle; if we treat the CPU like a GPU and count each 32-bit vector lane as a separate “thread” (appropriate when running SIMT/SPMD-style code), then we get 24 “instructions” executed per cycle and a memory bandwidth of about 0.125 bytes per cycle per “SIMT thread”, or less unwieldy, one byte every 8 “instructions”.
It gets even worse if we look at GPUs. Now, GPUs generally look like they have insanely high memory bandwidths. But they also have a lot of compute units and (by CPU standards) extremely small amounts of cache per “thread” (invocation, lane, CUDA core, pick your terminology of choice). Let’s take the (again quite recent as of this writing) NVidia GeForce GTX 1080Ti as an example. It has (as per Wikipedia) a memory bandwidth of 484GB/s, with a stock core clock of about 1.48GHz, for an overall memory bandwidth of about 327 bytes/cycle for the whole GPU. However, this GPU has 28 “Shading Multiprocessors” (roughly comparable to CPU cores) and 3584 “CUDA cores” (SIMT lanes). We get about 11.7 bytes/cycle per SM, so about 4x what the i7-7700K core gets; that sounds good, but each SM drives 128 “CUDA cores”, each corresponding to a thread in the SIMT programming model. Per thread, we get about 0.09 bytes of memory bandwidth per cycle – or perhaps less awkward at this scale, one byte every 11 instructions.
That, in short, is why everything keeps getting more and larger caches, and why even desktop GPUs have quietly started using tile-based rendering approaches (or just announced so openly). Absolute memory bandwidths in consumer devices have gone up by several orders of magnitude from the ~1MB/s of early 80s home computers, but available compute resources have grown much faster still, and the only way to stop bumping into bandwidth limits all the time is to make sure your workloads have reasonable locality of reference so that the caches can do their job.
Final disclaimer: bandwidth is only one part of the equation. Not considered here is memory latency (and that’s a topic for a different post). The good news is absolute DRAM latencies have gone down since the 80s – by a factor of about 4-5 or so. The bad news is that clock rates have increased by about a factor of 3000 since then – oops. CPUs generally hit much lower memory latencies than GPUs (and are designed for low-latency operation in general) whereas GPUs are all about throughput. When CPU code is limited by memory, it is more commonly due to latency than bandwidth issues (running out of independent work to run while waiting for a memory access). GPU kernels have tons of runnable warps at the same time, and are built to schedule something else during the wait; running on GPUs, it’s much easier to run into bandwidth issues.