Skip to content

Memory bandwidth

Absolute memory bandwidth figures tend to look fairly large, especially for GPUs. This is deceptive. It’s much more useful to relate memory bandwidth to say the number of clock cycles or instructions being executed, to get a feel for what you can (and can’t) get away with.

Let’s start with a historical example: the MOS 6502, first released in 1975 – 42 years ago, and one of the key chips in the microcomputer revolution. A 6502 was typically clocked at 1MHz and did a 1-byte memory access essentially every clock cycle, which are nice round figures to use as a baseline. A typical 6502 instruction took 3-5 cycles; some instructions with more complex addressing modes took longer, a few were quicker, and there was some overlapping of the fetch of the next instruction with execution of the current instruction, but no full pipelining like you’d see in later (and more complex) workstation and then microcomputer CPUs, starting around the mid-80s. That gives us a baseline of 1 byte/cycle and let’s say about 4 bytes/instruction memory bandwidth on a 40-year old CPU. A large fraction of that bandwidth went simply into fetching instruction bytes.

Next, let’s look at a recent (as of this writing) and relatively high-end desktop CPU. An Intel Core i7-7700K, has about 50GB/s and 4 cores, so if all 4 cores are under equal load, they get about 12.5GB/s each. They also clock at about 4.2GHz (it’s safe to assume that with all 4 cores active and hitting memory, none of them is going to be in “turbo boost” mode), so they come in just under 3 bytes per cycle of memory bandwidth. Code that runs OK-ish on that CPU averages around 1 instruction per cycle, well-optimized code around 3 instructions per cycle. So well-optimized code running with all cores busy has about 1 byte/instruction of available memory bandwidth. Note that we’re 40 years of Moore’s law scaling later and the available memory bandwidth per instruction has gone down substantially. And while the 6502 is a 8-bit microprocessor doing 8-bit operations, these modern cores can execute multiple (again, usually up to three) 256-bit SIMD operations in one cycle; if we treat the CPU like a GPU and count each 32-bit vector lane as a separate “thread” (appropriate when running SIMT/SPMD-style code), then we get 24 “instructions” executed per cycle and a memory bandwidth of about 0.125 bytes per cycle per “SIMT thread”, or less unwieldy, one byte every 8 “instructions”.

It gets even worse if we look at GPUs. Now, GPUs generally look like they have insanely high memory bandwidths. But they also have a lot of compute units and (by CPU standards) extremely small amounts of cache per “thread” (invocation, lane, CUDA core, pick your terminology of choice). Let’s take the (again quite recent as of this writing) NVidia GeForce GTX 1080Ti as an example. It has (as per Wikipedia) a memory bandwidth of 484GB/s, with a stock core clock of about 1.48GHz, for an overall memory bandwidth of about 327 bytes/cycle for the whole GPU. However, this GPU has 28 “Shading Multiprocessors” (roughly comparable to CPU cores) and 3584 “CUDA cores” (SIMT lanes). We get about 11.7 bytes/cycle per SM, so about 4x what the i7-7700K core gets; that sounds good, but each SM drives 128 “CUDA cores”, each corresponding to a thread in the SIMT programming model. Per thread, we get about 0.09 bytes of memory bandwidth per cycle – or perhaps less awkward at this scale, one byte every 11 instructions.

That, in short, is why everything keeps getting more and larger caches, and why even desktop GPUs have quietly started using tile-based rendering approaches (or just announced so openly). Absolute memory bandwidths in consumer devices have gone up by several orders of magnitude from the ~1MB/s of early 80s home computers, but available compute resources have grown much faster still, and the only way to stop bumping into bandwidth limits all the time is to make sure your workloads have reasonable locality of reference so that the caches can do their job.

Final disclaimer: bandwidth is only one part of the equation. Not considered here is memory latency (and that’s a topic for a different post). The good news is absolute DRAM latencies have gone down since the 80s – by a factor of about 4-5 or so. The bad news is that clock rates have increased by about a factor of 3000 since then – oops. CPUs generally hit much lower memory latencies than GPUs (and are designed for low-latency operation in general) whereas GPUs are all about throughput. When CPU code is limited by memory, it is more commonly due to latency than bandwidth issues (running out of independent work to run while waiting for a memory access). GPU kernels have tons of runnable warps at the same time, and are built to schedule something else during the wait; running on GPUs, it’s much easier to run into bandwidth issues.

Rounding up to the nearest integer that’s congruent to k mod N

It’s fairly well-known (among programmers anyway) that say rounding up x to the nearest multiple of 8 can be accomplished using the formula (x + 7) & ~7, and that in general rounding up to the nearest multiple of N (where N is a power of 2) can be accomplished using (x + N - 1) & ~(N - 1). But sometimes you need a slightly generalized version: round up to the nearest value that is congruent to some k \pmod N; for example, this crops up in boundary tag-using memory allocators when the user requests aligned memory. Such allocators put a header before allocated blocks (just before the address returned to the caller). For the user-visible pointer to be aligned by say 32, that header needs to fall at an address that’s off alignment by a specified distance (which brings us to our problem).

It’s not immediately obvious how to adapt the original formula to this case (there is a way; I’ll get to it in a second). Now this is not exactly a frequent problem, nor is there any real need for a clever solution, but it turns out there is a very nice, satisfying solution anyway, and I wanted to write a few words about it. The solution is simply x + ((k - x) & (N - 1)) for power-of-2 N. The basic approach works in principle for arbitrary N, but x + ((k - x) % N) will not work properly in environments using truncated division where taking the modulus of a negative argument can return negative results, which sadly is most of them. That said, in the remainder of this short post I’ll write % N instead of & (N - 1) with a “N needs to be a power of 2” disclaimer anyway, since there’s really nothing about the method that really requires it. Finally, this expression works fine even in overflowing unsigned integer arithmetic when N is a power of 2, but not for non-power-of-2 N.

What I like about this solution is that, once you see it written down, it’s fairly clear that and why it works (unlike many bit manipulation tricks), provided you know the rules of modular arithmetic: x + ((k - x) \bmod N) \equiv x + (k - x) = k \pmod N. We’re adding a non-negative value to x, so it’s clear that the result is ≥ x (provided there is no overflow). And we’re adding the smallest possible value we can to get to a value that’s congruent to k (mod N); I wrote about similar things before in my post “Intervals in modular arithmetic”.

There’s an equivalent expression for rounding down to the nearest value congruent to k (mod N): x - ((x - k) % N) that works (and is easy to prove) the same way.

It’s interesting to consider the case k=0. The round-down variant, x - (x % N), feels fairly natural and is something I’ve seen in “real-world” code more than once. The round-up variant, x + (-x % N) is something I’ve never seen anywhere. Once you throw the k in there, it all makes sense, but without it the expression looks quite odd.

Finally, here’s the aforementioned way to adapt the “regular” round-up formula to produce a value that’s congruent to k (instead of 0) mod N (and we’re back to requiring power-of-2 N here): ((x - k + N - 1) & ~(N - 1)) + k. This uses a different trick from the intervals in modular arithmetic paper: shift the origin around. In this case, we don’t have a formula for arbitrary k, but we do have a formula to round up to the nearest multiple of N. So we first subtract k; in this new shifted coordinate system, we want to round up to the next-larger multiple of N, which we know how to do. And finally, we add back k. It gets the job done, but it’s not as pretty as the other variant (in my opinion anyway), and it takes some thinking to convince yourself that it works at all.

How many x86 instructions are there?

It’s surprisingly hard to give a good answer (the question was raised in this article). It depends on how you count, and the details are interesting (to me anyway).

To not leave you hanging: Intel has an official x86 encoder/decoder library called XED. According to Intel’s XED, as of this writing, there are 1503 defined x86 instructions (“iclasses” in XED lingo), from AAA to XTEST (this includes AMD-specific extensions too, by the way). Straightforward, right?

Well, it depends on what you wanted to count. For example, as per XED, ADD and LOCK ADD are different “instruction classes”. Many assembly programmers would consider LOCK a prefix and LOCK ADD an addition with said prefix, not a distinct instruction, but XED disagrees. And in fact, for the purposes of execution, so do current x86s. An atomic add does very different things from a regular add. The prefix thing crops up elsewhere: is say MOVSD (copy a single 32-bit word) a different “instruction class” from REP MOVSD (block copy several 32-bit words)? XED says yes. But it doesn’t handle all prefixes this way in all contexts. For example, the operand-size prefix (0x66) turns integer instructions operating on 32-bit registers into the equivalent instruction operating on their lower 16-bit halves, but unlike with the REP or LOCK prefixes, XED does not count these as separate instruction classes. If you disagree about any of these choices, your count will come out different.


It all depends on how precisely we define an instruction. Is it something with a distinct mnemonic? Let’s first look at what the article I quoted above says is by far the most common x86 instruction, at 33% of the total sample set: MOV. So let’s look up MOV in the Intel Architecture manuals. And… there are 3 different top-level entries? “MOV—Move”, “MOV—Move to/from Control registers”, “MOV—Move to/from Debug Registers”. The latter are sufficiently distinct from “regular” MOV to rate their own documentation pages, they have completely different instruction encodings (not even in the same encoding block as regular MOV), and they’re privileged instructions, meaning lowly user-mode code isn’t even allowed to execute them. Consequently they’re also extremely rare, and are likely to account for approximately 0% of the test sample. And, sure enough, XED counts them as separate instruction classes (MOV_CR and MOV_DR).

So these instructions may be called MOV, but they’re weird, special snowflakes, and from the processor’s point of view they’re entirely different instructions in a different part of the encoding space and with different rules. Calling them MOV is essentially nothing but syntactic sugar in the official Intel assembly language.

And on the subject of syntactic sugar: some mnemonics are just aliases. For example, SAL (shift arithmetic left) is a long-standing alias for SHL (shift left). Both are just bit shifts; there is no distinction between “arithmetic” and “logical” left shifts like there is between arithmetic and logical right shifts, but the Intel manuals list SAL (with an encoding that happens to be the same as SHL) and all x86 assemblers I’ve ever used accept it. Hilariously, in official Intel syntax, we’re simultaneously miscounting in the other direction, since at least two mnemonics got assigned twice: we already saw the “copy” variant of MOVSD (which has no explicit operands), but there’s also MOVSD as in “move scalar double” (which always has two explicit operands) which is an entirely different instruction (XED calls it MOVSD_XMM to disambiguate, and the same problem happens with CMPSD).

There’s also SSE compares like CMPSD (the two-operand one!) and CMPPS. XED counts these as one instruction each. But they have an 8-bit immediate constant byte that specifies what type of comparison to perform. But disassemblers usually won’t produce the hard-to-read CMPSD xmm0, xmm1, 2; they’ll disassemble that instruction as the pseudo-instruction CMPLESD (compare scalar doubles for lesser-than-or-equal) instead. So is CMPSD one instruction (just the base opcode with an immediate operand), is it 8 (for the 8 different standard compare modes), or something else?

This is getting messy. AT&T syntax to the rescue? Well, it solves some of our problems but also introduces new ones. For example, AT&T adds suffixes to the mnemonics to distinguish different operation widths. What Intel calls just ADD turns into ADDB (8-bit bytes), ADDW (16-bit “words”), ADDL (32-bit “long words”) and ADDQ (64-bit “quadwords”) in x86-64 AT&T syntax. Do we count these as separate? As per Intel syntax, no. As per XED instruction classes, also no. But maybe we consider these distinct enough to count separately after all? Or maybe we decide that if our definition depends on the choice of assembly syntax, of which there are several, then maybe it’s not a very natural one. What does the machine do?

Instruction bytes

Note I haven’t specified what part of the machine yet. This is thorny too. We’ll get there in a bit.

But first, instruction bytes. Let’s look at the aforementioned manual entry for real now: “MOV—Move”. If you check that page out in the current Intel Architecture Software Developer’s Manual, you’ll find it lists no less than thirty-four encodings (not all of them distinct; I’ll get to that). Some of these are more special, privileged operations with special encodings (namely, moves to and from segment registers). This time, XED doesn’t seem to consider segment register loads and stores to be special and lumps them into plain old MOV, but I consider them distinct, and the machine considers them distinct enough to give them a special opcode byte in the encoding that’s not used for anything else, so let’s call those distinct.

That leaves us with 30 “regular” moves. Which are… somewhat irregular: 10 of them are doing their own thing and involve moves between memory and different parts of the RAX (in 64-bit mode) register, all with a special absolute addressing mode (“moffs”) that shows up in these instructions and, to my knowledge, nowhere else. These instructions exist, and again, pretty much nothing uses them. They were useful on occasion in 16-bit mode but not anymore.

This specialness of the accumulator register is a recurring theme in x86. “op (AL/AX/EAX/RAX), something” has its own encoding (usually smaller) and various quirks for a lot of the instructions that go back to the 8086 days. So even though an asssembly programmer might consider say TEST ebx, 128 and TEST eax, 128 the same instruction (and the XED instruction class list agrees here!), these have different opcodes and different sizes. So a lot of things that look the same in an assembly listing are actually distinct for this fairly random reason. Keep that in mind. But back to our MOV!

The remaining 20 listed MOV variants fall into four distinct categories, each of which has 5 entries. These four categories are:

  • “Load-ish” – move from memory or another same-sized register to a 8/16/32/64-bit register.
  • “Store-ish” – move from a 8/16/32/64-bit register to either another register of the same size, or memory.
  • “Load-immediate-ish” – load an integer constant into a 8/16/32/64-bit register.
  • “Store-immediate-ish” – store an integer constant to either a 8/16/32/64-bit memory location, or a register.

All processor have some equivalent of the first three (the “store immediate” exists in some CPU architectures, but there’s also many that don’t have it). Load/store architectures generally have explicit load and store instructions (hence the name), and everyone has some way to load immediates (large immediate constants often require multiple instructions, but not on x86) and to move the content of one register to another. (Though the latter is not always a dedicated instruction.) So other than the fact that our “load-ish” and “store-ish” instructions also support “storing to” and “loading from” a register (in particular, there’s two distinct ways to encode register-register MOVs), this is not that remarkable. It does explain why MOVs are so common in x86 code: “load”, “store” and “load immediate” in particular are all very common instruction, and MOV subsumes all of them, so of course you see plenty of them.

Anyway, we have four operand sizes, and four categories. So why are there five listed encodings per category? Okay, so this is a bit awkward. x86-64 has 16 general-purpose registers. You can access them as 16 full 64-bit registers. For all 16 registers, you can read from (or write to) their low 32-bit halves. Writing to the low 32-bit half zero-extends (i.e. it sets the high half to zero). For all 16 register, you can read from (or write to) their low 16-bit quarter. Writing to the low 16-bit quarter of a register does not zero-extend; the remaining bits of the register are preserved, because that’s what 32-bit code used to do and AMD decided to preserve that behavior when they specced 64-bit x86 for some reason. And for all 16 registers, you can read from (or write to) their low 8-bit eighth (the lowest byte). Writing the low byte again preserves all the higher bytes, because that’s what 32-bit mode did. With me so far? Great. Because now is when it gets weird. In 16-bit and 32-bit mode, you can also access bits 8 through 15 of the A, B, C and D registers as AH, BH, CH and DH. And x86-64 mode still lets you do that! But due to a quirk of the encoding, that works only if there’s no REX prefix (which is the prefix that is used to extend the addressable register count from 8 to 16) on the instruction.

So x86-64 actually has a total of 20 addressable 8-bit registers, in 3 disjoint sets: AL through DL, which can be used in any encoding. AH through DH, which can only be accessed if no REX prefix is present on the instruction. And the low 8 bits of the remaining 12 registers, which can only be accessed if a REX prefix is present.

This quirk is why Intel lists all 8-bit variants twice: once without REX and one with REX, because they can access slightly different parts of the register space! Alright, but surely, other than that, we must have 4 different opcodes, right? One each for move byte, word, doubleword, quadword?

Nope. Of course not. In fact, in each of these categories, there are two different opcode bytes: one used for 8-bit accesses, and one for “larger than 8-bit”. This dates back to the 8086, which was a 16-bit machine: “8-bit” and “16-bit” was all the distinction needed. Then the 386 came along and needed a way to encode 32-bit destinations, and we got the already mentioned operand size prefix byte. In 32-bit mode (handwaving here, the details are a bit more complicated), the instructions that used to mean 16-bit now default to 32-bit, and getting actual 16-bit instrutions requires an operand size prefix. And I already mentioned that 64-bit mode added its own set of prefixes (REX), and this REX prefix is used to upgrade the now default-32-bit “word” instructions to 64-bit width.

So even though Intel lists 5 different encodings of the instructions in each group, all of which have somewhat different semantics, there’s only 2 opcodes each associated to them: “8-bit” or “not 8-bit”. The rest is handled via prefix bytes. And as we (now) know, there’s lots of different types of MOVs that do very different things, all of which fall under the same XED “instruction class”.

Maybe instruction classes is the wrong metric to use? XED has another, finer-grained thing called “iforms” that considers the different subtypes of instructions separately. For example, for the just-discussed MOV, we get this list:


As you can see, that list basically matches the way the instruction encoding works, where 8-bit anything is considered a separate instruction, but size overrides by way of prefixes are not. So that’s basically the rule for XED iforms: if it’s a separate instruction (or a separate encoding), it gets a new iform. But just modifying the size of an existing instruction (for example, widening MMX instructions to SSE, or changing the size of a MOV via prefix bytes) doesn’t.

So how many x86 instructions are there if we count distinct iforms as distinct? Turns out, an even 6000. Is that all of them? No. There are some undocumented instructions that XED doesn’t include (in addition to the several formerly undocumented instructions that Intel at some point just decided to make official). If you look at the Intel manuals, you’ll find the curious “UD2”, the defined “Undefined instruction” which is architecturally guaranteed to produce an “invalid opcode” exception. As the name suggests, it’s not the first of its kind. Its older colleague “UD1” half-exists, but not officially so. Since the semantics of UD1 are exactly the same as if it was never defined to begin with. Does a non-instruction that is non-defined and unofficially guaranteed to non-execute exactly as if it had never been in the instruction set to begin with count as an x86 instruction? For that matter, does UD2 itself, the defined undefined instruction, count as an instruction?

Instruction decoders

But back to those iforms: 6000 instructions, huh? And these must all be handled in the decoder? That must be terrible.

Well, no. Not really. I mean, it’s not pleasant, but it’s not the end of the world.

First off, let’s talk about how x86 is decoded in the first place: all x86 CPUs you’re likely to interact with can decode (and execute) multiple instructions per cycle. Think about what that means: we have an (aggressively!) variable-length encoding, and we’re continually fetching instructions. These chips can decode (given the right code) 4 instructions per clock cycle. How does that work? They’re variable-length! We may know where the first instruction we’re looking at in this cycle starts, but how does the CPU know where to start decoding the second, third, and fourth instructions? That’s straightforward when your instructions are fixed-size, but for x86 they are most certainly not. And we do need to decide this quickly (within a single cycle), because if we take longer, we don’t know where the last instruction in our current “bundle” ends, and we don’t know where to resume decoding in the next cycle!

You do not have enough time in a 4GHz clock cycle (all 0.25ns of it) to fully decode 4 x86 instructions. For that matter, you don’t even have close to enough time to “fully
decode” (what exactly that means is fuzzy, and I won’t try to make it precise here) one. Two basic ways to proceed: the first is simply, don’t do that! Try to avoid it at all cost. Keep extra predecoding information (such as marking the locations where instructions start) in your instruction cache, or keep a separate decoded cache altogether, like Intels uOp caches. This works, but it doesn’t help you the first time round when you’re running code that isn’t currently cached.

Which brings us to option two: deal with it. And the way to do it is pretty much brute force. Keep a queue of upcoming instruction bytes (this ties in with branch target prediction and other things). As long as there’s enough space in there, you just keep fetching another 16 (or whatever) instruction bytes and throw them into the queue.

Then, for every single byte position in that queue, you pretend that an x86 instruction starts at that byte, and determine how long it is. Just the length. No need to know what the instruction is. No need to know what the operands are, or where the bytes denoting these operands are stored, or whether it’s an invalid encoding, or if it’s a privileged instruction that we’re not allowed to execute. None of that matters at this stage. We just want to know “supposing that this is a valid instruction, what is it’s length?”. But if we add 16 bytes per cycle to the queue, we need 16 of these predecoders in parallel to make sure that we keep up and get an instruction length for every single possible starting location. We can pipeline these predecoders over multiple cycles if necessary; we just keep fetching ahead.

Once our queue is sufficiently full and we know that size estimate for every single location in it, then we decide where the instruction boundaries are. That’s the stage that keeps track. It grabs 16 queue entries (or whatever) starting at the location for the current instruction, and then it just needs to “switch through”. “First instruction says size starting from there is 5 bytes, okay; that means second instruction is at byte 5, and the queue entry says that one’s 3 bytes; okay, third instruction starts at byte 8, 6 bytes”. No computation in that stage, just “table lookups” in the small size table we just spent a few cycles computing.

That’s one way to do it. As said, very much brute force, but it works. However, if you need 16 predecoders (as you do to sustain a fetch rate of 16 bytes/cycle), then you really want these to be as dumb and simple as you can possibly get away with. These things most certainly don’t care about 6000 different iforms. They just squint at the instruction just enough to figure out the size, and leave the rest for later.

Luckily, if you look at the actual opcode map, you’ll see that this is not all that bad. There’s large groups of opcodes that all have basically the same size and operands, just with different operations – which we don’t care about at this stage at all.

And this kind of pattern exists pretty much everywhere. For example, look at that conspicuous, regular block of integer ALU instructions near the top of the opcode map. These all look (and work) pretty similar to the CPU. Most of them have essentially the same encodings (except for a few opcode bits that are different) and the same operand patterns. In fact, the decoder really doesn’t care whether it’s an OR, an ADD, a CMP, or a XOR. To an assembly-language programmer, a compiler, or a disassembler, these are very different instructions. To the CPU instruction decoder, these are all pretty much the same instruction: “ALU something-or-other mumble-mumble don’t care”. Which one of these gets performed will only be decided way later (and probably only after that operation make it to the ALU itself). What the decoder cares about is whether it’s an ALU instruction with an immediate operand, or if it has a memory operand, and what that memory operand looks like. And the instructions are conveniently organized in groups where the answers to these questions are always the same. With plenty of exceptions of course, because this is still x86, but evidently it can be made to work.

Further down the pipe

Instructions really don’t get decoded all at once, in one big “switch statement”, and after that they go to disjoint parts of the chip never to meet again. That’s not how these things are built. There’s plenty of similarity between different instructions, and the “understanding” of what an instruction does is distributed, not centralized.

For example, for the purposes of most of the instruction decoder, the SSE2 instructions ADDPS, SUBPS, MULSD and DIVPD are all pretty much the same thing. They’re FP ALU instructions, they accept the same types of operands, all of which are in the same place.

Some of these instructions are so similar that they’re almost certain to never fully get “decoded”. For example, for IEEE floats, a subtraction is literally just an addition where the sign bit of the second operand is flipped. If you look at the opcode table, the difference between the encoding for ADDPS and SUBPS is precisely one flipped bit: that bit is clear for ADDPS and set for SUBPS. Literally all you need to do to support both instructions is to decode them the same, make sure to grab that one bit from the instruction, and then feed it (along with the original second operand sign bit) into a single XOR gate in front of the FP adder. That’s it. You now support both floating point addition and subtraction.

Some of these differences matter more. For example, FP multiplies go to a different functional unit than FP adds, and they have a different latency. So the data needs a different routing, and the latency for say an add and a multiply is different, which the schedulers care about. If there’s a memory load involved, then the load unit needs to know what size of access, and what part of the operand bypass network to send the results to (integer, float/SIMD?). So there’s a bunch of control signals computed eventually that express the differences between all these instructions. But a lot of it happens really late. There’s certainly no big monolithic 6000-case “switch statement” anywhere.

And then there’s further differences still. For example, MOV elimination. Many x86s can in many cases avoid real execution of register-register MOVs altogether. They just resolve it as part of their register renaming. Likewise, zeroing a register by XORing it with itself (something the author of the original article I linked to) gets resolved by renaming that register to point to a hard-wired zero register and likewise doesn’t actually take any execution resources (even though it still needs to decode).

Where does that fit in our taxonomy? MOV rax, rbx will most often take 0 cycles, but sometimes take 1 cycle due to various reasons. Does the 0-cycle version, when it happens, count as a special instruction? Is XOR rax, rax (which goes down the magic implicit zeroing path and takes 0 cycles to execute) a different instruction from XOR rax, rcx which is encoded essentially the same way? These two instructions differ by exactly 1 bit in both the assembly-language source file and the assembled object code, yet execute in drastically different ways and with different latencies. Should that make them a separate instruction or not? The most useful answer really depends on what part of the pipeline you’re interested in. If you’re designing a CPU core, they pretty much are separate instructions. If you’re writing a disassembler, they are not.

In conclusion…

So, is there a point to all this? I wrote it because I think it’s fun, but is there something to learn here?

I think so. It makes a wonderful example for a general phenomenon I’ve encountered in a lot of different situations: questions to which a ballpark answer is fairly easy to give, but that keeps getting gnarlier the more you try to pin it down. It’s essentially an instance of the “coastline paradox“: the closer you look, the more detail you see, and the more the answer changes.

Suppose I ask you “where am I?”, and I’m okay with getting an answer that’s within about 10 meters or so. If you have a handheld GPS unit, you can just hand it to me, and if I look at the display I’ll get an answer. If I ask “where am I, down to the millimeter?”, things get a lot more complicated. Specifying the position of a person down to a meter or so makes sense, but specifying it down to a millimeter does not. Position of what exactly? My center of gravity? The position of my center of gravity, projected onto the ground? The position of the tip of my nose? The center point of the hangnail on my left pinky? You can’t answer that question precisely when the uncertainty inherent in the question is so much larger than the level of precision you’re aiming at.

And by the way, I used x86 as an example here, but don’t believe for a second the same thing doesn’t apply to, say, the ARM chip in your phone. Modern ARM chips support multiple encodings and also rank over 1000 instructions if you count them at the same level of granularity as XEDs “iforms”. In fact it’s pretty easy to get high instruction counts on any architecture as soon as any kind of vector/SIMD instruction set is involved, since most of them basically come in the form of “instantiate these 40 instructions for 10 different data types” (with lots of special magic that is either typeless only works on certain types, of course). And yeah, x86 has plenty of historical warts in its encoding, but so does ARM – many of them on display in the current generation of chips, where chip makers have the pleasurable task of designing 3 distinct instruction decoders: old-school 32-bit ARM or “A32”, the more compact but variable-size Thumb-2 or “T32”, and the fixed-size-again 64-bit “A64” encoding.

Why do CPUs have multiple cache levels?

This is a reader question from “jlforrest” that seems worth answering in more detail than just a single sentence:

I understand the need for a cache but I don’t understand why there are multiple levels of cache instead of having just one larger level. In other words, let’s say the L1 cache is 32K, the L2 cache is 256K, and the L3 cache is 2M, why not have a single 32K + 256K + 2M L1 cache?

The short version is that the various cache levels have very large variations in how they are designed; they are subject to different constraints and fulfill different purposes. As a rule of thumb, as you climb up the levels of the cache hierarchy, the caches get larger, slower, higher density (more bits stored per unit area), consume less power per bit stored, and gain additional tasks.

In the hopes of building some intuition here, let’s start with an elaborate and somewhat quaint analogy. That’s right, it is…

Cache story time!

Suppose you’re a white-collar office worker in some unnamed sprawling 1960s bureaucracy, with no computers in sight, and your job involves a lot of looking at and cross-referencing case files (here being folders containing sheets of paper).

You have a desk (the L1 data cache). On your desk are the files (cache lines) you’re working on right now, and some other files you recently pulled and are either done with or expect to be looking at again. Working with a file generally means looking at the individual pages it contains (corresponding to bytes in a cache line). But unless they’re on your desk, files are just treated as a unit. You always grab whole files at a time, even if there’s only one page in them that you actually care about right now.

Also in the office is a filing cabinet (L2 cache). That cabinet contains files you’ve handled recently, but aren’t using right now. When you’re done with what’s on your desk, most of these files will go back into that filing cabinet. Grabbing something from the cabinet isn’t immediate – you need to walk up there, open the right drawer and thumb through a few index cards to find the right file – but it’s still pretty quick.

Sometimes other people need to look at a file that’s in your cabinet. There’s a guy with a cart, Buster (representing a sort of ring bus) who just keeps doing his rounds through all the offices. When an office worker needs a file they don’t have in their private cabinet, they just write a small request slip and hand it to Buster. For simplicity’s sake, let’s just say Buster knows where everything is. So the next time he comes by your office, Buster will check if someone requested any files that are in your cabinet, and if so will just silently pull these files out of the cabinet and put them on his cart. The next time he comes by the office of whoever requested the file, he’ll drop the file in their cabinet, and leave a receipt on the desk.

Every once in a while, Buster notices a requested file isn’t in the cabinet, but on the desk instead. In that case, he can’t just silently grab it; he needs to ask the worker at the desk whether they’re done with it, and if no, that worker and the one who put in the request need to agree on what to do. There are tediously elaborate corporate protocols on what to do in that situation (meetings will be called for sure!).

The filing cabinets are usually full. That means Buster can’t just put a new file in; he has to make space first, which he does by grabbing another file, preferably one that hasn’t been used in a while. Those files, Buster takes to the archive in the basement (L3 cache). In the basement, groups of files are kept densely packed in cardboard boxes on industrial shelving units. The regular office workers don’t get to come down here at all; it’s well out of their way and they’re not familiar with the details of the filing system. We leave that to the archival clerks.

Whenever Buster gets down here, he drops all the old files he grabbed for archival in the “in” tray at the front desk. He also drops off all the request slips for files that aren’t in any of the filing cabinets upstairs, but he doesn’t wait around until the clerks bring the files back, since it takes a while. They’re just going to take the request slips, pick up the file in question, and drop it off in the “out” tray whenever they’re done. So every time Buster comes around, he grabs whatever is in the “out” tray, puts it on his cart and takes it to the recipient when he next comes by.

Now, the problem is, there’s a lot of these files, and even with the efficient packing, they’re not even close to fitting in the basement. Most of the files are kept off-site; this is an office building in a nice part of town, and rents over here are way too high to spend that much space on storage. Instead, the company rents warehouse space 30 minutes out of town, where most of the old files are kept (this corresponds to DRAM). At the front desk of the basement sits Megan, the Head Archivist. Megan keeps track of which files are kept in the basement, and which are warehoused. So when Buster drops his request slips in the “in” tray, she checks which of them correspond to files in the basement (to be handled by the archival clerks) and which aren’t on-site. The latter just get added to a big pile of requests. Maybe once or twice a day, they send a van to the warehouse to grab the requested files, along with a corresponding number of old files to be mothballed (as said, the basement is full; they need to make space before they can store any more files from the warehouse).

Buster doesn’t know or care about the details of the whole warehousing operation; that’s Megan’s job. All he knows is that usually, those request slips he hands to the archive are handled pretty quickly, but sometimes they take hours.

Back to the original question

So, what’s the point of this whole elaborate exercise? Briefly, to establish a more concrete model than an opaque “magic cache” that allows us to think more clearly about the logistics involved. The logistics are just as important in designing a chip as they are in running an efficient office.

The original question was “why don’t we build a single large cache, instead of several small ones”. So if you have say a quad-core machine with 4 cores that each have 32KB L1 data cache, 256KB L2 cache, plus 2MB of shared L3 cache, why don’t we just have a ~3MB shared cache to begin with?

In our analogy: for pretty much the same reason we have individual office desks that are maybe 1.50m wide, instead of seating four different people at a single enormous desk that is 150m wide.

The point of having something on the desk is that it’s within easy reach. If we make the desk too big, we defeat that purpose: if we need to walk 50m to get the file we want, the fact that it’s technically “right here on our desk” isn’t really helping. The same goes for L1 caches! Making them bigger makes them (physically) larger. Among other things, this makes accessing them slower and consume more energy (for various reasons). L1 caches are sized so they’re large enough to be useful, but small enough so they’re still fast to access.

A second point is that L1 caches deal with different types of accesses than other levels in the cache hierarchy. First off, there’s several of them: there’s the L1 data cache, but there’s also a L1 instruction cache, and e.g. Intel Core CPUs also have another instruction cache, the uOp cache, which is (depending on your point of view) either a parallel L1 instruction cache or a “L0 instruction cache”.

L1 data caches gets asked to read and write individual items that are most commonly between 1 and 8 bytes in size, somewhat more rarely larger (for SIMD instructions). Cache levels higher up in the hierarchy don’t generally bother with individual bytes. In our office analogy, everything that’s not on a desk somewhere is just handled at the granularity of individual files (or larger), corresponding to cache lines. The same is true in memory subsystems. When a core is performing a memory access, you deal in individual bytes; the higher-level caches generally handle data wholesale, one cache line at a time.

L1 instruction caches have quite different access patterns than data caches do, and unlike the L1 data cache, they’re read-only as far as the core is concerned. (Writing into the instruction cache generally happens indirectly, by putting the data in one of the higher-level unified caches, and then having the instruction cache reload their data from there). For these (and other) reasons, instruction caches are generally built quite differently from data caches; using a single unified L1 cache means that the resulting design needs to meet several conflicting design criteria, forcing compromises that make it worse at either purpose. A unified L1 cache also needs to handle both the instruction and data traffic, which is quite the load!

As an aside: as a programmer, it’s easy to ignore how much cache bandwidth is needed to fetch instructions, but it’s quite a lot. For example, when not running code from the uOp cache, all Intel Core i7 CPU cores can fetch 16 bytes worth of instructions from the L1 instruction cache, every cycle, and will in fact keep doing so as long as instruction execution is keeping up with decoding. At 3GHz, we’re talking on the order of 50GB/s per core here, just for instruction fetches – though, granted, only if the instruction fetch unit is busy all the time, instead of being stalled for some reason or other. In practice, the L2 cache usually only sees a small fraction of this, because L1 instruction caches work quite well. But if you’re designing a unified L1 cache, you need to anticipate at least bursts of both high instruction and high data traffic (think something like a fast memcpy of a few kilobytes of data with both source and destination in the L1 data caches).

This is a general point, by the way. CPU cores can handle many memory accesses per cycle, as long as they all hit within the L1 caches. For a “Haswell” or later Core i7 at 3GHz, we’re talking aggregate code+data L1 bandwidths well over 300GB/s per core if you have just the right instruction mix; very unlikely in practice, but you still get bursts of very hot activity sometimes.

L1 caches are designed to be as fast as possible to handle these bursts of activity when they occur. Only what misses L1 needs to be passed on to higher cache levels, which don’t need to be nearly as fast, nor have as much bandwidth. They can worry more about power efficiency and density instead.

Third point: sharing. A big part of the point of having individual desks in the office analogy, or per-core L1 caches, is that they’re private. If it’s on your private desk, you don’t need to ask anyone; you can just grab it.

This is crucial. In the office analogy, if you were sharing a giant desk with 4 people, you can’t just grab a file. That’s not just your desk, and one of your other 3 co-workers might need that file right now (maybe they’re trying to cross-reference it with another file they just picked up at the other end of the desk!). Every single time you want to pick up something, you need to yell out “everyone OK if I grab this?”, and if someone else wanted it first, you have to wait. Or you can have some scheme where everyone needs to grab a ticket and wait in line until it’s their turn if there’s a conflict. Or something else; the details don’t matter much here, but anything you do requires coordination with others.

The same thing applies with multiple cores sharing a cache. You can’t just start stomping over data unannounced; anything you do in a shared cache needs to be coordinated with all the others you’re sharing it with.

That’s why we have private L1 caches. The L1 cache is your “desk”. While you’re sitting there, you can just go ahead and work. The L2 cache (“filing cabinet”) handles most of the coordination with others. Most of the time, the worker (the CPU core) is sitting at the desk. Buster can just come by, pick up a list of new requests, and put previously requested files into the filing cabinet without interrupting the worker at all.

It’s only when the worker and Buster want to access the filing cabinet at the same time, or when someone else has requested a file that’s lying on the worker’s desk, that they need to stop and talk to each other.

In short, the L1 cache’s job is to serve its CPU core first and foremost. Because it’s private, it requires very little coordination. The L2 cache is still private to the CPU core, but along with just caching, it has an extra responsibility: deal with most bus traffic and communication instead of interrupting the core (which has better things to do).

The L3 cache is a shared resource, and access to it does need to be coordinated globally. In the office analogy, this worked by the workers only having a single way to access it: namely, by going through Buster (the bus). The bus is a choke point; the hope is that the preceding two cache levels have winnowed down the number of memory accesses far enough that this doesn’t end up being a performance bottleneck.


This article covers one particular cache topology that is matches current desktop (and notebook) x86 CPUs: per-core split L1I/L1D caches, per-core unified L2 cache, shared unified L3 cache, with the cores connected via a ring bus.

Not every system looks like this. Some (primarily older) systems don’t have split instruction and data caches; some have full Harvard architectures that treat instruction and data memory as completely separate all the way through. Often L2s are shared between multiple cores (think one office with one filing cabinet and multiple desks). In this type of configuration, the L2 caches effectively act as part of the bus between cores. Many systems don’t have L3 caches, and some have both L3 and L4 caches! I also didn’t talk about systems with multiple CPU sockets etc.

I stuck with a ring bus because it fits in nicely with the analogy. Ring buses are reasonably common. Sometimes (especially when only two or three blocks need to be connected) it’s a full mesh; sometimes it is multiple ring buses connected with a crossbar (which maps reasonably to the office analogy: a multi-story office building with one “Buster” making his rounds per floor, and an elevator connecting the floors).

As a software developer, there’s a natural tendency to assume that you can magically connect module A to module B and the data just teleports from one end to the other. The way memory actually works is incredibly complicated by now, but the abstraction presented to the programmer is just one of a large, flat array of bytes.

Hardware doesn’t work like that. Pieces aren’t magically connected through an invisible ether. Modules A and B aren’t abstract concepts; they’re physical devices, actual tiny machines, that take up actual physical area on a silicon die. Chips have a “floor plan”, and that’s not a vague nod or an in-joke; it’s an actual 2D map of what goes where. If you want to connect A to B, you need to run an actual, physical wire between them. Wires take up space, and driving them takes power (more the longer they are). Running a bunch of wires between A and B means it’s physically blocking area that could be used to connect other things (yes, chips use wires on multiple layers, but it’s still a serious problem; google “routing congestion” if you’re interested). Moving data around on a chip is an actual logistical problem, and a fiendishly complicated one at that.

So while the office thing is tongue-in-cheek, “who needs to talk to whom” and “how does the geometry of this system look – does it admit a sensible layout?” are very relevant questions for both hardware and overall system design that can have a huge impact. Spatial metaphors are a useful way of conceptualizing the underlying reality.

SSE: mind the gap!

SSE and SSE2 are available in every single x86-family CPU with 64-bit support. You too can play around with SIMD, which is great fun! Unfortunately, SSE2 level in particular also happens to be what is probably the most maddeningly non-orthogonal SIMD instruction set in the world, where operations are either available or not available for particular data types with little rhyme or reason, especially where integers are involved. Later revisions (especially starting around SSE4.1) fill in some of the more annoying gaps, but plenty of us are stuck with supporting the older CPUs for at least a few more years, and besides – not to mess with the authentic SSE experience – even on AVX2-supporting CPUs, there’s still a few of the classic gaps remaining.

So, here’s a list of tricks to get you around some of the more common, eh, “idiosyncrasies” of SSE and its descendants. This happens to be mostly focused on the integer side; the floating-point side is generally less, well, weird. I’ll keep the individual descriptions relatively brief since the whole point of this post is to collect lots of tricks. The assumption here is that you’re already somewhat familiar with the instructions, so I’ll not explain the basics (maybe another time). I’ll use the official Intel intrinsics (as exposed in C/C++) since that’s probably the most common way people interact with these instructions intentionally (awkward glance in the direction of auto-vectorization here. No making eye contact. Moving on.)

Branchless “select” (cond ? a : b)

The natural mode of operation in SIMD computations is to do things branchlessly. If some part of a computation is conditional, rather than doing the equivalent of an if, it’s more typical to do both the computation for the “if” and “else” forks, and then merge the results based on the condition. The “select” I mean is the operation which takes the condition and both results and performs the rough equivalent of C’s ternary operator cond ? a : b. You first evaluate both sides, giving a and b. You then evaluate the condition using a SIMD compare, which returns a vector containing a bit mask that is has all bits set for lanes that meet cond, and all bits clear for lanes that don’t.

This select operation can always be done using a few bitwise operations (which is well known), but starting in SSE 4.1 we get slightly more efficient variants too (less well known, and the reason I mention this):

  • Integer (all vers): _mm_or_si128(_mm_and_si128(a, cond), _mm_andnot_si128(cond, b)).
  • 32-bit float (all vers): _mm_or_ps(_mm_and_ps(a, cond), _mm_andnot_ps(cond, b)).
  • 64-bit float (all vers): _mm_or_pd(_mm_and_pd(a, cond), _mm_andnot_pd(cond, b)).
  • Integer (SSE4.1+): _mm_blendv_epi8(a, b, cond).
  • 32-bit float (SSE4.1+): _mm_blendv_ps(a, b, cond).
  • 64-bit float (SSE4.1+): _mm_blendv_pd(a, b, cond).

The andnot operations don’t come in handy very often, but they’re the best choice here (pre-SSE4.1).

If you don’t want to use cond but its logical negation, just switch the positions of a and b, since (!cond) ? a : b is the same as cond ? b : a.

Unsigned integer compares

SSE, in all incarnations, offers precisely two types of integer comparisons: test for equality (PCMPEQt, _mm_cmpeq_T, where t and T stand for various type suffixes) and test for signed greater-than (PCMPGTt, _mm_cmpgt_T). Most other comparison types can be produced using nothing but logical negation and standard identities:

  • a == b is supported directly.
  • a != b is !(a == b).
  • a > b (signed) is supported directly.
  • a < b (signed) is the same as b > a (swap a and b).
  • a >= b (signed) is !(a < b) (which in turn is !(b > a)).
  • a <= b (signed) is !(a > b).

See previous note on selection operations on how to get rid of the NOT in the most common use case. Conspicuously absent from that list is any type of unsigned ordered comparison. However, a trick that works is to bias both integers so that signed comparison does the right thing:

  • a > b (unsigned, 8-bit) is the same as (a - 0x80) > (b - 0x80) (signed, 8-bit).
  • a > b (unsigned, 16-bit) is the same as (a - 0x8000) > (b - 0x8000) (signed, 16-bit).
  • a > b (unsigned, 32-bit) is the same as (a - 0x80000000) > (b - 0x80000000) (signed, 32-bit).

The same argument-swapping and NOT-ing tricks as above still apply to give you the other compare types. In general, the trick is to add (or subtract, or XOR – they all do the same thing in this particular case) the INT_MIN for the respective type from both operands before doing the compare. This turns the smallest possible unsigned integer, 0, into the smallest possible signed integer for the given type; after that, the ordering works out. In particular, when comparing against a constant, this addition (or subtraction, or XOR) can be baked into the constant operand, so the unsigned compare “only” ends up doing one more operation than a signed compare (instead of two).

A completely different approach is to use the unsigned integer min/max instructions (more about those in a second) to build less-or-equal or greater-or-equal comparisons:

  • a <= b if and only if max(a, b) == b.
  • a >= b if and only if min(a, b) == b.

The good news is that this reduces unsigned comparisons to either an unsigned min or a max, followed by an equality comparison, which is only 2 instead of 3 operations. The bad news is that the requisite unsigned min/max operations only exist for uint8s in SSE2. The uint16/uint32 variants were finally added with SSE4.1; if your minimum target is earlier, you’re stuck with the bias-then-compare variants above.

Integer min and max

SSE4.1 has the full set of integer min/max for 8-, 16- and 32-bit types, both signed and unsigned. So if you’re targeting SSE4.1 or later, good for you!

If you’re stuck with anything earlier, you’re decidedly more limited. In SSE2, you get integer min/max for uint8 and int16. If you need min/max for int8, uint16, or anything 32-bit, you’re on your own.

Luckily, we can just combine some of the techniques above to derive a solution. The general patterns here are:

  max(a, b) == (a > b) ? a : b;
  min(a, b) == (a > b) ? b : a;

So this is just a combination of a compare and a “select” operation. When the compare is signed (the int8 and int32 cases), the comparison maps to a single SSE intrinsic. The unsigned compares (uint16 and uint32) can be solved using the bias-then-signed-compare trick which in turn gives us an unsigned min/max.

32-bit and 64-bit loads/stores

This one has nothing to do with the actual instruction set and everything to do with the intrinsics: yes, SSE2 has 32-bit (MOVD) and 64-bit (MOVQ) loads and stores, the standard intrinsics just do their best to confuse you about it:

  • 64-bit loads are _mm_loadl_epi64. This intrinsic takes a __m128i * as an argument. Don’t take that seriously. The actual load is 64-bit sized, not 128-bit sized, and there is no alignment requirement.
  • 64-bit stores are _mm_storel_epi64. Again, the __m128i * is confusing and does not mean that the actual store is 128-bit or that there are alignment requirements. It isn’t and there are not.
  • 32-bit loads are even more hidden! Namely, you write _mm_cvtsi32_si128(*x) where x is a pointer to a 32-bit integer. No direct load intrinsic, but compilers will turn this into a MOVD with memory operand where applicable.
  • 32-bit stores, likewise: *x = _mm_cvtsi128_si32(value). Now you know.


There’s lots of different ways to provide multiplies in a SIMD instruction set, and by now SSE has tried most of them in one form or another.

Let’s start with the (historically) first variant: multiplying 16-bit numbers. The relevant instructions originated in the Pentium MMX and compute the low and high halves (bottom and top 16 bits) of a signed 16-bit×16-bit product. MMX only has signed multiplies, but SSE also added a “high half of unsigned 16-bit times 16-bit product” instruction (the low halves of signed and unsigned products are identical), so we’re not gonna have to worry about that particular problem, not yet anyway.

These instructions are fine if you want the low or high halves of the product. What if you want the full 32-bit product of vectors of 16-bit values? You compute the low and high halves and then merge them using the “unpack” instructions. This is the standard approach, but not very obvious if you haven’t deal with this kind of thing before. So for a full 16×16→32-bit product (note this produces two vectors worth of results), we get:

  // EITHER: a*b (16-bit lanes), signed
  __m128i lo16 = _mm_mullo_epi16(a, b);
  __m128i hi16 = _mm_mulhi_epi16(a, b);

  // OR: a*b (16-bit lanes), unsigned
  __m128i lo16 = _mm_mullo_epi16(a, b);
  __m128i hi16 = _mm_mulhi_epu16(a, b);

  // THEN: merge results
  __m128i res0 = _mm_unpacklo_epi16(lo16, hi16); // result lanes 0..3
  __m128i res1 = _mm_unpackhi_epi16(lo16, hi16); // result lanes 4..7

But what if you’re working with 32-bit values? There is a 32×32→32-bit product (PMULLD / _mm_mullo_epi32), but it was only added with SSE4.1, and it’s significantly slower than the other SSE2 multiplies in many implementations. So you might either not want to set your minimum target that high, or you might be looking for something quicker.

There’s full 32×32→64-bit products, which are available from SSE2 on as
PMULUDQ/_mm_mul_epu32 (unsigned). SSE4.1 adds the signed equivalent PMULDQ/_mm_mul_epi32 (UPDATE: An older version of this post incorrectly stated that PMULDQ was SSE2. Thanks Exophase for pointing it out!). These ones only compute two products (between the even lanes of the two sources) and place them in a 128-bit result. The odd 32-bit lanes are ignored completely, so if you want four 32×32→32-bit products, you need at least two of these multiplies and a lot of wrangling:

  // res = _mm_mullo_epi32(a, b) equivalent using SSE2, via PMULUDQ.

  // even and odd lane products
  __m128i evnp = _mm_mul_epu32(a, b);
  __m128i odda = _mm_srli_epi64(a, 32);
  __m128i oddb = _mm_srli_epi64(b, 32);
  __m128i oddp = _mm_mul_epu32(odda, oddb);

  // merge results
  __m128i evn_mask = _mm_setr_epi32(-1, 0, -1, 0);
  __m128i evn_result = _mm_and_si128(evnp, evn_mask);
  __m128i odd_result = _mm_slli_epi64(oddp, 32);
  __m128i res = _mm_or_si128(evn_result, odd_result);

It works, but it’s a mouthful.

But what if you’re using 32-bit vector lanes, but happen to know that the numbers we’re trying to multiply are in fact in the range [-32768,32767] (i.e. representable as signed 16-bit integers)? We could try narrowing the 32-bit lanes into 16 bits then using the 16×16→32 sequences above, but is that really the best we can do?

It is not: PMADDWD (_mm_madd_epi16), MMX/SSE2’s amazing and strange (but mostly amazing) dot product operation, has our back, for we can do this:

   // a and b have 32-bit lanes with values that fit in int16s.
   // produces the 32-bit result
   //   res[i] = a[i] * b[i]

   // clears high 16 bits of every 32-bit lane
   __m128i bm = _mm_and_si128(b, _mm_set1_epi32(0xffff));

   // after this, madd_epi16 does what we want!
   __m128i res = _mm_madd_epi16(a, bm);

   // can swap role of a and b above too, when convenient.

That’s a lot shorter than narrowing to 16-bit first would be! Alas, it only works for int16 (signed). What if we’re working in 32-bit lanes with values that fit inside a uint16 (unsigned)? It’s not quite as slick, but still, better than narrowing to 16-bit first or dealing with the logistics when synthesizing 32×32→32-bit muls from PMULDQ/PMULUDQ:

   // a and b have 32-bit lanes with values that fit in uint16s,
   // i.e. a[i] == (uint16)a[i] and same for b[i].
   // produces the 32-bit result
   //   res[i] = a[i] * b[i]

   // compute low and high 16-bit products
   __m128i lop = _mm_mullo_epi16(a, b);
   __m128i hip = _mm_mulhi_epu16(a, b);

   // merge results
   __m128i res = _mm_or_si128(lop, _mm_slli_epi32(hip, 16));

Horizontal adds, dot products etc. (float)

SSE3 adds horizontal adds HADDPS (_mm_hadd_ps) and HADDPD (_mm_hadd_pd) and SSE4.1 throws in the dot-product instructions DPPS (_mm_dp_ps) and DPPD (_mm_dp_pd).

Generally, don’t expect these operations to be magic. They exist in the instruction set but are fast precisely nowhere; in all x86 implementations I’m familiar with, they just turn into a canned sequence of more basic (SSE2-level) operations. So more often that not, you will end up requiring a higher minimum CPU target for little to no speed gain. Caveat: these instructions are a smaller than their replacement instruction sequence, so using them can reduce code size slightly. But still, don’t expect this to be fast.

If you want good SIMD performance, don’t lean on horizontal and dot-product style operations; process data in batches (not just one vec4 at a time) and transpose on input, or use a SoA layout to begin with.

The other kind of horizontal adds, dot products etc. (integer)

SSE does have a bunch of horizontal add and dot product-style operations that don’t suck, but they’re on the integer pipe, and not what you’d expect.

Nope, not PHADDW/PHADDD (_mm_hadd_epi16/_mm_hadd_epi32). These are SSSE3 and later only and OK but not particularly great (similar disclaimer as for the floating-point horizontal adds applies).

No, I’m talking about PMADDWD (_mm_madd_epi16, SSE2 with its ancestor around since the original MMX), PSADBW (_mm_sad_epu8, SSE2) and PMADDUBSW (_mm_maddubs_epi16, SSSE3). The official manuals describe what these instructions do, so I won’t bother going into too much detail, but here’s the basic idea: PMADDWD and PMADDUBSW are 2-element dot-product instructions between pairs of adjacent SIMD lanes. PMADDWD computes two int16 by int16 products for each pair of 16-bit lanes and sums the 32-bit integer products to yield the 32-bit result lanes. PMADDUBSW computes two uint8 by int8 products for each pair of 8-bit lanes and sums the 16-bit integer products to yield the 16-bit result lanes. These can be used to compute dot products of this problem; but they also have “degenerate” configurations that are very useful:

  • _mm_madd_epi16(x, _mm_set1_epi16(1)) sums the 16-bit even and odd lanes of x in pairs to yield 32-bit results.
  • _mm_maddubs_epi16(_mm_unpacklo_epi8(a, b), _mm_setr_epi8(1, -1, 1, -1, ..., 1, -1)) happens to be the fastest way to compute the 16-bit signed differences between 8-bit unsigned vectors a and b on processors that support SSSE3.
  • The 16-bit multiply example above shows another special configuration.

Long story short, these dot product instructions are surprisingly versatile in decidedly non-obvious ways.

Finally, PSADBW (_mm_sad_epu8, SSE2). This one is intended for motion estimation in video codecs, but it also happens to be the one actually really fast horizontal add you get on x86. In particular, _mm_sad_epu8(x, _mm_setzero_si128()) computes two 16-bit horizontal sums of groups of 8 uint8 lanes in a single, and quite fast, operation. We can do the same trick we did for compares in reverse to compute the sum of 8 int8s instead: add (or subtract, or XOR) _mm_set1_epi8(-128) to x (before the PSADBW), the subtract 128×8 from the resulting 16-bit sums.

To be continued!

There’s a lot more of these, but this feels like enough to chew on for a single blog post. So there will be a sequel covering, at least, integer widening/narrowing and variable shifts. Until then!

Repeated match offsets in BitKnit

I spent the majority of last year working on LZ77-style codecs. I’ve written about some results before. But there were also several smaller (in scope) but still quite neat discoveries along the way.

One of them has to do with repeated match offsets. BitKnit was originally designed for Granny files, which usually contain 3D meshes, animations, sometimes textures, and can also store other user-defined data. As far as a compressor is concerned, Granny files are highly structured, mostly consisting of a few large, homogeneous arrays of fixed-size records.

Repeated match offsets

Often, there is significant correlation between adjacent records, for various reasons. What this means in a LZ77-style dictionary compressor is that there will usually be a lot of matches with a match distance (or match offset) that is a small integer multiple of the record size, and matches with the same offset tend to clump together.

The way LZ77 compressors typically exploit this fact is by reserving special codes for “reuse a recent match distance”. To my knowledge, this technique first appeared in LZX, which keeps a 3-element cache of recent match offsets with a LRU eviction policy. The basic idea seems to have spread from there. Many compressors (too many to list) reserve at least a single special, cheaper code to send another match with the same offset as the previous one (corresponding to a 1-element “cache”). This, among other things, gives a cheaper way to code “gap matches” (a match that resumes after being interrupted by a few mismatching bytes) and appears to be beneficial on most types of data.

On text and data that skews towards variable-size records, having extra codes for more repeated match offsets doesn’t help much, if at all (at least, they don’t seem to hurt, either). However, on data heavy on fixed-size records, it is often a big win. LZX, as mentioned before, has 3 “repeat offset slots”. LZMA uses 4. Several experiments early in the design of BitKnit indicated that at least for the highly structured Granny files it was designed for, there was a good case to be made for having even more repeat offset slots. We re-evaluated this several times, but a repeat offset count of 8 made it into the final codec; essentially, having a larger number of offset slots allows us to “get away” with an overall less sophisticated offset coder (reducing compression, but improving decoder speed), and is a very solid win on the highly structured data that was the target use case.

Experiments with lots of repeat match offsets

Two interesting problems arise from making the “repeat offset cache” this large. First, 8 entries is large enough that it’s worth thinking about different algorithmic variants. Second, at that size, it makes sense to investigate different eviction policies as well as other strategies such as maybe “pinning” a few match distances that we expect to be useful (for example, multiples of the record size in front of homogeneous sections).

Second part first: the effect of either “preloading” or pinning useful match distances was either “in the noise” (almost any change to a LZ encoder using adaptive models will make some files larger and others smaller simply due to getting a slightly different parse) or strictly worse in all our tests. Considering how much interface complications this implies (the pre-loaded offsets for different sections need to get to the compressor somehow, and they either need to be known in the decoder from other sources or stored in the compression stream, reducing the gains even further) that makes the idea fairly uninteresting. Empirically, at least in our tests, LZ compressors find useful match distances quickly, and once they’re in the cache, they tend to stick around. Since such a cache is naturally adaptive to the data (whereas static pinned offsets are not), keeping them fully dynamic seems like a good idea.

The next test was to decouple the eviction policy from offset modeling. LZX, LZMA etc. always keep their list of recent match offsets in MRU order: slot 0 is the most recently used offset, slot 1 is the second-most recent, and so forth. One experiment we tried with a very early version of BitKnit was a variant I dubbed “stable index MRU”: offsets are still evicted on a LRU basis, but instead of shuffling the indices around on every match so that the new most recent match gets offset 0, new offsets would get inserted into the least recently referenced slot without moving the slot IDs around.

This affects the modeling; before, you have a very skewed distribution: slot 0 (most recent) is much more important than slot 1, which is more important than slot 2, and so forth. After, they are more spread out; but the idea was that in highly structured files where the same few offsets stick around for a fairly long time, you might capture more useful correlations by keeping these offsets in a single spot (which the entropy coder then tried to capitalize on).

Here were the results on a few granny files, listing the compressed sizes in bytes, with “anchor” being a x86 executable file that doesn’t have a significant amount of record-structured data in it, as a baseline. “MTF” refers to move-to-front index update policy, “stable” is the stable-index variant just described.

Configuration granny1 granny2 granny3 anchor
4 offsets, MTF 18561994 22127519 15156085 1300427
4 offsets, stable 18533728 22261691 15177584 1300980
8 offsets, MTF 17825495 21746140 14800616 1300534
8 offsets, stable 17580201 21777192 14819439 1304406
12 offsets, MTF 17619775 21640749 14677706 1301007
12 offsets, stable 17415560 21448142 14681142 1306434
16 offsets, MTF 17554769 21523058 14600474 1300984
16 offsets, stable 17341154 21462241 14669793 1308508

First off, as you can see from these experiments, going from 4 to 8 repeat match offsets really does help significantly on these files; an extra 1.5%-4% reduction in file size may not sound like much, but it’s a fairly big deal in compression terms. The experiments with even more repeat offsets were mainly to get a feel for when we start to hit diminishing returns; also, as you can see, the compressed size for the “anchor” file (which is not record-structured) doesn’t seem to care much about the difference between 4 and 8 repeat offset slots, and gets worse after.

As for stable-index coding, well, it’s a mixed bag. It does help on some files, and on the files that do seem to improve from it, it’s a bigger win when using more offset slots, but on e.g. “granny3” and “anchor” it was a net negative. Interesting experiment, but it didn’t go in.

Insertion policy

Another experiment we ran was on insertion policy. Specifically, our hypothesis was that at least on highly structured data (where the repeat offsets really help), we really want to make sure the important offsets stay “in front”. But occasionally, you will still get other matches that “don’t fit the pattern”. The problem is that this puts some other random offset in front that will then slowly “slide down” and meanwhile cause our actually important offsets to be more expensive to code.

This is more of a problem with greedy LZ parsers (which make their decisions locally); optimizing parsers (which usually try to look ahead by a few kilobytes or so) are better at correctly estimating the cost of “disrupting” the set of offsets. Either way, it’s annoying.

We tried a couple different approaches with this; the best overall approach we found in our tests was to stick with a basic MTF coding scheme and LRU eviction policy (bog-standard in other words), but distinguish between updates caused by repeat matches and those caused by inserting a new offset not currently in the repeat offset set. The former (repeat matches) just do a full move-to-front step, as usual. The latter don’t; instead of inserting a new offset all the way in front, we insert it further back. If it then gets reused a second time, it really will be moved all the way to the front of the list; but if it doesn’t get referenced again, it will drop out more quickly and with less disruption of the remaining repeat offsets.

Here’s the batch of test results, from the same compressor version. “kSlotNew” is the slot where new offsets are inserted; 0 corresponds to inserting in front (regular MTF), 1 is the second position, and so forth.

Configuration granny1 granny2 granny3 anchor
4 offsets, kSlotNew=0 18561994 22127519 15156085 1300427
4 offsets, kSlotNew=1 18450961 22153774 15154609 1304707
4 offsets, kSlotNew=2 18118014 22000266 15181128 1305755
4 offsets, kSlotNew=3 17736663 22002942 15209073 1307550
8 offsets, kSlotNew=0 17825495 21746140 14800616 1300534
8 offsets, kSlotNew=4 17327247 21546289 14771634 1305128
8 offsets, kSlotNew=6 17197347 21425116 14713121 1305588
16 offsets, kSlotNew=0 17554769 21523058 14600474 1300984
16 offsets, kSlotNew=14 17122510 21177337 14578492 1305432

We can see that the anchor prefers pure MTF, but the Granny files definitely see a win from not moving new offsets all the way to the front the first time they’re seen. There were a few more tests than the one shown, but in general, inserting new offsets in the second-to-last slot seemed like a good rule of thumb for the Granny files.

This one is definitely more contextual. As you can see, different types of files really prefer different settings here. BitKnit went with 8 offsets and insertion at the second-to-last slot (corresponding to the “8 offsets, kSlotNew=6” row above), because it produced the overall best results on the data it was designed for. (As evaluated on a larger test set not shown here.)

So, this is fairly neat, and a comparatively major win over the baseline 4 offsets and insert-in-front variant (a la LZMA) for the data in question. Now how to implement this efficiently?

Implementation notes

The basic implementation of the offset maintenance logic in the decoder is dead simple. You just keep an array of recent offsets and shuffle it around with something like this:

if (is_repeat_match) {
    // move slot "rep_idx" to front.
    // this involves grabbing the offset at the corresponding
    // location and then sliding everything before that position
    // down by one slot.
    tmp = offsets[rep_idx];
    for (uint i = rep_idx; i > 0; --i)
        offsets[i] = offsets[i - 1];
    offsets[0] = tmp;
} else {
    // implement the "insert in second-to-last position"
    // rule, which touches exactly two elements.
    offsets[kNumReps - 1] = offsets[kNumReps - 2];
    offsets[kNumReps - 2] = newOffset;

This works just fine, but it has a lot of data-dependent branches in the repeat match case, which is a performance trap in decompressors; generally speaking, you want to avoid branching on data you just read out of a bitstream, because it tends to be relatively high entropy and thus cause a lot of branch mispredictions, which are expensive.

One way to fix this is to add several entries worth of padding in front of the actual used part of offsets, and always copy the same number of entries in the “sliding down” phase. This gets rid of the data-dependent branches and makes it easy to unroll the loop fully (since the trip count is now constant) or express it using a few unaligned SIMD loads/stores (where supported).

However, BitKnit uses a different approach derived from our earlier experiments with “stable index MRU” that doesn’t need anything beyond regular integer arithmetic. The basic idea is to leave the offsets array alone; instead, we keep a secondary “data structure” that tells us which logical “repeat offset” list position corresponds to which index in the offsets array.

I write “data structure” in quotes because that information is actually stored in (drum roll)… a single 32-bit unsigned integer! Here’s the idea: we have a uint32_t mtf_state that represents the current offset permutation. It does this by storing the offset array index for the i’th logical repeat offset slot in the i’th nibble (numbered starting from the LSB upwards). At initialization time, we set mtf_state = 0x76543210, the identity mapping: the logical and actual offset indices coincide.

Why does this help? Because the fundamental operation for move-to-front processing is moving a bunch of offsets “one slot down” in their array position. If they’re separate integers, that means either a lot of copying, or less copying but using much wider (e.g. SIMD) instructions. Our array of 4-bit indices is compact enough that 8 indices fit inside a single 32-bit uint; we can slide them all “down” or “up” using nothing but a single bit shift. Now, our code above doesn’t actually move all elements, just the ones at position ≥rep_idx; but that turns out to be easily remedied with some bit masking operations.

So the alternative variant is this:

if (is_repeat_match) {
    // move slot "rep_idx" to front by permuting mtf_state. first,
    // determine the offset slot ID at that position in the list
    uint32_t rep_idx4 = rep_idx*4;
    uint32_t slot_id = (mtf_state >> rep_idx4) & 0xf;
    match_offs = offsets[slot_id]; // decoder needs this later!

    // moved_mtf: slide down everything by one slot, then put
    // "slot_id" in front.
    uint32_t moved_mtf = (mtf_state << 4) + slot_id;
    uint32_t keep_mask = ~0xf << rep_idx4; // bits that don't move
    mtf_state = (mtf_state & keep_mask) | (moved_mtf & ~keep_mask);
} else {
    // implement the "insert in second-to-last position"
    // rule, which touches exactly two elements. this is easier
    // to do by just modifying the offsets directly.
    uint32_t last = (mtf_state >> ((kNumReps - 1)*4)) & 0xf;
    uint32_t before_last = (mtf_state >> ((kNumReps - 2)*4)) & 0xf;

    offsets[last] = offsets[before_last];
    offsets[before_last] = newOffset;

It’s a bit of integer arithmetic, but not a lot, and there’s no dependence on vector instructions, fast unaligned memory access, or in fact anything outside of standard C/C++. BitKnit uses a 32-bit mtf_state to implement an 8-entry LRU cache. Using 64-bit values (and still using nibbles to store array indices), the exact same approach (with essentially no modifications to the source save for type names) can manage a 16-entry LRU.

An 8-entry LRU actually only needs 24 bits (when storing array indices in groups of three bits instead of nibbles), but that’s not a very useful size. A 4-entry LRU state fits in 4*log2(4) = 8 bits, which is nice and compact, although for 4 entries, this way is generally not a win (at least in our tests).

And now this is time to come clean: I kinda like this approach, and it’s the real reason I wrote this whole thing up. I probably would’ve still written it up even if it hadn’t turned out to be useful in practice, but it did, which is always a nice bonus.

Finally, over the years, I’ve found a few instances like this where packing a small “data structure” (using the term loosely) inside a single register-width integer produces interesting results. There’s a good chance I’ll write about more in the future! Until then.


I dislike the way many (most?) people seem to conceptualize “smartness” or intelligence in others, because I feel it misses the mark in two separate, important ways.

1. Many of the things most people consider “intelligence” are in fact acquired (or at least acquirable) skills

You think someone being “smart” means they automatically can do things you can’t, and will never be able to learn, so there’s no point in even trying? Maybe, but it’s generally unlikely.

She has a phenomenal memory for facts and can just rattle them off? Must be eidetic memory, right? Actually, probably not. You too can improve your memory for abstract facts greatly by learning mnemonic techniques, if you want to. More so than you probably think.

He is great at mathematical problem-solving? Some of that requires genuine insight, sure. A lot of it is just pattern matching (which takes mainly familiarity and practice), some fairly general problem-solving heuristics that help you if you’re stuck (if you don’t know that book and want to become better at math, just buy it or lend it at a library!), and enough patience and stamina to keep going.

And so forth. Now I don’t mean to suggest that all that stands between you and a Nobel prize is three self-help books, a week of work and some autosuggestion! Anyone who claims that is a crank trying to sell you something (probably self-help books). But many people “don’t understand science” or “are just not smart” or “just don’t get math” in the same way that I am terrible at pole vaulting: not only do I not possess the skill, I also have never once seriously tried it or made an effort to become better at it in my life!

Which brings me to my second and more important point.

2. A lot of “being smart” actually consists of getting comfortable with feeling stupid

I knew a few people back in my early teens who were Mensa members and made sure everyone knew. They didn’t really do so well in the medium and long term. The problem was that they were brilliant, they knew it, and so they never really learned how to work for something; when they ran into a problem they didn’t immediately see how to handle, they would quickly give up in frustration.

Guess what; many of the problems you will actually face, both professionally and personally, cannot be solved using brilliance. They just take effort and stamina. And those that can benefit from brilliance…. well, usually we don’t really know how to solve them yet.

Most schools teach you well-known solutions to well-known, well-specified problems. And standardized IQ tests likewise ask clear questions with known “right” answers. Being good at that is a particular (and somewhat peculiar) skill. Real-world problem solving is mostly about heuristic solutions to messy, unclear, unfamiliar problems, usually subject to random external constraints, frequently not all satisfiable at once. And it tends to make you feel stupid.

This is normal, and it has been said better elsewhere, for example in “The importance of stupidity in scientific research”. I’ll quote a bit from it, but really you should just read the whole essay, it’s pretty short.

I recently saw an old friend for the first time in many years. We had been Ph.D. students at the same time, both studying science, although in different areas. She later dropped out of graduate school, went to Harvard Law School and is now a senior lawyer for a major environmental organization. At some point, the conversation turned to why she had left graduate school. To my utter astonishment, she said it was because it made her feel stupid. After a couple of years of feeling stupid every day, she was ready to do something else.

I had thought of her as one of the brightest people I knew and her subsequent career supports that view. What she said bothered me. I kept thinking about it; sometime the next day, it hit me. Science makes me feel stupid too. It’s just that I’ve gotten used to it. [..] But high-school and college science means taking courses, and doing well in courses means getting the right answers on tests. If you know those answers, you do well and get to feel smart.

A Ph.D., in which you have to do a research project, is a whole different thing. For me, it was a daunting task. How could I possibly frame the questions that would lead to significant discoveries; design and interpret an experiment so that the conclusions were absolutely convincing; foresee difficulties and see ways around them, or, failing that, solve them when they occurred? My Ph.D. project was somewhat interdisciplinary and, for a while, whenever I ran into a problem, I pestered the faculty in my department who were experts in the various disciplines that I needed. I remember the day when Henry Taube (who won the Nobel Prize two years later) told me he didn’t know how to solve the problem I was having in his area. I was a third-year graduate student and I figured that Taube knew about 1000 times more than I did (conservative estimate). If he didn’t have the answer, nobody did.

That’s when it hit me: nobody did. That’s why it was a research problem. And being my research problem, it was up to me to solve.

This quote is talking about academic research, but the same thing applies elsewhere. I’ve done programming, I’ve done research, and I’ve done art (in the form of PC demos). What all three have in common is that most of the people I know and respect in those disciplines spend the majority of their time feeling like idiots and talentless hacks. Impostor syndrome is the norm.

Being “smart” is not actually about knowing all the answers. One of the biggest parts is being aware of the limits of your knowledge and not running around like a headless chicken when you don’t know what to do. And it’s about being wrong a lot of the time, realizing the fact, and taking steps to be slightly less wrong next time round.