While in the middle of writing “Reading bits in far too many ways, part 3”, I realized that I had written a lot of background material that had absolutely nothing to do with bit I/O and really was worth putting in its own post. This is that post.

The problem I’m concerned with is fairly easy to state: say we have some piece of C++ code that we’re trying to understand (and perhaps improve) the performance of. A good first step is to profile it, which will give us some hints which parts are slow, but not necessarily why. On a fundamental level, any kind of profiling (or other measurement) is descriptive, not predictive: it can tell you how an existing system is behaving, but if you’re designing something that’s more than a few afternoons worth of work, you probably don’t have the time or resources to implement 5 or 6 completely different design alternatives, pick whichever one happens to work best, and throw the rest away. You should be able to make informed decisions up front from an algorithm sketch without having to actually write a fleshed-out implementation.

One thing I want to emphasize particularly here is that experiments coupled with before/after measurements are no adequate substitute for a useful performance model. These kinds of measurements can tell you how much you’ve improved, but not if you are where you should be: if I tell you that by tweaking some config files, I managed to double the number of requests served per second by the web server, that sounds great. It sounds less good if I give you the additional piece of information that with this fix deployed, we’re now at a whopping 1.5 requests per second; having an absolute scale of reference matters!

This goes especially for microbenchmarks. With microbenchmarks, like a trial lawyer during cross-examination, you should never ask a question you don’t know the answer to (or at least have a pretty good idea of what it is). Real-world systems are generally too complex and intertwined to understand from surface measurements alone. If you have no idea how a system works at all, you don’t know what the right questions are, nor how to ask them, and any answers you get will be opaque at best, if not outright garbage. Microbenchmarks are a useful tool to confirm that an existing model is a good approximation to reality, but not very helpful in building these models to begin with.

### Machine models

So, if we want to go deeper than just squinting at C/C++ code and doing some hand-waving, we need to start looking at a somewhat lower abstraction level and define a machine model that is more sophisticated than “statements execute one by one”. If you’re only interested in a single specific processor, one option is to use whatever documentation and tools you can find for the chip in question and analyze your code in detail for that specific machine. And if you’re willing to go all-out on microarchitectural tweaking, that’s indeed the way to go, but it’s a giant step from looking at C++ code, and complete overkill in most cases.

Instead, what I’m going to do is use a simplified machine model that allows us to make quantitative predictions about the behavior of straightforward compute-bound loops, which is simple to describe but still gives us a lot of useful groundwork for more complex scenarios. Here’s what I’ll use:

• We have an unlimited set of 64-bit integer general-purpose registers, which I’ll refer to by names like rSomething. Any “identifiers” that aren’t prefixed with a lowercase r are either symbolic constants or things like labels.
• We have the usual 64-bit integer arithmetic and logic operations. All operations can either be performed between two registers or a register and an immediate constant, and the result is placed in another register. All arithmetic uses two’s complement. For simplicity, all 64-bit values are permitted as immediate constants.
• There’s a flat, byte-granular 64-bit address space, and pointers are just represented as integers.
• All memory accesses require explicit load and store operations. Memory accesses are either 8, 16, 32, or 64 bits in size and can use (for my convenience) both little-endian or big-endian byte ordering, when requested. One of these is the default, but both are the same cost. Narrow stores store the least significant bits of the register in question; narrow loads zero-extend to 64 bits. Loads and stores have a few common addressing modes (that I’ll introduce as I use them). Unaligned loads and stores are supported.
• There’s unconditional branches, which just jump to a given location, and conditional branches, which compare a register to either another register or an immediate constant, and branch to a given destination if the condition is true.

Code will be written in a pseudo-C form, at most one instruction per line. Here’s a brief example showing what kind of thing I have in mind:

loop:                            // label
rFoo = rBar | 1;               // bitwise logical OR
rFoo = lsl(rFoo, 3);           // logical shift left
rBar = asr(rBar, rBaz);        // arithmetic shift right
store16BE(rDest + 3, rMem);    // big-endian store
rCount = rCount - 1;           // basic arithmetic
if rCount != 0 goto loop;      // branch


Shifts use explicit mnemonics because there’s different types of right shifts and at this level of abstraction, registers are generally treated as untyped bags of bits. I’ll introduce other operations and addressing modes as we get to them. What we’ve seen so far is quite close to classic RISC instruction sets, although I’ll allow a larger set of addressing modes than some of the more minimalist designs, and require support for unaligned access on all loads and stores. It’s also close in spirit to an IR (Intermediate Representation) you’d expect to see early in the backend of a modern compiler: somewhat lower-level than LLVM IR, and comparable to early-stage LLVM Machine IR or GCC RTL.

This model requires us to make the distinction between values kept in registers and memory accesses explicit, and flattens down control flow to basic blocks connected by branches. But it’s still relatively easy to look at a small snippet of C++ and e.g. figure out how many arithmetic instructions it boils down to: just count the number of operations.

As a next step, we could now specify a virtual processor to go with our instruction set, but I don’t want to really get into that level of detail; instead of specifying the actual processor, I’ll work the same way actual architectures do: we require that the end result (eventual register and memory contents in our model) of running a program must be as if we had executed the instructions sequentially one by one (as-if rule). Beyond that, an aggressive implementation is free to cut corners as much as it wants provided it doesn’t get caught. We’ll assume we’re in an environment—the combination of compilers/tools and the processor itself—that uses pipelining and tries to extract instruction-level parallelism to achieve higher performance, in particular:

• Instructions can launch independent from each other, and take some number of clock cycles to complete. For an instruction to start executing, all the operands it depends on need to have been computed. As long as the dependencies are respected, all reorderings are valid.
• There is some limit W (“width”) on how many new instructions we can start per clock cycle. In-flight instructions don’t interfere with each other; as long as we have enough independent work, we can start W new instructions every cycle. We’re going to treat W as variable.
• Memory operations have a latency of 4 cycles, meaning that the result of a load is available 4 cycles after the load issued, and a load reading the bytes written by a prior store can issue 4 cycles after the store. That’s a fairly typical latency for a load that hits in the L1 cache, in case you were wondering.
• Branches (conditional or not) count as a single instruction, but their latency is variable. Unconditional branches or easily predicted branches such as the loop counter in along-running loop have an effective latency of 0 cycles, meaning the instructions being branched to can issue at the same time as the branch itself. Unpredictable branches have a nonzero cost that depends on how unpredictable they are—I won’t even try to be more precise here.
• Every other instruction has a latency of 1 clock cycle, meaning the result is available in the next cycle.

This model can be understood as approximating either a dataflow architecture, an out-of-order machine with a very large issue window (and infinitely fast front-end), or a statically scheduled in-order machine running code compiled with a Sufficiently Smart Scheduler. (The kind that actually exists; e.g. a compiler implementing software pipelining).

Furthermore, I’m assuming that while there is explicit control flow (unlike a pure dataflow machine), there is a branch prediction mechanism in place that allows the machine to guess the control flow path taken arbitrarily far in advance. When these guesses are correct, the branches are effectively free other than still taking an instruction slot, during which time the machine checks whether its prediction was correct. When the guess was incorrect, the machine reverts all computations that were down the incorrectly guessed path, and takes some number of clock cycles to recover. If this idea of branch prediction is new to you, I’ll refer you to Dan Luu’s excellent article on the subject, which explains both how and why computers would be doing this.

The end result of these model assumptions is that while control flow exists, it’s on the sidelines: its only observable effect is that it sometimes causes us to throw away a bunch of work and take a brief pause to recover when we guessed wrong. Dataflow, on the other hand—the dependencies between instructions, and how long it takes for these dependencies to be satisfied—is front and center.

### Dataflow graphs

Why this emphasis? Because dataflow and data dependencies is because they can be viewed as the fundamental expression of the structure of a particular computation, whether it’s done on a small sequential machine, a larger superscalar out-of-order CPU, a GPU, or in hardware (be it a hand-soldered digital circuit, a FPGA, or an ASIC). Dataflow and keeping track of the shape of data dependencies is an organizing principle of both the machines themselves and the compilers that target them.

And these dependencies are naturally expressed in graph form, with individual operations being the nodes and data dependencies denoted by directed edges. In this post, I’ll have dependent operations point towards the operations they depend on, with the directed edges labeled with their latency. To reduce clutter, I’ll only write latency numbers when they’re not 1.

With all that covered, and to see what the point of this all is, let’s start with a simple, short toy program that just sums the 64-bit integers in some array delineated by two pointers stored in rCurPtr (which starts pointing to the first element) and rEndPtr (which points to one past the last element), idiomatic C++ iterator-style.

loop:
rSum = rSum + rCurInt;            // Sum
rCurPtr = rCurPtr + 8;            // Advance
if rCurPtr != rEndPtr goto loop;  // Done?


We load a 64-bit integer from the current pointer, add it to our current running total in register rSum, increment the pointer by 8 bytes (since we grabbed a 64-bit integer), and then loop until we’re done. Now let’s say we run this program for a short 6 iterations and draw the corresponding dataflow graph (click to see full-size version):

Note I group nodes into ranks by which cycle they can execute in, at the earliest, assuming we can issue as many instructions in parallel as we want, purely constrained by the data dependencies. The “Load” and “Advance” from the first iteration can execute immediately; the “Done?” check from the first iteration looks at the updated rCurPtr, which is only known one cycle later; and “Sum” from the first iteration needs to wait for the load to finish, which means it can only start a full 4 cycles later.

As we can see, during the first four cycles, all we do is keep issuing more loads and advancing the pointer. It takes until cycle 4 for the results of the first load to become available, so we can actually do some summing. After that, one more load completes every cycle, allowing us to add one more integer to the running sum in turn. If we let this process continue for longer, all the middle iterations would look the way cycles 4 and 5 do: in our state state, we’re issuing a copy of all four instructions in the loop every cycle, but from different iterations.

There’s a few conclusions we can draw from this: first, we can see that this four-instruction loop achieves a steady-state throughput of one integer added to the sum in every clock cycle. We take a few cycles to get into the steady state, and then a few more cycles at the end to drain out the pipeline, but if we start in cycle 0 and keep running N iterations, then the final sum will be completed by cycle N+4. Second, even though I said that our model has infinite lookahead and is free to issue as many instructions per cycle as it wants, we “only” end up using at most 4 instructions per cycle. The limiter here ends up being the address increment (“Advance”); we increment the pointer after every load, per our cost model this increment takes a cycle of latency, and therefore the load in the next iteration of the loop (which wants to use the updated pointer) can start in the next cycle at the earliest.

This is a crucial point: the longest-latency instruction in this loop is definitely the load, at 4 cycles. But that’s not a limiting factor; we can schedule around the load and do the summing later. The actual problem here is with the pointer advance; every single instruction that comes after it in program order depends on it either directly or indirectly, and therefore, its 1 cycle of latency determines when the next loop iteration can start. We say it’s on the critical path. In loops specifically, we generally distinguish between intra-iteration dependencies (between instructions within the same iteration, say “Sum 0” depending on “Load 0”) and inter-iteration or loop-carried dependencies (say “Sum 1” depending on “Sum 0”, or “Load 1” depending on “Advance 0”). Intra-iteration dependencies may end up delaying instructions within that iteration quite a lot, but it’s inter-iteration dependencies that determine how soon we can start working on the next iteration of the loop, which is usually more important because it tends to open up more independent instructions to work on.

The good news is that W=4 is actually a fairly typical number of instructions decoded/retired per cycle in current (as of this writing in early 2018) out-of-order designs, and the instruction mixture here (1 load, 1 branch, 2 arithmetic instructions) is also one that is quite likely to be able to issue in parallel on a realistic 4-wide decode/retire design. While many machines can issue a lot more instructions than that in short bursts, a steady state of 4 instructions per cycle is definitely good. So even though we’re not making much of the infinite parallel computing power of our theoretical machine, in practical terms, we’re doing OK, although on real machines we might want to apply some more transforms to the loop; see below.

Because these real-world machines can’t start an arbitrary number of instructions at the same time, we have another concern: throughput. Say we’re running the same loop on a processor that has W=2, i.e. only two instructions can start every cycle. Because our loop has 4 instructions, that means that we can’t possibly start a new loop iteration more often than once every two clock cycles, and the limiter aren’t the data dependencies, but the number of instructions our imaginary processor can execute in a clock cycle; we’re throughput-bound. We would also be throughput-bound on a machine with W=3, with a steady state of 3 new instructions issued per clock cycle, where we can start working on a new iteration every 4/3≈1.33 cycles.

### A different example

For the next example, we’re going to look at what’s turned into everyone’s favorite punching-bag of a data structure, the linked list. Let’s do the exact same task as before, only this time, the integers are stored in a singly-linked list instead of laid out as an array. We store first a 64-bit integer and then a 64-bit pointer to the next element, with the end of the list denoted by a special value stored in rEndPtr as before. We also assume the list has at least 1 element. The corresponding program looks like this:

loop:
rSum = rSum + rCurInt;            // Sum
if rCurPtr != rEndPtr goto loop;  // Done?


Very similar to before, only this time, instead of incrementing the pointer, we do another load to grab the “next” pointer. And here’s what happens to the dataflow graph if we make this one-line change:

Switching from a contiguous array to a linked list means that we have to wait for the load to finish before we can start the next iteration. Because loads have a latency of 4 cycles in our model, that means we can’t start a new iteration any more often than once every 4 cycles. With our 4-instruction loop, we don’t even need any instruction-level parallelism to reach that target; we might as well just execute one instruction per cycle and still hit the same overall throughput.

Now, this example, with its short 4-instruction loop, is fairly extreme; if our loop had say a total of 12 instructions that worked out nicely, the same figure might well end up averaging 3 instructions per clock cycle, and that’s not so bad. But the underlying problem here is a nasty one: because our longest-latency instruction is on the critical path between iterations, it ends up determining the overall loop throughput.

In our model, we’re still primarily focused on compute-bound code, and memory access is very simple: there’s no memory hierarchy with different cache levels, all memory accesses take the same time. If we instead had a more realistic model, we would also have to deal with the fact that some memory accesses take a whole lot longer than 4 cycles to complete. For example, suppose we have three cache levels and, at the bottom, DRAM. Sticking with the powers-of-4 theme, let’s say that a L1 cache hit takes 4 cycles (i.e. our current memory access latency), a L2 hit takes 16 cycles, a L3 hit takes 64 cycles, and an actual memory access takes 256 cycles—for what it’s worth, all these numbers are roughly in the right ballpark for high-frequency desktop CPUs under medium memory subsystem load as of this writing.

Finding work to keep the machine otherwise occupied for the next 4 cycles (L1 hit) is usually not that big a deal, unless we have a very short loop with unfavorable dependency structure, as in the above example. Fully covering the 16 cycles for a L1 miss but L2 hit is a bit trickier and requires a larger out-of-order window, but current out-of-order CPUs have those, and as long as there’s enough other independent work and not too many hard-to-predict branches along the way, things will work out okay. With a L3 cache hit, we’ll generally be hard-pressed to find enough independent work to keep the core usefully busy during the wait for the result, and if we actually miss all the way to DRAM, then in our current model, the machine is all but guaranteed to stall; that is, to have many cycles with no instructions executed at all, just like the gaps in the diagram above.

Because linked lists have this nasty habit of putting memory access latencies on the critical path, they have a reputation of being slow “because they’re bad for the cache”. Now while it’s definitely true that most CPUs with a cache would much rather have you iterate sequentially over an array, we have to be careful how we think about it. To elaborate, suppose we have yet another sum kernel, this time processing an array of pointers to integers, to compute the sum of the pointed-to values.

loop:
rSum = rSum + rCurInt;             // Sum
rCurPtr = rCurPtr + 8;             // Advance
if rCurPtr != rEndPtr goto loop;   // Done?


And this time, I’ll prune the dataflow graph to show only the current iteration and its direct dependency relationships with earlier and later iterations, because otherwise these more complicated graphs will get cluttered and unreadable quickly:

A quick look over that graph shows us that copies of the same instruction from different iterations are all spaced 1 cycle apart; this means that in the steady state, we will again execute one iteration of the loop per clock cycle, this time issuing 5 instructions instead of 4 (because there are 5 instructions in the loop). Just like in the linked list case, the pointer indirection here allows us to jump all over memory (potentially incurring cache misses along the way) if we want to, but there’s a crucial difference: in this setup, we can keep setting up future iterations of the loop and get more loads started while we’re waiting for the first memory access to complete.

To explain what I mean, let’s pretend that every single of the “LoadInt”s misses the L1 cache, but hits in the L2 cache, so its actual latency is 16 cycles, not 4. But a latency of 16 cycles just means that it takes 16 cycles between issuing the load and getting the result; we can keep issuing other loads for the entire time. So the only thing that ends up happening is that the “Sum k” in the graph above happens 12 cycles later. We still start two new loads every clock cycle in the steady state; some of them end up taking longer, but that does not keep us from starting work on a new iteration of the loop in every cycle.

Both the linked list and the indirect-sum examples have the opportunity to skip all over memory if they want to; but in the linked-list case, we need to wait for the result of the previous load until we can get started on the next one, whereas in the indirect-sum case, we get to overlap the wait times from the different iterations nicely. As a result, in the indirect-sum case, the extra latency towards reaching the final sum is essentially determined by the worst single iteration we had, whereas in the linked-list case, every single cache miss makes our final result later (and costs us throughput).

The fundamental issue isn’t that the linked-list traversal might end up missing the cache a lot; while this isn’t ideal (and might cost us in other ways), the far more serious issue is that any such cache miss prevents us from making progress elsewhere. Having a lot of cache misses isn’t necessarily a problem if we get to overlap them; having long stretches of time were we can’t do anything else, because everything else we could do depends on that one cache-missing load, is.

In fact, when we hit this kind of problem, our best bet is to just switch to doing something else entirely. This is what CPUs with simultaneous multithreading/hardware threads (“hyperthreads”) and essentially all GPUs do: build the machine so that it can process instructions from multiple instruction streams (threads), and then if one of the threads isn’t really making progress right now because it’s waiting for something, just work on something else for a while. If we have enough threads, then we can hopefully fill those gaps and always have something useful to work on. This trade-off is worthwhile if we have many threads and aren’t really worried about the extra latency caused by time-slicing, which is why this approach is especially popular in throughput-centric architectures that don’t worry about slight latency increases.

### Unrolling

But let’s get back to our original integer sum code for a second:

loop:
rSum = rSum + rCurInt;            // Sum
rCurPtr = rCurPtr + 8;            // Advance
if rCurPtr != rEndPtr goto loop;  // Done?


We have a kernel with four instructions here. Out of these four, two (“Load” and “Sum”) do the actual work we want done, whereas “Advance” and “Done?” just implement the loop itself and are essentially overhead. This type of loop is a prime target for unrolling, where we collapse two or more iterations of the loop into one to decrease the overhead fraction. Let’s not worry about the setup or what to do when the number of elements in the array is odd right now, and only focus on the “meat” of the loop. Then a 2× unrolled version might look like this:

loop:
rSum = rSum + rCurInt;            // SumEven
rSum = rSum + rCurInt;            // SumOdd
rCurPtr = rCurPtr + 16;           // Advance
if rCurPtr != rEndPtr goto loop;  // Done?


which has this dataflow graph:

Note that even though I’m writing to rCurInt twice in an iteration, which constitutes a write-after-write (WAW) or “output dependency”, there’s no actual dataflow between the loads and sums for the first and second version of rCurInt, so the loads can issue in parallel just fine.

This isn’t bad: we now have two loads every iteration and spend 6N instructions to sum 2N integers, meaning we take 3 instructions per integer summed, whereas our original kernel took 4. That’s an improvement, and (among other things) means that while our original integer-summing loop needed a machine that sustained 4 instructions per clock cycle to hit full throughput, we can now hit the same throuhgput on a smaller machine that only does 3 instructions per clock. This is definitely progress.

However, there’s a problem: if we look at the diagram, we can see that we can indeed start a new pair of loads every clock cycle, but there’s a problem with the summing: we have two dependent adds in our loop, and as we can see from the relationship between “SumEven k” and “SumEven k+1”, the actual summing part of the computation still takes 2 cycles per iteration. On our idealized dataflow machine with infinite lookahead, that just means that all the loads will get front-loaded, and then the adds computing the final sum proceed at their own pace; the result will eventually be available, but it will still take a bit more than 2N cycles, no faster than we were in the original version of the loop. On a more realistic machine (which can only look ahead by a limited number of instructions), we would eventually stop being able to start new loop iterations until some of the old loop iterations have completed. No matter how we slice it, we’ve gone from adding one integer to the sum per cycle to adding two integers to the sum every two cycles. We might take fewer instructions to do so, which is a nice consolation prize, but this is not what we wanted!

What’s happened is that unrolling shifted the critical path. Before, the critical path between iterations went through the pointer advance (or, to be more precise, there were two critical paths, one through the pointer advance and one through the sum, and they were both the same length). Now that we do half the number of advances per item, that isn’t a problem anymore; but the fact that we’re summing these integers sequentially is now the limiter.

A working solution is to change the algorithm slightly: instead of keeping a single sum of all integers, we keep two separate sums. One for the integers at even-numbered array positions, and one for the integers at odd-numberd positions. Then we need to sum those two values at the end. This is the algorithm:

loop:
rSumEven = rSumEven + rCurInt;    // SumEven
rSumOdd = rSumOdd + rCurInt;      // SumOdd
rCurPtr = rCurPtr + 16;           // Advance
if rCurPtr != rEndPtr goto loop;  // Done?

rSum = rSumEven + rSumOdd;        // FinalSum


And the dataflow graph for the loop kernel looks as follows:

Where before all the summing was in what’s called the same dependency chain (the name should be self-explanatory by now, I hope), we have now split the summation into two dependency chains. And this is enough to make a sufficiently-wide machine that can sustain 6 instructions per cycle complete our integer-summing task in just slightly more than half a cycle per integer being summed. Progress!

On a somewhat narrower 4-wide design, we are now throughput-bound, and take around 6/4=1.5 cycles per two integers summed, or 0.75 cycles per integer. That’s still a good improvement from the 1 cycle per integer we would have gotten on the same machine from the non-unrolled version; this gain is purely from reduction the loop overhead fraction, and further unrolling could reduce it even further. (That said, unless your loop really is as tiny as our example, you don’t generally want to go overboard with unrolling.)

### Tying it all together

In the introduction, I talked about the need for a model detailed enough to make quantitative, not just qualitative, predictions; and at least for very simple compute-bound loops, that is exactly what we have now. At this point, you should know enough to look at the dependency structure of simple loops, and have some idea for how much (or how little) latent parallelism there is, and be able to compute a coarse upper bound on their “speed of light” on various machines with different peak instructions/cycle rates.

Of course, there are many simplifications here, most of which have been already noted in the text; we’re mostly ignoring the effects of the memory hierarchy, we’re not worrying at all about where the decoded instructions come from and how fast they can possibly be delivered, we’ve been flat-out assuming that our branch prediction oracle is perfect, and we’ve been pretending that while there may be a limit on the total number of instructions we can issue per cycle, it doesn’t matter what these instructions are. None of these are true. And even if we’re still compute-bound, we need to worry at least about that latter constraint: sometimes it can make a noticeable difference to tweak the “instruction mix” so it matches better what the hardware can actually do in a given clock cycle.

But all these caveats aside, the basic concepts introduced here are very general, and even just sketching out the dependency graph of a loop like this and seeing it in front of you should give you useful ideas about what potential problems are and how you might address them. If you’re interested in performance optimization, it is definitely worth your time practicing this so you can look at loops and get a “feel” for how they execute, and how the shape of your algorithm (or your data structures, in the linked list case) aids or constrains the compiler and processor.

UPDATE: Some additional clarifications in answer to some questions: paraphrasing one, “if you have to first write C code, translate it to some pseudo-assembly, and then look at the graph, how can this possibly be a better process than just measuring the code in the first place?” Well, the trick here is that to measure anything, you actually need a working program. You don’t to draw a dataflow graph. For example, a common scenario is that there are many ways you could structure some task, and they all want their data structured differently. Actually implementing and testing multiple variants like this requires you to write a lot of plumbing to massage data from one format into another (all of which can be buggy). Drawing a graph can be done from a brief description of the inner loop alone, and you can leave out the parts that you don’t currently care about, or “dummy them out” by replacing them with a coarse approximation (“random work here, maybe 10 cycles latency?”). You only need to make these things precise when they become close to the critical path (or you’re throughput-bound).

And I think that’s enough material for today. Next up, I’ll continue my “Reading bits in far too many ways” series with the third part, where I’ll be using these techniques to get some insight into what kind of difference the algorithm variants make. Until then!

(Continued from part 1.)

Last time, I established the basic problem and went through various ways of doing shifting and masking, and the surprising difficulties inherent therein. The “bit extract” style I picked is based on a stateless primitive, which made it convenient to start with because there’s no loop invariants involved.

This time, we’re going to switch to the stateful style employed by most bit readers you’re likely to encounter (because it ends up cheaper). We’re also going to switch from a monolithic getbits function to something a bit more fine-grained. But let’s start at the beginning.

### Variant 1: reading the input one byte at a time

Our “extract”-style reader assumed the entire bitstream was available in memory ahead of time. This is not always possible or desirable; so let’s investigate the other extreme, a bit reader that requests additional bytes one at a time, and only when they’re needed.

In general, our stateful variants will dribble in input a few bytes at a time, and have partially processed bytes lying around. We need to store that data in a variable that I will call the “bit buffer”:

// Again, this is understood to be per-bit-reader state or local
// variables, not globals.
uint64_t bitbuf = 0;   // value of bits in buffer
int      bitcount = 0; // number of bits in buffer


While processing input, we will always be seesawing between putting more bits into that buffer when we’re running low, and then consuming bits from that buffer while we can.

Without further ado, let’s do our first stateful getbits implementation, reading one byte at a time, and starting with MSB-first this time:

// Invariant: there are "bitcount" bits in "bitbuf", stored from the
// MSB downwards; the remaining bits in "bitbuf" are 0.

uint64_t getbits1_msb(int count) {
assert(count >= 1 && count <= 57);

// Keep reading extra bytes until we have enough bits buffered
// Big endian; our current bits are at the top of the word,
// new bits get added at the bottom.
while (bitcount < count) {
bitbuf |= (uint64_t)getbyte() << (56 - bitcount);
bitcount += 8;
}

// We now have enough bits present; the most significant
// "count" bits in "bitbuf" are our result.
uint64_t result = bitbuf >> (64 - count);

// Now remove these bits from the buffer
bitbuf <<= count;
bitcount -= count;

return result;
}


As before, we can get rid of the count≥1 requirement by changing the way we grab the result bits, as explained in the last part. This is the last time I’ll mention this; just keep in mind that whenever I show any algorithm variant here, the variations from last time automatically apply.

The idea here is quite simple: first, we check whether there’s enough bits in our buffer to satisfy the request immediately. If not, we dribble in extra bytes one at a time until we have enough. getbyte() here is understood to ideally use some buffered IO mechanism that just boils down to dereferencing and incrementing a pointer on the hot path; it should not be a function call or anything expensive if you can avoid it. Because we insert 8 bits at a time, the maximum number of bits we can read in a single call is 57 bits; that’s the largest number of bits we can refill the buffer to without risking anything dropping out.

After that, we grab the top count bits from our buffer, then shift them out. The invariant we maintain here is that the first unconsumed bit is kept at the MSB of the buffer.

The other thing I want you to notice is that this process breaks down neatly into three separate smaller operations, which I’m going to call “refill”, “peek” and “consume”, respectively. The “refill” phase ensures that a certain given minimum number of bits is present in the buffer; “peek” returns the next few bits in the buffer, without discarding them; and “consume” removes bits without looking at them. These all turn out to be individually useful operations; to show how things shake out, here’s the equivalent LSB-first algorithm, factored into smaller parts:

// Invariant: there are "bitcount" bits in "bitbuf", stored from the
// LSB upwards; the remaining bits in "bitbuf" are 0.
void refill1_lsb(int count) {
assert(count >= 0 && count <= 57);
// New bytes get inserted at the top end.
while (bitcount < count) {
bitbuf |= (uint64_t)getbyte() << bitcount;
bitcount += 8;
}
}

uint64_t peekbits1_lsb(int count) {
assert(bit_count >= count);
return bitbuf & ((1ull << count) - 1);
}

void consume1_lsb(int count) {
assert(bit_count >= count);
bitbuf >>= count;
bitcount -= count;
}

uint64_t getbits1_lsb(int count) {
refill1_lsb(count);
uint64_t result = peekbits1_lsb(count);
consume1_lsb(count);
return result;
}


Writing getbits as the composition of these three smaller primitives is not always optimal. For example, if you use the rotate method for MSB-first bit buffers, you really want to have only rotate shared by the peekbits and consume phases; an optimal implementation shares that work between the two. However, breaking it down into these individual steps is still a useful thing to do, because once you conceptualize these three phases as distinct things, you can start putting them together differently.

The most important such transform is amortizing refills over multiple decode operations. Let’s start with a simple toy example: say we want to read our three example bit fields from the last part:

    a = getbits(4);
b = getbits(3);
c = getbits(5);


With getbits implemented as above, this will do the refill check (and potentially some actual refilling) up to 3 times. But this is silly; we know in advance that we’re going to be reading 4+3+5=12 bits in one go, so we might as well grab them all at once:

    refill(4+3+5);
a = getbits_no_refill(4);
b = getbits_no_refill(3);
c = getbits_no_refill(5);


where getbits_no_refill is yet another getbits variant that does peekbits and consume, but, as the name suggests, no refilling. And once you get rid of the refill loop between the individual getbits invocations, you’re just left with straight-line integer code, which compilers are good at optimizing further. That said, the all-fixed-length case is a bit of a cheap shot; it gets far more interesting when we’re not sure exactly how many bits we’re actually going to consume, like in this example:

    temp = getbits(5);
if (temp < 28)
result = temp;
else
result = 28 + (temp - 28)*16 + getbits(4);


This is a simple variable-length code where values from 0 through 27 are sent in 5 bits, and values from 28 through 91 are sent in 9 bits. The point being, in this case, we don’t know in advance how many bits we’re eventually going to consume. We do know that it’s going to be no more than 9 bits, though, so we can still make sure we only refill once:

    refill(9);
temp = getbits_no_refill(5);
if (temp < 28)
result = temp;
else
result = 28 + (temp - 28)*16 + getbits_no_refill(4);


In fact, if you want to, you can go wild and split operations even more, so that both execution paths only consume bits once. For example, assuming a MSB-first bit buffer, we could write this small decoder as follows:

    refill(9);
temp = peekbits(5);
if (temp < 28) {
result = temp;
consume(5);
} else {
// The "upper" and "lower" code are back-to-back,
// and combine exactly the way we want! Synergy!
result = getbits_no_refill(9) - 28*16 + 28;
}


This kind of micro-tweaking is really not recommended outside very hot loops, but as I mentioned in the previous part, some of these decoder loops are quite hot indeed, and in that case saving a few instructions here and a few instructions there adds up. One particularly important technique for decoding variable-length codes (e.g. Huffman codes) is to peek ahead by some fixed number of bits and then do a table lookup based on the result. The table entry then lists what the decoded symbol should be, and how many bits to consume (i.e. how many of the bits we just peeked at really belonged to the symbol). This is significantly faster than reading the code a bit or two at a time and consulting a Huffman tree at every step (the method sadly taught in many textbooks and other introductory texts.)

There’s a problem, though. The technique of peeking ahead a bit (no pun intended) and then later deciding how many bits you actually want to consume is quite powerful, but we’ve just changed the rules: the getbits implementation above is careful to only read extra bytes when it’s strictly necessary. But our modified variable-length code reader example above always refills so the buffer contains at least 9 bits, even when we’re only going to consume 5 bits in the end. Depending on where that refill happens, it might even cause us to read past the end of the actual data stream.

In short, we’ve introduced lookahead. The modified code reader starts grabbing extra input bytes before it’s sure whether it needs them. This has many advantages, but the trade-off is that it may cause us to read past the logical end of a bit stream; it certainly implies that we have to make sure this case is handled correctly. It certainly should never crash or read out of bounds; but beyond that, it implies certain thing about the way input buffering or the framing layer have to work.

Namely, if we’re going to do any lookahead, we need to figure out a way to handle this. The primary options are as follows:

• We can punt and make it someone else’s problem by just requiring that everyone hand us valid data with some extra padding bytes after the end. This makes our lives easier but is an inconvenience for everyone else.
• We can arrange things so the outer framing layer that feeds bytes to our getbits routine knows when the real data stream is over (either due to some escape mechanism or because the size is sent explicitly); then we can either stop doing any lookahead and switch to a more careful decoder when we’re getting close to the end, or pad the stream with some dummy value after its end (zeroes being the most popular choice).
• We can make sure that whatever bytes we might grab during lookahead while decoding a valid stream are still part of our overall byte stream that’s being processed by our code. For example, if you use a 64-bit bit buffer, we can finagle our way around the problem by requiring that some 8-byte footer (say a checksum or something) be present right after a valid bit stream. So while our bit buffer might overshoot, it’s still data that’s ultimately going to be consumed by the decoder, which simplifies the logistics considerably.
• Barring that, whatever I/O buffering layer we’re using needs to allow us to return some extra bytes we didn’t actually consume into the buffer. Namely, whatever lookahead bytes we have left in our bit buffer after we’re done decoding need to be returned to the buffer for whoever is going to read it next. This is essentially what the C standard library function ungetc does, except you’re not allowed to call ungetc more than once, and we might need to. So going along this route essentially dooms you to taking charge of IO buffer management.

I won’t sugarcoat it, all of these options are a pain in the neck, some more so than others: hand-waving it away by putting something else at the end is easiest, handling it in some outer framing layer isn’t too bad, and taking over all IO buffering so you can un-read multiple bytes is positively hideous, but you don’t have great options when you don’t control your framing. In the past, I’ve written posts about handy techniques that might help you in this context; and in some implementations you can work around it, for example by setting bitcount to a huge value just after inserting the final byte from the stream. But in general, if you want lookahead, you’re going to have to put some amount of work into it. That said, the winnings tend to be fairly good, so it’s not all for nothing.

### Variant 2: when you really want to read 64 bits at once

The methods I’ve discussed so far both have some “slop” from working in byte granularity. The extract-style bit reader started with a full 64-bit read but then had to shift by up to 7 positions to discard the part of the current byte that’s already consumed, and the getbits1 above inserts one byte at a time into the bit buffer; if there’s 57 bits already in the buffer, there’s no space for another byte (because that would make 65 bits, more than the width of our buffer), so that’s the maximum width the getbits1 method supports. Now 57 bits is a useful amount; but if you’re doing this on a 32-bit platform, the equivalent magic number is 25 bits (32-7), and that’s definitely on the low side, enough so to be inconvenient sometimes.

Luckily, if you want the full width, there’s a way to do it (like the rotate-and-mask technique for MSB-first bit buffers, I learned this at RAD). At this point, I think you get the correspondence between the MSB-first and LSB-first methods, so I’m only going to show one per variant. Let’s do LSB-first for this one:

// Invariant: "bitbuf" contains "bitcount" bits, starting from the
// LSB up; 1 <= bitcount <= 64
uint64_t bitbuf = 0;     // value of bits in buffer
int      bitcount = 0;   // number of bits in buffer
uint64_t lookahead = 0;  // next 64 bits

// Must do this to prime the pump!
void initialize() {
bitbuf = get_uint64LE();
bitcount = 64;
}

// grabs the lookahead word, if we don't have
}
}

uint64_t peekbits2_lsb(int count) {
assert(bitcount >= 1);
assert(count >= 0 && count <= 64);

if (count <= bitcount) { // enough bits in buf
} else {

// combine current bitbuf with lookahead
// (lookahead bits go above end of current buf)
uint64_t next_bits = bitbuf;
}
}

void consume2_lsb(int count) {
assert(bitcount >= 1);
assert(count >= 0 && count <= 64);

if (count < bitcount) { // still in current buf
// just shift the bits out
bitbuf >>= count;
bitcount -= count;
} else { // all of current buf consumed

int lookahead_consumed = count - bitcount;
}

assert(bitcount >= 1);
}

uint64_t getbits2_lsb(int count) {
uint64_t result = peekbits2_lsb(count);
consume2_lsb(count);
return result;
}


This one is a bit more complicated than the ones we’ve seen before, and needs an explicit initialization step to make the invariants work out just right. It also involves several extra branches compared to the variants we’ve seen before, which makes it less than ideal for deeply pipelined machines, which includes desktop PCs, and note that I’m using the width_to_mask_table again, and not just for show: none of the arithmetic expressions we saw last time to compute the mask for a given width work for the full 0-64 range of allowed widths on any common 64-bit architecture that’s not POWER, and even that only if we ignore them invoking undefined behavior, which we really shouldn’t.

The underlying idea is fairly simple: instead of just one bit buffer, we keep track of two values. We have however many bits are left of the last 64-bit value we read, and when that’s not enough for a peekbits, we grab the next 64-bit value from the input stream (via some externally-implemented get_uint64LE()) to give us the bits we’re missing. Likewise, consume checks whether there will still be any bits left in the current input buffer after consuming width bits. If not, we switch over to the bits from the lookahead value (shifting out however many of them we consumed) and clear the have_lookahead flag to indicate that what used to be our lookahead value is now just the contents of our bit buffer.

There are some contortions in this code to ensure we don’t do out-of-range (undefined-behavior-inducing) shifts. For example, note how peekbits tests whether count <= bitcount to detect the bits-present-in-buffer case, whereas consume uses count < bitcount. This is not an accident: in peekbits, the next_bits calculation involves a right-shift by bitcount. Since it only happens in the path where bitcount < count ≤ 64, that means that bitcount < 64, and we’re safe. In consume, the situation is reversed: we shift by lookahead_consumed = count - bitcount. The condition around the block guarantees that lookahead_consumed ≥ 0; in the other direction, because count is at most 64 and bitcount is at least 1, we have lookahead_consumed ≤ 64 – 1 = 63. That said, to paraphrase Knuth: beware of bugs in the above code; I’ve only proved it correct, not tried it.

This technique has another thing going for it besides supporting bigger bit field widths: note how it always reads full 64-bit uints at a time. Variant 1 above only reads bytes at a time, but requires a refill loop; the various branchless variants we’ll see later implicitly rely on the target CPU supporting fast unaligned reads. This version alone has the distinction of doing reads of a single size and with consistent alignment, which makes it more attractive on targets that don’t support fast unaligned reads, such as many old-school RISC CPUs.

Finally, as usual, there’s several more variations here that I’m not showing. For example, if you happen to have the data you’re decoding fully in memory, there’s no reason to bother with the boolean have_lookahead flag; just keep a pointer to the current lookahead word, and bump that pointer up whenever the current lookahead is consumed.

### Variant 3: bit extraction redux

The original bit extraction-based bit reader from the previous part was a bit on the expensive side. But as long as we’re OK with the requirement that the entire input stream be in memory at once, we can wrangle it into the refill/peek/consume pattern to get something useful. It still gives us a bit reader that looks ahead (and hence has the resulting difficulties), but such is life. For this one, let’s do MSB again:

const uint8_t *bitptr; // Pointer to current byte
uint64_t       bitbuf = 0; // last 64 bits we read
int            bitpos = 0; // how many of those bits we've consumed

void refill3_msb() {
assert(bitpos <= 64);

// Advance the pointer by however many full bytes we consumed
bitptr += bitpos >> 3;

// Refill

// Number of bits in the current byte we've already consumed
// (we took care of the full bytes; these are the leftover
// bits that didn't make a full byte.)
bitpos &= 7;
}

uint64_t peekbits3_msb(int count) {
assert(bitpos + count <= 64);
assert(count >= 1 && count <= 64 - 7);

// Shift out the bits we've already consumed
uint64_t remaining = bitbuf << bitpos;

// Return the top "count" bits
return remaining >> (64 - count);
}

void consume3_msb(int count) {
bitpos += count;
}


This time, I’ve also left out the getbits built from refill / peek / consume calls, because that’s yet another pattern that should be pretty clear by now.

It’s a pretty sweet variant. Once we break the bit extraction logic into separate “refill” and “peek”/”consume” pieces, it becomes clear how all of the individual pieces are fairly small and clean. It’s also completely branchless! It does expect unaligned 64-bit big-endian reads to exist and be reasonably cheap (not a problem on mainstream x86s or ARMs), and of course a realistic implementation needs to include handling of the end-of-buffer cases; see the discussion in the “lookahead” section.

### Variant 4: a different kind of lookahead

And now that we’re here, let’s do another branchless lookahead variant. This exact variant is, to the best of my knowledge, another RAD special – discovered by my colleague Charles Bloom and me while working on Kraken (UPDATE: as Yann points out in the comments, this basic idea was apparently used in Eric Biggers’ “Xpack” long before Kraken was launched; I wasn’t aware of this and I don’t think Charles was either, but that means we’re definitely not the first ones to come up with the idea. Our variant has an interesting wrinkle though – details in my reply). Now all branchless (well, branchless if you ignore end-of-buffer checking in the refill etc.) bit readers look very much alike, but this particular variant has a few interesting properties (some of which I’ll only discuss later because we’re lacking a bit of necessary background right now), and that I haven’t seen anywhere else in this combination; if someone else did it first, feel free to inform me in the comments, and I’ll add the proper attribution! Here goes; back to LSB-first again, because I’m committed to hammering home just how similar and interchangeable LSB-first/MSB-first are at this level, holy wars notwithstanding.

const uint8_t *bitptr;   // Pointer to next byte to insert into buf
uint64_t bitbuf = 0;     // value of bits in buffer
int      bitcount = 0;   // number of bits in buffer

void refill4_lsb() {
// Grab the next few bytes and insert them right above
// the current top.

bitptr += (63 - bitcount) >> 3;

// Update the available bit count
bitcount |= 56; // now bitcount is in [56,63]
}

uint64_t peekbits4_lsb(int count) {
assert(count >= 0 && count <= 56);
assert(count <= bitcount);

return bitbuf & ((1ull << count) - 1);
}

void consume4_lsb(int count) {
assert(count <= bitcount);

bitbuf >>= count;
bitcount -= count;
}


The peek and consume phases are nothing we haven’t seen before, although this time the maximum permissible bit width seems to have shrunk by one more bit down to 56 bits for some reason.

That reason is in the refill phase, which works slightly differently from the ones we’ve seen so far. Reading 64 little-endian bits and shifting them up to align with the top of our current bit buffer should be straightforward at this point. But the bitptr / bitcount manipulation needs some explanation.

It’s actually easier to start with the bitcount part. The variants we’ve seen so far generally have between 57 and 64 bits in the buffer after refill. This version instead targets having between 56 and 63 bits in the buffer (which is also why the limit on count went down by one). But why? Well, inserting some integer number of bytes means bitcount is going to be incremented by some multiple of 8 during the refill; that means that bitcount & 7 (the low 3 bits) won’t change. And if we refill to a target of [56,63] bits in the buffer, we can compute the updated bit count with a single binary OR operation.

Which brings me to the question of how many bytes we should advance the pointer by. Well, let’s just look at the values of the original bitcount:

• If 56 ≤ bitcount ≤ 63, we were already in our target range and don’t want to advance by another byte.
• If 48 ≤ bitcount ≤ 55, we’re adding exactly 1 byte (and so want to advance bit_ptr by that much).
• If 40 ≤ bitcount ≤ 47, we’re adding exactly 2 bytes.

and so forth. This works out to the (63 - bitcount) >> 3 bytes we’re adding to bitptr.

Now, the bits in bitbuf above bitcount can end up getting ORed over multiple times. However, when that happens, we OR over the same value every time, so it doesn’t change the result. Therefore, once they later travel downwards (from the right-shift in the consume function), they’re fine; no need to worry about garbage.

Okay. So what’s interesting, but what’s so special about this particular variant? When would you choose this over, say, variant 3 above?

One simple reason: in this variant, the address the refill is loading from does not have a dependency on the current value of bitcount. In fact, the next load address is known as soon as the previous refill is complete. This is a subtle distinction that turns out to be a fairly major advantage on an out-of-order CPU. Among integer operations, even when hitting the L1 cache, loads are on the high latency side (typically somewhere between 3 and 5 cycles, whereas most integer operations take a single cycle), and the exact value of bitcount at the end of some loop iteration is often only known late (consider the simple variable-length code example I gave above).

Having the load address not depend on bitcount means the load can potentially issue as soon as the previous refill is complete; then we have plenty of time to complete the load, potentially byte-swap the value if the endianness of our load doesn’t match the target ISA (say because we’re using a MSB-first bit buffer on a little-endian CPU), and then the only thing that depends on the previous value of bitcount is the shift, which is a regular ALU operation and generally takes a single cycle.

In short, this somewhat obscure form of refill looks weird, but provides a tangible increase in available instruction-level parallelism. It was good for about a 10% throughput improvement on desktop PCs (vs. the earlier branchless refill it replaced) in the then-current version of the Kraken Huffman decoder when I tested it in early 2016.

Consider this a teaser for the next (and hopefully last) part of this series, in which I won’t introduce many more variants (maybe one more), and will instead talk a lot more about the performance of bit stream decoders and what kinds of things to watch out for.

Until then!

Resurrecting an old tradition of this blog, let’s take a simple problem, go over a way too long list of alternative solutions, and evaluate their merits.

Our simple problem is this: we want to read data encoded using some variable-bit-length code from a byte stream. Reading individual bytes, machine words etc. is directly supported by most CPUs and many programming languages, but for bit-granularity IO, you generally need to implement it yourself.

This sounds simple enough, and in some sense it is. The first source of problems is that this operation tends to be a hot spot in codecs—-and yes, compute bound, not memory or I/O bound. So we’d like not just an implementation that works; we’d like it to be efficient as well. And along the way we’ll run into many other complications: interactions with IO buffering, end-of-buffer handling, corner cases in the way bit shifts are specified both in C/C++ and in various processor architectures, as well as other bit shift peculiarities.

I’ll mainly focus on many different ways to handle the reader side in this post; essentially all techniques covered in here apply equally to the writer side, but I don’t want to double the number of algorithm variations I’m presenting. There will be plenty as it is.

### Degrees of freedom

“Read a variable number of bits” is not a sufficient problem specification. There are lots of plausible ways to pack bits into bytes, and all have their strengths and weaknesses that I’ll go into later. For now, let’s just cover the differences.

The first major decision to make is whether fields are packed MSB-first or LSB-first (“most significant bit” and “least significant bit”, respectively). That is, if we call our to-be-implemented function getbits and run the code sequence

a = getbits(4);
b = getbits(3);


on some bitstream that we just opened, we expect both values to come from the same byte, but how are they arranged in that byte? If bits are packed MSB-first, then “a” occupies 4 bits starting at the MSB, and “b” is below “a”, leading to this arrangement:

I’m numbering bits with the LSB being bit 0 and increasing towards the MSB, which is the conventional ordering in most contexts. LSB-first packing is the opposite, where the first field occupies bit 0 and later fields proceed upwards:

Both conventions are used in everyday file formats. For example, JPEG uses MSB-first bit packing for its bitstream, and DEFLATE (zip) uses LSB-first.

The next question we have to settle is what’s supposed to happen when a value ends up spanning multiple bytes. Say we have another value, “c”, that we want to encode in 5 bits. What do we want to end up with? We can postpone the issue slightly be declaring that we’re packing values into 32-bit words or 64-bit words and not bytes, but ultimately we’ll have to settle on something. This is where we suddenly get a whole lot of different variants, and I’m only going to cover the main contenders.

We can think of MSB-first bit packing as iterating over our bitfield “c” from its MSB towards the LSB, inserting one bit at a time. And once we fill up one byte, we start with the next byte. Following these rules for our new bitfield c, this is where the bits end up in our stream:

Note that by following these rules, we end up with the same two bytes we would have gotten from MSB-first bit packing into a big integer and then stored it in big-endian byte order. If we had instead decided to split c such that its LSB goes into the first byte and the four higher-order bits into the second byte, this wouldn’t have worked. I’m going to call bit-packing rules that are self-consistent like that “natural” rules.

LSB-first bit-packing of course also has its corresponding natural rule, which is to insert the new value bit by bit starting from the LSB upwards, and if we do that here, we end up with this bit stream:

LSB-first natural packing gives us the same bytes as LSB-first packing into a big integer then storing it in little-endian byte order. Also, we’re clearly running into some awkwardness with the drawing here: the logically contiguous packing of c into multiple bytes looks discontinuous when drawing it like this, whereas the drawing for the MSB-first packing looked the way you’d expect. But there’s trouble brewing there too: in the MSB-first drawing, we’re numbering the bits in increasing order from right to left (as is common) and the bytes in increasing order from left to right (as is also common).

Here’s what happens with the LSB-first bit diagram if we draw bit 0 (the LSB) in each byte at the left and bit 7 (the MSB) at the right:

If you draw it this way, it looks the way you’d expect. Putting the MSB on the right feels weird when thinking of a byte as a number and much less weird when you just think about it as an array of 8 bits (which is, effectively, what we’re treating it as when we’re doing bit-wise IO).

Incidentally, some big-endian architectures (e.g. IBM POWER) do number bits this way – bit 0 is the MSB, bit 31 (or 63) is the LSB. Diagramming MSB-first bit packing on such a machine with bit 0=MSB, and also numbering our own bit fields so that their bit 0 corresponds to the MSB, we’d get the exact same diagram (but it would mean something slightly different). This convention makes bit and byte ordering consistent (nice) but breaks the handy convention of bit k corresponding to a value of 2k (less nice).

And if the idea of renumbering bits so bit 0 is the MSB breaks your head, you can also leave the bit numbering as it is but be a rebel and draw byte addresses increasing to the left. Or alternatively keep drawing increasing addresses to the right but be a slightly different kind of rebel and write your byte stream backwards. If you do either of those, you end up with this layout:

I realize this is getting confusing, and I’ll stop now, but I’m trying to make an actual point here: you should not get too attached to the way this stuff is drawn. It’s easy to fool yourself into thinking that one variant is better than another because it looks prettier, but the conventions of drawing bytes left to right and bits within them right to left are completely arbitrary, and no matter which one you pick, they always end up having that One Weird Thing about them.

It turns out that MSB-first and LSB-first packing conventions both have advantages and disadvantages, and it’s much more useful to think of them as tools with different areas of application than it is to designate one as the “right way” and the other as the “wrong way”. As for byte order, and whether to pack values into bytes, words, or something larger, I highly recommend that you use whatever the natural order for your bit-packing convention is: MSB-first corresponds naturally to big-endian style byte ordering, and LSB-first corresponds naturally to little-endian byte ordering. Unless you’re writing byte streams in reverse—believe it or not, there’s good reasons to do that too—in which case we have reverse-order MSB-first corresponding to little-endian and reverse-order LSB-first corresponding to big-endian.

The reason to prefer “natural” orders is that they tend to give you more freedom for different implementations. A natural-order stream admits a variety of different decoders (and encoders), all with different trade-offs (and good on different targets). “Unnatural” orders are usually designed with exactly one implementation in mind and tend to very awkward to decode any other way.

### Our first getbits (bit extract)

Now that we’ve specified the problem sufficiently, we can implement a solution. A particularly simple version is possible if we assume the entire bit stream is sequential in memory (as a byte array) and also completely ignore, for the time being, such pesky issues as running into the end of the array. Let’s just pretend it’s infinite! (Or large enough and zero-padded, anyway.)

In that case, we can base everything on a purely functional “bit extract” function, which I am then going to illustrate a number of issues that come up in all kind of bit-reading functions. Let’s start with LSB-first bit packing:

// Treating buf[] as a giant little-endian integer, grab "width"
// bits starting at bit number "pos" (LSB=bit 0).
uint64_t bit_extract_lsb(const uint8_t *buf, size_t pos, int width) {
assert(width >= 0 && width <= 64 - 7);

// Read a 64-bit little-endian number starting from the byte
// containing bit number "pos" (relative to "buf").
uint64_t bits = read64LE(&buf[pos / 8]);

// Shift out the bits inside the first byte that we've
// After this, the LSB of our bit field is in the LSB of bits.
bits >>= pos % 8;

// Return the low "width" bits, zeroing the rest via bit mask.
return bits & ((1ull << width) - 1);
}

// State variable, assumed to be local variables, or factored
// into some object; hopefully not actual globals.
const uint8_t *bitstream; // The input bistream
size_t bit_pos; // Current position in the stream, in bits.

uint64_t getbits_extract_lsb(int width) {
uint64_t result = bit_extract_lsb(bitstream, bit_pos, width);
bit_pos += width;
return result;
}


We’re just using the fact I noted earlier that a LSB-first bit stream is just a big little-endian number. We first grab 64 contiguous byte-aligned bits starting from the first byte containing any bits we care about, do a right shift to get rid of the remaining 0-7 extra bits below the first bit we care about, and then return the result masked to the desired width.

Depending on the value of pos, that right shift can cost us an extra 7 bits. So even though we read a full 64 bits, the maximum number of bits we can read in one go with the code above is 64-7=57 bits.

With bitextract in hand, getbits is straightforward; we just keep track of the current position in the bitstream (in bits), and increment it after reading. Easy.

The corresponding MSB-first variant works out quite similarly, except for one annoying issue I’ll explain after showing the code:

// Treating buf[] as a giant big-endian integer, grab "width"
// bits starting at bit number "pos" (MSB=bit 0).
uint64_t bit_extract_msb(const uint8_t *buf, size_t pos, int width) {
assert(width >= 1 && width <= 64 - 7);

// Read a 64-bit big-endian number starting from the byte
// containing bit number "pos" (relative to "buf").
uint64_t bits = read64BE(&buf[pos / 8]);

// Shift out the bits we've already consumed.
// After this, the MSB of our bit field is in the MSB of bits.
bits <<= pos % 8;

// Return the top "width" bits.
return bits >> (64 - width);
}

uint64_t getbits_extract_msb(int width) {
uint64_t result = bit_extract_msb(bitstream, bit_pos, width);
bit_pos += width;
return result;
}


This works very similar to the one before: read 64 contiguous byte-aligned bits (big-endian this time), do a left shift to align the top of the bit field we want with the MSB of bits (where before, we did a right shift to align the bottom of our bit field with the LSB of bits), and then do a right shift to place the top width bits at the bottom to return them, because when somebody calls getbits(3), they generally expect to see a value between 0 and 7 inclusive.

### Shift boundary cases

So what’s the problem? Well, this version doesn’t allow width to be zero. The issue is that if we allow width == 0, then the final shift will try to right-shift a 64-bit value by 64 bits, and that’s undefined behavior in C/C++! In this case, we may only shift by 0 through 63.

Now, in some cases, C/C++ leave details unspecified for backwards-compatibility with machines essentially nobody cares about at this point. Not requiring that the representation of signed numbers use two’s complement being a famous example; while non-two’s complement architectures exist, they’re all museum pieces at this point.

Sadly, this is not one of these cases. Here’s what happens on different widespread CPU architectures when the shift amount is out of range:

• For 32-bit x86 and x86-64, shift amounts are interpreted mod 32 for operand widths of 32 bits and lower, and mod 64 for 64-bit operands. So right-shifting a 64-bit value by 64 will yield the same as shifting by 0, i.e. a no-op.
• In 32-bit ARM (A32/T32 instruction sets), shift amounts are taken mod 256. Right-shifting a 32-bit value by 32 (or 64) will hence yield 0, as will right-shifting it by 255, but right-shifting it by 256 will leave the value untouched.
• In 64-bit ARM (A64 ISA), shift amounts are taken mod 32 for 32-bit shifts and mod 64 for 64-bit shifts (essentially the same as x86-64).
• RISC-V also follows the same rule: 32-bit shifts distances are mod 32, 64-bit shift distances are mod 64.
• For POWER/PowerPC, 32-bit shifts take the shift amount mod 64 and 64-bit shifts take the shift amount mod 128.

To make matters even more confusing, in most of these instruction sets with SIMD extensions, the SIMD integer instructions have different out-of-range shift behavior
than the non-SIMD instructions. In short, this is one of those cases where there’s actual architectural differences between mainstream platforms; POWER and RISC-V might be a bit obscure for most of you, but say 32-bit vs. 64-bit ARM both have hundreds of millions of devices sold at this point.

Therefore, even if all the compiler does with that right-shift is to emit the corresponding right-shift instruction for the target architecture (which is generally what happens), you will still see different behavior on different architectures, and on both ARM A64 and x86-64, the result of a shift by 64 will be effectively a no-op, and getbits(0) will therefore (generally) return a non-0 value, where one would expect that it’s always zero.

Why does it matter in the first place? Having a hard-coded getbits(0) is indeed not an interesting use case; but sometimes you might want to perform a getbits(x) for some variable x, where x can be zero in some cases, and it’s nice for that case to just work and not require some special-case testing.

If you really need that case to work, one option is to explicitly test for width == 0 and handle that specially; another is to use an branch-less expression that works for zero widths, for example this variant used in Yann Collet‘s FSE:

    // Return the top "width" bits, avoiding width==0 edge cases.
return (bits >> 1) >> (63 - width);


This particular case is easier to handle with LSB-first bit streams. And while I’m mentioning them, let’s talk about the masking operation I used to isolate the lowest width bits:

    // Return the low "width" bits, zeroing the rest via bit mask.
return bits & ((1ull << width) - 1);


There’s an equivalent form that is slightly cheaper on architectures with a three-operand and-with-complement (AND-NOT) instruction. This includes many RISC CPUs as well as x86s with BMI1. Namely, we can take a mask of all-1 bits, left-shift it to introduce width zeroes at the bottom, then complement the whole thing:

    return bits & ~(~0ull << width);


If you’re on an x86 with not just BMI1, but also BMI2 support, you can also use the BZHI instruction that is tailor-made for this use case, if you can figure out how to make your compiler emit it (or use assembly). Yet another option that is advantageous in some cases is to simply prepare a small look-up table of bit masks, which then simplifies the code into

    return bits & width_to_mask_table[width];


Another interesting property is that the LSB-first version using the mask table lookup only performs one shift by a variable amount (to shift out the already-consumed bits). This matters because, for a variety of (often banal) reasons, integer shifts by a variable amount are more expensive than shifts by a compile-time constant amount on many micro-architectures. For example, the Cell PPU/Xbox 360 Xenon CPU were infamous for having variable-distance shifts that stalled the CPU core for a whopping 12 cycles, whereas regular shifts were pipelined and would complete in two cycles. Less drastically, on many Intel x86 microarchitectures, “traditional” variable x86 shifts (SHR reg, cl and friends) are about three times as expensive as shifts by a compile-time constant.

This seems like another way in which the MSB-first variant is behind: the variants we’ve seen so far perform either two or three shifts per extract operation, two of which are variable-distance. But there’s a trick to get down to a single shift-type operation, namely by using a bit rotate (cyclical shift) instead of a regular shift:

// Treating buf[] as a giant big-endian integer, grab "width"
// bits starting at bit number "pos" (MSB=bit 0).
uint64_t bit_extract_rot(const uint8_t *buf, size_t pos, int width) {
assert(width >= 0 && width <= 64 - 7);

// Read a 64-bit big-endian number starting from the byte
// containing bit number "pos" (relative to "buf").
uint64_t bits = read64BE(&buf[pos >> 3]);

// Rotate left to align the bottom of our bit field with the LSB.
bits = rotate_left64(bits, (pos & 7) + width);

// Return the bottom "width" bits.
// (Using a table here, but the other ways of masking discussed
// for LSB-first bit IO also work.)
}


Here, the rotate-left (which you need to figure out how to best access in your C compiler yourself; I want to avoid further digressions) first takes up the work of the original left-shift, and then rotates by an extra “width” bits to make our bit field wrap around from the most-significant bits of the value down to the least-significant, where we can then mask it the same way as in the LSB-first variant.

Oh, and I’ve also started writing the division by 8 as shift-right by 3, and the mod-8 of our unsigned position as a binary AND with 7. These are equivalent substitutions, and you’ll see both forms in real-world code, so I figured I’d mix it up a little.

This rotate-style masking operation was how we (being RAD Game Tools) used to perform bit reading on the Cell PPU and Xbox 360, purely because the variable-distance shifts were so dismal on that particular core. And I’ll note that this version also has no problems whatsoever with width == 0; the only problem is its dependence on rotate instructions, which are available (and fast) in most architectures, but tend to be somewhat inconvenient to access from C code.

At this point, I’ve talked a lot about many, many different ways to do a bit shifting and masking, and I’ll be referring back to these later. What I haven’t shown you yet is any actual practical bit IO logic—the bit extraction form is a useful starting point, and convenient to explain these concepts in without having to worry about state, but you’re unlikely to ever use it as presented in any production code, not least because it assumes that the entire bit stream is in memory and will gleefully read past the end of the buffer during normal operation!

However, this post is already long enough that WordPress editing is starting to get sluggish on me. So what I’ll do is split this up into a multi-part series; you get to take a breather, and we’ll meet back in the next part. Until then!

In this short post I’m going to attempt to convince you that current network (Internet) latencies are here to stay, because they are already within a fairly small factor of what is possible under known physics, and getting much closer to that limit – say, another 2x gain – requires heroics of civil and network engineering as well as massive capital expenditures that are very unlikely to be used for general internet links in the foreseeable future.

This is a conversation I’ve had a few times in my life, usually with surprised conversation partners; last year I happened to write a mail about it that I’m now going to recycle and turn into this blog post so that in the future, I can just link to this if it ever comes up again. :)

When I originally wrote said mail in October 2017, I started by pinging csail.mit.edu, a machine that as far as I know (and Geo-IP services are with me on this one) is actually at MIT, so in Cambridge, MA. I did this sitting at my Windows work machine in Seattle, WA, and got this result:

Pinging csail.mit.edu [128.30.2.121] with 32 bytes of data:
Reply from 128.30.2.121: bytes=32 time=71ms TTL=48
Reply from 128.30.2.121: bytes=32 time=71ms TTL=48
Reply from 128.30.2.121: bytes=32 time=70ms TTL=48
Reply from 128.30.2.121: bytes=32 time=71ms TTL=48


Ping times are round-trip times and include the time it takes for the network packet to go from my machine to the MIT server, the time it takes for the server to prepare a reply, and the time for said reply to make it back to my machine and get delivered to the running ping process. The best guess for a single-way trip is to just divide the RTT by 2, giving me an estimate of about 35ms for my ping packet to make it from Seattle to Boston.

Google Maps tells me the great circle distance from my office to Cambridge MA is about 4000km (2500 miles, if a kilometer means nothing to you). Any network packets I’m sending these days over normal network infrastructure are likely to use either optical fiber (especially for long-distance links) or copper cable as a transmission medium. The rule of thumb I learned in university that effective signal transmission speed over both is about 2/3rds of the speed of light in vacuum. This is called the velocity factor; that article has some actual numbers, which work out to 0.65c for Cat-6A twisted pair cable (used for 10Gbit Ethernet), 0.64c for Cat-5e (1Gbit Ethernet), and 0.67c for optical fiber, all of which are close enough to each other and to the aforementioned 2/3c rule of thumb that I won’t bother differentiating between different types of cables in the rest of this post.

Divide our distance of 2500mi by 2/3c, we get about 4×106m / (2×108 m/s) = 2 × 10-2 s = 20ms. That is the transmission delay we would have for a hypothetical optical fiber strung along the great circle between Seattle and Cambridge, the shortest distance between two points on a sphere; I’m neglecting height differences and Earth being not quite spherical here, but I’m only doing a back-of-the-envelope estimate. Note that the actual measured one-way latency I quoted above is already well below twice that. Hence my earlier comment about even a factor-of-2 improvement being unlikely.

That means that of our approximately 35ms one-way trip, 24ms is just from physics (speed of light, index of refraction of optical fiber) and logistics (not having a full mesh of minimum-distance point-to-point links between any two points in the network). The remaining 11ms of the transit time are, presumably, spent with the packets briefly enqueued in some routers, in signal boosters and network stacks. These are the parts that could be improved with advances in network technology. But even reducing that part of the cost to zero still wouldn’t give us even a full 1.5× speed-up. And that’s why I think mainstream network tech isn’t going to get that much faster anytime soon.

What if we are willing to pay up and lay a private dedicated fiber link for that distance (or use some dark fiber going the right direction that’s already there)? That’s still using mainstream tech, just spending money to reduce the fraction of time spent on sub-optimal routing (physical route that is) and in the network, and it seems likely that you could get the RTT to somewhere between 45ms and 50ms using regular fiber if you were willing to spend the money to set it up (and maintain it).

But that’s still assuming using something as pedestrian as fiber (or copper cable), and actually sticking to the surface of the Earth. Going even further along the “money is no object” scale, and edging a teensy bit into Bond Villain territory, we can upgrade our signal velocity to full light speed and also get rid of pesky detours like the curvature of the Earth by digging a perfectly straight tunnel between the two points, keeping a vacuum inside (not strictly necessary, but might as well while we’re at it) and then establishing a really high-powered microwave link; it would have to be to make it along a distance of a few thousand kilometers, given beam divergence. Keeping in theme, such a link would then also be attractive because the transceivers at either end because a sufficiently high-powered microwave transmitter should make for a serviceable death ray, in a pinch. More seriously, high-powered point-to-point microwave links are a thing, and are used in very latency-sensitive applications such as high-frequency trading. However, as far as I know (which might be wrong – feel free to correct me!), individual segments typically span distances of a few miles, not tens of miles, and definitely not hundreds. Longer links are then built by chaining multiple segments together (effectively a sequence of repeaters), adding a small delay on every split, so the effective transmission speed is definitely lower than full speed of light, though I don’t know by how much. And of course, such systems require uninterrupted lines of sight, which (under non-Bond-villain conditions) means they tend to not work, or only work in a diminished capacity, under bad weather conditions with poor visibility.

Anyway, for that level of investment, you should be able to get one-way trip times down to about 12ms or so, and round-trips of around 24ms. That is about 3× faster than current mainstream network tech, and gets us fairly close to the limits of known physics, but it’s also quite expensive to set up and operate, not as reliable, and has various other problems.

So, summarizing:

• If you have a good consumer/business-level Internet connection, ping someone who does likewise, and there are no major issues on the network in between, the RTTs you will get right now are within about a factor of 3 of the best they can possibly be as per currently known physics: unless we figure out a way around Special Relativity, it’s not getting better than straight-line line-of-sight light-speed communication.
• If you’re willing to spend a bunch of money to buy existing dark fiber
capacity that goes in the right direction and lay down some of your own
where there isn’t any, and you build it as a dedicated link, you should be able to get within 2× of the speed of light limit using fairly mainstream tech. (So about a 1.5× improvement over just using existing networks.)
• Getting substantially better than that with current tech requires
major civil engineering as well as a death ray. Upside: a death ray is its own reward.

Continued from part 6.

### 27. Qureshi, Jaleel et al. – “Adaptive Insertion Policies for High Performance Caching” (2007; computer architecture/caches)

I have written about caches before, so I’ll skip explaining what exactly a cache is, why you would want them and how they interact with the rest of the system and go straight to (some of) the details.

Caches in hardware are superficially similar to hash tables/maps/dictionaries on the software side: they’re associative data structures, meaning that instead of being accessed using something like an array index that (more or less) directly corresponds to a memory location, they’re indexed using a key, and then there’s an extra step where these keys are converted into locations of the corresponding memory cells. But unlike most software hash tables, caches in HW usually have a fixed size and hence “forget” old data as new data is being inserted.

There are a few ways this can work. On one extreme, you have what’s called “direct-mapped” caches. These compute a “hash function” of the key (the key often being a physical or virtual memory address, and the “hash function” most often being simply “take some of the address bits”, hence the quotation marks) and then
have a 1:1 mapping from the hash to a corresponding memory location in some dedicated local memory (mostly SRAM). The important point being that each key (address) has exactly one location it can go in the cache. Either the corresponding value is in that location, or it’s not in the cache at all. You need to store some extra metadata (the “tags”) so you know what keys are stored in what location, so you can verify whether the key in that location is the key you wanted, or just another key that happens to have the same hash code. Direct-mapped caches are very fast and have low power consumption, but they perform poorly when access patterns lead to lots of collisions (which often happens by accident).

At the other extreme, there’s “fully associative” caches. These allow any key to be stored in any location in the cache. Finding the corresponding location for a key means comparing that key against all the tags – that is, basically a linear search, except it’s usually done all in parallel. This is slower and a lot more power-hungry than direct-mapped caches; it also really doesn’t scale to large sizes. The primary advantage is that fully associative caches don’t have pathological cases caused by key hash collisions. When they’re used (mostly for things like TLBs), are rarely bigger than 40 entries or so.

In the middle between the two, there’s “set associative” caches (usually denoted N-way associative, with some number N). In a N-way associative cache, each key can be stored in one of N locations, which collectively make up the “set”. The set index can be determined directly from the key hash, and then there’s a (parallel) linear search over the remaining N items using the tags to determine which item in the set, if any, matches the key. This is the way most caches you’re likely to encounter work. Note that a direct-mapped cache is effectively a 1-way associative cache, and a fully associative cache is a cache where the degree of associativity is the same as the number of entries.

Because caches have a fixed size, inserting a new entry means that (in general) an older entry needs be evicted first (or replaced, anyway). In a direct-mapped cache, we have no choice: there is a single location where each key can go, so whatever item was in that slot gets evicted. When we have a higher number of ways, though, there’s multiple possible locations where that key could go, and we get to decide what to evict to free up space.

This decision is the “insertion policy” (also “replacement policy” or “eviction policy” in the literature – these are not quite the same, but very closely related) in the paper title. The problem to be solved here is essentially the same problem solved by operating systems that are trying to decide which portions of a larger virtual memory pool to keep resident in physical memory, and therefore classic page replacement algorithms apply.

This is a classic CS problem with a known, provably optimal solution: Bélády’s OPT. The only problem with OPT is that it heavily relies on time travel, which is permissible for research purposes but illegal to use in production worldwide, as per the 1953 Lucerne Convention on Causality and Timeline Integrity. Therefore, mass-market hardware relies on other heuristics, most commonly LRU (Least-Recently Used), meaning that the item that gets evicted is whatever was used least recently. (In practice, essentially nobody does true LRU for N>2 either, but rather one of various LRU approximations that are cheaper to implement in HW.)

LRU works quite well in practice, but has one big weakness: it’s not “scan resistant”, meaning it performs poorly on workloads that iterate over a working set larger than the size of the cache – the problem being that doing so will tend to wipe out most of the prior contents of the cache, some of which is still useful, with data that is quite unlikely to be referenced again before it gets evicted (“thrashing”).

One option to avoid this problem is to not use LRU (or an approximation) at all. For example, several of the ARM Cortex CPU cores use (pseudo-)random replacement policies in their caches, which is not as good as well-working LRU, but at the same time better than a thrashing LRU.

This paper instead proposes a sequence of modifications to standard LRU that require very little in the way of extra hardware but substantially improve the performance for LRU-thrashing workloads.

The first step is what they call LIP (“LRU Insertion Policy”): when thrashing is occurring, it’s not useful to keep cache lines around unless we think they’re going to be reused. The idea is simple: replace the least-recently-used cache line as before, but then don’t mark the newly-inserted cache line as recently used, meaning that cache line stays flagged as least-recently used. If there is a second access to this cache line before the next insertion, it will get flagged as recently used; but if there isn’t, that cache line will get replaced immediately. In a N-way associative cache, this means that at most 1/N’th of the cache will get wiped on a scan-like workload that iterates over a bunch of data that is not referenced again in the near future. (It is worth noting here that updating the recently-used state as described makes sense for L2 and deeper cache levels, where the primary type of transactions involves full-size cache lines, but is less useful in L1 caches, since most processors don’t have load instructions that are as wide as a full cache line, so even a pure sequential scan will often show up as multiple accesses to the same cache line at the L1 level.)

LIP is scan-resistant, but also very aggressive at discarding data that isn’t quickly re-referenced, and hence doesn’t deal well with scenarios where the working set changes (which happens frequently). The next step is what the paper calls BIP (“bimodal insertion policy”): this is like LIP, except that the just-inserted cache line is flagged as recently used some of the time, with low probability. The paper just uses a small counter (5 bits) and performs a recency update on insertion whenever the counter is zero.

BIP is a good policy for workloads that would thrash with pure LRU, but not as good as LRU on workloads that have high locality of reference (which many do). What we’d really like to do is have the best of both worlds: use LRU when it’s working well, and only switch to BIP when we’re running into thrashing, which is what the authors call DIP (“Dynamic Insertion Policy”), the “adaptive insertion policy” from the paper title. The way they end up recommending (DIP via “Set Dueling”) is really simple and cool: a small fraction of the sets in the cache always performs updates via pure LRU, and another small fraction (of the same size) always performs updates via pure BIP. These two fractions make up the “dedicated sets”. The cache then keeps track of how many misses occur in the pure-LRU and pure-BIP sets, using a single saturating 10-bit counter, “PSEL”: when there’s a miss in one of the pure-LRU sets, PSEL is incremented. When one of the pure-BIP sets misses, PSEL is decremented.

In effect, the algorithm is continuously running experiments in a small fraction of the cache. When PSEL is small (below 512), LRU is “winning”: there are fewer misses in the LRU sets than in the BIP sets. When PSEL is larger, BIP is in the lead, producing fewer misses than LRU. The rest of the sets (which makes up the majority of them) are “follower sets”, and use whatever insertion policy is currently winning. If PSEL is below 512, the follower sets use LRU, else they use BIP.

And… that’s it. The implementation of this in hardware is a really simple modification to an existing cache module; it requires the two counters (15 bits of storage total), a tiny bit of glue logic, and the only extra decision that needs to happen is whether the LRU status should be updated on newly-inserted cache lines (entries) or not. In particular, the timing-critical regular cache access path (read with no new insertion) stays exactly the same. Most of the relevant bits of the design fits into a tiny block diagram in figure 16 of the paper.

What’s amazing is that despite this relative simplicity, DIP results in a significant improvement on many SPEC CPU2000 benchmarks, bridging two thirds of the gap between regular LRU and the unimplementable OPT, and performing significantly better than all other evaluated competing schemes (including significantly more complex algorithms such as using ARC), despite only having a fraction of the hardware overhead.

Even if you’re not designing hardware (and most of use are not), the ideas in here are pretty useful in designing software-managed caches of various types.

### 28. Thompson – “Reflections on Trusting Trust” (1983 Turing Award lecture; security)

Ken Thompson and Dennis Ritchie got the 1983 ACM Turing Award for their work on Unix. Ken Thompson delivered this acceptance speech, which turned into a classic paper on its own right.

In it, he describes what is now known as the “Thompson hack”. The idea is simple: suppose you put a really blatant backdoor in some important system program like the “login” command. This is easy to do given the source, but also really blatant – why would anyone run your modified login program of their own volition?

So for the next stage, suppose you don’t actually put the backdoor in the login program itself. Instead, you hijack some program that is used to produce such system programs, like the C compiler. Say, instead of inserting your backdoor directly in “login”, you insert some evil code in the system C compiler that checks whether you’re compiling the login program, and if so, inserts your backdoor code itself. Now, a security review of the login source code will not turn up anything suspicious, but the compiler will still insert the backdoor on every compilation. However, you now put some (even more highly suspicious) code into the C compiler!

For the third and final stage, we can make this disappear as well: the important thing to note is that all major C compilers are, themselves, written in C (“bootstrapping”). Therefore, we can now add some more logic to the backdoor we added to the C compiler in the previous step: we also add some code to the C compiler that adds in our backdoors (including modification of the login program, as well as the changes we’re making to the C compiler itself) whenever the C compiler itself is compiled.

Having done this successfully, we can remove all traces of the backdoor from the C compiler source code: we now have a binary of a modified. evil C compiler that backdoors “login”, and also detects when we’re trying to compile a clean C compiler, producing another evil C compiler with the same backdoors instead. This kind of trickery is impossible to detect via source code review; and if you’re thinking that you would spot it in the binary, well, you can imagine more involved versions of the same idea that also hijack your operating system, file system drivers etc. to make it impossible to see the surreptitious modifications to the original binary.

That is not just idle musing; 33 years after Ken Thompson’s acceptance speech, this problem is more current than ever. Persistent malware that burrows into trusted components of the machine (like firmware) and is undetectable by the rest of the firmware, the OS and everything that runs on the machine is a reality, and the recent security vulnerabilities in the Intel Management Engine present on most processors Intel has shipped in the past 10 years are a reality.

Over the past few years, there have been efforts towards reproducible builds of open-source OSes and trustable (trustworthy?) end-to-end verifiable hardware design, but this is a real problem as more and more of our lives and data moves to computers we have no reason to trust (and sometimes every reason not to). We’re not getting out of this any time soon.

### 29. Dijkstra, Lamport et. al-“On-the-fly Garbage Collection: an Exercise in Cooperation” (1978; garbage collection)

This is the paper that introduced concurrent Garbage Collection via the tri-color marking invariant. It forms the basis for most non-concurrent incremental collectors as well. No matter where you stand on garbage collection, I think it’s useful (and interesting!) to know how collectors work, and this is one paper you’ll be hard-pressed to avoid when delving into the matter; despite the claim in the paper’s introduction that “it has hardly been our purpose to contribute specifically to the art of garbage collection, and consequently no practical significance is claimed for our solution”, this is definitely one of the most important and influential papers on GC ever written.

I won’t go into the details here; but as with anything Lamport has written or co-authored, his notes on the paper are worth checking out.

### 30. Lamport – “Multiple Byte Processing with Full-Word Instructions” (1975; bit twiddling)

In my original list on Twitter, I summarized this by saying that “I thought I wouldn’t need to do this anymore when we got byte-wise SIMD instructions on CPUs, and then CUDA came along and it was suddenly profitable on GPUs.”

This style of technique (often called “SWAR” for SIMD-within-a-register) is not new (nor was it in 1975, as far as I can tell), but Lamport’s exposition is better than most, since most sources just give a few examples and expect you to pick up on the underlying ideas themselves. Fundamentally, it’s about synthesizing at least some basic integer SIMD operations from non-SIMD integer arithmetic.

I used to consider this kind of thing a cute hack back when I was first doing graphics with software rendering back in the mid-90s. As my earlier quote illustrates, I thought these tricks were mostly obsolete when most CPUs started getting some form of packed SIMD support.

However, I used this technique in my DXT1 compressor back in 2007 (primarily because I only need a few narrow bit fields and SIMD intrinsic support in most compilers was in a relatively sorry state back then), published the source code, and was soon told by Ignacio Castaño (then at NVidia) that this technique was good a for about a 15% increase (if I remember correctly) in throughput for the NVidia Texture Tools CUDA encoder.

Since then, I have used SWAR techniques in GPU kernels myself, and I’ve also used them to paper over missing instructions in some SIMD instruction sets for multi-platform code – some of this quite recently (within the past year).

Within the last few years, PC GPUs have started adding support for narrower-than-32-bit data types and packed arithmetic, and mobile GPUs have always had them. And no matter what kind of computing device you’re writing code for, sometimes you just need to execute a huge number of very simple operations on tons of data, and a 2x-4x (or even more) density increase over standard data types is worth the extra hassle.

Separately, I just keep bumping into ways to map certain kinds of parallel-scan type operations to integer adder carry logic. It’s never as good as dedicated hardware would be, but unlike specialized hardware, integer adders are available everywhere right now and are not going anywhere.

In short, I’ve changed my mind on this one. It’s probably always going to be a niche technique, but it’s worth checking out if you’re interested in low-level bit hacks, and the wins can be fairly substantial sometimes. You might never get to use it (and if you do, please please write proper comments with references!), but I’ve come back to it often enough by now to consider it a useful part of my tool chest.

### The end

And that’s it! It took me a while to get through the full list of my recommendations and write the extended summaries. That said, spacing the posts out like this is probably better to read than having a giant dump of 30+ papers recommended all at once.

Either way, with this list done, regular blogging will resume in the near future. At least that’s the plan.

Continued from part 5. (Sorry for the long delay, I was busy with other things.)

### 24. O’Neill – “PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation” (2014; pseudo-random number generation)

This paper (well, more like a technical report really) introduces a new family of pseudo-random number generators, but unlike most such papers, it’s essentially self-contained and features a great introduction to much of the underlying theory. Highly recommended if you’re interested in how random number generators work.

Unlike most of the papers on the list, there’s not much reason for me to write a lot about the background here, since essentially everything you need to know is already in there!

That said, the section I think everyone using random-number generators should have a look at is section 3, “Quantifying Statistical Performance”. Specifically, the key insight is that any pseudorandom number generator with a finite amount of state must, of necessity, have certain biases in the sequences of numbers it outputs. A RNG’s period is bounded by the amount of state it has – a RNG with N bits can’t possibly have a period longer than 2N, per the pigeonhole principle – and if we also expect the distribution of values over that full period to be uniform (as is usually desired for RNGs), that means that as we draw more and more random numbers (and get closer to exhausting a full period of the RNG), the distribution of numbers drawn must – of necessity – become distinguishable from “random”. (Note that how exactly to measure “randomness” is a thorny problem; you can’t really prove that something is random, but you can prove that something isn’t random, namely by showing that is has some kind of discernible structure or bias. Random number generators are generally evaluated by running them through a barrage of statistical tests; passing these tests is no guarantee that the sequence produced by a RNG is “sufficiently random”, but failing even one test is strong evidence that it isn’t.)

What the paper then does is instead of comparing random number generators with N bits of internal state to an idealized random source, they instead compare them to the performance of an ideal random number generator with N bits of internal state – ideal here meaning that its internal state evolves according to a randomly chosen permutation $f : 2^N \rightarrow 2^N$ with a single cycle. That is to say, instead of picking a new “truly random” number every time the RNG is called, we now require that it has a limited amount of state; there is a (randomly chosen!) list of 2N uniformly distributed random numbers that the generator cycles through, and because the generator only has N bits of state, after 2N evaluations the list must repeat. (A side effect of this is that any RNG that outputs its entire state as its return value, e.g. many classic LCG and LFSR generators, will be easy to distinguish from true randomness because it never repeats any value until looping around a whole period, which is statistically highly unlikely.) Moreover, after going through a full cycle of 2N random numbers, we expect all possible outputs to have occurred with (as close as possible to) uniform probability.

This shift of perspective of comparing not to an ideal random number generator, but to the best any random number generator can do with a given amount of state, is crucial. The same insight also underlies another fairly important discovery of the last few years, cryptographic sponge functions. Cryptographic sponges started as a theoretical construction for hash function construction, starting from the desire to compare not to an idealized “random oracle”, but instead a “random sponge” – which is like a random oracle except it has bounded internal state (as any practical cryptographic hash has, and must).

Armed with this insight, this paper then goes on to compare the performance of various published random number generators against the theoretical performance of an ideal RNG with the same amount of state. No generator with a given amount of state can, in the long term and on average, do better than such an ideal RNG with the same amount of state. Therefore, checking how close any particular algorithm with a given state size comes to the statistical performance of a same-state-size ideal RNG is a decent measure of the algorithm quality – and conversely, reducing the state size must make any RNG, no matter how good otherwise, eventually fail statistical tests. How much the state size can be reduced before statistical deficiencies become apparent is thus an interesting metric to collect. The paper goes on to do just that, testing using the well-respected TestU01 test suite by L’Ecuyer and Simard. It turns out that:

• Many popular random number generators fail certain randomness tests, regardless of their state size. For example, the Mersenne Twister, being essentially a giant Generalized Feedback Shift Register (a generalization of Linear-Feedback Shift Registers, one of the “classic” pseudorandom number generation algorithms), has easily detectable biases despite its giant state size of nearly 2.5 kilobytes.
• Linear Congruential Generators (LCGs), the other big “classic” family of PRNGs, don’t perform particularly great for small to medium state sizes (i.e. the range most typical LCGs fall into), but improve fairly rapidly and consistently as their state size increases.
• “Hybrid” generators that combine a simple basic generator to evolve the internal state with a different function to produce the output bits from the state empirically have much better statistical performance than using the basic generators directly does.

From there on, the paper constructs a new family of pseudorandom number generators, PCGs (“permuted congruential generators”), by combining basic LCGs (with sufficient amount of state bits) for the internal state evolution with systematically constructed output functions based on certain types of permutations, and shows that they combine state-of-the-art statistical performance with relatively small state sizes (i.e. low memory usage) and high speed.

### 25. Cook, Podelski, Rybalchenko – “Termination Proofs for Systems Code (2006; theoretical CS/program analysis)”

This paper is about an algorithm that can automatically prove whether programs terminate. If you’ve learned about the halting problem, this might give you pause.

I’ll not talk much about the actual algorithm here, and instead use this space to try and explain how to reconcile results such as the halting problem and Rice’s theorem with the existence of software tools that can prove termination, do static code analysis/linting, or even just advanced optimizing compilers!

The halting problem, informally stated, is the problem of deciding whether a given program with a given input will eventually complete, or whether it will run forever. Alan Turing famously proved that this problem is undecidable, meaning that there can be no algorithm that will always give a correct yes-or-no answer to any instance of this problem after running for a finite amount of time.

Note that I’m stating this quite carefully, because it turns out that all of the small restrictions in the above paragraph are important. It is very easy to describe an algorithm that always gives a yes-or-no answer to any halting problem instance if the answer isn’t required to be correct (duh); the problem is also tractable if you don’t require the algorithm to work on any program, since there are absolutely sets of programs for which it is easy to prove whether they halt or not. And finally, if we relax the “finite run time” requirement, we can get by on a technicality: just run the program. If it has terminated yet, return “yes”; if not, well, it might terminate further on, so let’s not commit to an answer yet and keep going!

Rice’s theorem builds on the halting problem construction to show that there can be no algorithm that can, with certainty, produce the answer to any “interesting” (meaning it is not true or false for all possible programs) yes-or-no question in finite time.

That sounds like pretty bad news for program analysis. But it turns out that even if we want something that is correct, works on any program, and answers in a finite amount of time, we have one last resort left: we can relax the “yes-or-no” requirement, and give our algorithm a third possible return value, “not sure”. Once we allow for that option, it’s obvious that there is a (trivial) solution: the easiest such solver can just return “not sure” for absolutely every program. This is technically correct (the best kind!) but not very useful; however, it gives us something to build on. Namely, instead of having to come up with a definitive yes-or-no answer on the unwieldy set of all possible programs given all possible inputs (which, as stated before, is undecidable), we lower our aim slightly and merely try to be able to give termination proofs (or answer other program analysis-type questions) for some programs – hopefully practically relevant ones. And if we get a program that we can’t say anything with certainty about, we always have our “not sure” escape hatch.

To explain, let me show you three functions, this time using a Python 3-esque instead of my usual C/C++-esque pseudocode, to show a few different ways this can play out:

def func1():
while True:
print('Hello world!')

def func2(n):
n = int(n)
while n > 0:
print('All good things must end!')
n -= 1

def func3(n):
n = int(n)
if n <= 0:
return

while n != 1:
print(n)
if (n % 2) == 0:
n = n/2
else:
n = 3*n + 1


The first function, func1, contains an obvious infinite loop. It’s easy to see (and prove) that it never terminates.

The second function, func2, contains a loop that is definitely terminating. (The conversion to integer at the start is an important technicality here, since this kind of loop is not guaranteed to terminate if n is a sufficiently large floating-point number.) The argument here is straightforward: n keeps steadily decreasing throughout the loop, and any integer, no matter how big, can only be decremented a finite number of times before it will be ≤0. Very similar arguments apply for loops iterating over lists that aren’t modified inside the loop, and other typical iteration constructs – in short, a very common class of iteration in programs can be shown to terminate quite easily. (Simplifying a lot, the Terminator paper uses a generalized version of this type of argument to prove termination, when it can.)

Finally, func3 shows a case that is legitimately hard. Showing that this program always terminates is equivalent to proving the Collatz conjecture, a long-standing open problem. Currently, it is conjectured that it will always terminate, and it has been verified that there are no “small” counterexamples (the sequence definitely terminates for any number up to around 260, for example), but nobody knows the answer for sure – or if they do, they aren’t talking.

This is also the reason for the underlying difficulty here: the set of “all possible programs” is extremely big, and many other complex problems – like, in this case, mathematical theorems – can be converted into termination proofs for certain programs.

However, the second example shows that there is hope: while some simple programs are hard to prove anything about, many programs – even fairly complex ones – only use constructs that are fairly easy to show are terminating, and even when we can’t prove anything about a whole program, being able to automatically prove certain properties about say parts of the program is a lot better than not having anything at all.

The same thing applies to other types of program analysis, like for example used in optimizing compilers: if the compiler can figure out some non-trivial property of the program that is useful for some tricky transformation, great! If not, it’s stuck with leaving the code in the form that it’s written, but that’s still no worse than we started out with.

### 26. O’Neill – “The Genuine Sieve of Eratosthenes” (2009; programming)

As a bit of a palate cleanser after the previous paper (which is rather technical), this is a paper about a pet peeve of mine. Specifically, describing something like this piece of Haskell code (using the version from the paper) as the “Sieve of Eratosthenes” (a standard algorithm to find prime numbers):

  primes = sieve [2..]

sieve (p : xs) = p : sieve [x | x <- xs, x mod p > 0]


The operation of this type of program (which is not limited to lazy functional languages, by the way) can indeed be understood as a kind of “sieving” process, but the underlying algorithm is not, in fact, the Sieve of Eratosthenes; rather, it is equivalent to a rather inefficient implementation of basic trial division, that ends up with significantly worse asymptotic complexity than either “proper” trial division or the true Sieve of Eratosthenes (see section 2.1 of the paper). Its run time is, for large n, worse than the true sieve by about a factor of $n/(\log(n))^2$ – not quite “accidentally quadratic”, but close! (I’m neglecting a $\log \log n$ term for the true sieve, but iterated logs grow so slowly that they’re essentially negligible for algorithms that need memory proportional to n, since in that case, the possible values for $\log \log n$ on any physically realizable machine are bounded by a very small constant.)

Now, the interesting part here isn’t that a given Haskell one-liner (or two-liner) isn’t the most efficient way to solve this problem; that’s hardly surprising. But what I do find interesting is that the “unfaithful sieving” algorithm implemented by the short program above is frequently claimed to be the true Sieve of Eratosthenes, despite being not even close.

It goes to show that what feels like small implementation details can often make very significant differences, and in ways that even seasoned experts miss for years. It’s not a bad idea to get into the habit of plotting the run times of any algorithm you implement for a few problem sizes, just to check whether it looks like it has the asymptotic runtime you think it should have or not. In this case, it hardly matters, but for more serious applications, it’s a lot better to discover this kind of problem during testing that it is in production.

Continued from part 4.

### 19. Curtsinger, Berger – “Stabilizer: Statistically sound performance evaluation” (2013; performance evaluation)

Current CPUs are chock-full of associative data structures that are indexed by memory addresses, either virtual or physical. In no particular order, we have: multiple cache levels (for instructions, data, or both), multiple TLB levels, page entry caches, branch target buffers (multiple levels), branch direction history, snoop filters, cache directories and more. (Not all of these exist everywhere or are necessarily distinct.)

All of these implement a kind of “forgetful” dictionary/map structure, and they are usually implemented as set-associative caches. Without going into too much detail, that means that addresses are “hashed” (some caches use relatively decent hash functions, but most that are on time-critical paths just use a few bits from the middle of the address as their “hash”). For a given hash value, there are then typically somewhere between 2 and 16 locations in the cache that values with that hash can be stored in (that’s the “set” in “set-associative”). Generally, to free up a slot, something else has to be thrown out (“evicted”) first; there’s a lot of design freedom in how caches pick what gets evicted. Most commonly, it’s either one of several LRU approximations (“real” LRU for more than 3-4 elements is relatively expensive in hardware) or just a (pseudo-)random eviction policy.

Even when there is no randomness in the cache policy, the addresses themselves are also somewhat random. Virtual addresses within a process used to be mostly the same between two runs of the same program (this is possible because each process gets its own address space), but this made it easy to write reproducible security exploits, so we got ASLR (address space layout randomization), which deliberately shuffles the memory map to make writing reliable exploits harder. Even without ASLR, memory maps for separate runs of the same process weren’t always exactly the same; for example, changing the size of the environment variables or command line can also have a big knock-on effect on the memory map as various areas get moved to make space.

Modifying code in any way whatsoever can have significant effects on memory layout too. If a single function changes size, the linker is going to end up moving everything after it (for a randomly picked function, on average that’s going to be half the program) to a different location.

For anything that is indexed with physical memory addresses (typically L2 and below cache levels, and everything associated with them), things are even less predictable. Two back-to-back runs of the same process may have the same virtual address map if they have the same environment, command line and ASLR is disabled, but they’re unlikely to get assigned the exact same physical memory as the previous run, because physical memory is a globally shared resource between all the processes running on a system, and the OS is servicing physical memory allocations and deallocations all the time.

The end result is that even if you manage to carefully control for other spanners in the works such as background tasks and services (which have a tendency to start intensive tasks like indexing when you least expect it), network traffic (lots of incoming or outgoing packets that are not for your program can have a serious effect), available memory, and GPU/CPU voltage/frequency scaling, it’s possible to end up with dramatic execution time differences between either two runs of the same program, or two runs of two versions of a program that haven’t touched any code in the inner loops.

How bad does this get for real-world code? Now don’t get me wrong, many programs aren’t significantly affected by such issues at all. But some are; and microscopic layout differences can have decidedly macroscopic consequences. The SPECint 2000 “perl” benchmark (which is just the Perl interpreter running some – completely deterministic! – code) is infamous for having +-5% swings (this happens for multiple architectures) whenever anything in the executable changes. “Causes of Performance Instability due to Code Placement in X86” (presentation given at the LLVM US Dev Meeting 2016) has some particularly drastic examples.

This kind of thing is particularly insidious when you’re spending time tinkering with some hot loop. Ever made a change that really should have made things faster but instead was a noticeable slow-down? Often these things are explained if you look at the generated machine code (if you are the kind of person who does that, that is), but sometimes they really are completely mysterious and apparently random – or at least “twitchy”, meaning there are big swings whenever you make any change, with no discernible rhyme or reason. It’s really frustrating to chase this type of problem down if you’re ever stuck with it (it involves a lot of staring at CPU performance counters between different versions of a program), but probably worse is when you don’t notice it’s happening at all, and think that what ends up being a “random” (or at least accidental/unintended) fluctuation is due to an intentional change working as expected.

That’s where this paper comes in. Stabilizer uses various tricks to (to the extent possible) re-randomize memory layout in LLVM-compiled code periodically while the app is running. It can’t move existing allocations, but it can add random padding on thread stacks, make new heap allocations return addresses in a different range, and shuffle machine code around. While any individual memory layout has its own biases, sampling over many independent memory layouts during a test run allows systematically controlling for memory layout effects. Ideally (if the memory layout gets sufficiently randomized and there are no other confounding factors), it results in the different runs being independent and identically distributed, meaning the Central Limit Theorem applies and the resulting sum distribution is Gaussian. That’s very good news because Gaussians are easy to do statistical hypothesis testing with, which the paper then proceeds to do.

I wish this kind of thing was standard on every platform I’m developing for, but sadly the paper’s implementation never seems to have made it past a research prototype (as far as I can tell).

### 20. Tomasulo – “An efficient algorithm for exploiting multiple arithmetic units” (1967; computer architecture)

Many important computer architecture ideas are way older than you might expect. The primary reason is that there was intensive research and fierce competition in the mainframe/supercomputer market in the 1960s and 70s. Decades later, after advances in semiconductor manufacturing made microprocessors (entire processors on a single chip!) first possible and then kept the available transistor budget at consumer-relevant prices increasing exponentially, more and more ideas that used to be limited to top-of-the-line, room-sized computing devices became relevant for mass-market electronics.

Tomasulo’s paper talks about the floating-point unit in the very top of the line of the famous System/360 series, the Model 91. Sources disagree on how many of these were ever built; the count is somewhere between 10 and 20. Worldwide. Just the CPU set you back about $5.5 million, in 1966 dollars (which would be$41.5 million at the time I’m writing this, September 2017). But it wouldn’t do you much good by itself; you’d also want to shop for a console, a couple punch-card readers, a tape drive or two, maybe even a disk drive, some line printers… you get the idea. My point being, the computer market was different in the 1960s.

System/360 is extremely influential. It pioneered the idea of an Instruction Set Architecture (ISA). In the 60s, every CPU model had its own instruction set, operating system, and compilers. It was considered normal that you’d have to rewrite all your software when you got a newer machine. System/360 had a different idea: it specified a single ISA that had multiple implementations. At the low end, there was the Model 30. It executed 34500 instructions per second and had, depending on configuration, somewhere between 8KB and 64KB of core memory (and I do mean core memory). At the very top, there was the Model 91, running 16.7 million instructions per second and with several megabytes of RAM. Of course, there were several models in between. And all of them would run the exact same programs. Note I wrote is earlier; the direct descendants of System/360 are still around, albeit re-branded as “z/Architecture” (slashes still going strong after 50 years!) when it was extended to 64 bits. IBM announced a new iteration, the z14, with custom CPU just a few weeks ago, and yes, they are still backwards compatible and will run 1960s System/360 code if you want them to.

Snicker all you want about mainframes, but if you manage to design a computer architecture in the 1960s that is still a multi-billion-dollar-a-year business (and still gets new implementations 50 years on), you certainly did something right.

Anyway: in retrospect, we can clearly say that this whole ISA idea was a good one. But it comes with a new problem: if the whole pitch for your product line is that you can always upgrade to a bigger machine if you need higher performance, then these bigger machines better deliver increased performance on whatever code you’re running, without a programmer having to modify the programs.

Which brings us to the Model 91 floating-point unit and Tomasulo’s algorithm. The model 91 FPU had separate, pipelined floating-point adders and multipliers. This is now completely standard, but was novel and noteworthy then (there’s a separate paper on the FPU design that’s fascinating if you’re interested in that sort of thing; among other things, that FPU was the source of Goldschmidt’s division algorithm).

The Model 91’s instruction unit could deliver one instruction per clock cycle. FP additions took 2 cycles (pipelined), single-precision multiplies 3 cycles (pipelined), and single-precision divisions 12 cycles (blocking, since they are executed as a sequence of multiplies).

Now if you were writing code for that specific machine by hand, unrolling loops where necessary etc., it would not be very hard to write code that actually achieves a rate of 1 new instruction per clock cycle – or at least, it wouldn’t be if the original System/360 architecture had more than 4 floating-point registers. But combine the 4 floating-point registers available, the relatively slow storage access (judging by the paper, loads had a latency of about 10 clock cycles), and the desire to be “plug-in compatible” and not require hand-tuned code for good performance if possible, a different solution was required.

Thus, Tomasulo’s algorithm, the first “shipping” instance of out-of-order execution, albeit only for the FPU portion of the Model 91, and only issuing one instruction per cycle. (The first attempt at superscalar out-of-order execution I’m aware of was designed by Lynn Conway for the IBM ACS-1 around the same time, but the project got canned for political reasons).

What I particularly like about Tomasulo’s paper is that it illustrates the process of incrementally modifying the initial in-order floating point unit design to its out-of-order equivalent. Textbook treatments of out-of-order execution generally deal with more complicated full out-of-order machines; focusing it on a single FPU makes it easy to see what is going on, and how the changes are relatively incremental.

The underlying ideas for out-of-order execution are closely related to the local value numbering I talked about last time in the context of SSA construction – namely, eliminating unnecessary dependencies that are just the result of name collisions rather than true dataflow between operations. CPUs don’t have the same difficulty as compilers in that they don’t need to build a representation valid for all possible control flow sequences through a given function; instead, they can either wait until control flow is decided (as in the Model 91 variant) or use speculation to, effectively, bet on a single likely control flow outcome when encountering a branch. (If that bet turns out to be wrong, all instructions thus started have to be cancelled.)

While out-of-order execution as a concept has been around for a long time, as far as I can tell, the first mass-produced fully out-of-order machines – in the sense of fully committing to out-of-order and speculative execution for the majority of the processor core, not just at the “periphery” (say for the FPU) – came out all within a few years of each other in the mid-90s. There’s IBM’s PowerPC 604 (December 1994), Intel’s Pentium Pro (November 1995), the MIPS R10000 (January 1996), and the DEC Alpha 21264 (October 1996).

Of these, the first two use a variant of Tomasulo’s algorithm using reservation stations, and the last two used a different approach based on explicit register renaming. Some processors even mixed the two styles: several of the early AMD Athlons used Tomasulo in the integer pipe and explicit register renaming for floating-point/SIMD.

Newer processors seem to tend towards explicit renaming; it usually has less data movement than Tomasulo’s algorithm, because for the most part, the signals passed through the pipeline are indices into a physical register file instead of actual values. The difference is especially pronounced with wide SIMD registers. But Tomasulo’s algorithm was very popular for quite a long time.

### 21. Wilson, Johnstone, Neely, Boles – “Dynamic storage allocation: A survey and critical review” (1995; memory allocation)

This one is a survey of memory allocation algorithms as of 1995; that is to say, it collects almost everything you need to know about how to implement memory (de)allocation for single-threaded programs, but doesn’t cover multi-threaded environments, which only really became a serious concern later. That’s not as big a gap as it may sound, because the main adaptations necessary to get a useful multi-threaded allocator are, at least conceptually, relatively clean. (one-sentence summary of proven-out approaches: thread-local caches, deferring of cross-thread frees, and multiple arenas).

But back to 1995: why is this paper on my list? Well, if you started programming in the early 90s on home computers or PCs (like me), then your early experiences with dynamic memory allocation pretty much sucked. (As far as I can tell, many late-80s era Unices were not all that much better.)

The key problems were this:

1. These allocators tended to be buggy, slow, or both at the same time. For example, it was fairly common for realloc implementation to be buggy and corrupt the heap in certain circumstances.
2. They tended to have high levels of memory fragmentation, meaning that programs that did many allocations and deallocations would tend to end up reserving a lot more memory space than they were actively using, because there were lots of awkwardly-sized holes in the middle of the address space that were not large enough to satisfy normal allocation requests.
3. All of this would be exacerbated by running in environment without virtual memory (this one did not apply to the aforementioned Unices, obviously). With virtual memory and swap space, growing memory use over time is more an inconvenience than a show-stopper. The program’s performance (and system performance in general) slowly degrades. Without virtual memory, the moment you want to allocate even one byte more than there is memory in the machine, you get an “out of memory” condition and, more likely than not, your program crashes.

The reason I’m citing this paper is because it turns out that the first two issues are related. In particular, it turns out that there were many issues with the way memory allocators were usually designed and evaluated. Most significantly, the “standard procedure” to evaluate allocators in the literature for a long time was to generate statistics for the sizes of memory allocations from real programs (or sometimes by building Markov models for the allocation/deallocation patterns), and then evaluate allocators using synthetic traces generated to match those statistics, rather than using actual allocation traces recorded from full runs of some test programs.

It turns out that this is a serious mistake. One of the most important characteristics of “real” allocation traces is that allocation and deallocation patterns tend to be bursty, and the statistics used didn’t capture any of those patterns.

This survey looks critically at this evaluation methodology, points out its flaws, and also points out the necessity of systematically analyzing the effect of individual allocation policies, rather than just testing complete allocators against each other.

In the follow-up paper “The memory fragmentation problem: solved?” the same authors analyze various allocator implementations and show that roving-pointer “next-fit”, one of the more popular algorithms at the time (and recommended by Knuth’s “The Art of Computer Programming”), has particularly bad fragmentation behavior, and shows that address-ordered first fit and best fit perform quite well, having very low overall fragmentation. In short, the high fragmentation produced by many late-80s and early-90s allocators was primarily because they used an algorithm that exacerbates the problem, which went unnoticed for a long time because the accepted analysis methodology was deeply flawed.

The authors of both papers collaborated with Doug Lea, who improved his allocator dlmalloc in response; dlmalloc implements a combination of segregated-storage (for smaller sizes) and best-fit (based on a radix tree, for larger requests). The resulting allocator is quite famous and was the basis for many later follow-up allocators.

### 22. Meyer, Tischer – “GLICBAWLS – Grey Level Image Compression By Adaptive Weighed Least Squares” (2001; image compression)

This paper is not like the other papers on this list.

For one thing, the name is a pretty obvious backronym. For another, the paper comes with a reference implementation – because the algorithm was originally developed as an entry for the International Obfuscated C Code Contest and fits in 1839 bytes of C / ASCII art.

Nevertheless, the algorithm is quite elegant and was, at the time it was released, easily among the best lossless grayscale image compressors. I wouldn’t use it today because it’s just way too serial internally, but I still have a soft spot for it.

### 23. Hutton et al. – “Improving FPGA Performance and Area Using an Adaptive Logic Module” (2004; FPGAs)

This one, I don’t have that much to write about.

Here’s the short version: the combinational logic elements in FPGAs are internally realized as small lookup tables (LUTs). Arbitrary Boolean functions are thus described by their truth tables, which is stored in small SRAMs: 16 bits of SRAM for a 4-input binary logic function.

On the first page, the paper has a picture of a simple logic cell: a 4-bit LUT plus a D flip-flop. This used to be the standard building block of FPGAs, because it yields a good trade-off between area and delay for synthesized designs.

What this paper does is derive the design of a different logic module cell that can express either multiple 4-input LUTs, arbitrary 6-input LUTs, certain pairs of two 6-input logic functions with two outputs, and certain 7-input logic functions.

This is purely geeking out but it’s pretty cool if you’re interested in that sort of thing!

Continued from part 3.

### 16. O’Donoghue et al. – “Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding” (2016; numerical math/mathematical optimization)

This is a very neat first-order method to solve cone programs that brings together several key advances in the theory over the past 25 years to produce a conceptually simple (and quite short!) yet powerful solver.

Unfortunately, this puts me in a bit of a pickle here, because I expect most of you don’t know what cone programs are, what “operator splitting” is, or for that matter, what is meant by “homogeneous self-dual embedding” or “first-order method” here.

So I’m going to do the same thing I did in part 2 when talking about matrix multiplies and solving linear systems of equations, and back up a whole lot.

We saw linear systems of equations; in matrix form, we were trying to solve
$Ax = b$
where $A \in \mathbb{R}^{n \times n}$, $x \in \mathbb{R}^n$, $b \in \mathbb{R}^n$. If A is regular (nonzero determinant), this problem has exactly one solution, and barring potential numerical issues if A is ill-conditioned (just ignore that part if you don’t know what it means), we can solve this with standard methods like LU decomposition with partial pivoting – like the non-pivoted LU we saw in part 2, but now we’re allowed to swap rows to avoid certain problems and increase numerical accuracy (still not gonna go into it here). I hope that after reading part 2, you have at least some idea of how that process goes.

On the next step of the ladder up, we have linear least-squares problems. These naturally occur when we have linear systems with more equations than unknowns (more common), or linear systems with more variable than equations (less common, and I’ll ignore that case in the following), and a few others. These types of problems appear in approximation, or when trying to recover parameters of a linear model from noisy measurements where errors are uncorrelated, have expectation 0 and the same variance (as per the Gauss-Markov theorem). They also tend to get applied to problems where they’re really not well-suited at all, because linear least-squares is by far the easiest and most widely-known mathematical optimization method. But I’m getting ahead of myself.

Under the original conditions above, we could actually achieve $Ax=b$ (in exact arithmetic anyway). If we have more equations than variables, we can’t expect to hit an exact solution always; we now have $A \in \mathbb{R}^{m \times n}$, $x \in \mathbb{R}^n$, $b \in \mathbb{R}^m$, $m \ge n$ (at least as many equations as variables), and the best we can hope for is $Ax \approx b$. We rewrite this into $Ax-b \approx 0$, because in normed vector spaces, we have good machinery to express that some vector is “close” to zero: we want its norm to be small. That is, we get a problem of the type

$\displaystyle \min_{x \in \mathbb{R}^n} \| Ax - b \|$

for some norm $\| \cdot \|$ we get to choose. Not surprisingly given the name “least squares”, we choose the 2-norm. I’m going to go over this fairly quickly because least-squares isn’t the point of this post: for a vector x, we have $\|x\|_2 = \sqrt{x^T x}$ where $x^T x$ is just a dot product of x with itself written as a matrix product, if you haven’t seen that notation before. We can get rid of the square root by just squaring everything (standard trick) and get:

$\displaystyle \min_{x \in \mathbb{R}^n} \| Ax - b \|_2^2 = \min_x (Ax-b)^T (Ax-b) \\ = \min_x x^T A^T Ax - 2 x^T A^T b + b^T b$

That’s a quadratic function. Multi-dimensional, but still quadratic. Basic calculus tells us that we can find the extrema of a quadratic function by finding the spot where its derivative is zero. The derivative of the above expression with respect to x is

$\displaystyle 2 (A^T Ax - A^T b)$

and setting it to zero brings us to the linear system called the “normal equation”

$A^T A x = A^T b$

If you followed along with that, great! If not, don’t worry; the only thing you should remember here is that 2-norm means the function we’re trying to minimize is quadratic, which therefore has a linear derivative, thus we can find the minimum by solving a linear system, which we know how to do. Caveat: the normal equations are not generally the preferred way to solve such problems, because squaring A (the $A^T A$ part) greatly amplifies any numerical problems that the matrix might have. Unless you know in advance that $A^T A$ is well-behaved (which you do in several types of problems), you should generally solve such problems with other non-squaring approaches based on say the QR decomposition or SVD, but again, out of scope for this post.

So far, we’ve covered elimination methods for solving linear systems (known to the Chinese back in the 2nd century CE) and linear least-squares; depending on whether you trust Gauss’s assertion that he came up with it before Legendre published it, the latter is either late 18th or early 19th century. The third of the trifecta of fundamental linear numerical algorithms is Linear Programming; this puts us either in the 19th century (if you’re feeling charitable and consider Fourier-Motzkin elimination a viable way to solve them) on in the mid-20th century for the first practical algorithm, the simplex method.

But what is a linear program? A canonical-form linear program is a mathematical optimization problem of the form

maximize: $c^T x$
subject to: $Ax \le b$ (componentwise), $x \ge 0$ (also componentwise)

But as the “canonical-form” name suggests, LPs come in many forms. In general, a linear program:

• must have a linear objective function. It doesn’t need to be an “interesting” objective function; for example, the sub-class of linear feasibility problems “maximizes” the objective function 0, and it’s really just about finding a point that satisfies the constraints.
• can be either minimization or maximization. They’re trivial to turn into each other: to minimize $c^T x$, just maximize $-c^T x$ and vice versa.
• can have a set of linear equality constraints. These are not strictly necessary and hence absent in the canonical form because you can rewrite a linear equality constraint $Ex = d$ into twice the number of inequality constraints $d \le Ex \le d$, but basically all solvers naturally support them directly and will prefer you to just pass them in unmodified.
• must have some linear inequality constraints. Either direction works: to get say $Ax \ge b$, just multiply everything by -1 to yield $-Ax \le -b$. That’s why the canonical form just has one direction.
• may or may not require x (the variable vector) to be non-negative. The canonical form demands it, but as with the lack of equality constraints, this one’s really not essential. If you’re forced to work with a solver that insists on non-negative x, you can split each variable that can be negative into two variables $x = x^+ - x^-$ with the constraint $x^+ \ge 0, x^- \ge 0$. If you have a solver that defaults to unconstrained x and you want non-negative, you just add a bunch of inequality constraints $-Ix \le 0$ (i.e. you just append a negated identity matrix at the bottom of your constraint matrix A). Either way, still a linear program.

As you can see, while say a linear system is quite easy to recognize (is it a bunch of equations? Are they all linear in the variables of interest? If so, you’ve got a linear system!), LPs can come in quite different-looking forms that are nevertheless equivalent.

At this point, it would be customary to show a made-up linear problem. A factory can produce two kinds of widgets that have different profit margins and consume some number of machines for some amount of time, how many widgets of each kind of type should we be producing to maximize profit, and so forth. I’m not going to do that (check out a textbook on linear programming if you want to get the spiel). Instead, I’m going to show you an example with a very different flavor to hammer the point home that LPs are not at all easy to recognize (or at the very least, that it takes a lot of practice).

Earlier, we went over linear least-squares and saw that the problem reduces to a linear system of equations, because the objective function is quadratic and thus has a linear derivative. Therefore, that trick only works for the 2-norm. What if we’d really rather minimize the residuals (that’s the Ax-b) as measured in another norm? Say we want to minimize the largest deviation from 0 in any of the rows; this corresponds to the infinity-norm (or supremum norm), which is defined as $\|x\|_\infty = \max(|x_1|,\dots,|x_n|)$. So our new problem is to minimize the maximum error in any of the components:

$\displaystyle \min_{x \in \mathbb{R}^n} \| Ax - b \|_\infty$

Turns out that this is a linear program, even if it really doesn’t look like one. Don’t see it? I don’t blame you. Here’s how it goes:

minimize: $t \in \mathbb{R}$
subject to: $-t \le Ax - b \le t$ (componentwise).

To make this clear, this is a linear program with our original vector x as its variables, plus an additional auxiliary variable t that we made up. We require that t bound the size of every component of Ax-b; those inequalities are really just saying that if we named the residuals $r=Ax-b$, we’d have $|r_1| \le t$, $|r_2| \le t$ and so on. And then as our objective function, we just want the smallest t possible (i.e the smallest upper bound on the absolute value of the residuals), which is how we encode that we’re minimizing our error bound. We don’t constrain x at all; the solver gets to pick x however it wants to make that work. And yeah, I know this feels like a really cheap trick. Get used to it. Mathematical optimization is all about the cheap shots. (Also note that if we had other linear inequality constraints, like say wanting x to be non-negative, we could just throw these in as well. Still a linear program.)

And while we’re on the subject of cheap shots: we have the probably two most important norms in applications, the 2-norm and the infinity-norm; can we get the next important one, the 1-norm as well? Sure we can. Buckle up! The 1-norm of a vector is defined as $\|x\|_1 = |x_1| + \cdots + |x_n|$, i.e. the sum of the absolute values of the components. What happens if we try to minimize $\|Ax - b\|_1$? Yup, we again get a linear program, and this time we need a whole bunch of auxiliary variables and it’s even cheesier than the last one:

minimize: $r_1^+ + r_1^- + \cdots + r_m^+ + r_m^-$
subject to: $r^+ - r^- = Ax - b$, $r^+ \ge 0$, $r^- \ge 0$

This combines the auxiliary variables trick from last time with the “split a vector into positive and negative parts” trick we saw when I talked about how to get rid of unwanted $x \ge 0$ constraints. The absolute value of residuals split that way is just the $r^+ + r^-$ we see in the objective function. And to top it all off, after introducing 2m auxiliary variables with the two m-element vectors $r^+$ and $r^-$, we discuss them all away by adding the linear equality constraint $r^+ - r^- = Ax-b$ that forces them to sum to our residuals.

There are two points I’m trying to make here: first, that a surprising number of problems turns out to be LPs in disguise. And second, that they really don’t need to look like it.

That much for LPs. I still haven’t talked about cone programs or the actual paper, though! But now that I’ve shown how these things crop up, I’m going to reduce the amount of detail a lot.

First: cone programs. What we’ve seen so far are LPs. Cone programs are a generalization, and there’s a whole zoo of them. Cone linear programs are equivalent to regular LPs – effectively just a different canonical form. Then there’s Second-Order Cone Programs (SOCPs) which are a superset of LPs and can also express a whole bunch of other optimization problems including (convex) Quadratic Programming (quadratic objective function, linear inequality constraints) and positive semidefinite quadratically constrained quadratic programming (quadratic objective function and quadratic inequality constraints, all quadratic forms involved positive-semidefinite). This is somewhat surprising at first because SOCPs have a linear objective function and quadratic constraints, but the solution to “I want to express a quadratic objective function using a linear objective function and quadratic constraints” turns out to be yet more cheap shots. (I’ll let you figure out how this one goes yourself, you’ve seen enough by now.)

The next step up from SOCPs is semidefinite programs (SDPs), which can express everything I’ve mentioned so far, and more. And then there’s a couple more cone types that I’m not going to cover here.

What all these cone programs have in common is very similar mathematical foundations (though the neat theory for this is, to my knowledge; relatively recent; we’re in the mid-1990s now!) and very closely related solvers, traditionally interior point methods (IPMs).

IPMs work by encoding the constraints using smooth (differentiable) barrier functions. The idea is that such functions increase rapidly near the bounds of the feasible region (i.e. when you’re getting close to constraints being violated), and are relatively small otherwise. Then use Newton’s method to minimize the resulting smooth objective function. And Newton iteration is a “second-order” method, which means that once the iteration gets close to a solution, it will roughly double the number of correct digits in every step. In practice, this means that you perform a bunch of iterations to get close enough to a solution, and once you do, you’ll be converged to within machine precision in another four to six iterations.

And that’s where we finally get to the paper itself: O’Donoghue et al.’s method is not a second-order method; instead, it’s a first-order method that uses a less sophisticated underlying iteration (ADMM) that nevertheless has some practical advantages. It’s not as accurate as second-order methods, but it needs a lot less memory and individual iterations are much faster than in second-order methods, so it scales to large problems (millions of variables and constraints) that are currently infeasible to solve via IPMs.

The way it works uses a lot of other techniques that were refined over the past 25 years or so; for example, the titular homogeneous self-dual embedding goes back to the mid-90s and is a general way to encode cone programs into a form where initialization is easy (always a tricky problem with iterative methods), there’s no need to do a separate iteration to find a feasible point first, and boundary cases such as infeasible or unbounded problems are easy to detect.

The end result is a fundamentally really simple iteration (equation 17 in the paper) that alternates between solving a linear system – always the same matrix, for what it’s worth – and a projection step. For linear programs, all the latter does is clamp negative values in the iteration variables to zero.

If you’ve made it this far, have at least some understanding of convex optimization and the underlying duality theory, and are interested in the details, definitely check out the paper! If you don’t but want to learn, I recommend checking out Boyd and Vandenberghe’s book Convex Optimization; there’s also a bunch of (very good!) videos of the course online on YouTube. Another good source if you’re at least somewhat comfortable with numerical linear algebra but don’t want to spend the time on the Convex Optimization course is Paul Khuong’s small tutorial “So You Want to Write an LP Solver”.

And this concludes the parts of this series where I info-dump you to death about numerical math. I won’t cover non-linear optimization or anything related here; maybe some other time. :)

### 17. Braun et al. – Simple and Efficient Construction of Static Single Assignment Form (2013; compilers)

Complete change of pace here; no more math for now, I promise. There will be greek symbols though, because that’s kind of required here. Sorry.

All right, Static Single Assignment form. It’s used in all kinds of compilers these days, but why? Let’s use a small, nonsensical fragment of code in some arbitrary imperative language as demonstration:

  x = z + 5;
y = 2*x - 1;
x = 3;
y = y - x;


We have 3 integer variables x, y, and z, and are doing some random computation. The important thing to note here is that we’re assigning to x and y twice, and we can’t just move statements around without changing the meaning of the program. For example, the second assignment to x has to go after the first assignment to y.

But computationally, that’s just silly. The second assignment to x doesn’t depend on anything; the only problem is that the names x and y refer to different values depending on where we are in the problem. For a compiler, it turns out to be much more useful to separate the (somewhat arbitrary, programmer-assigned) names from the values they correspond to. Any dependencies between the actual operations are an intrinsic property of the computation the program is trying to specify. The names, not so much. So instead, we make a pass over the program where we give a variable a new name whenever we’re assigning to it:

  x1 = z + 5;
y1 = 2*x1 - 1;
x2 = 3;
y2 = y1 - x2;


And now we can reorder the code if we want to; the only requirement is that a variable must be assigned to before it’s used. So for example, this ordering

  x2 = 3;
x1 = z + 5;
y1 = 2*x1 - 1;
y2 = y1 - x2;


has the same meaning – even though we’re now “computing” the second value of x before we compute the first.

This process can be done systematically and straightforwardly for any straight-line block of code without control flow. It’s called “Local Value Numbering” (because we’re numbering values of a variable) and is literally older than digital computers themselves – for example, the idea appears in Ada Lovelace’s writings, penned 1842:

Whenever a Variable has only zeros upon it, it is called 0V; the moment a value appears on it (whether that value be placed there arbitrarily, or appears in the natural course of a calculation), it becomes 1V. If this value gives place to another value, the Variable becomes 2V, and so forth. [..] There are several advantages in having a set of indices of this nature; but these advantages are perhaps hardly of a kind to be immediately perceived, unless by a mind somewhat accustomed to trace the successive steps by means of which the [analytical] engine accomplishes its purposes. We have only space to mention in a general way, that the whole notation of the tables is made more consistent by these indices, for they are able to mark a difference in certain cases, where there would otherwise be an apparent identity confusing in its tendency. In such a case as Vn=Vp+Vn there is more clearness and more consistency with the usual laws of algebraical notation, in being able to write m+1Vn=qVp+mVn.

However, we don’t have SSA yet. That took a bit longer. The problem is what to do with control flow, such as branches and loops. How do we apply value numbering to a program like this?

  x = z;
y = 0;
while (x > 0) {
y = y + x;
x = x - 1;
}
x = 2*y;


We can certainly apply local value numbering to each basic block of straight-line code, but that doesn’t get us very far in this case. So what do we do? Well, before I do anything else, let me first rewrite that loop slightly to a somewhat less idiomatic but equivalent form that will avoid some trouble with notation in a moment:

  x = z;
y = 0;
loop {
if (x <= 0)
break;
y = y + x;
x = x - 1;
}
x = 2*y;


The problem we have here is that operations such as y = y + x reference different versions of y depending on where we are in the control flow! In the first iteration of the loop, y refers to the initial value set by y = 0; but subsequent iterations reference the y computed by the previous iteration of the loop. We can’t just decide on a single version of any particular variable up-front; it depends on how we got into the current block. However, that’s really all it depends on.

So here’s the key trick that gives us SSA: we introduce a construct called φ functions (phi functions) that returns one of its several arguments, depending on where we entered the current block from. Which of these values correspond to which way to enter the current block needs to be kept track of; since I’m going with pseudo-code here, I’ll just write it in comments. With φ functions, we can transform the whole example program into SSA form:

  x1 = z;
y1 = 0;
loop {
// when entering from outside loop, return first arg
// otherwise, return second arg.
x2 = φ(x1, x3);
y2 = φ(y1, y3);
if (x2 <= 0)
break;
y3 = y2 + x2;
x3 = x2 - 1;
}
x4 = 2*y2;


Note that things get a bit tricky now: the φ functions for x2 and y2 have to reference the values x3 and y3 that are defined later in program order, and the calculation for x4 needs to pick up y2 (and not say y3), because that’s always the most recent definition of y when control flow reaches the computation of x4.

Why do we care about forming SSA? It turns out that this representation is very convenient for all kinds of program analysis, and well-suited to various transforms compilers wish to perform.

In this simple example, it’s not hard to construct the SSA form by hand. But for it to be useful in compilers, we need an algorithm, and it better be efficient – both in the sense that it doesn’t run too long, and in the sense that it shouldn’t increase the size of the program too much. In particular, we’d like to only insert phi functions that are truly necessary, instead of say inserting phis for all visible variables in every single block.

And that’s where this paper comes in. The “standard” SSA construction algorithm is due to Cytron et al., from a 1991 paper; it’s efficient and works fine, and is used in many production compilers, including LLVM. But it needs some fairly complicated machinery to do its job, and is not really suited to incremental construction.

That’s why I like this paper. It uses only elementary techniques, is simple to describe and understand, gives good results, and is suitable for quickly patching up an existing SSA-form program are modifications that would otherwise violate SSA form.

### 18. Lamport, Palais – “On the Glitch Phenomenon” (1976; hardware/physics)

This link goes to Lamport’s publication page, where he writes a few notes on the paper (and his difficulties in getting it published).

Both the notes and the paper are relatively short and unlike several of the other papers I’ve covered, I don’t have any background information to add; the problem statement here is relatively elementary, it’s just the answer that is surprising.

Lamport later wrote another paper, Buridan’s principle, about other instances of the same problem outside of CS. Like the Glitch paper, he ran into problems getting it published, so again I’m linking to the publications page, which has his notes.

I will quote this part of the notes on “Buridan’s principle”, if you’re not already curious:

My problems in trying to publish this paper and [“On The Glitch Phenomenon”] are part of a long tradition. According to one story I’ve heard (but haven’t verified), someone at G. E. discovered the phenomenon in computer circuits in the early 60s, but was unable to convince his managers that there was a problem. He published a short note about it, for which he was fired. Charles Molnar, one of the pioneers in the study of the problem, reported the following in a lecture given on February 11, 1992, at HP Corporate Engineering in Palo Alto, California:

One reviewer made a marvelous comment in rejecting one of the early papers, saying that if this problem really existed it would be so important that everybody knowledgeable in the field would have to know about it, and “I’m an expert and I don’t know about it, so therefore it must not exist.”

Continued from part 2.

### 12. Chazelle-“Filtering search: a new approach to query-answering” (1986; computational geometry/information retrieval)

The part I specifically recommend is sections 1 through 3. For most of these problems, there are simpler solutions using different data structures by now (especially via the persistent search trees we saw earlier in part 1), so don’t hurt yourself with the parts dealing with the hive graph; but it’s definitely worth your time to grok these first few sections.

Filtering search deals with “retrieval”-type problems: you have a set of objects, and you want to run a query that finds the subset of those objects that matches some given criterion. In the interval overlap problem the paper uses as its first example, the set of objects is a bunch of 1-dimensional intervals:

I’m drawing each interval in its own line so overlaps don’t get messy, but that’s just a visualization aid; this is a 1D problem. Our queries are going to be “report all intervals in the database that intersect a given query interval”.

How would you set up a data structure that answers these queries efficiently?

A natural enough approach is to chop our 1D x-axis into slices, creating a new “window” (that’s the terminology the paper uses) whenever an interval begins or ends:

Because the begin and end points of intervals are the only locations when the answer to “which intervals overlap a given x-coordinate” change, the answer is the same within each window. Therefore, if we compute this partition in advance and store it in say a search tree (or an external search tree like a database for bigger sets), and each window stores a list of which intervals overlap it, we can answer one of our original questions directly: given a 1D point, we can find all the intervals overlapping it by finding out which of the windows it overlaps by looking it up our search tree, and then reporting the stored list.

Furthermore, given a solution for point queries, we can take a first stab at solving it for interval queries: find all the windows overlapping our query interval, and then return the union of all the lists for the individual windows. Computing unions of those sets can be done in linear time if we store the data appropriately: for example, if we give each interval stored in our database some unique integer ID and make sure all our lists of “intervals overlapping this window” are sorted by the interval IDs, then the union between a m-element list and a n-element list can be computed in $O(n+m)$ time (using a variant of the standard list merging algorithm used in mergesort).

It’s fairly easy to convince yourself that this works, in the sense that it gives the correct answer; but is it efficient? How big can a database for n intervals get? This matters not just for storage efficiency, but also for queries: if the user queries a gigantic interval (say encompassing all intervals in the image above), we end up computing the union of all interval lists stored in the entire database. We know the final answer can only contain at most n intervals (because that’s how many are in the database), but do we have any nice bounds on how long it will take to determine that fact?

Alas, it turns out this approach doesn’t really work for our problem. Without further ado, here’s an example set that kills this initial approach for good:

Here, we have 5 intervals with the same center: the first one has a width of 1 unit, the second is 3 units wide, the third 5 units wide, and so on. The middle window overlaps all 5 intervals; the two windows to either side overlap 4 intervals each, all but the first; the next windows on either side overlap 3 intervals each, and so forth.

Therefore, the bottom-most interval gets stored in 1 list; the one up from it gets stored in 3 lists; the next one up gets stored in 5 lists, and so on. For our 5 input intervals, the total size of the lists sums to $1+3+5+7+9 = 25$, and in general, $1+3+5+\hdots+(2n-3)+(2n-1) = n^2$. So our database will end up with $O(n^2)$ size just to store the sorted lists, and if someone executes a query overlapping all n intervals, we will iterate over that entire database and try to merge n2 list elements to produce a final result set with n intervals in it. No good. And note that even if the query part was nice and fast, having to build a $O(n^2)$-sized search structure for n elements would make this completely impractical for large data sets.

Now, on to the actual paper: Chazelle points out that in the entire class of retrieval problems, if the input size is n, and the number of elements reported for a particular query is k, the worst-case time for that query will in general be of the form $O(k + f(n))$ where f is some (hopefully slow-growing, say logarithmic) function of n. This is because reporting a given result is not free; it’s generally an $O(1)$ operation.

Consider extreme cases like our “return me the entire universe” range query: in that case, we have k=n, and if we have say $f=\log$, the resulting time complexity for that query is going to be $O(n + \log(n)) = O(n)$; if we ask for the entire database, it really doesn’t make a big difference how smart our indexing structure is (or isn’t). The total operation cost is going to be dominated by the “reporting” portion.

This revised cost estimate tells us where we were going wrong with the structure we were building before. It’s important to be able to locate regions of the database overlapped by only a few intervals quickly. But when we have many “active” intervals, the cost of reporting them exceeds the cost of finding them anyway, and caching all those pre-made lists does more harm than good!

Instead, what we want to do is adapt the data structure to the size of the answer (the number of results given). In regions where there are only a few active intervals, we want to locate them quickly; where there’s many active intervals, we don’t want to make a new list of intervals every time a single one of them begins or ends, because when we do, we’ll waste most of our query time merging almost-identical lists.

And that, finally, brings us to filtering search: instead of storing a new list every time the answer changes, we adapt the frequency with which we store lists to the number of active intervals. Once we do this, we need to do a bit more processing per list entry: we need to check whether it actually overlaps our query point (or interval) first – but that’s an $O(1)$ check per list entry (this is the “filtering” part of “filtering search” – rejecting the false positives).

In return, we get a significantly smaller database. Using the scheme described in section 3 of the paper (briefly: make the number of intervals per window proportional to the lowest number of active intervals at any point in the window), our data structure for n intervals takes $O(n)$ space and gives $O(k + \log n)$ query time (and with simple construction and search algorithms too, no hidden huge constant factors). And in particular, if we know that the entire database has size proportional to n, not only do we know that this will scale just fine to large data sets, it also means we won’t hit the “lots of time spent pointlessly merging lists” problem in our original approach.

That’s why I recommend this paper: the underlying insight is easy to understand – we can afford to be “sloppy” in areas where we’ll spend a lot of time reporting the results anyway – and the resulting algorithm is still quite simple, but the gains are significant. And as Chazelle notes, even if you don’t care about the particular problems he talks about, the overall approach is applicable to pretty much all retrieval problems.

### 13. McIllroy – “A Killer Adversary for Quicksort” (1999; algorithms/testing)

This paper is short (4 pages only!) and sweet. It’s not news that Quicksort will go quadratic for some inputs. Nor is it news how to write a mostly-Quicksort that avoids the quadratic case altogether; that was solved by Musser’s Introsort in 1997 (if you don’t know the trick: limit the recursion depth in Quicksort to some constant factor of a logarithm of the input size, and when you get too deep, switch to Heapsort).

The cool thing about this paper is that, given any Quicksort implementation that conforms to the C standard library qsort interface (the relevant part being that the sorting function doesn’t compare the data directly and instead asks a user-provided comparison function), it will produce a sequence that makes it go quadratic. Guaranteed, on the first try, under fairly mild assumptions, as long as the function in question actually implements some variant of Quicksort.

As-is, this paper is obviously useful to implementers of Quicksort. If you write one, you generally know that there are inputs that result in quadratic run time, but you might not know which.

But more generally, I think this paper is worth reading and understanding because it shows the value of adversarial testing: the solution isn’t hard; it relies on little more than knowing that a Quicksort proceeds as a sequence of partitioning operations, and an individual partition will compare all remaining elements in the current list against the chosen pivot. Treating the rest of the code as a black box, it turns out that knowing this much is basically enough to defeat it.

The details are different every time, but I found the paper quite inspiring when I first read it, and I’ve gotten a lot of mileage since out of building test cases by trying to make an algorithm defeat itself. Even (particularly?) when you don’t succeed, it tends to deepen your understanding.

### 14. Knuth – “Structured Programming with go to Statements” (1974; programming languages)

I’ll write “go to statements” as gotos in the following, since that’s the spelling we mostly have these days.

This paper is the primary source of the (in)famous “premature optimization is the root of all evil” quote. So to get that out of the way first, it’s on page 8 of the PDF, printed page number 268, right column, second paragraph. I recommend reading at least the preceding 2 paragraphs as well, as well as the paragraph that follows, before quoting it at others. It’s definitely different in context.

Zooming further out, this is a long paper covering a wide range of topics, a lot of which might not interest you, so I’ll go over the structure. It starts by recounting a lot of 1970s arguing back and forth about structured programming that’s mainly of historical interest these days; structured control flow constructs are a done deal these days.

After that, he goes over several algorithms and discusses their use of goto statements, which might or might not interest you.

One section that is definitely worthwhile if you haven’t seen it before is the section on “Systematic Elimination” starting on page 13 in the PDF (printed page number 273) which explains ways of converting any goto-ridden program into one that uses purely structured control flow and gives pointers to the underlying theory – note that some of these ways are quite anticlimactic!

The sections after are, again, interesting archaeology from a time before break and continue statements existed (the “structured” gotos we now generally consider unproblematic).

Then we get into “2. Introduction of goto statements”, which again is nowadays mostly standard compiler fare (covering certain control flow transformations), but worth checking out if you haven’t seen it before. Note that most modern compilers eventually transform a program into a set of basic blocks with an associated control-flow graph, which is essentially lists of statements connected by gotos; it turns out that such a (more expressive) notation makes control flow transformations simpler, and passes that need more structure generally recover it from the control flow graph. Thus, contrary to popular belief, such compilers don’t care all that much whether you write say loops using built-in loop constructs or via ifs and gotos. It does, however, make a difference semantically for objects with block-scoped lifetimes in languages that have them.

The final section I’ll point out is “Program Manipulation Systems”, starting on page 22 in the PDF (printed page number 282). Modern optimizers implement many more transforms than 1970s-era compilers do, yet still I feel like the problem Knuth observed in the 1970s still exists today (admittedly, speaking as a fairly low-level programmer): many of the transforms I wish to apply are quite mechanical, yet inexpressible in the source language. Various LISP dialects probably get closest; the metaprogramming facilities of more mainstream languages generally leave a lot to be desired. I still feel like there’s a lot of potential in that direction.

### 15. Bryant-“Graph-Based Algorithms for Boolean Function Manipulation” (1986; data structures)

For a long time, this used to be one of the widest-cited papers in all of CS. Not sure if that’s still the case.

It introduces a data structure properly called a Reduced Ordered Binary Decision Diagram (ROBDD), often just called BDDs, although there exist other variants that are not ordered or reduced, so there’s some potential for confusion. Watch out! That said, I’ll just be writing BDD for the remainder; note that I’m always talking about the ROBDD variant here.

So what is a BDD? It’s a data structure that encodes a Boolean truth table as a directed graph. It’s “reduced” because identical subgraphs are eliminated during construction, and “ordered” because the resulting BDD is specific to a particular ordering of input variables. The paper gives some examples.

Why does anyone care? Because truth tables have exponential size, whereas BDDs for many functions of interest have very tractable size. For example, an important result is that BDDs for the outputs of binary adders of arbitrary widths (with the right variable ordering) have a linear number of nodes in the width of the adder. And a BDD shares many of the properties of a truth table: it’s easy to evaluate a function given as a BDD, it’s easy to build them incrementally out of a description of some logic function, and they’re canonical – that is, with a given variable ordering, there is exactly one ROBDD that represents a given function. Which in turn means that we can check whether two binary functions are identical by checking whether their BDDs agree.

This last property is why BDDs were such a hot topic when they were introduced (leading to the aforementioned high citation count): they are very handy for formal verification of combinational logic and other problems arising in hardware design.

For example, say you have a clever 64-bit adder design you wish to implement, but you need to prove that it gives the correct result for all possible pairs of 64-bit input numbers. Checking the proposed design by exhaustively testing all 2128 possible inputs is out of the question; even if had a machine that managed to verify over a trillion combination per second, say 240 of them, and we had over a million such machines, say 220 of them, we’d still have to wait 268 seconds to get the result – that’s about 9.36 trillion years, 680 times the age of the universe.

This is not going to work. But luckily, it doesn’t need to: we can build a BDD for the new adder design, another BDD for a simple known-good adder design (say a ripple-carry adder), and check whether they agree. That validation, written in interpreted Python and run on a single workstation, takes just a moment.

BDDs aren’t a panacea; for example, BDDs for multiplier circuits have exponential size, so formally verifying those isn’t as easy. As far as I know, modern tools combine different (non-canonical) representations such as and-inverter graphs with SMT solvers to tackle more difficult Boolean equivalence problems.

Nonetheless, BDDs are a simple, elegant data structure that immediately solved a serious practical problem, and they’re worth knowing about.

Continued from part 1.

Once I was about a thousand words into describing background for GEMM, it became pretty clear that it made more sense to group the numerical math papers into one post, so here goes the (out-of-order) numerical linear algebra special issue.

### 11. Goto, van de Geijn-“Anatomy of high-performance matrix multiplication” (2008; numerical Linear Algebra / HPC)

You might wonder: why do we care about matrix multiplication in particular so much? Who is it who’s doing these giant matrix multiplies? If you’re a customer of a linear algebra library, it’s not unlikely that you’re never calling GEMM (GEneral Matrix Multiply, the customary name for matrix multiply kernels, thanks to FORTRAN 77’s 6-character limit for function names) at all. So what gives?

Well, if you’re calling into a linear algebra library, odds are you want to solve a linear system of equations (which usually turns into a pivoted LU decomposition plus a solve step), a linear least-squares problem (depending on the problem and your accuracy requirements, this might turn either into a Cholesky decomposition or a QR decomposition, again followed by a solve step), or you want something fancier like the SVD (yet another matrix decomposition, and you probably still eventually want to solve a linear system – but you’re using the SVD because it’s noisy or ill-conditioned and you want to munge around with the singular values a bit).

What’s with all the focus on “decompositions”? Are numerical analysts secretly the world’s most pocket-protected goth band? No: a matrix decomposition (or factorization) expresses a more general matrix $A$ as the product of several special matrices that have some desirable structure. For example, the LU decomposition turns our general matrix into a product $A=LU$ where $L$ is a unit lower triangular matrix and $U$ is upper triangular (note: I’ll be ignoring pivoting in this post for simplicity). The LU decomposition is the industrial-strength counterpart of the Gaussian Elimination process you might have learned in school, but with some side benefits: you can decompose the matrix once and then reuse the factorization multiple times if you’re solving the same system many times with different right-hand sides (this is common in applications), and it also happens to be really nice to have the whole process in a form that can be manipulated algebraically, which we’ll be using in a moment.

But why does this decomposition help? Well, suppose we have a toy system with 3 equations and 3 unknowns, which can be written as a matrix equation $Ax = b$ where $A$ is a 3×3 matrix of coefficients and $x$ and $b$ are 3-element column vectors. If we have a LU decomposition for A, this turns into

$Ax = LUx = \begin{pmatrix} 1 & 0 & 0 \\ l_{21} & 1 & 0 \\ l_{31} & l_{32} & 1 \end{pmatrix} \begin{pmatrix} u_{11} & u_{12} & u_{13} \\ 0 & u_{22} & u_{23} \\ 0 & 0 & u_{33} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix} b_1 \\ b_2 \\ b_3 \end{pmatrix}$

How does this help? Well, we can’t solve the full thing yet, but we now have two fairly simply matrices. For now, let’s focus on the left matrix and treat $y = Ux$ as an unknown:

$LUx = Ly = \begin{pmatrix} 1 & 0 & 0 \\ l_{21} & 1 & 0 \\ l_{31} & l_{32} & 1 \end{pmatrix} \begin{pmatrix} y_1 \\ y_2 \\ y_3 \end{pmatrix} = \begin{pmatrix} b_1 \\ b_2 \\ b_3 \end{pmatrix}$

Well, this one’s easy: the first row just states that $y_1 = b_1$. The second row states that $l_{21} y_1 + y_2 = b_2$, and we know everything but $y_2$, so we can rearrange this to $y_2 = b_2 - l_{21} y_1$. With this, the final row $l_{31} y_1 + l_{32} y_2 + y_3 = b_3$ poses no problems either, yielding $y_3 = b_3 - l_{31} y_1 - l_{32} y_2$. So $y_1$ just falls out, given $y_1$ we can compute $y_2$, and given both we can compute $y_3$. This is called “forward substitution”. Note that we’re just computing $y = L^{-1} b$ here. However, we’re never forming the inverse of L explicitly! This is important. In numerical LA, when you see an inverse, that means you’re supposed to use the corresponding “solve” routine. Actually computing the inverse matrix is generally both inefficient and inaccurate and to be avoided whenever possible.

Anyway, now that we have $y$, we can write out the $y = Ux$ we defined it as, and use that to solve for the $x$ we actually wanted:

$Ux = \begin{pmatrix} u_{11} & u_{12} & u_{13} \\ 0 & u_{22} & u_{23} \\ 0 & 0 & u_{33} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix} y_1 \\ y_2 \\ y_3 \end{pmatrix}$

This time, we’re going backwards: $u_{33} x_3 = y_3 \Leftrightarrow x_3 = y_3 / u_{33}$, $u_{22} x_2 + u_{23} x_3 = y_2 \Leftrightarrow x_2 = (y_2 - u_{23} x_3) / u_{22}$, and $u_{11} x_1 + u_{12} x_2 + u_{13} x_3 = y_1 \Leftrightarrow x_1 = (y_1 - u_{12} x_2 - u_{13} x_3) / u_{11}$. You will not be surprised to learn that this is called “backwards substitution”. Again, we’re just calculating $x = U^{-1} y$, which does not actually use a matrix inversion when U is triangular.

And that’s how you solve a linear system given a LU decomposition. In BLAS-ese, solving a triangular system using forwards or backwards substitution for one right-hand side is called a TRSV (TRiangular Solve for a single Vector) – that single routine handles both types. It’s what’s called a level-2 BLAS operation. Level-1 operations are between two vectors, level-2 operations work on a matrix and a vector, and level-3 operations work on two matrices. More about “levels” in a bit.

That’s all dandy, but what does any of this have to do with GEMM? Hang on, we’re getting close. Let’s first generalize slightly: what if we want to solve multiple systems with the same A all at once? Say we want to solve two systems

$A x^1 = b^1 \\ A x^2 = b^2$

at once (using superscripts to denote the separate vectors, since I’m already using subscripts to denote components of a vector or matrix). It turns out that you can just write this as a single matrix equation

$AX = A \left(\begin{array}{c|c} x^1 & x^2 \end{array}\right) = \left(\begin{array}{c|c} b^1 & b^2 \end{array}\right) = B$

where we just group the column vectors for x into one matrix X, and the column vectors for b into another matrix B. Again we can solve this for a LU-decomposed A by forward and back substitution (remember, still not actually forming inverses!)

$Y = L^{-1} B \\ X = U^{-1} Y$

note that we already know one direct way to do this type of equation: loop over the columns of X (and B) and solve them one by one, as above. This kind of operation is called a TRSM: TRianguler Solve for Multiple right-hand sides, or TRiangular Solve for Matrix, our first level-3 BLAS operation.

Just to get used to the idea of dealing with multiple right-hand sides at once, let’s write down the full matrix equation form for a 6 equations, 6 unknowns unit lower triangular system with two separate right-hand sides explicitly:

$\begin{pmatrix} 1 & & & & & \\ l_{21} & 1 & & & & \\ l_{31} & l_{32} & 1 & & & \\ l_{41} & l_{42} & l_{43} & 1 & & \\ l_{51} & l_{52} & l_{53} & l_{54} & 1 & \\ l_{61} & l_{62} & l_{63} & l_{64} & l_{65} & 1 \end{pmatrix} \begin{pmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \\ x_{31} & x_{32} \\ x_{41} & x_{42} \\ x_{51} & x_{52} \\ x_{61} & x_{62} \end{pmatrix} = \begin{pmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \\ b_{31} & b_{32} \\ b_{41} & b_{42} \\ b_{51} & b_{52} \\ b_{61} & b_{62} \end{pmatrix}$

As before, the first row tells us that $\begin{pmatrix} x_{11}& x_{12} \end{pmatrix} = \begin{pmatrix} b_{11} & b_{12} \end{pmatrix}$; the second row mutliplied out gives $l_{21} \begin{pmatrix} x_{11} & x_{12} \end{pmatrix} + \begin{pmatrix} x_{21} & x_{22} \end{pmatrix} = \begin{pmatrix} b_{21} & b_{22} \end{pmatrix}$, and so forth, which we solve the exact same way as before, only now we’re always multiplying (and summing) short row vectors instead of single scalars.

But 6×6 is still really small as far as real-world systems of equations go and this is already getting really unwieldy. It’s time to chop the matrix into pieces! (You can always do that, and then work on blocks instead of scalars. This is really important to know.) Let’s just draw some lines and then cut up the matrices involved into parts:

$\left(\begin{array}{cc|cccc} 1 & & & & & \\ l_{21} & 1 & & & & \\ \hline l_{31} & l_{32} & 1 & & & \\ l_{41} & l_{42} & l_{43} & 1 & & \\ l_{51} & l_{52} & l_{53} & l_{54} & 1 & \\ l_{61} & l_{62} & l_{63} & l_{64} & l_{65} & 1 \end{array}\right) \begin{pmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \hline x_{31} & x_{32} \\ x_{41} & x_{42} \\ x_{51} & x_{52} \\ x_{61} & x_{62} \end{pmatrix} = \begin{pmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \\ \hline b_{31} & b_{32} \\ b_{41} & b_{42} \\ b_{51} & b_{52} \\ b_{61} & b_{62} \end{pmatrix}$

turns into the matrix equation

$\begin{pmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{pmatrix} \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} = \begin{pmatrix} B_1 \\ B_2 \end{pmatrix}$

where $L_{11}$ and $L_{22}$ are unit lower triangular, and $L_{21}$ is just a general matrix. If we just multiply the matrix product out by blocks (again, the blocks behave like they’re scalars in a larger matrix, but you need to make sure the matrix product sizes match and be careful about order of multiplication because matrix multiplication doesn’t commute) we get two matrix equations:

$L_{11} X_1 = B_1 \\ L_{21} X_1 + L_{22} X_2 = B_2$

The first of these is just a smaller TRSM with a 2×2 system, and in the second we can bring the $X_1$ term to the right-hand side, yielding

$L_{22} X_2 = B_2 - L_{21} X_1$

On the right-hand side, we have a matrix multiply of values we already know (we computed $X_1$ with the smaller TRSM, and everything else is given). Compute the result of that, and we have another TRSM, this time with a 4×4 system.

The matrix multiply here is one instance of a GEneral Matrix Multiply (GEMM). The corresponding BLAS function computes $C \leftarrow \alpha AB + \beta C$, where the left arrow denotes assignment, A, B, and C are matrices, and α and β are scalar values. In this particular case, we would have $A=L_{21}$, $B=X_1$, $C=B_2$, $\alpha=-1$ and $\beta=1$.

So we cut the matrix into two parts, did a bit of algebra, and saw that our TRSM with a 6×6 L can be turned into a 2×2 TRSM, a GEMM of a 4×2 by a 2×2 matrix, and finally a 4×4 TRSM. Note the function of the matrix multiply: once we’ve computed two unknowns, we need to subtract out their contributions from every single equation that follows. That’s what the GEMM does. It’s the first matrix multiply we’ve seen, but does it matter?

Well, the next thing to realize is that we can do the splitting trick again for the 4×4 TRSM, turning it into 2 even smaller TRSM, plus another GEMM. But now that we’ve establishes the idea of using blocks, let’s skip to a somewhat more serious problem size, so it becomes clear why this is interesting.

Let’s say our A is 1000×1000 (so 1000 equations in 1000 unknowns); its LU factors are the same size. This time, let’s say we’re doing 20 right-hand sides at once, and working in blocks of 30×30. We have the same equation as before:

$\begin{pmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{pmatrix} \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} = \begin{pmatrix} B_1 \\ B_2 \end{pmatrix}$

but this time $L_{11}$ is 30×30 unit lower triangular, $L_{21}$ is 970×30, $L_{22}$ is 970×970 unit lower triangular, $X_1$ and $B_1$ are 30×20, and $B_1$ and $B_2$ are 970×20. Again we do the same 3 steps:

1. $X_1 \leftarrow L_{11}^{-1} B_1$ (TRSM, 30×30 matrix times 30×20 RHS)
2. $B_2 \leftarrow B_2 - L_{21} X_1$ (GEMM, 970×30 times 30×30)
3. $X_2 \leftarrow L_{22}^{-1} B_2$ (TRSM, 970×970 times 970×30)

Now the computational cost of both a m×n-by-n×p TRSM and a m×n-by-n×p GEMM (the middle dimensions always have to match) are both roughly 2×m×n×p floating-point operations (flops, not to be confused with all-uppercase FLOPS, which conventionally denote flops/s, because nomenclature is stupid sometimes). Which means the first step above (the medium TRSM) has on the order of 1800 flops, while the second step (the GEMM) takes 873000 flops. In short, of these two steps, step 1 is basically a rounding error in terms of execution time.

And note that we’re splitting a large × TRSM into a medium × medium TRSM, a large × small GEMM, and a final large × large (but smaller than the original) TRSM. And we can keep doing the same splitting process to that remaining large TRSM, until it becomes small as well. In short, this process allows us to turn a large TRSM into a sequence of medium-size TRSM (always the same size), alternating with large-size GEMMs (which keep getting smaller as we proceed down). And what happens if you look at the matrix as a whole is that we end up doing a small amount of actual TRSM work near the diagonal, while the rest of the matrix gets carpet-bombed with GEMMs.

In short, even though what we wanted to do was solve a pre-factored linear systems for a bunch of different right-hand sides, what the computer actually ended up spending its time computing was mostly matrix multiplies. The GEMM calls are coming from inside the solver! (Cue scary music.)

Alright. At this point you might say, “fair enough, that may indeed be what happens if you use this TRSM thing that for all we know you just made up, but I for one am not ever asking the computer to solve the same equation with 50 different right-hand sides in a batch, so how does this apply to me?” Okay then, let’s have a look at how LU factorizations (which so far I’ve assumed we just have lying around) are actually computed, shall we? (And again, note I’m ignoring pivoting here, for simplicity.)

What we want to do is factor our matrix A into a unit lower triangular and an upper triangular factor:

$A = LU$

So, how do we do that? Just keep staring at that equation for a minute longer, see if it flinches first! It doesn’t? Bugger. Okay, plan B: apply our new favorite trick, splitting a matrix into blocks, to play for time until we get a brilliant idea:

$\begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix} = \begin{pmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{pmatrix} \begin{pmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{pmatrix}$

Our top-left block $A_{11}$ needs to be square (same number of rows as columns), else this isn’t right, but it can be any size. This makes $A_{22}$ square as well, and the other blocks are rectangular. The zeros are there because we want L and U to be lower and upper triangular respectively, and their entire top-right respectively bottom-left blocks better be all zero. Furthermore, $L_{11}$ and $L_{22}$ are also unit lower triangular (like the bigger L we carved them out of), and likewise $U_{11}$ and $U_{22}$ are upper triangular. About the remaining $L_{21}$ and $U_{12}$, we can’t say much.

Still drawing blanks on the ideas front. But let’s just keep going for now: if we multiply out that matrix equation, we get

$A_{11} = L_{11} U_{11} \\ A_{12} = L_{11} U_{12} \\ A_{21} = L_{21} U_{11} \\ A_{22} = L_{21} U_{12} + L_{22} U_{22}$

Wait a second. That first line is a smaller LU decomposition, which we’re trying to figure out how to compute. But suppose we knew for now, and we had something that gave us $L_{11}$ and $U_{11}$. Then that second line is really just $U_{12} = L_{11}^{-1} A_{12}$. That’s a TRSM, we just went over that. And the third line is $L_{21} = A_{21} U_{11}^{-1}$, which is also a TRSM (of a shape we haven’t seen before, but it works the same way). Once we have $L_{11}$ and $U_{11}$, our hand is forced with regards to these two matrices; for the factorization to multiply to the correct result A, we need them to be the things I just wrote down. And if we know these two matrices, we can attack that last equation by moving their product to the left-hand side:

$A_{22} - L_{21} U_{12} = L_{22} U_{22}$

Hey look, we do a big GEMM and then resume with computing a LU decomposition of the remainder – we’ve seen that kind of structure before! Great. This is how to do a block LU decomposition: compute a LU decomposition of the top-left block, two TRSMs, one GEMM to update the bottom-right part, then keep decomposing that. And this time the TRSMs are on medium × medium × large problems, while the GEMM is on large × medium × large, so again the bulk of the computation is going to be spent in the GEMM part.

But we still don’t know how to compute the LU decomposition of that top-left block. No worries: if in doubt, go for a cheap shot. We don’t know how to do this for an arbitrary block. But what if we make our partition really silly? For $A_{11}$ is a 1×1 element “matrix” levels of silly? (That is, we’re just splitting off one row and one column at the top left.)

Then $A_{11} = L_{11} U_{11}$ is a no-brainer; all three of these matrices are 1×1, and we require $L_{11}$ to be “unit” (ones on the diagonal), which for a 1×1 matrix just means that $L_{11}=1$. Therefore $A_{11} = U_{11}$. Ta-daa! We “solved” a 1×1 LU decomposition. But that’s all we really need. Because once we have that one value $U_{11}$ determined, we can crank through our other 3 formulas, which give us $U_{12}$ (the rest of the top row of U), $L_{21}$ (the rest of the left column of L), and updates the rest of the matrix to eliminate the one variable we just computed. To compute a LU decomposition of a block, we simply keep peeling off 1×1 sub-blocks until we run out of matrix to decompose.

This description covers both “regular” and block LU decompositions (in fact we just do blocks and then get the regular decomposition as a special case when we do 1×1 blocks, at which point the problem becomes trivial), and not a single index or elementary row operation was harmed in the preceding text.

Note that this time, we turned LU decomposition (i.e. Gaussian elimination) into mostly-GEMMs-and-some-block-TRSMs, and we already saw earlier that block TRSMs turn into mostly-GEMMs-and-some-small-TRSMs. Therefore, the entire process of factoring a linear system and then solving it turns into… mostly GEMMs.

And that’s why everyone cares about GEMMs so much. (And also, you may now see why even if you don’t use TRSMs, math libraries still include them, because the solvers your app code calls want to call them internally!)

This pattern is not just specific to Gaussian Elimination-type algorithms, either. Block Householder for QR decompositions? Heavy on GEMMs. Hessenberg reduction for Eigenvalue problems? Basically Householder, which is mostly GEMMs. Computation of the Singular Value Decomposition (either for solvers or to get PCAs)? Generally starts with Golub-Kahan Bidiagonalization or one of its relatives, which is a somewhat fancier version of the QR decomposition algorithm, and yup, lots of GEMMs again. Then the actual singular value computation is iterative on that bidiagonal matrix, but that part tends to take less time than the non-iterative parts surrounding it, because the iteration is only iterating on a matrix reduced to 2 diagonals, whereas everything else works with the whole matrix.

In fact, what we’ve seen so far is a pattern of various matrix operations turning into smaller versions of themselves, plus maybe some other matrix operations, plus a GEMM. Guess what happens with a GEMM itself? If your guess was “GEMMs all the way down”, you’re right. It’s like a weed. (And turning GEMMs into smaller GEMMs is, in fact, quite important – but that’s in the paper, so I won’t talk about it here.)

This concludes our brief foray into dense numerical LA and why HPC people are so obsessed about GEMM performance. Note that dense problems are basically the easy case, at least from a high-level point of view; many of the things that are really interesting are huge (millions of equations and variables) but sparse and with exploitable structure, and these take a lot more care from the user, as well as more sophisticated algorithms. (That will nevertheless usually end up calling into a dense LA library for their bulk computations.)

Now that I’ve hopefully satisfyingly answered why GEMM, let’s talk a bit about the actual paper. The presentation I gave you of splitting up a matrix into blocks wasn’t just for notational convenience; that’s how these algorithms tend to work internally. The reason is that large matrices are, well, large. There’s an inherent 2D structure to these algorithms and completely looping over one axis of a giant matrix tends to thrash the cache, which in turn means there are suddenly lots of main memory accesses, and at that point you lose, because current computers can do way more computation per unit of time than they can do memory-to-cache transfers. If you truly want to do high-performance computation, then you have to worry about memory access patterns. (In fact, that’s most of what you do.)

That is something I pushed on the stack earlier in this post: different BLAS levels. This is an old chestnut, but it’s worth repeating: level-1 BLAS operations are vector-vector; something like say a big dot product (DOT in BLAS). Doing a dot product between two N-element vectors is on the order of 2N flops, and 2N memory operations (memops) to load the elements of the two vectors. 1:1 flops to memops – no good! Level-2 BLAS operations are matrix-vector; take say M×N matrix by M-element vector multiplication (GEMV, GEneral Matrix times Vector). Does 2MN flops, M×(N+2) memops (load all matrix elements once, load each vector element once, store each vector element once); closer to 2:1 flops to memops, which is an improvement, but still bad if you have the SIMD units to compute 32 single-precision flops per cycle and core and the main memory bandwidth to load half a float per cycle and core (the size of this gap is similar for CPUs and GPUs, for what it’s worth). It just means that your performance drops off a cliff once you’re working on data larger than your cache size.

Level-3 BLAS operations like GEMM, however, have 2MNP flops to MN+NP+MP necessary memops (load each matrix element once, store the result). That means the flops to memops ratio can in principle get arbitrarily large if only the matrices are big enough. Which in turn means that high-performance GEMMs are all about making sure that they do, in fact, reuse matrix elements extensively once they’re in the cache, and making sure that all the different cache levels are happy.

The way this works is interesting and worth studying, and that’s why that paper was on my list. Whew.

### 31. Bientinesi, van de Geijn-“Formal Correctness and Stability of Dense Linear Algebra Algorithms” (2005; numerical LA)

According to my headings, all of the above was about the matrix multiplication paper. Well, what can I say? I was lying.

That whole business with deriving our LU decomposition by partitioning our matrix into blocks, writing down equations for the individual block elements, and then winging our way towards a solution? That’s, essentially, this paper. Except the real paper is a lot more rigorous and consequently with a lot less “winging it”.

Partitioning matrices into blocks for fun and profit is a numerical linear algebra mainstay. I saw it done for a few algorithms at university. The work of this group at UT Austin (especially the stuff following from this paper) is what made me realize just how general and systematic it can be when it’s done right.

For a large class of dense LA algorithms, this procedure is well-specified enough to derive a working algorithm automatically from a problem description, complete with correctness and numerical stability analysis, within seconds; no inspiration required. It’s an algorithm-derivation algorithm. For a very limited (and fairly rigidly structured) problem domain, but still!

This is really cool and I like it a lot.