Continued from part 4.

### 19. Curtsinger, Berger – “Stabilizer: Statistically sound performance evaluation” (2013; performance evaluation)

Current CPUs are chock-full of associative data structures that are indexed by memory addresses, either virtual or physical. In no particular order, we have: multiple cache levels (for instructions, data, or both), multiple TLB levels, page entry caches, branch target buffers (multiple levels), branch direction history, snoop filters, cache directories and more. (Not all of these exist everywhere or are necessarily distinct.)

All of these implement a kind of “forgetful” dictionary/map structure, and they are usually implemented as set-associative caches. Without going into too much detail, that means that addresses are “hashed” (some caches use relatively decent hash functions, but most that are on time-critical paths just use a few bits from the middle of the address as their “hash”). For a given hash value, there are then typically somewhere between 2 and 16 locations in the cache that values with that hash can be stored in (that’s the “set” in “set-associative”). Generally, to free up a slot, something else has to be thrown out (“evicted”) first; there’s a lot of design freedom in how caches pick what gets evicted. Most commonly, it’s either one of several LRU approximations (“real” LRU for more than 3-4 elements is relatively expensive in hardware) or just a (pseudo-)random eviction policy.

Even when there is no randomness in the cache policy, the addresses themselves are also somewhat random. Virtual addresses within a process used to be mostly the same between two runs of the same program (this is possible because each process gets its own address space), but this made it easy to write reproducible security exploits, so we got ASLR (address space layout randomization), which deliberately shuffles the memory map to make writing reliable exploits harder. Even without ASLR, memory maps for separate runs of the same process weren’t always exactly the same; for example, changing the size of the environment variables or command line can also have a big knock-on effect on the memory map as various areas get moved to make space.

Modifying code in any way whatsoever can have significant effects on memory layout too. If a single function changes size, the linker is going to end up moving everything after it (for a randomly picked function, on average that’s going to be half the program) to a different location.

For anything that is indexed with *physical* memory addresses (typically L2 and below cache levels, and everything associated with them), things are even less predictable. Two back-to-back runs of the same process may have the same virtual address map if they have the same environment, command line and ASLR is disabled, but they’re unlikely to get assigned the exact same physical memory as the previous run, because physical memory is a globally shared resource between all the processes running on a system, and the OS is servicing physical memory allocations and deallocations all the time.

The end result is that even if you manage to carefully control for other spanners in the works such as background tasks and services (which have a tendency to start intensive tasks like indexing when you least expect it), network traffic (lots of incoming or outgoing packets that are not for your program can have a serious effect), available memory, and GPU/CPU voltage/frequency scaling, it’s possible to end up with dramatic execution time differences between either two runs of the same program, or two runs of two versions of a program that haven’t touched any code in the inner loops.

How bad does this get for real-world code? Now don’t get me wrong, many programs aren’t significantly affected by such issues at all. But some are; and microscopic layout differences can have decidedly macroscopic consequences. The SPECint 2000 “perl” benchmark (which is just the Perl interpreter running some – completely deterministic! – code) is infamous for having +-5% swings (this happens for multiple architectures) whenever *anything* in the executable changes. “Causes of Performance Instability due to Code Placement in X86” (presentation given at the LLVM US Dev Meeting 2016) has some particularly drastic examples.

This kind of thing is particularly insidious when you’re spending time tinkering with some hot loop. Ever made a change that really should have made things faster but instead was a noticeable slow-down? Often these things are explained if you look at the generated machine code (if you are the kind of person who does that, that is), but sometimes they really are completely mysterious and apparently random – or at least “twitchy”, meaning there are big swings whenever you make *any* change, with no discernible rhyme or reason. It’s really frustrating to chase this type of problem down if you’re ever stuck with it (it involves a lot of staring at CPU performance counters between different versions of a program), but probably worse is when you don’t notice it’s happening at all, and think that what ends up being a “random” (or at least accidental/unintended) fluctuation is due to an intentional change working as expected.

That’s where this paper comes in. Stabilizer uses various tricks to (to the extent possible) re-randomize memory layout in LLVM-compiled code periodically while the app is running. It can’t move existing allocations, but it can add random padding on thread stacks, make new heap allocations return addresses in a different range, and shuffle machine code around. While any *individual* memory layout has its own biases, sampling over *many* independent memory layouts during a test run allows systematically controlling for memory layout effects. Ideally (if the memory layout gets sufficiently randomized and there are no other confounding factors), it results in the different runs being independent and identically distributed, meaning the Central Limit Theorem applies and the resulting sum distribution is Gaussian. That’s very good news because Gaussians are easy to do statistical hypothesis testing with, which the paper then proceeds to do.

I wish this kind of thing was standard on every platform I’m developing for, but sadly the paper’s implementation never seems to have made it past a research prototype (as far as I can tell).

### 20. Tomasulo – “An efficient algorithm for exploiting multiple arithmetic units” (1967; computer architecture)

Many important computer architecture ideas are *way* older than you might expect. The primary reason is that there was intensive research and fierce competition in the mainframe/supercomputer market in the 1960s and 70s. Decades later, after advances in semiconductor manufacturing made microprocessors (entire processors *on a single chip*!) first possible and then kept the available transistor budget at consumer-relevant prices increasing exponentially, more and more ideas that used to be limited to top-of-the-line, room-sized computing devices became relevant for mass-market electronics.

Tomasulo’s paper talks about the floating-point unit in the very top of the line of the famous System/360 series, the Model 91. Sources disagree on how many of these were ever built; the count is somewhere between 10 and 20. Worldwide. Just the CPU set you back about $5.5 million, in 1966 dollars (which would be $41.5 million at the time I’m writing this, September 2017). But it wouldn’t do you much good by itself; you’d also want to shop for a console, a couple punch-card readers, a tape drive or two, maybe even a disk drive, some line printers… you get the idea. My point being, the computer market was *different* in the 1960s.

System/360 is extremely influential. It pioneered the idea of an *Instruction Set Architecture* (ISA). In the 60s, every CPU model had its own instruction set, operating system, and compilers. It was considered normal that you’d have to rewrite all your software when you got a newer machine. System/360 had a different idea: it specified a single ISA that had multiple implementations. At the low end, there was the Model 30. It executed 34500 instructions per second and had, depending on configuration, somewhere between 8KB and 64KB of core memory (and I do mean core memory). At the very top, there was the Model 91, running 16.7 million instructions per second and with several megabytes of RAM. Of course, there were several models in between. And all of them would run the exact same programs. Note I wrote *is* earlier; the direct descendants of System/360 are still around, albeit re-branded as “z/Architecture” (slashes still going strong after 50 years!) when it was extended to 64 bits. IBM announced a new iteration, the z14, with custom CPU just a few weeks ago, and yes, they are still backwards compatible and will run 1960s System/360 code if you want them to.

Snicker all you want about mainframes, but if you manage to design a computer architecture in the 1960s that is still a multi-billion-dollar-a-year business (and still gets new implementations 50 years on), you certainly did *something* right.

Anyway: in retrospect, we can clearly say that this whole ISA idea was a good one. But it comes with a new problem: if the whole pitch for your product line is that you can always upgrade to a bigger machine if you need higher performance, then these bigger machines better deliver increased performance on whatever code you’re running, without a programmer having to modify the programs.

Which brings us to the Model 91 floating-point unit and Tomasulo’s algorithm. The model 91 FPU had separate, pipelined floating-point adders and multipliers. This is now completely standard, but was novel and noteworthy then (there’s a separate paper on the FPU design that’s fascinating if you’re interested in that sort of thing; among other things, that FPU was the source of Goldschmidt’s division algorithm).

The Model 91’s instruction unit could deliver one instruction per clock cycle. FP additions took 2 cycles (pipelined), single-precision multiplies 3 cycles (pipelined), and single-precision divisions 12 cycles (blocking, since they are executed as a sequence of multiplies).

Now if you were writing code for that specific machine by hand, unrolling loops where necessary etc., it would not be very hard to write code that actually achieves a rate of 1 new instruction per clock cycle – or at least, it wouldn’t be if the original System/360 architecture had more than 4 floating-point registers. But combine the 4 floating-point registers available, the relatively slow storage access (judging by the paper, loads had a latency of about 10 clock cycles), and the desire to be “plug-in compatible” and not require hand-tuned code for good performance if possible, a different solution was required.

Thus, Tomasulo’s algorithm, the first “shipping” instance of out-of-order execution, albeit only for the FPU portion of the Model 91, and only issuing one instruction per cycle. (The first attempt at *superscalar* out-of-order execution I’m aware of was designed by Lynn Conway for the IBM ACS-1 around the same time, but the project got canned for political reasons).

What I particularly like about Tomasulo’s paper is that it illustrates the process of incrementally modifying the initial in-order floating point unit design to its out-of-order equivalent. Textbook treatments of out-of-order execution generally deal with more complicated full out-of-order machines; focusing it on a single FPU makes it easy to see what is going on, and how the changes are relatively incremental.

The underlying ideas for out-of-order execution are closely related to the local value numbering I talked about last time in the context of SSA construction – namely, eliminating unnecessary dependencies that are just the result of name collisions rather than true dataflow between operations. CPUs don’t have the same difficulty as compilers in that they don’t need to build a representation valid for *all possible* control flow sequences through a given function; instead, they can either wait until control flow is decided (as in the Model 91 variant) or use speculation to, effectively, bet on a single likely control flow outcome when encountering a branch. (If that bet turns out to be wrong, all instructions thus started have to be cancelled.)

While out-of-order execution as a concept has been around for a long time, as far as I can tell, the first mass-produced *fully* out-of-order machines – in the sense of fully committing to out-of-order and speculative execution for the majority of the processor core, not just at the “periphery” (say for the FPU) – came out all within a few years of each other in the mid-90s. There’s IBM’s PowerPC 604 (December 1994), Intel’s Pentium Pro (November 1995), the MIPS R10000 (January 1996), and the DEC Alpha 21264 (October 1996).

Of these, the first two use a variant of Tomasulo’s algorithm using reservation stations, and the last two used a different approach based on explicit register renaming. Some processors even mixed the two styles: several of the early AMD Athlons used Tomasulo in the integer pipe and explicit register renaming for floating-point/SIMD.

Newer processors seem to tend towards explicit renaming; it usually has less data movement than Tomasulo’s algorithm, because for the most part, the signals passed through the pipeline are indices into a physical register file instead of actual values. The difference is especially pronounced with wide SIMD registers. But Tomasulo’s algorithm was very popular for quite a long time.

### 21. Wilson, Johnstone, Neely, Boles – “Dynamic storage allocation: A survey and critical review” (1995; memory allocation)

This one is a survey of memory allocation algorithms as of 1995; that is to say, it collects almost everything you need to know about how to implement memory (de)allocation for single-threaded programs, but doesn’t cover multi-threaded environments, which only really became a serious concern later. That’s not as big a gap as it may sound, because the main adaptations necessary to get a useful multi-threaded allocator are, at least conceptually, relatively clean. (one-sentence summary of proven-out approaches: thread-local caches, deferring of cross-thread frees, and multiple arenas).

But back to 1995: why is this paper on my list? Well, if you started programming in the early 90s on home computers or PCs (like me), then your early experiences with dynamic memory allocation pretty much sucked. (As far as I can tell, many late-80s era Unices were not all that much better.)

The key problems were this:

- These allocators tended to be buggy, slow, or both at the same time. For example, it was fairly common for
`realloc`

implementation to be buggy and corrupt the heap in certain circumstances. - They tended to have high levels of memory fragmentation, meaning that programs that did many allocations and deallocations would tend to end up reserving a lot more memory space than they were actively using, because there were lots of awkwardly-sized holes in the middle of the address space that were not large enough to satisfy normal allocation requests.
- All of this would be exacerbated by running in environment without virtual memory (this one did not apply to the aforementioned Unices, obviously). With virtual memory and swap space, growing memory use over time is more an inconvenience than a show-stopper. The program’s performance (and system performance in general) slowly degrades. Without virtual memory, the moment you want to allocate even one byte more than there is memory in the machine, you get an “out of memory” condition and, more likely than not, your program crashes.

The reason I’m citing this paper is because it turns out that the first two issues are related. In particular, it turns out that there were many issues with the way memory allocators were usually designed and evaluated. Most significantly, the “standard procedure” to evaluate allocators in the literature for a long time was to generate statistics for the sizes of memory allocations from real programs (or sometimes by building Markov models for the allocation/deallocation patterns), and then evaluate allocators using synthetic traces generated to match those statistics, rather than using actual allocation traces recorded from full runs of some test programs.

It turns out that this is a serious mistake. One of the most important characteristics of “real” allocation traces is that allocation and deallocation patterns tend to be bursty, and the statistics used didn’t capture any of those patterns.

This survey looks critically at this evaluation methodology, points out its flaws, and also points out the necessity of systematically analyzing the effect of individual allocation *policies*, rather than just testing complete allocators against each other.

In the follow-up paper “The memory fragmentation problem: solved?” the same authors analyze various allocator implementations and show that roving-pointer “next-fit”, one of the more popular algorithms at the time (and recommended by Knuth’s “The Art of Computer Programming”), has particularly bad fragmentation behavior, and shows that address-ordered first fit and best fit perform quite well, having very low overall fragmentation. In short, the high fragmentation produced by many late-80s and early-90s allocators was primarily because they used an algorithm that exacerbates the problem, which went unnoticed for a long time because the accepted analysis methodology was deeply flawed.

The authors of both papers collaborated with Doug Lea, who improved his allocator dlmalloc in response; dlmalloc implements a combination of segregated-storage (for smaller sizes) and best-fit (based on a radix tree, for larger requests). The resulting allocator is quite famous and was the basis for many later follow-up allocators.

### 22. Meyer, Tischer – “GLICBAWLS – Grey Level Image Compression By Adaptive Weighed Least Squares” (2001; image compression)

This paper is not like the other papers on this list.

For one thing, the name is a pretty obvious backronym. For another, the paper comes with a reference implementation – because the algorithm was originally developed as an entry for the International Obfuscated C Code Contest and fits in 1839 bytes of C / ASCII art.

Nevertheless, the algorithm is quite elegant and was, at the time it was released, easily among the best lossless grayscale image compressors. I wouldn’t use it today because it’s just way too serial internally, but I still have a soft spot for it.

### 23. Hutton et al. – “Improving FPGA Performance and Area Using an Adaptive Logic Module” (2004; FPGAs)

This one, I don’t have that much to write about.

Here’s the short version: the combinational logic elements in FPGAs are internally realized as small lookup tables (LUTs). Arbitrary Boolean functions are thus described by their truth tables, which is stored in small SRAMs: 16 bits of SRAM for a 4-input binary logic function.

On the first page, the paper has a picture of a simple logic cell: a 4-bit LUT plus a D flip-flop. This used to be the standard building block of FPGAs, because it yields a good trade-off between area and delay for synthesized designs.

What this paper does is derive the design of a different logic module cell that can express either multiple 4-input LUTs, arbitrary 6-input LUTs, certain pairs of two 6-input logic functions with two outputs, and certain 7-input logic functions.

This is purely geeking out but it’s pretty cool if you’re interested in that sort of thing!

Continued from part 3.

### 16. O’Donoghue et al. – “Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding” (2016; numerical math/mathematical optimization)

This is a very neat first-order method to solve cone programs that brings together several key advances in the theory over the past 25 years to produce a conceptually simple (and quite short!) yet powerful solver.

Unfortunately, this puts me in a bit of a pickle here, because I expect most of you don’t know what cone programs are, what “operator splitting” is, or for that matter, what is meant by “homogeneous self-dual embedding” or “first-order method” here.

So I’m going to do the same thing I did in part 2 when talking about matrix multiplies and solving linear systems of equations, and back up a whole lot.

We saw linear systems of equations; in matrix form, we were trying to solve

where , , . If A is regular (nonzero determinant), this problem has exactly one solution, and barring potential numerical issues if A is ill-conditioned (just ignore that part if you don’t know what it means), we can solve this with standard methods like LU decomposition with partial pivoting – like the non-pivoted LU we saw in part 2, but now we’re allowed to swap rows to avoid certain problems and increase numerical accuracy (still not gonna go into it here). I hope that after reading part 2, you have at least some idea of how that process goes.

On the next step of the ladder up, we have linear least-squares problems. These naturally occur when we have linear systems with more equations than unknowns (more common), or linear systems with more variable than equations (less common, and I’ll ignore that case in the following), and a few others. These types of problems appear in approximation, or when trying to recover parameters of a linear model from noisy measurements where errors are uncorrelated, have expectation 0 and the same variance (as per the Gauss-Markov theorem). They also tend to get applied to problems where they’re really not well-suited at all, because linear least-squares is by far the easiest and most widely-known mathematical optimization method. But I’m getting ahead of myself.

Under the original conditions above, we could actually achieve (in exact arithmetic anyway). If we have more equations than variables, we can’t expect to hit an exact solution always; we now have , , , (at least as many equations as variables), and the best we can hope for is . We rewrite this into , because in normed vector spaces, we have good machinery to express that some vector is “close” to zero: we want its norm to be small. That is, we get a problem of the type

for some norm we get to choose. Not surprisingly given the name “least squares”, we choose the 2-norm. I’m going to go over this fairly quickly because least-squares isn’t the point of this post: for a vector x, we have where is just a dot product of x with itself written as a matrix product, if you haven’t seen that notation before. We can get rid of the square root by just squaring everything (standard trick) and get:

That’s a quadratic function. Multi-dimensional, but still quadratic. Basic calculus tells us that we can find the extrema of a quadratic function by finding the spot where its derivative is zero. The derivative of the above expression with respect to x is

and setting it to zero brings us to the linear system called the “normal equation”

If you followed along with that, great! If not, don’t worry; the only thing you should remember here is that 2-norm means the function we’re trying to minimize is quadratic, which therefore has a linear derivative, thus we can find the minimum by solving a linear system, which we know how to do. Caveat: the normal equations are not generally the preferred way to solve such problems, because squaring A (the part) greatly amplifies any numerical problems that the matrix might have. Unless you know in advance that is well-behaved (which you do in several types of problems), you should generally solve such problems with other non-squaring approaches based on say the QR decomposition or SVD, but again, out of scope for this post.

So far, we’ve covered elimination methods for solving linear systems (known to the Chinese back in the 2nd century CE) and linear least-squares; depending on whether you trust Gauss’s assertion that he came up with it before Legendre published it, the latter is either late 18th or early 19th century. The third of the trifecta of fundamental linear numerical algorithms is Linear Programming; this puts us either in the 19th century (if you’re feeling charitable and consider Fourier-Motzkin elimination a viable way to solve them) on in the mid-20th century for the first practical algorithm, the simplex method.

But what *is* a linear program? A canonical-form linear program is a mathematical optimization problem of the form

maximize:

subject to: (componentwise), (also componentwise)

But as the “canonical-form” name suggests, LPs come in many forms. In general, a linear program:

**must**have a linear objective function. It doesn’t need to be an “interesting” objective function; for example, the sub-class of linear feasibility problems “maximizes” the objective function 0, and it’s really just about finding a point that satisfies the constraints.- can be either minimization or maximization. They’re trivial to turn into each other: to minimize , just maximize and vice versa.
**can**have a set of linear equality constraints. These are not strictly necessary and hence absent in the canonical form because you can rewrite a linear equality constraint into twice the number of inequality constraints , but basically all solvers naturally support them directly and will prefer you to just pass them in unmodified.**must**have some linear inequality constraints. Either direction works: to get say , just multiply everything by -1 to yield . That’s why the canonical form just has one direction.**may or may not**require x (the variable vector) to be non-negative. The canonical form demands it, but as with the lack of equality constraints, this one’s really not essential. If you’re forced to work with a solver that insists on non-negative x, you can split each variable that can be negative into two variables with the constraint . If you have a solver that defaults to unconstrained x and you want non-negative, you just add a bunch of inequality constraints (i.e. you just append a negated identity matrix at the bottom of your constraint matrix A). Either way, still a linear program.

As you can see, while say a linear system is quite easy to recognize (is it a bunch of equations? Are they all linear in the variables of interest? If so, you’ve got a linear system!), LPs can come in quite different-looking forms that are nevertheless equivalent.

At this point, it would be customary to show a made-up linear problem. A factory can produce two kinds of widgets that have different profit margins and consume some number of machines for some amount of time, how many widgets of each kind of type should we be producing to maximize profit, and so forth. I’m not going to do that (check out a textbook on linear programming if you want to get the spiel). Instead, I’m going to show you an example with a very different flavor to hammer the point home that *LPs are not at all easy to recognize* (or at the very least, that it takes a lot of practice).

Earlier, we went over linear least-squares and saw that the problem reduces to a linear system of equations, because the objective function is quadratic and thus has a linear derivative. Therefore, that trick *only* works for the 2-norm. What if we’d really rather minimize the residuals (that’s the Ax-b) as measured in another norm? Say we want to minimize the largest deviation from 0 in any of the rows; this corresponds to the infinity-norm (or supremum norm), which is defined as . So our new problem is to minimize the maximum error in any of the components:

Turns out that this is a linear program, even if it really doesn’t look like one. Don’t see it? I don’t blame you. Here’s how it goes:

minimize:

subject to: (componentwise).

To make this clear, this is a linear program with our original vector x as its variables, plus an additional auxiliary variable t that we made up. We require that t bound the size of every component of Ax-b; those inequalities are really just saying that if we named the residuals , we’d have , and so on. And then as our objective function, we just want the smallest t possible (i.e the smallest upper bound on the absolute value of the residuals), which is how we encode that we’re minimizing our error bound. We don’t constrain x at all; the solver gets to pick x however it wants to make that work. And yeah, I know this feels like a really cheap trick. Get used to it. Mathematical optimization is *all about* the cheap shots. (Also note that if we had other linear inequality constraints, like say wanting x to be non-negative, we could just throw these in as well. Still a linear program.)

And while we’re on the subject of cheap shots: we have the probably two most important norms in applications, the 2-norm and the infinity-norm; can we get the next important one, the 1-norm as well? Sure we can. Buckle up! The 1-norm of a vector is defined as , i.e. the sum of the absolute values of the components. What happens if we try to minimize ? Yup, we again get a linear program, and this time we need a whole bunch of auxiliary variables and it’s even cheesier than the last one:

minimize:

subject to: , ,

This combines the auxiliary variables trick from last time with the “split a vector into positive and negative parts” trick we saw when I talked about how to get rid of unwanted constraints. The absolute value of residuals split that way is just the we see in the objective function. And to top it all off, after introducing 2m auxiliary variables with the two m-element vectors and , we discuss them all away by adding the linear equality constraint that forces them to sum to our residuals.

There are two points I’m trying to make here: first, that a surprising number of problems turns out to be LPs in disguise. And second, that they really don’t need to look like it.

That much for LPs. I still haven’t talked about cone programs or the actual paper, though! But now that I’ve shown how these things crop up, I’m going to reduce the amount of detail a lot.

First: cone programs. What we’ve seen so far are LPs. Cone programs are a generalization, and there’s a whole zoo of them. Cone linear programs are equivalent to regular LPs – effectively just a different canonical form. Then there’s Second-Order Cone Programs (SOCPs) which are a superset of LPs and can also express a whole bunch of other optimization problems including (convex) Quadratic Programming (quadratic objective function, linear inequality constraints) and positive semidefinite quadratically constrained quadratic programming (quadratic objective function and quadratic inequality constraints, all quadratic forms involved positive-semidefinite). This is somewhat surprising at first because SOCPs have a linear objective function and quadratic constraints, but the solution to “I want to express a quadratic objective function using a linear objective function and quadratic constraints” turns out to be yet more cheap shots. (I’ll let you figure out how this one goes yourself, you’ve seen enough by now.)

The next step up from SOCPs is semidefinite programs (SDPs), which can express everything I’ve mentioned so far, and more. And then there’s a couple more cone types that I’m not going to cover here.

What all these cone programs have in common is very similar mathematical foundations (though the neat theory for this is, to my knowledge; relatively recent; we’re in the mid-1990s now!) and very closely related solvers, traditionally interior point methods (IPMs).

IPMs work by encoding the constraints using smooth (differentiable) barrier functions. The idea is that such functions increase rapidly near the bounds of the feasible region (i.e. when you’re getting close to constraints being violated), and are relatively small otherwise. Then use Newton’s method to minimize the resulting smooth objective function. And Newton iteration is a “second-order” method, which means that once the iteration gets close to a solution, it will roughly double the number of correct digits in every step. In practice, this means that you perform a bunch of iterations to get close enough to a solution, and once you do, you’ll be converged to within machine precision in another four to six iterations.

And that’s where we finally get to the paper itself: O’Donoghue et al.’s method is not a second-order method; instead, it’s a first-order method that uses a less sophisticated underlying iteration (ADMM) that nevertheless has some practical advantages. It’s not as accurate as second-order methods, but it needs a lot less memory and individual iterations are *much* faster than in second-order methods, so it scales to large problems (millions of variables and constraints) that are currently infeasible to solve via IPMs.

The way it works uses a lot of other techniques that were refined over the past 25 years or so; for example, the titular homogeneous self-dual embedding goes back to the mid-90s and is a general way to encode cone programs into a form where initialization is easy (always a tricky problem with iterative methods), there’s no need to do a separate iteration to find a feasible point first, and boundary cases such as infeasible or unbounded problems are easy to detect.

The end result is a fundamentally *really simple* iteration (equation 17 in the paper) that alternates between solving a linear system – always the same matrix, for what it’s worth – and a projection step. For linear programs, all the latter does is clamp negative values in the iteration variables to zero.

If you’ve made it this far, have at least some understanding of convex optimization and the underlying duality theory, and are interested in the details, definitely check out the paper! If you don’t but want to learn, I recommend checking out Boyd and Vandenberghe’s book Convex Optimization; there’s also a bunch of (very good!) videos of the course online on YouTube. Another good source if you’re at least somewhat comfortable with numerical linear algebra but don’t want to spend the time on the Convex Optimization course is Paul Khuong’s small tutorial “So You Want to Write an LP Solver”.

And this concludes the parts of this series where I info-dump you to death about numerical math. I won’t cover non-linear optimization or anything related here; maybe some other time. :)

### 17. Braun et al. – Simple and Efficient Construction of Static Single Assignment Form (2013; compilers)

Complete change of pace here; no more math for now, I promise. There will be greek symbols though, because that’s kind of required here. Sorry.

All right, Static Single Assignment form. It’s used in all kinds of compilers these days, but why? Let’s use a small, nonsensical fragment of code in some arbitrary imperative language as demonstration:

x = z + 5; y = 2*x - 1; x = 3; y = y - x;

We have 3 integer variables x, y, and z, and are doing some random computation. The important thing to note here is that we’re assigning to x and y twice, and we can’t just move statements around without changing the meaning of the program. For example, the second assignment to x *has* to go after the first assignment to y.

But computationally, that’s just silly. The second assignment to x doesn’t depend on anything; the only problem is that the *names* x and y refer to different *values* depending on where we are in the problem. For a compiler, it turns out to be much more useful to separate the (somewhat arbitrary, programmer-assigned) names from the values they correspond to. Any dependencies between the actual operations are an intrinsic property of the computation the program is trying to specify. The names, not so much. So instead, we make a pass over the program where we give a variable a new name whenever we’re assigning to it:

x1 = z + 5; y1 = 2*x1 - 1; x2 = 3; y2 = y1 - x2;

And now we can reorder the code if we want to; the only requirement is that a variable must be assigned to before it’s used. So for example, this ordering

x2 = 3; x1 = z + 5; y1 = 2*x1 - 1; y2 = y1 - x2;

has the same meaning – even though we’re now “computing” the second value of x before we compute the first.

This process can be done systematically and straightforwardly for any straight-line block of code without control flow. It’s called “Local Value Numbering” (because we’re numbering values of a variable) and is literally older than digital computers themselves – for example, the idea appears in Ada Lovelace’s writings, penned 1842:

Whenever a Variable has only zeros upon it, it is called

^{0}V; the moment a value appears on it (whether that value be placed there arbitrarily, or appears in the natural course of a calculation), it becomes^{1}V. If this value gives place to another value, the Variable becomes^{2}V, and so forth. [..] There are several advantages in having a set of indices of this nature; but these advantages are perhaps hardly of a kind to be immediately perceived, unless by a mind somewhat accustomed to trace the successive steps by means of which the [analytical] engine accomplishes its purposes. We have only space to mention in a general way, that the whole notation of the tables is made more consistent by these indices, for they are able to mark a difference in certain cases, where there would otherwise be an apparent identity confusing in its tendency. In such a case as Vn=Vp+Vn there is more clearness and more consistency with the usual laws of algebraical notation, in being able to write^{m+1}Vn=^{q}Vp+^{m}Vn.

However, we don’t have SSA yet. That took a bit longer. The problem is what to do with control flow, such as branches and loops. How do we apply value numbering to a program like this?

x = z; y = 0; while (x > 0) { y = y + x; x = x - 1; } x = 2*y;

We can certainly apply local value numbering to each *basic block* of straight-line code, but that doesn’t get us very far in this case. So what do we do? Well, before I do anything else, let me first rewrite that loop slightly to a somewhat less idiomatic but equivalent form that will avoid some trouble with notation in a moment:

x = z; y = 0; loop { if (x <= 0) break; y = y + x; x = x - 1; } x = 2*y;

The problem we have here is that operations such as `y = y + x`

reference different versions of y depending on where we are in the control flow! In the first iteration of the loop, y refers to the initial value set by `y = 0`

; but subsequent iterations reference the y computed by the previous iteration of the loop. We can’t just decide on a single version of any particular variable up-front; it depends on how we got into the current block. However, that’s really *all* it depends on.

So here’s the key trick that gives us SSA: we introduce a construct called φ functions (phi functions) that returns one of its several arguments, depending on where we entered the current block *from*. Which of these values correspond to which way to enter the current block needs to be kept track of; since I’m going with pseudo-code here, I’ll just write it in comments. With φ functions, we can transform the whole example program into SSA form:

x1 = z; y1 = 0; loop { // when entering from outside loop, return first arg // otherwise, return second arg. x2 = φ(x1, x3); y2 = φ(y1, y3); if (x2 <= 0) break; y3 = y2 + x2; x3 = x2 - 1; } x4 = 2*y2;

Note that things get a bit tricky now: the φ functions for x2 and y2 have to reference the values x3 and y3 that are defined later in program order, and the calculation for x4 needs to pick up y2 (and not say y3), because that’s always the most recent definition of y when control flow reaches the computation of x4.

Why do we care about forming SSA? It turns out that this representation is very convenient for all kinds of program analysis, and well-suited to various transforms compilers wish to perform.

In this simple example, it’s not hard to construct the SSA form by hand. But for it to be useful in compilers, we need an algorithm, and it better be efficient – both in the sense that it doesn’t run too long, and in the sense that it shouldn’t increase the size of the program too much. In particular, we’d like to only insert phi functions that are truly necessary, instead of say inserting phis for all visible variables in every single block.

And that’s where this paper comes in. The “standard” SSA construction algorithm is due to Cytron et al., from a 1991 paper; it’s efficient and works fine, and is used in many production compilers, including LLVM. But it needs some fairly complicated machinery to do its job, and is not really suited to incremental construction.

That’s why I like this paper. It uses only elementary techniques, is simple to describe and understand, gives good results, and is suitable for quickly patching up an existing SSA-form program are modifications that would otherwise violate SSA form.

### 18. Lamport, Palais – “On the Glitch Phenomenon” (1976; hardware/physics)

This link goes to Lamport’s publication page, where he writes a few notes on the paper (and his difficulties in getting it published).

Both the notes and the paper are relatively short and unlike several of the other papers I’ve covered, I don’t have any background information to add; the problem statement here is relatively elementary, it’s just the answer that is surprising.

Lamport later wrote another paper, Buridan’s principle, about other instances of the same problem outside of CS. Like the Glitch paper, he ran into problems getting it published, so again I’m linking to the publications page, which has his notes.

I will quote this part of the notes on “Buridan’s principle”, if you’re not already curious:

My problems in trying to publish this paper and [“On The Glitch Phenomenon”] are part of a long tradition. According to one story I’ve heard (but haven’t verified), someone at G. E. discovered the phenomenon in computer circuits in the early 60s, but was unable to convince his managers that there was a problem. He published a short note about it, for which he was fired. Charles Molnar, one of the pioneers in the study of the problem, reported the following in a lecture given on February 11, 1992, at HP Corporate Engineering in Palo Alto, California:

One reviewer made a marvelous comment in rejecting one of the early papers, saying that if this problem really existed it would be so important that everybody knowledgeable in the field would have to know about it, and “I’m an expert and I don’t know about it, so therefore it must not exist.”

Continued from part 2.

### 12. Chazelle-“Filtering search: a new approach to query-answering” (1986; computational geometry/information retrieval)

The part I specifically recommend is sections 1 through 3. For most of these problems, there are simpler solutions using different data structures by now (especially via the persistent search trees we saw earlier in part 1), so don’t hurt yourself with the parts dealing with the hive graph; but it’s definitely worth your time to grok these first few sections.

Filtering search deals with “retrieval”-type problems: you have a set of objects, and you want to run a query that finds the subset of those objects that matches some given criterion. In the interval overlap problem the paper uses as its first example, the set of objects is a bunch of 1-dimensional intervals:

I’m drawing each interval in its own line so overlaps don’t get messy, but that’s just a visualization aid; this is a 1D problem. Our queries are going to be “report all intervals in the database that intersect a given query interval”.

How would you set up a data structure that answers these queries efficiently?

A natural enough approach is to chop our 1D x-axis into slices, creating a new “window” (that’s the terminology the paper uses) whenever an interval begins or ends:

Because the begin and end points of intervals are the only locations when the answer to “which intervals overlap a given x-coordinate” change, the answer is the same within each window. Therefore, if we compute this partition in advance and store it in say a search tree (or an external search tree like a database for bigger sets), and each window stores a list of which intervals overlap it, we can answer one of our original questions directly: given a 1D point, we can find all the intervals overlapping it by finding out which of the windows it overlaps by looking it up our search tree, and then reporting the stored list.

Furthermore, given a solution for point queries, we can take a first stab at solving it for interval queries: find all the windows overlapping our query interval, and then return the union of all the lists for the individual windows. Computing unions of those sets can be done in linear time if we store the data appropriately: for example, if we give each interval stored in our database some unique integer ID and make sure all our lists of “intervals overlapping this window” are sorted by the interval IDs, then the union between a m-element list and a n-element list can be computed in time (using a variant of the standard list merging algorithm used in mergesort).

It’s fairly easy to convince yourself that this works, in the sense that it gives the correct answer; but is it efficient? How big can a database for n intervals get? This matters not just for storage efficiency, but also for queries: if the user queries a gigantic interval (say encompassing all intervals in the image above), we end up computing the union of all interval lists stored in the entire database. We know the final answer can only contain at most n intervals (because that’s how many are in the database), but do we have any nice bounds on how long it will take to determine that fact?

Alas, it turns out this approach doesn’t really work for our problem. Without further ado, here’s an example set that kills this initial approach for good:

Here, we have 5 intervals with the same center: the first one has a width of 1 unit, the second is 3 units wide, the third 5 units wide, and so on. The middle window overlaps all 5 intervals; the two windows to either side overlap 4 intervals each, all but the first; the next windows on either side overlap 3 intervals each, and so forth.

Therefore, the bottom-most interval gets stored in 1 list; the one up from it gets stored in 3 lists; the next one up gets stored in 5 lists, and so on. For our 5 input intervals, the total size of the lists sums to , and in general, . So our database will end up with size just to store the sorted lists, and if someone executes a query overlapping all n intervals, we will iterate over that entire database and try to merge n^{2} list elements to produce a final result set with n intervals in it. No good. And note that even if the query part was nice and fast, having to build a -sized search structure for n elements would make this completely impractical for large data sets.

Now, on to the actual paper: Chazelle points out that in the entire class of retrieval problems, if the input size is n, and the number of elements reported for a particular query is k, the worst-case time for that query will in general be of the form where f is some (hopefully slow-growing, say logarithmic) function of n. This is because reporting a given result is not free; it’s generally an operation.

Consider extreme cases like our “return me the entire universe” range query: in that case, we have k=n, and if we have say , the resulting time complexity for that query is going to be ; if we ask for the entire database, it really doesn’t make a big difference how smart our indexing structure is (or isn’t). The total operation cost is going to be dominated by the “reporting” portion.

This revised cost estimate tells us where we were going wrong with the structure we were building before. It’s important to be able to locate regions of the database overlapped by only a few intervals quickly. But when we have many “active” intervals, the cost of reporting them exceeds the cost of finding them anyway, and caching all those pre-made lists does more harm than good!

Instead, what we want to do is adapt the data structure to the size of the answer (the number of results given). In regions where there are only a few active intervals, we want to locate them quickly; where there’s many active intervals, we don’t want to make a new list of intervals every time a single one of them begins or ends, because when we do, we’ll waste most of our query time merging almost-identical lists.

And that, finally, brings us to filtering search: instead of storing a new list every time the answer changes, we adapt the frequency with which we store lists to the number of active intervals. Once we do this, we need to do a bit more processing per list entry: we need to check whether it actually overlaps our query point (or interval) first – but that’s an check per list entry (this is the “filtering” part of “filtering search” – rejecting the false positives).

In return, we get a *significantly* smaller database. Using the scheme described in section 3 of the paper (briefly: make the number of intervals per window proportional to the lowest number of active intervals at any point in the window), our data structure for n intervals takes space and gives query time (and with simple construction and search algorithms too, no hidden huge constant factors). And in particular, if we know that the *entire database* has size proportional to n, not only do we know that this will scale just fine to large data sets, it also means we won’t hit the “lots of time spent pointlessly merging lists” problem in our original approach.

That’s why I recommend this paper: the underlying insight is easy to understand – we can afford to be “sloppy” in areas where we’ll spend a lot of time reporting the results anyway – and the resulting algorithm is still quite simple, but the gains are significant. And as Chazelle notes, even if you don’t care about the particular problems he talks about, the overall approach is applicable to pretty much all retrieval problems.

### 13. McIllroy – “A Killer Adversary for Quicksort” (1999; algorithms/testing)

This paper is short (4 pages only!) and sweet. It’s not news that Quicksort will go quadratic for some inputs. Nor is it news how to write a mostly-Quicksort that avoids the quadratic case altogether; that was solved by Musser’s Introsort in 1997 (if you don’t know the trick: limit the recursion depth in Quicksort to some constant factor of a logarithm of the input size, and when you get too deep, switch to Heapsort).

The cool thing about this paper is that, given *any* Quicksort implementation that conforms to the C standard library `qsort`

interface (the relevant part being that the sorting function doesn’t compare the data directly and instead asks a user-provided comparison function), it will produce a sequence that makes it go quadratic. Guaranteed, on the first try, under fairly mild assumptions, as long as the function in question actually implements some variant of Quicksort.

As-is, this paper is obviously useful to implementers of Quicksort. If you write one, you generally know that there are inputs that result in quadratic run time, but you might not know which.

But more generally, I think this paper is worth reading and understanding because it shows the value of adversarial testing: the solution isn’t hard; it relies on little more than knowing that a Quicksort proceeds as a sequence of partitioning operations, and an individual partition will compare all remaining elements in the current list against the chosen pivot. Treating the rest of the code as a black box, it turns out that knowing this much is basically enough to defeat it.

The details are different every time, but I found the paper quite inspiring when I first read it, and I’ve gotten a lot of mileage since out of building test cases by trying to make an algorithm defeat itself. Even (particularly?) when you don’t succeed, it tends to deepen your understanding.

### 14. Knuth – “Structured Programming with go to Statements” (1974; programming languages)

I’ll write “go to statements” as gotos in the following, since that’s the spelling we mostly have these days.

This paper is the primary source of the (in)famous “premature optimization is the root of all evil” quote. So to get that out of the way first, it’s on page 8 of the PDF, printed page number 268, right column, second paragraph. I recommend reading at least the preceding 2 paragraphs as well, as well as the paragraph that follows, before quoting it at others. It’s definitely different in context.

Zooming further out, this is a long paper covering a wide range of topics, a lot of which might not interest you, so I’ll go over the structure. It starts by recounting a lot of 1970s arguing back and forth about structured programming that’s mainly of historical interest these days; structured control flow constructs are a done deal these days.

After that, he goes over several algorithms and discusses their use of goto statements, which might or might not interest you.

One section that is definitely worthwhile if you haven’t seen it before is the section on “Systematic Elimination” starting on page 13 in the PDF (printed page number 273) which explains ways of converting any goto-ridden program into one that uses purely structured control flow and gives pointers to the underlying theory – note that some of these ways are quite anticlimactic!

The sections after are, again, interesting archaeology from a time before `break`

and `continue`

statements existed (the “structured” gotos we now generally consider unproblematic).

Then we get into “2. Introduction of goto statements”, which again is nowadays mostly standard compiler fare (covering certain control flow transformations), but worth checking out if you haven’t seen it before. Note that most modern compilers eventually transform a program into a set of basic blocks with an associated control-flow graph, which is essentially lists of statements connected by gotos; it turns out that such a (more expressive) notation makes control flow transformations simpler, and passes that need more structure generally recover it from the control flow graph. Thus, contrary to popular belief, such compilers don’t care all that much whether you write say loops using built-in loop constructs or via ifs and gotos. It does, however, make a difference semantically for objects with block-scoped lifetimes in languages that have them.

The final section I’ll point out is “Program Manipulation Systems”, starting on page 22 in the PDF (printed page number 282). Modern optimizers implement many more transforms than 1970s-era compilers do, yet still I feel like the problem Knuth observed in the 1970s still exists today (admittedly, speaking as a fairly low-level programmer): many of the transforms I wish to apply are quite mechanical, yet inexpressible in the source language. Various LISP dialects probably get closest; the metaprogramming facilities of more mainstream languages generally leave a lot to be desired. I still feel like there’s a lot of potential in that direction.

### 15. Bryant-“Graph-Based Algorithms for Boolean Function Manipulation” (1986; data structures)

For a long time, this used to be one of the widest-cited papers in all of CS. Not sure if that’s still the case.

It introduces a data structure properly called a Reduced Ordered Binary Decision Diagram (ROBDD), often just called BDDs, although there exist other variants that are not ordered or reduced, so there’s some potential for confusion. Watch out! That said, I’ll just be writing BDD for the remainder; note that I’m always talking about the ROBDD variant here.

So what is a BDD? It’s a data structure that encodes a Boolean truth table as a directed graph. It’s “reduced” because identical subgraphs are eliminated during construction, and “ordered” because the resulting BDD is specific to a particular ordering of input variables. The paper gives some examples.

Why does anyone care? Because truth tables have exponential size, whereas BDDs for many functions of interest have very tractable size. For example, an important result is that BDDs for the outputs of binary adders of arbitrary widths (with the right variable ordering) have a linear number of nodes in the width of the adder. And a BDD shares many of the properties of a truth table: it’s easy to evaluate a function given as a BDD, it’s easy to build them incrementally out of a description of some logic function, and they’re canonical – that is, with a given variable ordering, there is exactly one ROBDD that represents a given function. Which in turn means that we can check whether two binary functions are identical by checking whether their BDDs agree.

This last property is why BDDs were such a hot topic when they were introduced (leading to the aforementioned high citation count): they are very handy for formal verification of combinational logic and other problems arising in hardware design.

For example, say you have a clever 64-bit adder design you wish to implement, but you need to prove that it gives the correct result for all possible pairs of 64-bit input numbers. Checking the proposed design by exhaustively testing all 2^{128} possible inputs is out of the question; even if had a machine that managed to verify over a trillion combination per second, say 2^{40} of them, and we had over a million such machines, say 2^{20} of them, we’d still have to wait 2^{68} seconds to get the result – that’s about 9.36 trillion years, 680 times the age of the universe.

This is not going to work. But luckily, it doesn’t need to: we can build a BDD for the new adder design, another BDD for a simple known-good adder design (say a ripple-carry adder), and check whether they agree. That validation, written in interpreted Python and run on a single workstation, takes just a moment.

BDDs aren’t a panacea; for example, BDDs for multiplier circuits have exponential size, so formally verifying those isn’t as easy. As far as I know, modern tools combine different (non-canonical) representations such as and-inverter graphs with SMT solvers to tackle more difficult Boolean equivalence problems.

Nonetheless, BDDs are a simple, elegant data structure that immediately solved a serious practical problem, and they’re worth knowing about.

Continued from part 1.

Once I was about a thousand words into describing background for GEMM, it became pretty clear that it made more sense to group the numerical math papers into one post, so here goes the (out-of-order) numerical linear algebra special issue.

### 11. Goto, van de Geijn-“Anatomy of high-performance matrix multiplication” (2008; numerical Linear Algebra / HPC)

You might wonder: why do we care about matrix multiplication in particular so much? Who is it who’s doing these giant matrix multiplies? If you’re a customer of a linear algebra library, it’s not unlikely that you’re never calling GEMM (GEneral Matrix Multiply, the customary name for matrix multiply kernels, thanks to FORTRAN 77’s 6-character limit for function names) at all. So what gives?

Well, if you’re calling into a linear algebra library, odds are you want to solve a linear system of equations (which usually turns into a pivoted LU decomposition plus a solve step), a linear least-squares problem (depending on the problem and your accuracy requirements, this might turn either into a Cholesky decomposition or a QR decomposition, again followed by a solve step), or you want something fancier like the SVD (yet another matrix decomposition, and you probably still eventually want to solve a linear system – but you’re using the SVD because it’s noisy or ill-conditioned and you want to munge around with the singular values a bit).

What’s with all the focus on “decompositions”? Are numerical analysts secretly the world’s most pocket-protected goth band? No: a matrix decomposition (or factorization) expresses a more general matrix as the product of several special matrices that have some desirable structure. For example, the LU decomposition turns our general matrix into a product where is a unit lower triangular matrix and is upper triangular (note: I’ll be ignoring pivoting in this post for simplicity). The LU decomposition is the industrial-strength counterpart of the Gaussian Elimination process you might have learned in school, but with some side benefits: you can decompose the matrix once and then reuse the factorization multiple times if you’re solving the same system many times with different right-hand sides (this is common in applications), and it also happens to be really nice to have the whole process in a form that can be manipulated algebraically, which we’ll be using in a moment.

But why does this decomposition help? Well, suppose we have a toy system with 3 equations and 3 unknowns, which can be written as a matrix equation where is a 3×3 matrix of coefficients and and are 3-element column vectors. If we have a LU decomposition for A, this turns into

How does this help? Well, we can’t solve the full thing yet, but we now have two fairly simply matrices. For now, let’s focus on the left matrix and treat as an unknown:

Well, this one’s easy: the first row just states that . The second row states that , and we know everything but , so we can rearrange this to . With this, the final row poses no problems either, yielding . So just falls out, given we can compute , and given both we can compute . This is called “forward substitution”. Note that we’re just computing here. However, we’re never forming the inverse of L explicitly! This is important. *In numerical LA, when you see an inverse, that means you’re supposed to use the corresponding “solve” routine*. Actually computing the inverse matrix is generally both inefficient and inaccurate and to be avoided whenever possible.

Anyway, now that we have , we can write out the we defined it as, and use that to solve for the we actually wanted:

This time, we’re going backwards: , , and . You will not be surprised to learn that this is called “backwards substitution”. Again, we’re just calculating , which does not actually use a matrix inversion when U is triangular.

And that’s how you solve a linear system given a LU decomposition. In BLAS-ese, solving a triangular system using forwards or backwards substitution for one right-hand side is called a TRSV (TRiangular Solve for a single Vector) – that single routine handles both types. It’s what’s called a level-2 BLAS operation. Level-1 operations are between two vectors, level-2 operations work on a matrix and a vector, and level-3 operations work on two matrices. More about “levels” in a bit.

That’s all dandy, but what does any of this have to do with GEMM? Hang on, we’re getting close. Let’s first generalize slightly: what if we want to solve multiple systems with the same A all at once? Say we want to solve two systems

at once (using superscripts to denote the separate vectors, since I’m already using subscripts to denote components of a vector or matrix). It turns out that you can just write this as a single matrix equation

where we just group the column vectors for x into one matrix X, and the column vectors for b into another matrix B. Again we can solve this for a LU-decomposed A by forward and back substitution (remember, still not actually forming inverses!)

note that we already know one direct way to do this type of equation: loop over the columns of X (and B) and solve them one by one, as above. This kind of operation is called a TRSM: TRianguler Solve for Multiple right-hand sides, or TRiangular Solve for Matrix, our first level-3 BLAS operation.

Just to get used to the idea of dealing with multiple right-hand sides at once, let’s write down the full matrix equation form for a 6 equations, 6 unknowns unit lower triangular system with two separate right-hand sides explicitly:

As before, the first row tells us that ; the second row mutliplied out gives , and so forth, which we solve the exact same way as before, only now we’re always multiplying (and summing) short row vectors instead of single scalars.

But 6×6 is still really small as far as real-world systems of equations go and this is already getting really unwieldy. It’s time to chop the matrix into pieces! (You can always do that, and then work on blocks instead of scalars. This is really important to know.) Let’s just draw some lines and then cut up the matrices involved into parts:

turns into the matrix equation

where and are unit lower triangular, and is just a general matrix. If we just multiply the matrix product out by blocks (again, the blocks behave like they’re scalars in a larger matrix, but you need to make sure the matrix product sizes match and be careful about order of multiplication because matrix multiplication doesn’t commute) we get two matrix equations:

The first of these is just a smaller TRSM with a 2×2 system, and in the second we can bring the term to the right-hand side, yielding

On the right-hand side, we have a matrix multiply of values we already know (we computed with the smaller TRSM, and everything else is given). Compute the result of that, and we have another TRSM, this time with a 4×4 system.

The matrix multiply here is one instance of a GEneral Matrix Multiply (GEMM). The corresponding BLAS function computes , where the left arrow denotes assignment, A, B, and C are matrices, and α and β are scalar values. In this particular case, we would have , , , and .

So we cut the matrix into two parts, did a bit of algebra, and saw that our TRSM with a 6×6 L can be turned into a 2×2 TRSM, a GEMM of a 4×2 by a 2×2 matrix, and finally a 4×4 TRSM. Note the function of the matrix multiply: once we’ve computed two unknowns, we need to subtract out their contributions from every single equation that follows. That’s what the GEMM does. It’s the first matrix multiply we’ve seen, but does it matter?

Well, the next thing to realize is that we can do the splitting trick again for the 4×4 TRSM, turning *it* into 2 even smaller TRSM, plus another GEMM. But now that we’ve establishes the idea of using blocks, let’s skip to a somewhat more serious problem size, so it becomes clear why this is interesting.

Let’s say our A is 1000×1000 (so 1000 equations in 1000 unknowns); its LU factors are the same size. This time, let’s say we’re doing 20 right-hand sides at once, and working in blocks of 30×30. We have the same equation as before:

but this time is 30×30 unit lower triangular, is 970×30, is 970×970 unit lower triangular, and are 30×20, and and are 970×20. Again we do the same 3 steps:

- (TRSM, 30×30 matrix times 30×20 RHS)
- (GEMM, 970×30 times 30×30)
- (TRSM, 970×970 times 970×30)

Now the computational cost of both a m×n-by-n×p TRSM and a m×n-by-n×p GEMM (the middle dimensions always have to match) are both roughly 2×m×n×p floating-point operations (flops, not to be confused with all-uppercase FLOPS, which conventionally denote flops/s, because nomenclature is stupid sometimes). Which means the first step above (the medium TRSM) has on the order of 1800 flops, while the second step (the GEMM) takes 873000 flops. In short, of these two steps, step 1 is basically a rounding error in terms of execution time.

And note that we’re splitting a large × TRSM into a medium × medium TRSM, a large × small GEMM, and a final large × large (but smaller than the original) TRSM. And we can keep doing the same splitting process to that remaining large TRSM, until it becomes small as well. In short, this process allows us to turn a large TRSM into a sequence of medium-size TRSM (always the same size), alternating with large-size GEMMs (which keep getting smaller as we proceed down). And what happens if you look at the matrix as a whole is that we end up doing a small amount of actual TRSM work near the diagonal, while the rest of the matrix gets carpet-bombed with GEMMs.

In short, even though what we wanted to do was solve a pre-factored linear systems for a bunch of different right-hand sides, what the computer actually ended up spending its time computing was mostly matrix multiplies. The GEMM calls are coming from *inside the solver*! (Cue scary music.)

Alright. At this point you might say, “fair enough, that may indeed be what happens if you use this TRSM thing that for all we know you just made up, but I for one am *not* ever asking the computer to solve the same equation with 50 different right-hand sides in a batch, so how does this apply to me?” Okay then, let’s have a look at how LU factorizations (which so far I’ve assumed we just have lying around) are actually computed, shall we? (And again, note I’m ignoring pivoting here, for simplicity.)

What we want to do is factor our matrix A into a unit lower triangular and an upper triangular factor:

So, how do we do that? Just keep staring at that equation for a minute longer, see if it flinches first! It doesn’t? Bugger. Okay, plan B: apply our new favorite trick, splitting a matrix into blocks, to play for time until we get a brilliant idea:

Our top-left block needs to be square (same number of rows as columns), else this isn’t right, but it can be any size. This makes square as well, and the other blocks are rectangular. The zeros are there because we want L and U to be lower and upper triangular respectively, and their entire top-right respectively bottom-left blocks *better* be all zero. Furthermore, and are also unit lower triangular (like the bigger L we carved them out of), and likewise and are upper triangular. About the remaining and , we can’t say much.

Still drawing blanks on the ideas front. But let’s just keep going for now: if we multiply out that matrix equation, we get

Wait a second. That first line is a smaller LU decomposition, which we’re trying to figure out how to compute. But suppose we knew for now, and we had something that gave us and . Then that second line is really just . That’s a TRSM, we just went over that. And the third line is , which is also a TRSM (of a shape we haven’t seen before, but it works the same way). Once we have and , our hand is forced with regards to these two matrices; for the factorization to multiply to the correct result A, we *need* them to be the things I just wrote down. And if we know these two matrices, we can attack that last equation by moving their product to the left-hand side:

Hey look, we do a big GEMM and then resume with computing a LU decomposition of the remainder – we’ve seen that kind of structure before! Great. This is how to do a block LU decomposition: compute a LU decomposition of the top-left block, two TRSMs, one GEMM to update the bottom-right part, then keep decomposing that. And this time the TRSMs are on medium × medium × large problems, while the GEMM is on large × medium × large, so again the bulk of the computation is going to be spent in the GEMM part.

But we still don’t know how to compute the LU decomposition of that top-left block. No worries: if in doubt, go for a cheap shot. We don’t know how to do this for an arbitrary block. But what if we make our partition really silly? For is a 1×1 element “matrix” levels of silly? (That is, we’re just splitting off one row and one column at the top left.)

Then is a no-brainer; all three of these matrices are 1×1, and we require to be “unit” (ones on the diagonal), which for a 1×1 matrix just means that . Therefore . Ta-daa! We “solved” a 1×1 LU decomposition. But that’s all we really need. Because once we have that one value determined, we can crank through our other 3 formulas, which give us (the rest of the top row of U), (the rest of the left column of L), and updates the rest of the matrix to eliminate the one variable we just computed. To compute a LU decomposition of a block, we simply keep peeling off 1×1 sub-blocks until we run out of matrix to decompose.

This description covers both “regular” and block LU decompositions (in fact we just do blocks and then get the regular decomposition as a special case when we do 1×1 blocks, at which point the problem becomes trivial), and not a single index or elementary row operation was harmed in the preceding text.

Note that this time, we turned LU decomposition (i.e. Gaussian elimination) into mostly-GEMMs-and-some-block-TRSMs, and we already saw earlier that block TRSMs turn into mostly-GEMMs-and-some-small-TRSMs. Therefore, the entire process of factoring a linear system and then solving it turns into… mostly GEMMs.

And *that’s* why everyone cares about GEMMs so much. (And also, you may now see why even if *you* don’t use TRSMs, math libraries still include them, because the solvers your app code calls want to call them internally!)

This pattern is not just specific to Gaussian Elimination-type algorithms, either. Block Householder for QR decompositions? Heavy on GEMMs. Hessenberg reduction for Eigenvalue problems? Basically Householder, which is mostly GEMMs. Computation of the Singular Value Decomposition (either for solvers or to get PCAs)? Generally starts with Golub-Kahan Bidiagonalization or one of its relatives, which is a somewhat fancier version of the QR decomposition algorithm, and yup, lots of GEMMs again. Then the actual singular value computation is iterative on that bidiagonal matrix, but that part tends to take less time than the non-iterative parts surrounding it, because the iteration is only iterating on a matrix reduced to 2 diagonals, whereas everything else works with the whole matrix.

In fact, what we’ve seen so far is a pattern of various matrix operations turning into smaller versions of themselves, plus maybe some other matrix operations, plus a GEMM. Guess what happens with a GEMM itself? If your guess was “GEMMs all the way down”, you’re right. It’s like a weed. (And turning GEMMs into smaller GEMMs is, in fact, quite important – but that’s in the paper, so I won’t talk about it here.)

This concludes our brief foray into dense numerical LA and why HPC people are so obsessed about GEMM performance. Note that dense problems are basically the easy case, at least from a high-level point of view; many of the things that are really interesting are huge (millions of equations and variables) but sparse and with exploitable structure, and these take a lot more care from the user, as well as more sophisticated algorithms. (That will nevertheless usually end up calling into a dense LA library for their bulk computations.)

Now that I’ve hopefully satisfyingly answered *why* GEMM, let’s talk a bit about the actual paper. The presentation I gave you of splitting up a matrix into blocks wasn’t just for notational convenience; that’s how these algorithms tend to work internally. The reason is that large matrices are, well, large. There’s an inherent 2D structure to these algorithms and completely looping over one axis of a giant matrix tends to thrash the cache, which in turn means there are suddenly lots of main memory accesses, and at that point you lose, because current computers can do way more computation per unit of time than they can do memory-to-cache transfers. If you truly want to do high-performance computation, then you *have* to worry about memory access patterns. (In fact, that’s most of what you do.)

That is something I pushed on the stack earlier in this post: different BLAS levels. This is an old chestnut, but it’s worth repeating: level-1 BLAS operations are vector-vector; something like say a big dot product (DOT in BLAS). Doing a dot product between two N-element vectors is on the order of 2N flops, and 2N memory operations (memops) to load the elements of the two vectors. 1:1 flops to memops – no good! Level-2 BLAS operations are matrix-vector; take say M×N matrix by M-element vector multiplication (GEMV, GEneral Matrix times Vector). Does 2MN flops, M×(N+2) memops (load all matrix elements once, load each vector element once, store each vector element once); closer to 2:1 flops to memops, which is an improvement, but still bad if you have the SIMD units to compute 32 single-precision flops per cycle and core and the main memory bandwidth to load half a float per cycle and core (the size of this gap is similar for CPUs and GPUs, for what it’s worth). It just means that your performance drops off a cliff once you’re working on data larger than your cache size.

Level-3 BLAS operations like GEMM, however, have 2MNP flops to MN+NP+MP necessary memops (load each matrix element once, store the result). That means the flops to memops ratio can in principle get arbitrarily large if only the matrices are big enough. Which in turn means that high-performance GEMMs are all about making sure that they do, in fact, reuse matrix elements extensively once they’re in the cache, and making sure that all the different cache levels are happy.

The way this works is interesting and worth studying, and *that’s* why that paper was on my list. Whew.

### 31. Bientinesi, van de Geijn-“Formal Correctness and Stability of Dense Linear Algebra Algorithms” (2005; numerical LA)

According to my headings, all of the above was about the matrix multiplication paper. Well, what can I say? I was lying.

That whole business with deriving our LU decomposition by partitioning our matrix into blocks, writing down equations for the individual block elements, and then winging our way towards a solution? That’s, essentially, this paper. Except the real paper is a lot more rigorous and consequently with a lot less “winging it”.

Partitioning matrices into blocks for fun and profit is a numerical linear algebra mainstay. I saw it done for a few algorithms at university. The work of this group at UT Austin (especially the stuff following from this paper) is what made me realize just how general and systematic it can be when it’s done right.

For a large class of dense LA algorithms, this procedure is well-specified enough to derive a working algorithm automatically from a problem description, complete with correctness and numerical stability analysis, within seconds; no inspiration required. It’s an algorithm-derivation algorithm. For a very limited (and fairly rigidly structured) problem domain, but still!

This is really cool and I like it a lot.

This started out on Twitter and I expected having to write up maybe 50 or so, but then it made the rounds. I don’t think I have 1200+ papers that I’d recommend reading, and even if I did, I don’t think a reading list of that length is actually useful to anyone.

The original list had tweet-length comments. I have no such space limitation on this blog, so I’ll go ahead and write a few paragraphs on each paper. That does mean that we’re now back to lots of text, which means I’ll split it into parts. Such is life. In this post, I’ll do the first ten papers from my list.

One thing I noticed while writing the list is that the kind of paper I like most is those that combine solid theory with applications to concrete problems. You won’t find a lot of pure theory or pure engineering papers here. This is also emphatically not a list of “papers everyone should read” or any such nonsense. These are papers I happen to like, and if you share some of my interests, I think you’ll find a lot to like in them too. Beyond that, no guarantees.

### 1. Lamport-“State the Problem Before Describing the Solution” (1978; general)

It’s a one-page memo, and yeah, the gist of it’s already in the title. Read it anyway. This applies to papers and code both.

This is not just a matter of style. If you don’t have a description of the problem independent of your proposed solution, usually the best you can say about your solution (in terms of proofs or invariants) is “it does what I think it does”. In code, the equivalent is a long, complicated process that everybody is afraid of touching and nobody is really sure what it does – just that things tend to break when you touch it.

### 2. Herlihy-“Wait-free synchronization” (1991; concurrency)

This is probably the most important single paper ever written about shared-memory multiprocessing. When it was written, such systems were nowhere near as common as they are now and each architecture had its own set of atomic synchronization primitives – chosen primarily by whatever the kernel programmers at the respective companies thought might come in handy. Pretty much everyone had a decent way to implement a basic spinlock, but beyond that, it was the wild west.

Herlihy had, in joint work with Wing, formally defined linearizability a few years earlier, in 1987 – and if it feels odd to you that we first got an “industrial-strength” definition of, effectively, atomicity a few decades after the first multi-processor supercomputers appeared, that should tell you everything you need to know about how far theory was lagging behind practice at the time.

That’s the setting for this paper. Mutual exclusion primitives exist and are well-known, but they’re *blocking*: if a thread (note: this description is anachronistic; threads weren’t yet a mainstream concept) holding a lock is delayed, say by waiting for IO, then until that thread is done, nobody else who wants to hold that lock will make any progress.

Herlihy starts by defining wait-freedom: “A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of other processors”. (This is stronger than nonblocking aka “lock-free”, which guarantees that *some* thread will always be making progress, but can have any particular thread stuck in a retry loop for arbitrarily long).

Given that definition, he asks the question “with some atomic operations as primitives, what kinds of concurrent objects (think data structures) can we build out of them?”, and answers it definitively via consensus numbers (short version: virtually all primitives in use at the time can only solve 2-processor consensus wait-free, but compare-and-swap has a consensus number of infinity, i.e. it can be used to achieve consensus between an arbitrary number of processors wait-free).

The paper then goes on to show that, if the consensus number of available primitives is at least as high as the number of processor in a system, then in fact *any* operation can be done wait-free (*universality*). Note that this type of universal construction has high overhead and is therefore not particularly efficient, but it’s always possible.

This is a great paper and if you’re interested in concurrency, you should read it.

### 3. Cook – “How complex systems fail” (1998; complex systems)

This is about all kinds of complex systems, not just computer-related ones. Cook is a MD researching incidents in emergency medicine; this four-page write-up (the fifth page is bibliography) summarizes some of the most salient findings. It’s short, non-technical and relevant to everyone interacting with complex systems in their daily lives (which is, in modern society, everyone).

### 4. Moffat, Turpin – “On the Implementation of Minimum Redundancy Prefix Codes” (1997; data compression)

My tweet-length summary was “Much has been written about Huffman coding, a lot of it wrong. Almost everything worth knowing is in this (short!) paper”. I have no space limit here, so let me explain what I mean by that: the “folk understanding” of Huffman coding is that you start by building a data structure, the Huffman tree, using Huffman’s algorithm, then you transfer a description of the tree “somehow” (imagine vigorous hand-waving here), and then the encoder produces the bit stream by encoding paths from the root to the leaves (corresponding to individual symbols) in the Huffman tree, and the decoder works by reading a bit at a time and walking down the appropriate branch of the tree, producing the right symbol and going back to the root when it hits a leaf node.

Sounds about right to you? It shouldn’t, because every single part of what I just said is either a bad idea or outright wrong in practice. Let’s start with the trees. You don’t actually want a binary tree data structure for your code – because neither encoder nor decoder should ever be working a bit at a time. In fact, the preferred approach is generally to use canonical Huffman coding which not only never builds a tree, but also doesn’t care about the codes assigned during the original Huffman tree-building process. All it cares about is the code *lengths*, which it uses to build an equivalent (in terms of compression ratio) encoding that is easier to decode (and also easier to transmit efficiently in the bit stream, because it has fewer redundant degrees of freedom).

You might also not want to use Huffman’s algorithm directly; actual Huffman code words for rare symbols can become really long. In practice, you want to limit them to something reasonable (I don’t know any format offhand that allows codes longer than about 20 bits; 16 bits is a typical limit) to simplify bit IO in the encoder and decoder. So to avoid over-long codes, there’s eiher some post-processing that essentially “rotates” parts of the tree (making some longer codes shorter and some shorter codes longer), or code lengths are constructed using a different algorithm altogether.

This paper gets all these things right, and that’s why it’s the one I would recommend to anyone who wants to learn how to do Huffman coding right. As a final note, let me quote a passage from the conclusion of the paper:

In particular, we observe that explicitly tree-based decoding is an anachronism and usually best avoided, despite the attention such methods have received in textbooks, in the research literature, and in postings to the various network news groups.

It’s been 20 years; this is just as true now as it was then.

### 5. Dybvig, Hieb, Butler – “Destination-Driven Code Generation” (1990; compilers)

I also linked to Millikin’s “One-pass Code Generation in V8” as a companion piece.

Code generation for a simple stack-based virtual machine is really easy; it’s little more than a post-order traversal of the abstract syntax tree and some label management. If execution speed isn’t much of a concern, at that point you write a bytecode interpreter for said VM and call it a day.

But what if you do care about speed? Well, the next step is to generate actual machine code that still maintains the explicit stack (to get rid of the interpreter loop). Once you do that, it tends to become clear how much redundant stack manipulation and data movement you’re doing, so if you’re still not satisfied with the speed, what are you to do?

The most common answers are either to add some local smarts to the code generator, which gets messy alarmingly quickly (see the Milliken presentation for examples), or to give up and go for a “real” optimizing compiler back end, which is orders of magnitude more code and complexity (and has, not surprisingly, much higher compile times).

DDCG is a great compromise. It’s good at avoiding most of the redundant data movement and unnecessary control flow, while still fitting nicely within a simple one-pass code generator. That makes it a popular technique for JITs that need to generate passable machine code on tight deadlines.

### 6. Valmari – “Fast brief practical DFA minimization” (2012; theoretical CS/automata theory)

Take a -state DFA with transitions and an input alphabet of size as input, and produce the corresponding minimal (by number of states) DFA – i.e. an equivalent DFA that is smaller (or at worst the same size as the input).

This is a classic automata theory problem; it’s relatively straightforward to design an algorithm that works in time (usually credited to Moore), and there’s a 1971 algorithm by Hopcroft that solves the problem in time. So why am I linking to a 2012 paper?

Hopcroft’s algorithm is *complicated*. Textbooks tend to omit it entirely because its correctness proof and run-time analysis are considered too hard. Hopcroft’s 1971 paper was followed by a 1973 paper by Gries, “Describing an algorithm by Hopcroft”, that aims to improve the exposition. Nearly 30 years later, Knuutila wrote another follow-up paper, “Re-describing an algorithm by Hopcroft”, that still bemoans the lack of a readable exposition and gives it another go, spending 31 pages on it, and showing along the way that Hopcroft and Gries do not actually describe quite the same algorithm – and that his new algorithm is slightly different still (intentionally this time, to make it somewhat easier to analyze).

This is not what successful replication looks like, and a profoundly unsatisfying state of affairs for such a relatively fundamental (and clean) problem.

And *that’s* why I’m linking to Valmari’s 2012 paper. This paper, unlike most of the previous ones, works correctly with partial DFAs (i.e. those that don’t have transitions defined for every (state,input) pair), includes working code in the body of the paper (about 130 lines of C++, formatted for 42-char lines to fit in the two-column limit, resulting in one-letter variable names for everything; correspondingly it’s quite dense, but readable enough once you get used to the names), and takes a total of 5 pages for the algorithm description, correctness proof, and run-time analysis – the latter is ; since , this is as good as Hopcroft’s algorithm.

The algorithm itself is rather pretty too, working by alternatingly refining a partition on the states and the transitions of the input DFA in a pleasantly symmetric way. The data structures are simple but clever.

It’s taken an extra 41 years, but I think we can actually consider this problem truly solved now.

### 7. Sarnak, Tarjan – “Planar Point Location Using Persistent Search Trees” (1986; computational geometry/data structures)

Planar point location (locating which polygon of a polygonal subdivision of the plane a query point falls in) is an interesting problem in its own right, but in retrospect, it’s just side dish. What this paper really does is introduce space-efficient persistent search trees, using the “fat node” method. It’s a simple and very versatile idea (section 3 of the paper).

With persistent search trees, building the point location structure itself is a straightforward sweep: do a plane sweep from left to right, keeping track of line segments of the polygonal subdivision currently intersecting the sweep line. Their relative ordering only changes at vertices. The algorithm simply keeps track of the currently active line segments in a persistent red-black tree. Once the sweep is done, the persistent tree encodes all “slabs” between line segments (and thus the polygons between them) for arbitrary x coordinates, and can be queried in logarithmic time. It’s a textbook example of good data structures making for good algorithms.

### 8. Porter, Duff – “Compositing Digital Images” (1984; computer graphics)

The paper that introduced, well, digital image compositing, premultiplied (also known as “associated”) alpha, and what is now known as the “Porter-Duff operators” (“over” being the most important one by far). All these concepts are fairly important in image processing; sadly the paper under-sells them a bit.

In particular, premultiplied alpha is introduced as little more than a hack that saves a fair amount of arithmetic per pixel. While true, this is one of those cases where applications were ahead of the theory.

It turns out that a much stronger argument for premultiplication isn’t that it saves a bit of arithmetic, but that it solves some important other problems: when image data is stored in pre-multiplied form, (linear) filtering and compositing operations commute – that is, filtering and compositing can be exchanged. This is not the case with non-premultiplied alpha. In particular, scaling non-premultiplied images up or down before compositing tends to introduce fringing at the edges, as color values of transparent pixels (which should not matter) bleed in. Many try to mitigate this by preprocessing say sprite data so that the color values for transparent pixels adjacent to opaque or partially translucent pixels are “close” to minimize the error, but premultiplied alpha is cheaper, simpler to implement and completely free of errors. It’s generally the representation of choice for any linear filtering/interpolation tasks. And it also turns out that compositing operations themselves (if you care about the alpha value of the result) are a lot simpler in premultiplied alpha, whereas non-premultiplied ends up with various complications and special cases.

A good read on these issues are Jim Blinn’s columns in *IEEE Computer Graphics and Applications*, specifically those on Image Compositing, which are unfortunately pay-walled so I can’t link to them . They are collected in his book “Dirty Pixels” though, which is a recommended read if you’re interested in the topic.

### 9. Brandt – “Hard Sync Without Aliasing” (2001; DSP)

Probably the most popular class of synthesizers employs what is called “subtractive synthesis”. The idea is fairly simple: start out with an oscillator generating an “interesting” waveform (one with lots of overtones) at a specified frequency, and then proceed to shape the result using filters, which emphasize some of the overtones and de-emphasize others. It’s called subtractive synthesis because we start out with lots of overtones and then remove some of them; the converse (additive synthesis) builds up the overtone series manually by summing individual sine waves.

Typical good waveforms to use are sawtooth or pulse waves. Generating these seems really simple in code, but there’s a catch: in DSP, we generally represent waveforms by storing their values at regularly spaced discrete sample times; the original continuous signal can be recovered from these samples if and only if it is bandlimited. (This is the famous sampling theorem).

Neither sawtooth waves nor pulse waves are bandlimited. To get a bandlimited (and thus representable) version of them, you have to cut off the overtone series at some point; for sampling, this is the Nyquist frequency (half the sampling rate).

If you don’t, and just sample the continuous non-bandlimited waveforms, the signal you get is similar to what a “proper” bandlimited sawtooth/pulse wave would look like, but it’s got a lot of other noisy junk in it, corresponding to frequencies above Nyquist that weren’t representable and got reflected back into the representable part of the spectrum – this is “aliasing”.

If you’re doing sound synthesis, you want clean signals without such aliasing noise. The problem is that generating them directly is hard in general. Restricted cases (perfect sawtooth or square waves where you never change the frequency or other parameters) are reasonably easy, but that doesn’t help you if you want to tweak knobs describing the shape of the waveform in real time, doing pulse-width modulation, or various other tricks.

One such other trick is “hard sync”; two oscillators, where the first oscillator resets the phase of the second whenever it wraps around. This is done in analog synthesizers and the resulting sound is harmonically quite interesting, but the resulting waveform is complex enough to defeat straightforward attempts at description using additive synthesis or similar.

It would all be much simpler if you could work with the convenient non-bandlimited saw/pulse waves, where everything is easy, and just subtract the aliasing later. Well, turns out that because sampling and everything else involved in this chain is linear, you can in fact do *exactly that*.

What this paper does is introduce what’s called “MinBLEPs” – minimum-phase BandLimited stEPs. You simply work with the convenient continuous functions, and whenever a discontinuity is introduced, you insert a corresponding MinBLEP, which cancels out the aliasing. It’s a beautiful solution to a long-standing practical problem.

### 10. Veach – “Robust Monte Carlo Methods for Light Transport Simulation” (1997; computer graphics)

This one’s a PhD thesis, not a paper. I’ve been active in computer graphics for a long time, but there aren’t many graphics papers linked in here. That’s partly because I have many interests outside of CG, but also due to the nature of computer graphics research, which tends to be rapid-paced and very incremental. That makes it hard to single out individual papers, for the most part.

Veach’s PhD thesis sticks out, though. When it was initially published, Veach’s research was groundbreaking. Now, nearing its 20th anniversary (it was published in December of 1997), it’s probably better summarized as “classic”, with important contributions to both theory and algorithms.

This is not light reading, but it’s definitely a must-read for anyone interested in unbiased rendering.

Absolute memory bandwidth figures tend to look fairly large, especially for GPUs. This is deceptive. It’s much more useful to relate memory bandwidth to say the number of clock cycles or instructions being executed, to get a feel for what you can (and can’t) get away with.

Let’s start with a historical example: the MOS 6502, first released in 1975 – 42 years ago, and one of the key chips in the microcomputer revolution. A 6502 was typically clocked at 1MHz and did a 1-byte memory access essentially every clock cycle, which are nice round figures to use as a baseline. A typical 6502 instruction took 3-5 cycles; some instructions with more complex addressing modes took longer, a few were quicker, and there was some overlapping of the fetch of the next instruction with execution of the current instruction, but no full pipelining like you’d see in later (and more complex) workstation and then microcomputer CPUs, starting around the mid-80s. That gives us a baseline of 1 byte/cycle and let’s say about 4 bytes/instruction memory bandwidth on a 40-year old CPU. A large fraction of that bandwidth went simply into fetching instruction bytes.

Next, let’s look at a recent (as of this writing) and relatively high-end desktop CPU. An Intel Core i7-7700K, has about 50GB/s and 4 cores, so if all 4 cores are under equal load, they get about 12.5GB/s each. They also clock at about 4.2GHz (it’s safe to assume that with all 4 cores active and hitting memory, none of them is going to be in “turbo boost” mode), so they come in just under 3 bytes per cycle of memory bandwidth. Code that runs OK-ish on that CPU averages around 1 instruction per cycle, well-optimized code around 3 instructions per cycle. So well-optimized code running with all cores busy has about 1 byte/instruction of available memory bandwidth. Note that we’re 40 years of Moore’s law scaling later and the available memory bandwidth per instruction has gone *down* substantially. And while the 6502 is a 8-bit microprocessor doing 8-bit operations, these modern cores can execute multiple (again, usually up to three) 256-bit SIMD operations in one cycle; if we treat the CPU like a GPU and count each 32-bit vector lane as a separate “thread” (appropriate when running SIMT/SPMD-style code), then we get 24 “instructions” executed per cycle and a memory bandwidth of about 0.125 bytes per cycle per “SIMT thread”, or less unwieldy, one byte every 8 “instructions”.

It gets even worse if we look at GPUs. Now, GPUs generally look like they have insanely high memory bandwidths. But they also have a lot of compute units and (by CPU standards) extremely small amounts of cache per “thread” (invocation, lane, CUDA core, pick your terminology of choice). Let’s take the (again quite recent as of this writing) NVidia GeForce GTX 1080Ti as an example. It has (as per Wikipedia) a memory bandwidth of 484GB/s, with a stock core clock of about 1.48GHz, for an overall memory bandwidth of about 327 bytes/cycle for the whole GPU. However, this GPU has 28 “Shading Multiprocessors” (roughly comparable to CPU cores) and 3584 “CUDA cores” (SIMT lanes). We get about 11.7 bytes/cycle per SM, so about 4x what the i7-7700K core gets; that sounds good, but each SM drives 128 “CUDA cores”, each corresponding to a thread in the SIMT programming model. Per *thread*, we get about 0.09 bytes of memory bandwidth per cycle – or perhaps less awkward at this scale, one byte every 11 instructions.

That, in short, is why everything keeps getting more and larger caches, and why even desktop GPUs have quietly started using tile-based rendering approaches (or just announced so openly). Absolute memory bandwidths in consumer devices have gone up by several orders of magnitude from the ~1MB/s of early 80s home computers, but available compute resources have grown much faster still, and the only way to stop bumping into bandwidth limits all the time is to make sure your workloads have reasonable locality of reference so that the caches can do their job.

Final disclaimer: bandwidth is only one part of the equation. Not considered here is memory latency (and that’s a topic for a different post). The good news is absolute DRAM latencies have gone down since the 80s – by a factor of about 4-5 or so. The bad news is that clock rates have increased by about a factor of 3000 since then – oops. CPUs generally hit much lower memory latencies than GPUs (and are designed for low-latency operation in general) whereas GPUs are all about throughput. When CPU code is limited by memory, it is more commonly due to latency than bandwidth issues (running out of independent work to run while waiting for a memory access). GPU kernels have tons of runnable warps at the same time, and are built to schedule something else during the wait; running on GPUs, it’s much easier to run into bandwidth issues.

It’s fairly well-known (among programmers anyway) that say rounding up x to the nearest multiple of 8 can be accomplished using the formula `(x + 7) & ~7`

, and that in general rounding up to the nearest multiple of N (where N is a power of 2) can be accomplished using `(x + N - 1) & ~(N - 1)`

. But sometimes you need a slightly generalized version: round up to the nearest value that is congruent to some ; for example, this crops up in boundary tag-using memory allocators when the user requests aligned memory. Such allocators put a header before allocated blocks (just before the address returned to the caller). For the user-visible pointer to be aligned by say 32, that header needs to fall at an address that’s *off* alignment by a specified distance (which brings us to our problem).

It’s not immediately obvious how to adapt the original formula to this case (there is a way; I’ll get to it in a second). Now this is not exactly a frequent problem, nor is there any real need for a clever solution, but it turns out there is a very nice, satisfying solution anyway, and I wanted to write a few words about it. The solution is simply `x + ((k - x) & (N - 1))`

for power-of-2 N. The basic approach works in principle for arbitrary N, but `x + ((k - x) % N)`

will not work properly in environments using truncated division where taking the modulus of a negative argument can return negative results, which sadly is most of them. That said, in the remainder of this short post I’ll write `% N`

instead of `& (N - 1)`

with a “N needs to be a power of 2” disclaimer anyway, since there’s really nothing about the method that really requires it. Finally, this expression works fine even in overflowing unsigned integer arithmetic when N is a power of 2, but not for non-power-of-2 N.

What I like about this solution is that, once you see it written down, it’s fairly clear that and why it works (unlike many bit manipulation tricks), provided you know the rules of modular arithmetic: . We’re adding a non-negative value to x, so it’s clear that the result is ≥ x (provided there is no overflow). And we’re adding the smallest possible value we can to get to a value that’s congruent to k (mod N); I wrote about similar things before in my post “Intervals in modular arithmetic”.

There’s an equivalent expression for rounding down to the nearest value congruent to k (mod N): `x - ((x - k) % N)`

that works (and is easy to prove) the same way.

It’s interesting to consider the case k=0. The round-down variant, `x - (x % N)`

, feels fairly natural and is something I’ve seen in “real-world” code more than once. The round-up variant, `x + (-x % N)`

is something I’ve never seen anywhere. Once you throw the k in there, it all makes sense, but without it the expression looks quite odd.

Finally, here’s the aforementioned way to adapt the “regular” round-up formula to produce a value that’s congruent to k (instead of 0) mod N (and we’re back to requiring power-of-2 N here): `((x - k + N - 1) & ~(N - 1)) + k`

. This uses a different trick from the intervals in modular arithmetic paper: shift the origin around. In this case, we don’t have a formula for arbitrary k, but we do have a formula to round up to the nearest multiple of N. So we first subtract k; in this new shifted coordinate system, we want to round up to the next-larger multiple of N, which we know how to do. And finally, we add back k. It gets the job done, but it’s not as pretty as the other variant (in my opinion anyway), and it takes some thinking to convince yourself that it works at all.