Skip to content

Papers I like (part 2)

Continued from part 1.

Once I was about a thousand words into describing background for GEMM, it became pretty clear that it made more sense to group the numerical math papers into one post, so here goes the (out-of-order) numerical linear algebra special issue.

11. Goto, van de Geijn-“Anatomy of high-performance matrix multiplication” (2008; numerical Linear Algebra / HPC)

You might wonder: why do we care about matrix multiplication in particular so much? Who is it who’s doing these giant matrix multiplies? If you’re a customer of a linear algebra library, it’s not unlikely that you’re never calling GEMM (GEneral Matrix Multiply, the customary name for matrix multiply kernels, thanks to FORTRAN 77’s 6-character limit for function names) at all. So what gives?

Well, if you’re calling into a linear algebra library, odds are you want to solve a linear system of equations (which usually turns into a pivoted LU decomposition plus a solve step), a linear least-squares problem (depending on the problem and your accuracy requirements, this might turn either into a Cholesky decomposition or a QR decomposition, again followed by a solve step), or you want something fancier like the SVD (yet another matrix decomposition, and you probably still eventually want to solve a linear system – but you’re using the SVD because it’s noisy or ill-conditioned and you want to munge around with the singular values a bit).

What’s with all the focus on “decompositions”? Are numerical analysts secretly the world’s most pocket-protected goth band? No: a matrix decomposition (or factorization) expresses a more general matrix A as the product of several special matrices that have some desirable structure. For example, the LU decomposition turns our general matrix into a product A=LU where L is a unit lower triangular matrix and U is upper triangular (note: I’ll be ignoring pivoting in this post for simplicity). The LU decomposition is the industrial-strength counterpart of the Gaussian Elimination process you might have learned in school, but with some side benefits: you can decompose the matrix once and then reuse the factorization multiple times if you’re solving the same system many times with different right-hand sides (this is common in applications), and it also happens to be really nice to have the whole process in a form that can be manipulated algebraically, which we’ll be using in a moment.

But why does this decomposition help? Well, suppose we have a toy system with 3 equations and 3 unknowns, which can be written as a matrix equation Ax = b where A is a 3×3 matrix of coefficients and x and b are 3-element column vectors. If we have a LU decomposition for A, this turns into

Ax = LUx = \begin{pmatrix} 1 & 0 & 0 \\ l_{21} & 1 & 0 \\ l_{31} & l_{32} & 1 \end{pmatrix} \begin{pmatrix} u_{11} & u_{12} & u_{13} \\ 0 & u_{22} & u_{23} \\ 0 & 0 & u_{33} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix} b_1 \\ b_2 \\ b_3 \end{pmatrix}

How does this help? Well, we can’t solve the full thing yet, but we now have two fairly simply matrices. For now, let’s focus on the left matrix and treat y = Ux as an unknown:

LUx = Ly = \begin{pmatrix} 1 & 0 & 0 \\ l_{21} & 1 & 0 \\ l_{31} & l_{32} & 1 \end{pmatrix} \begin{pmatrix} y_1 \\ y_2 \\ y_3 \end{pmatrix} = \begin{pmatrix} b_1 \\ b_2 \\ b_3 \end{pmatrix}

Well, this one’s easy: the first row just states that y_1 = b_1. The second row states that l_{21} y_1 + y_2 = b_2, and we know everything but y_2, so we can rearrange this to y_2 = b_2 - l_{21} y_1. With this, the final row l_{31} y_1 + l_{32} y_2 + y_3 = b_3 poses no problems either, yielding y_3 = b_3 - l_{31} y_1 - l_{32} y_2. So y_1 just falls out, given y_1 we can compute y_2, and given both we can compute y_3. This is called “forward substitution”. Note that we’re just computing y = L^{-1} b here. However, we’re never forming the inverse of L explicitly! This is important. In numerical LA, when you see an inverse, that means you’re supposed to use the corresponding “solve” routine. Actually computing the inverse matrix is generally both inefficient and inaccurate and to be avoided whenever possible.

Anyway, now that we have y, we can write out the y = Ux we defined it as, and use that to solve for the x we actually wanted:

Ux = \begin{pmatrix} u_{11} & u_{12} & u_{13} \\ 0 & u_{22} & u_{23} \\ 0 & 0 & u_{33} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix} y_1 \\ y_2 \\ y_3 \end{pmatrix}

This time, we’re going backwards: u_{33} x_3 = y_3 \Leftrightarrow x_3 = y_3 / u_{33}, u_{22} x_2 + u_{23} x_3 = y_2 \Leftrightarrow x_2 = (y_2 - u_{23} x_3) / u_{22}, and u_{11} x_1 + u_{12} x_2 + u_{13} x_3 = y_1 \Leftrightarrow x_1 = (y_1 - u_{12} x_2 - u_{13} x_3) / u_{11}. You will not be surprised to learn that this is called “backwards substitution”. Again, we’re just calculating x = U^{-1} y, which does not actually use a matrix inversion when U is triangular.

And that’s how you solve a linear system given a LU decomposition. In BLAS-ese, solving a triangular system using forwards or backwards substitution for one right-hand side is called a TRSV (TRiangular Solve for a single Vector) – that single routine handles both types. It’s what’s called a level-2 BLAS operation. Level-1 operations are between two vectors, level-2 operations work on a matrix and a vector, and level-3 operations work on two matrices. More about “levels” in a bit.

That’s all dandy, but what does any of this have to do with GEMM? Hang on, we’re getting close. Let’s first generalize slightly: what if we want to solve multiple systems with the same A all at once? Say we want to solve two systems

A x^1 = b^1 \\ A x^2 = b^2

at once (using superscripts to denote the separate vectors, since I’m already using subscripts to denote components of a vector or matrix). It turns out that you can just write this as a single matrix equation

AX = A \left(\begin{array}{c|c} x^1 & x^2 \end{array}\right) = \left(\begin{array}{c|c} b^1 & b^2 \end{array}\right) = B

where we just group the column vectors for x into one matrix X, and the column vectors for b into another matrix B. Again we can solve this for a LU-decomposed A by forward and back substitution (remember, still not actually forming inverses!)

Y = L^{-1} B \\ X = U^{-1} Y

note that we already know one direct way to do this type of equation: loop over the columns of X (and B) and solve them one by one, as above. This kind of operation is called a TRSM: TRianguler Solve for Multiple right-hand sides, or TRiangular Solve for Matrix, our first level-3 BLAS operation.

Just to get used to the idea of dealing with multiple right-hand sides at once, let’s write down the full matrix equation form for a 6 equations, 6 unknowns unit lower triangular system with two separate right-hand sides explicitly:

\begin{pmatrix} 1 & & & & & \\ l_{21} & 1 & & & & \\ l_{31} & l_{32} & 1 & & & \\ l_{41} & l_{42} & l_{43} & 1 & & \\ l_{51} & l_{52} & l_{53} & l_{54} & 1 & \\ l_{61} & l_{62} & l_{63} & l_{64} & l_{65} & 1 \end{pmatrix} \begin{pmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \\ x_{31} & x_{32} \\ x_{41} & x_{42} \\ x_{51} & x_{52} \\ x_{61} & x_{62} \end{pmatrix} = \begin{pmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \\ b_{31} & b_{32} \\ b_{41} & b_{42} \\ b_{51} & b_{52} \\ b_{61} & b_{62} \end{pmatrix}

As before, the first row tells us that \begin{pmatrix} x_{11}&  x_{12} \end{pmatrix} = \begin{pmatrix} b_{11} & b_{12} \end{pmatrix}; the second row mutliplied out gives l_{21} \begin{pmatrix} x_{11} & x_{12} \end{pmatrix} + \begin{pmatrix} x_{21} & x_{22} \end{pmatrix} = \begin{pmatrix} b_{21} & b_{22} \end{pmatrix}, and so forth, which we solve the exact same way as before, only now we’re always multiplying (and summing) short row vectors instead of single scalars.

But 6×6 is still really small as far as real-world systems of equations go and this is already getting really unwieldy. It’s time to chop the matrix into pieces! (You can always do that, and then work on blocks instead of scalars. This is really important to know.) Let’s just draw some lines and then cut up the matrices involved into parts:

\left(\begin{array}{cc|cccc} 1 & & & & & \\ l_{21} & 1 & & & & \\ \hline l_{31} & l_{32} & 1 & & & \\ l_{41} & l_{42} & l_{43} & 1 & & \\ l_{51} & l_{52} & l_{53} & l_{54} & 1 & \\ l_{61} & l_{62} & l_{63} & l_{64} & l_{65} & 1 \end{array}\right) \begin{pmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \hline x_{31} & x_{32} \\ x_{41} & x_{42} \\ x_{51} & x_{52} \\ x_{61} & x_{62} \end{pmatrix} = \begin{pmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \\ \hline b_{31} & b_{32} \\ b_{41} & b_{42} \\ b_{51} & b_{52} \\ b_{61} & b_{62} \end{pmatrix}

turns into the matrix equation

\begin{pmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{pmatrix} \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} = \begin{pmatrix} B_1 \\ B_2 \end{pmatrix}

where L_{11} and L_{22} are unit lower triangular, and L_{21} is just a general matrix. If we just multiply the matrix product out by blocks (again, the blocks behave like they’re scalars in a larger matrix, but you need to make sure the matrix product sizes match and be careful about order of multiplication because matrix multiplication doesn’t commute) we get two matrix equations:

L_{11} X_1 = B_1 \\ L_{21} X_1 + L_{22} X_2 = B_2

The first of these is just a smaller TRSM with a 2×2 system, and in the second we can bring the X_1 term to the right-hand side, yielding

L_{22} X_2 = B_2 - L_{21} X_1

On the right-hand side, we have a matrix multiply of values we already know (we computed X_1 with the smaller TRSM, and everything else is given). Compute the result of that, and we have another TRSM, this time with a 4×4 system.

The matrix multiply here is one instance of a GEneral Matrix Multiply (GEMM). The corresponding BLAS function computes C \leftarrow \alpha AB + \beta C, where the left arrow denotes assignment, A, B, and C are matrices, and α and β are scalar values. In this particular case, we would have A=L_{21}, B=X_1, C=B_2, \alpha=-1 and \beta=1.

So we cut the matrix into two parts, did a bit of algebra, and saw that our TRSM with a 6×6 L can be turned into a 2×2 TRSM, a GEMM of a 4×2 by a 2×2 matrix, and finally a 4×4 TRSM. Note the function of the matrix multiply: once we’ve computed two unknowns, we need to subtract out their contributions from every single equation that follows. That’s what the GEMM does. It’s the first matrix multiply we’ve seen, but does it matter?

Well, the next thing to realize is that we can do the splitting trick again for the 4×4 TRSM, turning it into 2 even smaller TRSM, plus another GEMM. But now that we’ve establishes the idea of using blocks, let’s skip to a somewhat more serious problem size, so it becomes clear why this is interesting.

Let’s say our A is 1000×1000 (so 1000 equations in 1000 unknowns); its LU factors are the same size. This time, let’s say we’re doing 20 right-hand sides at once, and working in blocks of 30×30. We have the same equation as before:

\begin{pmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{pmatrix} \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} = \begin{pmatrix} B_1 \\ B_2 \end{pmatrix}

but this time L_{11} is 30×30 unit lower triangular, L_{21} is 970×30, L_{22} is 970×970 unit lower triangular, X_1 and B_1 are 30×20, and B_1 and B_2 are 970×20. Again we do the same 3 steps:

  1. X_1 \leftarrow L_{11}^{-1} B_1 (TRSM, 30×30 matrix times 30×20 RHS)
  2. B_2 \leftarrow B_2 - L_{21} X_1 (GEMM, 970×30 times 30×30)
  3. X_2 \leftarrow L_{22}^{-1} B_2 (TRSM, 970×970 times 970×30)

Now the computational cost of both a m×n-by-n×p TRSM and a m×n-by-n×p GEMM (the middle dimensions always have to match) are both roughly 2×m×n×p floating-point operations (flops, not to be confused with all-uppercase FLOPS, which conventionally denote flops/s, because nomenclature is stupid sometimes). Which means the first step above (the medium TRSM) has on the order of 1800 flops, while the second step (the GEMM) takes 873000 flops. In short, of these two steps, step 1 is basically a rounding error in terms of execution time.

And note that we’re splitting a large × TRSM into a medium × medium TRSM, a large × small GEMM, and a final large × large (but smaller than the original) TRSM. And we can keep doing the same splitting process to that remaining large TRSM, until it becomes small as well. In short, this process allows us to turn a large TRSM into a sequence of medium-size TRSM (always the same size), alternating with large-size GEMMs (which keep getting smaller as we proceed down). And what happens if you look at the matrix as a whole is that we end up doing a small amount of actual TRSM work near the diagonal, while the rest of the matrix gets carpet-bombed with GEMMs.

In short, even though what we wanted to do was solve a pre-factored linear systems for a bunch of different right-hand sides, what the computer actually ended up spending its time computing was mostly matrix multiplies. The GEMM calls are coming from inside the solver! (Cue scary music.)

Alright. At this point you might say, “fair enough, that may indeed be what happens if you use this TRSM thing that for all we know you just made up, but I for one am not ever asking the computer to solve the same equation with 50 different right-hand sides in a batch, so how does this apply to me?” Okay then, let’s have a look at how LU factorizations (which so far I’ve assumed we just have lying around) are actually computed, shall we? (And again, note I’m ignoring pivoting here, for simplicity.)

What we want to do is factor our matrix A into a unit lower triangular and an upper triangular factor:

A = LU

So, how do we do that? Just keep staring at that equation for a minute longer, see if it flinches first! It doesn’t? Bugger. Okay, plan B: apply our new favorite trick, splitting a matrix into blocks, to play for time until we get a brilliant idea:

\begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix} = \begin{pmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{pmatrix} \begin{pmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{pmatrix}

Our top-left block A_{11} needs to be square (same number of rows as columns), else this isn’t right, but it can be any size. This makes A_{22} square as well, and the other blocks are rectangular. The zeros are there because we want L and U to be lower and upper triangular respectively, and their entire top-right respectively bottom-left blocks better be all zero. Furthermore, L_{11} and L_{22} are also unit lower triangular (like the bigger L we carved them out of), and likewise U_{11} and U_{22} are upper triangular. About the remaining L_{21} and U_{12}, we can’t say much.

Still drawing blanks on the ideas front. But let’s just keep going for now: if we multiply out that matrix equation, we get

A_{11} = L_{11} U_{11} \\ A_{12} = L_{11} U_{12} \\ A_{21} = L_{21} U_{11} \\ A_{22} = L_{21} U_{12} + L_{22} U_{22}

Wait a second. That first line is a smaller LU decomposition, which we’re trying to figure out how to compute. But suppose we knew for now, and we had something that gave us L_{11} and U_{11}. Then that second line is really just U_{12} = L_{11}^{-1} A_{12}. That’s a TRSM, we just went over that. And the third line is L_{21} = A_{21} U_{11}^{-1}, which is also a TRSM (of a shape we haven’t seen before, but it works the same way). Once we have L_{11} and U_{11}, our hand is forced with regards to these two matrices; for the factorization to multiply to the correct result A, we need them to be the things I just wrote down. And if we know these two matrices, we can attack that last equation by moving their product to the left-hand side:

A_{22} - L_{21} U_{12} = L_{22} U_{22}

Hey look, we do a big GEMM and then resume with computing a LU decomposition of the remainder – we’ve seen that kind of structure before! Great. This is how to do a block LU decomposition: compute a LU decomposition of the top-left block, two TRSMs, one GEMM to update the bottom-right part, then keep decomposing that. And this time the TRSMs are on medium × medium × large problems, while the GEMM is on large × medium × large, so again the bulk of the computation is going to be spent in the GEMM part.

But we still don’t know how to compute the LU decomposition of that top-left block. No worries: if in doubt, go for a cheap shot. We don’t know how to do this for an arbitrary block. But what if we make our partition really silly? For A_{11} is a 1×1 element “matrix” levels of silly? (That is, we’re just splitting off one row and one column at the top left.)

Then A_{11} = L_{11} U_{11} is a no-brainer; all three of these matrices are 1×1, and we require L_{11} to be “unit” (ones on the diagonal), which for a 1×1 matrix just means that L_{11}=1. Therefore A_{11} = U_{11}. Ta-daa! We “solved” a 1×1 LU decomposition. But that’s all we really need. Because once we have that one value U_{11} determined, we can crank through our other 3 formulas, which give us U_{12} (the rest of the top row of U), L_{21} (the rest of the left column of L), and updates the rest of the matrix to eliminate the one variable we just computed. To compute a LU decomposition of a block, we simply keep peeling off 1×1 sub-blocks until we run out of matrix to decompose.

This description covers both “regular” and block LU decompositions (in fact we just do blocks and then get the regular decomposition as a special case when we do 1×1 blocks, at which point the problem becomes trivial), and not a single index or elementary row operation was harmed in the preceding text.

Note that this time, we turned LU decomposition (i.e. Gaussian elimination) into mostly-GEMMs-and-some-block-TRSMs, and we already saw earlier that block TRSMs turn into mostly-GEMMs-and-some-small-TRSMs. Therefore, the entire process of factoring a linear system and then solving it turns into… mostly GEMMs.

And that’s why everyone cares about GEMMs so much. (And also, you may now see why even if you don’t use TRSMs, math libraries still include them, because the solvers your app code calls want to call them internally!)

This pattern is not just specific to Gaussian Elimination-type algorithms, either. Block Householder for QR decompositions? Heavy on GEMMs. Hessenberg reduction for Eigenvalue problems? Basically Householder, which is mostly GEMMs. Computation of the Singular Value Decomposition (either for solvers or to get PCAs)? Generally starts with Golub-Kahan Bidiagonalization or one of its relatives, which is a somewhat fancier version of the QR decomposition algorithm, and yup, lots of GEMMs again. Then the actual singular value computation is iterative on that bidiagonal matrix, but that part tends to take less time than the non-iterative parts surrounding it, because the iteration is only iterating on a matrix reduced to 2 diagonals, whereas everything else works with the whole matrix.

In fact, what we’ve seen so far is a pattern of various matrix operations turning into smaller versions of themselves, plus maybe some other matrix operations, plus a GEMM. Guess what happens with a GEMM itself? If your guess was “GEMMs all the way down”, you’re right. It’s like a weed. (And turning GEMMs into smaller GEMMs is, in fact, quite important – but that’s in the paper, so I won’t talk about it here.)

This concludes our brief foray into dense numerical LA and why HPC people are so obsessed about GEMM performance. Note that dense problems are basically the easy case, at least from a high-level point of view; many of the things that are really interesting are huge (millions of equations and variables) but sparse and with exploitable structure, and these take a lot more care from the user, as well as more sophisticated algorithms. (That will nevertheless usually end up calling into a dense LA library for their bulk computations.)

Now that I’ve hopefully satisfyingly answered why GEMM, let’s talk a bit about the actual paper. The presentation I gave you of splitting up a matrix into blocks wasn’t just for notational convenience; that’s how these algorithms tend to work internally. The reason is that large matrices are, well, large. There’s an inherent 2D structure to these algorithms and completely looping over one axis of a giant matrix tends to thrash the cache, which in turn means there are suddenly lots of main memory accesses, and at that point you lose, because current computers can do way more computation per unit of time than they can do memory-to-cache transfers. If you truly want to do high-performance computation, then you have to worry about memory access patterns. (In fact, that’s most of what you do.)

That is something I pushed on the stack earlier in this post: different BLAS levels. This is an old chestnut, but it’s worth repeating: level-1 BLAS operations are vector-vector; something like say a big dot product (DOT in BLAS). Doing a dot product between two N-element vectors is on the order of 2N flops, and 2N memory operations (memops) to load the elements of the two vectors. 1:1 flops to memops – no good! Level-2 BLAS operations are matrix-vector; take say M×N matrix by M-element vector multiplication (GEMV, GEneral Matrix times Vector). Does 2MN flops, M×(N+2) memops (load all matrix elements once, load each vector element once, store each vector element once); closer to 2:1 flops to memops, which is an improvement, but still bad if you have the SIMD units to compute 32 single-precision flops per cycle and core and the main memory bandwidth to load half a float per cycle and core (the size of this gap is similar for CPUs and GPUs, for what it’s worth). It just means that your performance drops off a cliff once you’re working on data larger than your cache size.

Level-3 BLAS operations like GEMM, however, have 2MNP flops to MN+NP+MP necessary memops (load each matrix element once, store the result). That means the flops to memops ratio can in principle get arbitrarily large if only the matrices are big enough. Which in turn means that high-performance GEMMs are all about making sure that they do, in fact, reuse matrix elements extensively once they’re in the cache, and making sure that all the different cache levels are happy.

The way this works is interesting and worth studying, and that’s why that paper was on my list. Whew.

31. Bientinesi, van de Geijn-“Formal Correctness and Stability of Dense Linear Algebra Algorithms” (2005; numerical LA)

According to my headings, all of the above was about the matrix multiplication paper. Well, what can I say? I was lying.

That whole business with deriving our LU decomposition by partitioning our matrix into blocks, writing down equations for the individual block elements, and then winging our way towards a solution? That’s, essentially, this paper. Except the real paper is a lot more rigorous and consequently with a lot less “winging it”.

Partitioning matrices into blocks for fun and profit is a numerical linear algebra mainstay. I saw it done for a few algorithms at university. The work of this group at UT Austin (especially the stuff following from this paper) is what made me realize just how general and systematic it can be when it’s done right.

For a large class of dense LA algorithms, this procedure is well-specified enough to derive a working algorithm automatically from a problem description, complete with correctness and numerical stability analysis, within seconds; no inspiration required. It’s an algorithm-derivation algorithm. For a very limited (and fairly rigidly structured) problem domain, but still!

This is really cool and I like it a lot.

Papers I like (part 1)

This started out on Twitter and I expected having to write up maybe 50 or so, but then it made the rounds. I don’t think I have 1200+ papers that I’d recommend reading, and even if I did, I don’t think a reading list of that length is actually useful to anyone.

The original list had tweet-length comments. I have no such space limitation on this blog, so I’ll go ahead and write a few paragraphs on each paper. That does mean that we’re now back to lots of text, which means I’ll split it into parts. Such is life. In this post, I’ll do the first ten papers from my list.

One thing I noticed while writing the list is that the kind of paper I like most is those that combine solid theory with applications to concrete problems. You won’t find a lot of pure theory or pure engineering papers here. This is also emphatically not a list of “papers everyone should read” or any such nonsense. These are papers I happen to like, and if you share some of my interests, I think you’ll find a lot to like in them too. Beyond that, no guarantees.

1. Lamport-“State the Problem Before Describing the Solution” (1978; general)

It’s a one-page memo, and yeah, the gist of it’s already in the title. Read it anyway. This applies to papers and code both.

This is not just a matter of style. If you don’t have a description of the problem independent of your proposed solution, usually the best you can say about your solution (in terms of proofs or invariants) is “it does what I think it does”. In code, the equivalent is a long, complicated process that everybody is afraid of touching and nobody is really sure what it does – just that things tend to break when you touch it.

2. Herlihy-“Wait-free synchronization” (1991; concurrency)

This is probably the most important single paper ever written about shared-memory multiprocessing. When it was written, such systems were nowhere near as common as they are now and each architecture had its own set of atomic synchronization primitives – chosen primarily by whatever the kernel programmers at the respective companies thought might come in handy. Pretty much everyone had a decent way to implement a basic spinlock, but beyond that, it was the wild west.

Herlihy had, in joint work with Wing, formally defined linearizability a few years earlier, in 1987 – and if it feels odd to you that we first got an “industrial-strength” definition of, effectively, atomicity a few decades after the first multi-processor supercomputers appeared, that should tell you everything you need to know about how far theory was lagging behind practice at the time.

That’s the setting for this paper. Mutual exclusion primitives exist and are well-known, but they’re blocking: if a thread (note: this description is anachronistic; threads weren’t yet a mainstream concept) holding a lock is delayed, say by waiting for IO, then until that thread is done, nobody else who wants to hold that lock will make any progress.

Herlihy starts by defining wait-freedom: “A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of other processors”. (This is stronger than nonblocking aka “lock-free”, which guarantees that some thread will always be making progress, but can have any particular thread stuck in a retry loop for arbitrarily long).

Given that definition, he asks the question “with some atomic operations as primitives, what kinds of concurrent objects (think data structures) can we build out of them?”, and answers it definitively via consensus numbers (short version: virtually all primitives in use at the time can only solve 2-processor consensus wait-free, but compare-and-swap has a consensus number of infinity, i.e. it can be used to achieve consensus between an arbitrary number of processors wait-free).

The paper then goes on to show that, if the consensus number of available primitives is at least as high as the number of processor in a system, then in fact any operation can be done wait-free (universality). Note that this type of universal construction has high overhead and is therefore not particularly efficient, but it’s always possible.

This is a great paper and if you’re interested in concurrency, you should read it.

3. Cook – “How complex systems fail” (1998; complex systems)

This is about all kinds of complex systems, not just computer-related ones. Cook is a MD researching incidents in emergency medicine; this four-page write-up (the fifth page is bibliography) summarizes some of the most salient findings. It’s short, non-technical and relevant to everyone interacting with complex systems in their daily lives (which is, in modern society, everyone).

4. Moffat, Turpin – “On the Implementation of Minimum Redundancy Prefix Codes” (1997; data compression)

My tweet-length summary was “Much has been written about Huffman coding, a lot of it wrong. Almost everything worth knowing is in this (short!) paper”. I have no space limit here, so let me explain what I mean by that: the “folk understanding” of Huffman coding is that you start by building a data structure, the Huffman tree, using Huffman’s algorithm, then you transfer a description of the tree “somehow” (imagine vigorous hand-waving here), and then the encoder produces the bit stream by encoding paths from the root to the leaves (corresponding to individual symbols) in the Huffman tree, and the decoder works by reading a bit at a time and walking down the appropriate branch of the tree, producing the right symbol and going back to the root when it hits a leaf node.

Sounds about right to you? It shouldn’t, because every single part of what I just said is either a bad idea or outright wrong in practice. Let’s start with the trees. You don’t actually want a binary tree data structure for your code – because neither encoder nor decoder should ever be working a bit at a time. In fact, the preferred approach is generally to use canonical Huffman coding which not only never builds a tree, but also doesn’t care about the codes assigned during the original Huffman tree-building process. All it cares about is the code lengths, which it uses to build an equivalent (in terms of compression ratio) encoding that is easier to decode (and also easier to transmit efficiently in the bit stream, because it has fewer redundant degrees of freedom).

You might also not want to use Huffman’s algorithm directly; actual Huffman code words for rare symbols can become really long. In practice, you want to limit them to something reasonable (I don’t know any format offhand that allows codes longer than about 20 bits; 16 bits is a typical limit) to simplify bit IO in the encoder and decoder. So to avoid over-long codes, there’s eiher some post-processing that essentially “rotates” parts of the tree (making some longer codes shorter and some shorter codes longer), or code lengths are constructed using a different algorithm altogether.

This paper gets all these things right, and that’s why it’s the one I would recommend to anyone who wants to learn how to do Huffman coding right. As a final note, let me quote a passage from the conclusion of the paper:

In particular, we observe that explicitly tree-based decoding is an anachronism and usually best avoided, despite the attention such methods have received in textbooks, in the research literature, and in postings to the various network news groups.

It’s been 20 years; this is just as true now as it was then.

5. Dybvig, Hieb, Butler – “Destination-Driven Code Generation” (1990; compilers)

I also linked to Millikin’s “One-pass Code Generation in V8” as a companion piece.

Code generation for a simple stack-based virtual machine is really easy; it’s little more than a post-order traversal of the abstract syntax tree and some label management. If execution speed isn’t much of a concern, at that point you write a bytecode interpreter for said VM and call it a day.

But what if you do care about speed? Well, the next step is to generate actual machine code that still maintains the explicit stack (to get rid of the interpreter loop). Once you do that, it tends to become clear how much redundant stack manipulation and data movement you’re doing, so if you’re still not satisfied with the speed, what are you to do?

The most common answers are either to add some local smarts to the code generator, which gets messy alarmingly quickly (see the Milliken presentation for examples), or to give up and go for a “real” optimizing compiler back end, which is orders of magnitude more code and complexity (and has, not surprisingly, much higher compile times).

DDCG is a great compromise. It’s good at avoiding most of the redundant data movement and unnecessary control flow, while still fitting nicely within a simple one-pass code generator. That makes it a popular technique for JITs that need to generate passable machine code on tight deadlines.

6. Valmari – “Fast brief practical DFA minimization” (2012; theoretical CS/automata theory)

Take a n-state DFA with m transitions and an input alphabet of size k as input, and produce the corresponding minimal (by number of states) DFA – i.e. an equivalent DFA that is smaller (or at worst the same size as the input).

This is a classic automata theory problem; it’s relatively straightforward to design an algorithm that works in O(kn^2) time (usually credited to Moore), and there’s a 1971 algorithm by Hopcroft that solves the problem in O(nk \log n) time. So why am I linking to a 2012 paper?

Hopcroft’s algorithm is complicated. Textbooks tend to omit it entirely because its correctness proof and run-time analysis are considered too hard. Hopcroft’s 1971 paper was followed by a 1973 paper by Gries, “Describing an algorithm by Hopcroft”, that aims to improve the exposition. Nearly 30 years later, Knuutila wrote another follow-up paper, “Re-describing an algorithm by Hopcroft”, that still bemoans the lack of a readable exposition and gives it another go, spending 31 pages on it, and showing along the way that Hopcroft and Gries do not actually describe quite the same algorithm – and that his new algorithm is slightly different still (intentionally this time, to make it somewhat easier to analyze).

This is not what successful replication looks like, and a profoundly unsatisfying state of affairs for such a relatively fundamental (and clean) problem.

And that’s why I’m linking to Valmari’s 2012 paper. This paper, unlike most of the previous ones, works correctly with partial DFAs (i.e. those that don’t have transitions defined for every (state,input) pair), includes working code in the body of the paper (about 130 lines of C++, formatted for 42-char lines to fit in the two-column limit, resulting in one-letter variable names for everything; correspondingly it’s quite dense, but readable enough once you get used to the names), and takes a total of 5 pages for the algorithm description, correctness proof, and run-time analysis – the latter is O(n + m \log m); since m \le nk, this is as good as Hopcroft’s algorithm.

The algorithm itself is rather pretty too, working by alternatingly refining a partition on the states and the transitions of the input DFA in a pleasantly symmetric way. The data structures are simple but clever.

It’s taken an extra 41 years, but I think we can actually consider this problem truly solved now.

7. Sarnak, Tarjan – “Planar Point Location Using Persistent Search Trees” (1986; computational geometry/data structures)

Planar point location (locating which polygon of a polygonal subdivision of the plane a query point falls in) is an interesting problem in its own right, but in retrospect, it’s just side dish. What this paper really does is introduce space-efficient persistent search trees, using the “fat node” method. It’s a simple and very versatile idea (section 3 of the paper).

With persistent search trees, building the point location structure itself is a straightforward sweep: do a plane sweep from left to right, keeping track of line segments of the polygonal subdivision currently intersecting the sweep line. Their relative ordering only changes at vertices. The algorithm simply keeps track of the currently active line segments in a persistent red-black tree. Once the sweep is done, the persistent tree encodes all “slabs” between line segments (and thus the polygons between them) for arbitrary x coordinates, and can be queried in logarithmic time. It’s a textbook example of good data structures making for good algorithms.

8. Porter, Duff – “Compositing Digital Images” (1984; computer graphics)

The paper that introduced, well, digital image compositing, premultiplied (also known as “associated”) alpha, and what is now known as the “Porter-Duff operators” (“over” being the most important one by far). All these concepts are fairly important in image processing; sadly the paper under-sells them a bit.

In particular, premultiplied alpha is introduced as little more than a hack that saves a fair amount of arithmetic per pixel. While true, this is one of those cases where applications were ahead of the theory.

It turns out that a much stronger argument for premultiplication isn’t that it saves a bit of arithmetic, but that it solves some important other problems: when image data is stored in pre-multiplied form, (linear) filtering and compositing operations commute – that is, filtering and compositing can be exchanged. This is not the case with non-premultiplied alpha. In particular, scaling non-premultiplied images up or down before compositing tends to introduce fringing at the edges, as color values of transparent pixels (which should not matter) bleed in. Many try to mitigate this by preprocessing say sprite data so that the color values for transparent pixels adjacent to opaque or partially translucent pixels are “close” to minimize the error, but premultiplied alpha is cheaper, simpler to implement and completely free of errors. It’s generally the representation of choice for any linear filtering/interpolation tasks. And it also turns out that compositing operations themselves (if you care about the alpha value of the result) are a lot simpler in premultiplied alpha, whereas non-premultiplied ends up with various complications and special cases.

A good read on these issues are Jim Blinn’s columns in IEEE Computer Graphics and Applications, specifically those on Image Compositing, which are unfortunately pay-walled so I can’t link to them . They are collected in his book “Dirty Pixels” though, which is a recommended read if you’re interested in the topic.

9. Brandt – “Hard Sync Without Aliasing” (2001; DSP)

Probably the most popular class of synthesizers employs what is called “subtractive synthesis”. The idea is fairly simple: start out with an oscillator generating an “interesting” waveform (one with lots of overtones) at a specified frequency, and then proceed to shape the result using filters, which emphasize some of the overtones and de-emphasize others. It’s called subtractive synthesis because we start out with lots of overtones and then remove some of them; the converse (additive synthesis) builds up the overtone series manually by summing individual sine waves.

Typical good waveforms to use are sawtooth or pulse waves. Generating these seems really simple in code, but there’s a catch: in DSP, we generally represent waveforms by storing their values at regularly spaced discrete sample times; the original continuous signal can be recovered from these samples if and only if it is bandlimited. (This is the famous sampling theorem).

Neither sawtooth waves nor pulse waves are bandlimited. To get a bandlimited (and thus representable) version of them, you have to cut off the overtone series at some point; for sampling, this is the Nyquist frequency (half the sampling rate).

If you don’t, and just sample the continuous non-bandlimited waveforms, the signal you get is similar to what a “proper” bandlimited sawtooth/pulse wave would look like, but it’s got a lot of other noisy junk in it, corresponding to frequencies above Nyquist that weren’t representable and got reflected back into the representable part of the spectrum – this is “aliasing”.

If you’re doing sound synthesis, you want clean signals without such aliasing noise. The problem is that generating them directly is hard in general. Restricted cases (perfect sawtooth or square waves where you never change the frequency or other parameters) are reasonably easy, but that doesn’t help you if you want to tweak knobs describing the shape of the waveform in real time, doing pulse-width modulation, or various other tricks.

One such other trick is “hard sync”; two oscillators, where the first oscillator resets the phase of the second whenever it wraps around. This is done in analog synthesizers and the resulting sound is harmonically quite interesting, but the resulting waveform is complex enough to defeat straightforward attempts at description using additive synthesis or similar.

It would all be much simpler if you could work with the convenient non-bandlimited saw/pulse waves, where everything is easy, and just subtract the aliasing later. Well, turns out that because sampling and everything else involved in this chain is linear, you can in fact do exactly that.

What this paper does is introduce what’s called “MinBLEPs” – minimum-phase BandLimited stEPs. You simply work with the convenient continuous functions, and whenever a discontinuity is introduced, you insert a corresponding MinBLEP, which cancels out the aliasing. It’s a beautiful solution to a long-standing practical problem.

10. Veach – “Robust Monte Carlo Methods for Light Transport Simulation” (1997; computer graphics)

This one’s a PhD thesis, not a paper. I’ve been active in computer graphics for a long time, but there aren’t many graphics papers linked in here. That’s partly because I have many interests outside of CG, but also due to the nature of computer graphics research, which tends to be rapid-paced and very incremental. That makes it hard to single out individual papers, for the most part.

Veach’s PhD thesis sticks out, though. When it was initially published, Veach’s research was groundbreaking. Now, nearing its 20th anniversary (it was published in December of 1997), it’s probably better summarized as “classic”, with important contributions to both theory and algorithms.

This is not light reading, but it’s definitely a must-read for anyone interested in unbiased rendering.

Memory bandwidth

Absolute memory bandwidth figures tend to look fairly large, especially for GPUs. This is deceptive. It’s much more useful to relate memory bandwidth to say the number of clock cycles or instructions being executed, to get a feel for what you can (and can’t) get away with.

Let’s start with a historical example: the MOS 6502, first released in 1975 – 42 years ago, and one of the key chips in the microcomputer revolution. A 6502 was typically clocked at 1MHz and did a 1-byte memory access essentially every clock cycle, which are nice round figures to use as a baseline. A typical 6502 instruction took 3-5 cycles; some instructions with more complex addressing modes took longer, a few were quicker, and there was some overlapping of the fetch of the next instruction with execution of the current instruction, but no full pipelining like you’d see in later (and more complex) workstation and then microcomputer CPUs, starting around the mid-80s. That gives us a baseline of 1 byte/cycle and let’s say about 4 bytes/instruction memory bandwidth on a 40-year old CPU. A large fraction of that bandwidth went simply into fetching instruction bytes.

Next, let’s look at a recent (as of this writing) and relatively high-end desktop CPU. An Intel Core i7-7700K, has about 50GB/s and 4 cores, so if all 4 cores are under equal load, they get about 12.5GB/s each. They also clock at about 4.2GHz (it’s safe to assume that with all 4 cores active and hitting memory, none of them is going to be in “turbo boost” mode), so they come in just under 3 bytes per cycle of memory bandwidth. Code that runs OK-ish on that CPU averages around 1 instruction per cycle, well-optimized code around 3 instructions per cycle. So well-optimized code running with all cores busy has about 1 byte/instruction of available memory bandwidth. Note that we’re 40 years of Moore’s law scaling later and the available memory bandwidth per instruction has gone down substantially. And while the 6502 is a 8-bit microprocessor doing 8-bit operations, these modern cores can execute multiple (again, usually up to three) 256-bit SIMD operations in one cycle; if we treat the CPU like a GPU and count each 32-bit vector lane as a separate “thread” (appropriate when running SIMT/SPMD-style code), then we get 24 “instructions” executed per cycle and a memory bandwidth of about 0.125 bytes per cycle per “SIMT thread”, or less unwieldy, one byte every 8 “instructions”.

It gets even worse if we look at GPUs. Now, GPUs generally look like they have insanely high memory bandwidths. But they also have a lot of compute units and (by CPU standards) extremely small amounts of cache per “thread” (invocation, lane, CUDA core, pick your terminology of choice). Let’s take the (again quite recent as of this writing) NVidia GeForce GTX 1080Ti as an example. It has (as per Wikipedia) a memory bandwidth of 484GB/s, with a stock core clock of about 1.48GHz, for an overall memory bandwidth of about 327 bytes/cycle for the whole GPU. However, this GPU has 28 “Shading Multiprocessors” (roughly comparable to CPU cores) and 3584 “CUDA cores” (SIMT lanes). We get about 11.7 bytes/cycle per SM, so about 4x what the i7-7700K core gets; that sounds good, but each SM drives 128 “CUDA cores”, each corresponding to a thread in the SIMT programming model. Per thread, we get about 0.09 bytes of memory bandwidth per cycle – or perhaps less awkward at this scale, one byte every 11 instructions.

That, in short, is why everything keeps getting more and larger caches, and why even desktop GPUs have quietly started using tile-based rendering approaches (or just announced so openly). Absolute memory bandwidths in consumer devices have gone up by several orders of magnitude from the ~1MB/s of early 80s home computers, but available compute resources have grown much faster still, and the only way to stop bumping into bandwidth limits all the time is to make sure your workloads have reasonable locality of reference so that the caches can do their job.

Final disclaimer: bandwidth is only one part of the equation. Not considered here is memory latency (and that’s a topic for a different post). The good news is absolute DRAM latencies have gone down since the 80s – by a factor of about 4-5 or so. The bad news is that clock rates have increased by about a factor of 3000 since then – oops. CPUs generally hit much lower memory latencies than GPUs (and are designed for low-latency operation in general) whereas GPUs are all about throughput. When CPU code is limited by memory, it is more commonly due to latency than bandwidth issues (running out of independent work to run while waiting for a memory access). GPU kernels have tons of runnable warps at the same time, and are built to schedule something else during the wait; running on GPUs, it’s much easier to run into bandwidth issues.

Rounding up to the nearest integer that’s congruent to k mod N

It’s fairly well-known (among programmers anyway) that say rounding up x to the nearest multiple of 8 can be accomplished using the formula (x + 7) & ~7, and that in general rounding up to the nearest multiple of N (where N is a power of 2) can be accomplished using (x + N - 1) & ~(N - 1). But sometimes you need a slightly generalized version: round up to the nearest value that is congruent to some k \pmod N; for example, this crops up in boundary tag-using memory allocators when the user requests aligned memory. Such allocators put a header before allocated blocks (just before the address returned to the caller). For the user-visible pointer to be aligned by say 32, that header needs to fall at an address that’s off alignment by a specified distance (which brings us to our problem).

It’s not immediately obvious how to adapt the original formula to this case (there is a way; I’ll get to it in a second). Now this is not exactly a frequent problem, nor is there any real need for a clever solution, but it turns out there is a very nice, satisfying solution anyway, and I wanted to write a few words about it. The solution is simply x + ((k - x) & (N - 1)) for power-of-2 N. The basic approach works in principle for arbitrary N, but x + ((k - x) % N) will not work properly in environments using truncated division where taking the modulus of a negative argument can return negative results, which sadly is most of them. That said, in the remainder of this short post I’ll write % N instead of & (N - 1) with a “N needs to be a power of 2” disclaimer anyway, since there’s really nothing about the method that really requires it. Finally, this expression works fine even in overflowing unsigned integer arithmetic when N is a power of 2, but not for non-power-of-2 N.

What I like about this solution is that, once you see it written down, it’s fairly clear that and why it works (unlike many bit manipulation tricks), provided you know the rules of modular arithmetic: x + ((k - x) \bmod N) \equiv x + (k - x) = k \pmod N. We’re adding a non-negative value to x, so it’s clear that the result is ≥ x (provided there is no overflow). And we’re adding the smallest possible value we can to get to a value that’s congruent to k (mod N); I wrote about similar things before in my post “Intervals in modular arithmetic”.

There’s an equivalent expression for rounding down to the nearest value congruent to k (mod N): x - ((x - k) % N) that works (and is easy to prove) the same way.

It’s interesting to consider the case k=0. The round-down variant, x - (x % N), feels fairly natural and is something I’ve seen in “real-world” code more than once. The round-up variant, x + (-x % N) is something I’ve never seen anywhere. Once you throw the k in there, it all makes sense, but without it the expression looks quite odd.

Finally, here’s the aforementioned way to adapt the “regular” round-up formula to produce a value that’s congruent to k (instead of 0) mod N (and we’re back to requiring power-of-2 N here): ((x - k + N - 1) & ~(N - 1)) + k. This uses a different trick from the intervals in modular arithmetic paper: shift the origin around. In this case, we don’t have a formula for arbitrary k, but we do have a formula to round up to the nearest multiple of N. So we first subtract k; in this new shifted coordinate system, we want to round up to the next-larger multiple of N, which we know how to do. And finally, we add back k. It gets the job done, but it’s not as pretty as the other variant (in my opinion anyway), and it takes some thinking to convince yourself that it works at all.

How many x86 instructions are there?

It’s surprisingly hard to give a good answer (the question was raised in this article). It depends on how you count, and the details are interesting (to me anyway).

To not leave you hanging: Intel has an official x86 encoder/decoder library called XED. According to Intel’s XED, as of this writing, there are 1503 defined x86 instructions (“iclasses” in XED lingo), from AAA to XTEST (this includes AMD-specific extensions too, by the way). Straightforward, right?

Well, it depends on what you wanted to count. For example, as per XED, ADD and LOCK ADD are different “instruction classes”. Many assembly programmers would consider LOCK a prefix and LOCK ADD an addition with said prefix, not a distinct instruction, but XED disagrees. And in fact, for the purposes of execution, so do current x86s. An atomic add does very different things from a regular add. The prefix thing crops up elsewhere: is say MOVSD (copy a single 32-bit word) a different “instruction class” from REP MOVSD (block copy several 32-bit words)? XED says yes. But it doesn’t handle all prefixes this way in all contexts. For example, the operand-size prefix (0x66) turns integer instructions operating on 32-bit registers into the equivalent instruction operating on their lower 16-bit halves, but unlike with the REP or LOCK prefixes, XED does not count these as separate instruction classes. If you disagree about any of these choices, your count will come out different.


It all depends on how precisely we define an instruction. Is it something with a distinct mnemonic? Let’s first look at what the article I quoted above says is by far the most common x86 instruction, at 33% of the total sample set: MOV. So let’s look up MOV in the Intel Architecture manuals. And… there are 3 different top-level entries? “MOV—Move”, “MOV—Move to/from Control registers”, “MOV—Move to/from Debug Registers”. The latter are sufficiently distinct from “regular” MOV to rate their own documentation pages, they have completely different instruction encodings (not even in the same encoding block as regular MOV), and they’re privileged instructions, meaning lowly user-mode code isn’t even allowed to execute them. Consequently they’re also extremely rare, and are likely to account for approximately 0% of the test sample. And, sure enough, XED counts them as separate instruction classes (MOV_CR and MOV_DR).

So these instructions may be called MOV, but they’re weird, special snowflakes, and from the processor’s point of view they’re entirely different instructions in a different part of the encoding space and with different rules. Calling them MOV is essentially nothing but syntactic sugar in the official Intel assembly language.

And on the subject of syntactic sugar: some mnemonics are just aliases. For example, SAL (shift arithmetic left) is a long-standing alias for SHL (shift left). Both are just bit shifts; there is no distinction between “arithmetic” and “logical” left shifts like there is between arithmetic and logical right shifts, but the Intel manuals list SAL (with an encoding that happens to be the same as SHL) and all x86 assemblers I’ve ever used accept it. Hilariously, in official Intel syntax, we’re simultaneously miscounting in the other direction, since at least two mnemonics got assigned twice: we already saw the “copy” variant of MOVSD (which has no explicit operands), but there’s also MOVSD as in “move scalar double” (which always has two explicit operands) which is an entirely different instruction (XED calls it MOVSD_XMM to disambiguate, and the same problem happens with CMPSD).

There’s also SSE compares like CMPSD (the two-operand one!) and CMPPS. XED counts these as one instruction each. But they have an 8-bit immediate constant byte that specifies what type of comparison to perform. But disassemblers usually won’t produce the hard-to-read CMPSD xmm0, xmm1, 2; they’ll disassemble that instruction as the pseudo-instruction CMPLESD (compare scalar doubles for lesser-than-or-equal) instead. So is CMPSD one instruction (just the base opcode with an immediate operand), is it 8 (for the 8 different standard compare modes), or something else?

This is getting messy. AT&T syntax to the rescue? Well, it solves some of our problems but also introduces new ones. For example, AT&T adds suffixes to the mnemonics to distinguish different operation widths. What Intel calls just ADD turns into ADDB (8-bit bytes), ADDW (16-bit “words”), ADDL (32-bit “long words”) and ADDQ (64-bit “quadwords”) in x86-64 AT&T syntax. Do we count these as separate? As per Intel syntax, no. As per XED instruction classes, also no. But maybe we consider these distinct enough to count separately after all? Or maybe we decide that if our definition depends on the choice of assembly syntax, of which there are several, then maybe it’s not a very natural one. What does the machine do?

Instruction bytes

Note I haven’t specified what part of the machine yet. This is thorny too. We’ll get there in a bit.

But first, instruction bytes. Let’s look at the aforementioned manual entry for real now: “MOV—Move”. If you check that page out in the current Intel Architecture Software Developer’s Manual, you’ll find it lists no less than thirty-four encodings (not all of them distinct; I’ll get to that). Some of these are more special, privileged operations with special encodings (namely, moves to and from segment registers). This time, XED doesn’t seem to consider segment register loads and stores to be special and lumps them into plain old MOV, but I consider them distinct, and the machine considers them distinct enough to give them a special opcode byte in the encoding that’s not used for anything else, so let’s call those distinct.

That leaves us with 30 “regular” moves. Which are… somewhat irregular: 10 of them are doing their own thing and involve moves between memory and different parts of the RAX (in 64-bit mode) register, all with a special absolute addressing mode (“moffs”) that shows up in these instructions and, to my knowledge, nowhere else. These instructions exist, and again, pretty much nothing uses them. They were useful on occasion in 16-bit mode but not anymore.

This specialness of the accumulator register is a recurring theme in x86. “op (AL/AX/EAX/RAX), something” has its own encoding (usually smaller) and various quirks for a lot of the instructions that go back to the 8086 days. So even though an asssembly programmer might consider say TEST ebx, 128 and TEST eax, 128 the same instruction (and the XED instruction class list agrees here!), these have different opcodes and different sizes. So a lot of things that look the same in an assembly listing are actually distinct for this fairly random reason. Keep that in mind. But back to our MOV!

The remaining 20 listed MOV variants fall into four distinct categories, each of which has 5 entries. These four categories are:

  • “Load-ish” – move from memory or another same-sized register to a 8/16/32/64-bit register.
  • “Store-ish” – move from a 8/16/32/64-bit register to either another register of the same size, or memory.
  • “Load-immediate-ish” – load an integer constant into a 8/16/32/64-bit register.
  • “Store-immediate-ish” – store an integer constant to either a 8/16/32/64-bit memory location, or a register.

All processor have some equivalent of the first three (the “store immediate” exists in some CPU architectures, but there’s also many that don’t have it). Load/store architectures generally have explicit load and store instructions (hence the name), and everyone has some way to load immediates (large immediate constants often require multiple instructions, but not on x86) and to move the content of one register to another. (Though the latter is not always a dedicated instruction.) So other than the fact that our “load-ish” and “store-ish” instructions also support “storing to” and “loading from” a register (in particular, there’s two distinct ways to encode register-register MOVs), this is not that remarkable. It does explain why MOVs are so common in x86 code: “load”, “store” and “load immediate” in particular are all very common instruction, and MOV subsumes all of them, so of course you see plenty of them.

Anyway, we have four operand sizes, and four categories. So why are there five listed encodings per category? Okay, so this is a bit awkward. x86-64 has 16 general-purpose registers. You can access them as 16 full 64-bit registers. For all 16 registers, you can read from (or write to) their low 32-bit halves. Writing to the low 32-bit half zero-extends (i.e. it sets the high half to zero). For all 16 register, you can read from (or write to) their low 16-bit quarter. Writing to the low 16-bit quarter of a register does not zero-extend; the remaining bits of the register are preserved, because that’s what 32-bit code used to do and AMD decided to preserve that behavior when they specced 64-bit x86 for some reason. And for all 16 registers, you can read from (or write to) their low 8-bit eighth (the lowest byte). Writing the low byte again preserves all the higher bytes, because that’s what 32-bit mode did. With me so far? Great. Because now is when it gets weird. In 16-bit and 32-bit mode, you can also access bits 8 through 15 of the A, B, C and D registers as AH, BH, CH and DH. And x86-64 mode still lets you do that! But due to a quirk of the encoding, that works only if there’s no REX prefix (which is the prefix that is used to extend the addressable register count from 8 to 16) on the instruction.

So x86-64 actually has a total of 20 addressable 8-bit registers, in 3 disjoint sets: AL through DL, which can be used in any encoding. AH through DH, which can only be accessed if no REX prefix is present on the instruction. And the low 8 bits of the remaining 12 registers, which can only be accessed if a REX prefix is present.

This quirk is why Intel lists all 8-bit variants twice: once without REX and one with REX, because they can access slightly different parts of the register space! Alright, but surely, other than that, we must have 4 different opcodes, right? One each for move byte, word, doubleword, quadword?

Nope. Of course not. In fact, in each of these categories, there are two different opcode bytes: one used for 8-bit accesses, and one for “larger than 8-bit”. This dates back to the 8086, which was a 16-bit machine: “8-bit” and “16-bit” was all the distinction needed. Then the 386 came along and needed a way to encode 32-bit destinations, and we got the already mentioned operand size prefix byte. In 32-bit mode (handwaving here, the details are a bit more complicated), the instructions that used to mean 16-bit now default to 32-bit, and getting actual 16-bit instrutions requires an operand size prefix. And I already mentioned that 64-bit mode added its own set of prefixes (REX), and this REX prefix is used to upgrade the now default-32-bit “word” instructions to 64-bit width.

So even though Intel lists 5 different encodings of the instructions in each group, all of which have somewhat different semantics, there’s only 2 opcodes each associated to them: “8-bit” or “not 8-bit”. The rest is handled via prefix bytes. And as we (now) know, there’s lots of different types of MOVs that do very different things, all of which fall under the same XED “instruction class”.

Maybe instruction classes is the wrong metric to use? XED has another, finer-grained thing called “iforms” that considers the different subtypes of instructions separately. For example, for the just-discussed MOV, we get this list:


As you can see, that list basically matches the way the instruction encoding works, where 8-bit anything is considered a separate instruction, but size overrides by way of prefixes are not. So that’s basically the rule for XED iforms: if it’s a separate instruction (or a separate encoding), it gets a new iform. But just modifying the size of an existing instruction (for example, widening MMX instructions to SSE, or changing the size of a MOV via prefix bytes) doesn’t.

So how many x86 instructions are there if we count distinct iforms as distinct? Turns out, an even 6000. Is that all of them? No. There are some undocumented instructions that XED doesn’t include (in addition to the several formerly undocumented instructions that Intel at some point just decided to make official). If you look at the Intel manuals, you’ll find the curious “UD2”, the defined “Undefined instruction” which is architecturally guaranteed to produce an “invalid opcode” exception. As the name suggests, it’s not the first of its kind. Its older colleague “UD1” half-exists, but not officially so. Since the semantics of UD1 are exactly the same as if it was never defined to begin with. Does a non-instruction that is non-defined and unofficially guaranteed to non-execute exactly as if it had never been in the instruction set to begin with count as an x86 instruction? For that matter, does UD2 itself, the defined undefined instruction, count as an instruction?

Instruction decoders

But back to those iforms: 6000 instructions, huh? And these must all be handled in the decoder? That must be terrible.

Well, no. Not really. I mean, it’s not pleasant, but it’s not the end of the world.

First off, let’s talk about how x86 is decoded in the first place: all x86 CPUs you’re likely to interact with can decode (and execute) multiple instructions per cycle. Think about what that means: we have an (aggressively!) variable-length encoding, and we’re continually fetching instructions. These chips can decode (given the right code) 4 instructions per clock cycle. How does that work? They’re variable-length! We may know where the first instruction we’re looking at in this cycle starts, but how does the CPU know where to start decoding the second, third, and fourth instructions? That’s straightforward when your instructions are fixed-size, but for x86 they are most certainly not. And we do need to decide this quickly (within a single cycle), because if we take longer, we don’t know where the last instruction in our current “bundle” ends, and we don’t know where to resume decoding in the next cycle!

You do not have enough time in a 4GHz clock cycle (all 0.25ns of it) to fully decode 4 x86 instructions. For that matter, you don’t even have close to enough time to “fully
decode” (what exactly that means is fuzzy, and I won’t try to make it precise here) one. Two basic ways to proceed: the first is simply, don’t do that! Try to avoid it at all cost. Keep extra predecoding information (such as marking the locations where instructions start) in your instruction cache, or keep a separate decoded cache altogether, like Intels uOp caches. This works, but it doesn’t help you the first time round when you’re running code that isn’t currently cached.

Which brings us to option two: deal with it. And the way to do it is pretty much brute force. Keep a queue of upcoming instruction bytes (this ties in with branch target prediction and other things). As long as there’s enough space in there, you just keep fetching another 16 (or whatever) instruction bytes and throw them into the queue.

Then, for every single byte position in that queue, you pretend that an x86 instruction starts at that byte, and determine how long it is. Just the length. No need to know what the instruction is. No need to know what the operands are, or where the bytes denoting these operands are stored, or whether it’s an invalid encoding, or if it’s a privileged instruction that we’re not allowed to execute. None of that matters at this stage. We just want to know “supposing that this is a valid instruction, what is it’s length?”. But if we add 16 bytes per cycle to the queue, we need 16 of these predecoders in parallel to make sure that we keep up and get an instruction length for every single possible starting location. We can pipeline these predecoders over multiple cycles if necessary; we just keep fetching ahead.

Once our queue is sufficiently full and we know that size estimate for every single location in it, then we decide where the instruction boundaries are. That’s the stage that keeps track. It grabs 16 queue entries (or whatever) starting at the location for the current instruction, and then it just needs to “switch through”. “First instruction says size starting from there is 5 bytes, okay; that means second instruction is at byte 5, and the queue entry says that one’s 3 bytes; okay, third instruction starts at byte 8, 6 bytes”. No computation in that stage, just “table lookups” in the small size table we just spent a few cycles computing.

That’s one way to do it. As said, very much brute force, but it works. However, if you need 16 predecoders (as you do to sustain a fetch rate of 16 bytes/cycle), then you really want these to be as dumb and simple as you can possibly get away with. These things most certainly don’t care about 6000 different iforms. They just squint at the instruction just enough to figure out the size, and leave the rest for later.

Luckily, if you look at the actual opcode map, you’ll see that this is not all that bad. There’s large groups of opcodes that all have basically the same size and operands, just with different operations – which we don’t care about at this stage at all.

And this kind of pattern exists pretty much everywhere. For example, look at that conspicuous, regular block of integer ALU instructions near the top of the opcode map. These all look (and work) pretty similar to the CPU. Most of them have essentially the same encodings (except for a few opcode bits that are different) and the same operand patterns. In fact, the decoder really doesn’t care whether it’s an OR, an ADD, a CMP, or a XOR. To an assembly-language programmer, a compiler, or a disassembler, these are very different instructions. To the CPU instruction decoder, these are all pretty much the same instruction: “ALU something-or-other mumble-mumble don’t care”. Which one of these gets performed will only be decided way later (and probably only after that operation make it to the ALU itself). What the decoder cares about is whether it’s an ALU instruction with an immediate operand, or if it has a memory operand, and what that memory operand looks like. And the instructions are conveniently organized in groups where the answers to these questions are always the same. With plenty of exceptions of course, because this is still x86, but evidently it can be made to work.

Further down the pipe

Instructions really don’t get decoded all at once, in one big “switch statement”, and after that they go to disjoint parts of the chip never to meet again. That’s not how these things are built. There’s plenty of similarity between different instructions, and the “understanding” of what an instruction does is distributed, not centralized.

For example, for the purposes of most of the instruction decoder, the SSE2 instructions ADDPS, SUBPS, MULSD and DIVPD are all pretty much the same thing. They’re FP ALU instructions, they accept the same types of operands, all of which are in the same place.

Some of these instructions are so similar that they’re almost certain to never fully get “decoded”. For example, for IEEE floats, a subtraction is literally just an addition where the sign bit of the second operand is flipped. If you look at the opcode table, the difference between the encoding for ADDPS and SUBPS is precisely one flipped bit: that bit is clear for ADDPS and set for SUBPS. Literally all you need to do to support both instructions is to decode them the same, make sure to grab that one bit from the instruction, and then feed it (along with the original second operand sign bit) into a single XOR gate in front of the FP adder. That’s it. You now support both floating point addition and subtraction.

Some of these differences matter more. For example, FP multiplies go to a different functional unit than FP adds, and they have a different latency. So the data needs a different routing, and the latency for say an add and a multiply is different, which the schedulers care about. If there’s a memory load involved, then the load unit needs to know what size of access, and what part of the operand bypass network to send the results to (integer, float/SIMD?). So there’s a bunch of control signals computed eventually that express the differences between all these instructions. But a lot of it happens really late. There’s certainly no big monolithic 6000-case “switch statement” anywhere.

And then there’s further differences still. For example, MOV elimination. Many x86s can in many cases avoid real execution of register-register MOVs altogether. They just resolve it as part of their register renaming. Likewise, zeroing a register by XORing it with itself (something the author of the original article I linked to) gets resolved by renaming that register to point to a hard-wired zero register and likewise doesn’t actually take any execution resources (even though it still needs to decode).

Where does that fit in our taxonomy? MOV rax, rbx will most often take 0 cycles, but sometimes take 1 cycle due to various reasons. Does the 0-cycle version, when it happens, count as a special instruction? Is XOR rax, rax (which goes down the magic implicit zeroing path and takes 0 cycles to execute) a different instruction from XOR rax, rcx which is encoded essentially the same way? These two instructions differ by exactly 1 bit in both the assembly-language source file and the assembled object code, yet execute in drastically different ways and with different latencies. Should that make them a separate instruction or not? The most useful answer really depends on what part of the pipeline you’re interested in. If you’re designing a CPU core, they pretty much are separate instructions. If you’re writing a disassembler, they are not.

In conclusion…

So, is there a point to all this? I wrote it because I think it’s fun, but is there something to learn here?

I think so. It makes a wonderful example for a general phenomenon I’ve encountered in a lot of different situations: questions to which a ballpark answer is fairly easy to give, but that keeps getting gnarlier the more you try to pin it down. It’s essentially an instance of the “coastline paradox“: the closer you look, the more detail you see, and the more the answer changes.

Suppose I ask you “where am I?”, and I’m okay with getting an answer that’s within about 10 meters or so. If you have a handheld GPS unit, you can just hand it to me, and if I look at the display I’ll get an answer. If I ask “where am I, down to the millimeter?”, things get a lot more complicated. Specifying the position of a person down to a meter or so makes sense, but specifying it down to a millimeter does not. Position of what exactly? My center of gravity? The position of my center of gravity, projected onto the ground? The position of the tip of my nose? The center point of the hangnail on my left pinky? You can’t answer that question precisely when the uncertainty inherent in the question is so much larger than the level of precision you’re aiming at.

And by the way, I used x86 as an example here, but don’t believe for a second the same thing doesn’t apply to, say, the ARM chip in your phone. Modern ARM chips support multiple encodings and also rank over 1000 instructions if you count them at the same level of granularity as XEDs “iforms”. In fact it’s pretty easy to get high instruction counts on any architecture as soon as any kind of vector/SIMD instruction set is involved, since most of them basically come in the form of “instantiate these 40 instructions for 10 different data types” (with lots of special magic that is either typeless only works on certain types, of course). And yeah, x86 has plenty of historical warts in its encoding, but so does ARM – many of them on display in the current generation of chips, where chip makers have the pleasurable task of designing 3 distinct instruction decoders: old-school 32-bit ARM or “A32”, the more compact but variable-size Thumb-2 or “T32”, and the fixed-size-again 64-bit “A64” encoding.

Why do CPUs have multiple cache levels?

This is a reader question from “jlforrest” that seems worth answering in more detail than just a single sentence:

I understand the need for a cache but I don’t understand why there are multiple levels of cache instead of having just one larger level. In other words, let’s say the L1 cache is 32K, the L2 cache is 256K, and the L3 cache is 2M, why not have a single 32K + 256K + 2M L1 cache?

The short version is that the various cache levels have very large variations in how they are designed; they are subject to different constraints and fulfill different purposes. As a rule of thumb, as you climb up the levels of the cache hierarchy, the caches get larger, slower, higher density (more bits stored per unit area), consume less power per bit stored, and gain additional tasks.

In the hopes of building some intuition here, let’s start with an elaborate and somewhat quaint analogy. That’s right, it is…

Cache story time!

Suppose you’re a white-collar office worker in some unnamed sprawling 1960s bureaucracy, with no computers in sight, and your job involves a lot of looking at and cross-referencing case files (here being folders containing sheets of paper).

You have a desk (the L1 data cache). On your desk are the files (cache lines) you’re working on right now, and some other files you recently pulled and are either done with or expect to be looking at again. Working with a file generally means looking at the individual pages it contains (corresponding to bytes in a cache line). But unless they’re on your desk, files are just treated as a unit. You always grab whole files at a time, even if there’s only one page in them that you actually care about right now.

Also in the office is a filing cabinet (L2 cache). That cabinet contains files you’ve handled recently, but aren’t using right now. When you’re done with what’s on your desk, most of these files will go back into that filing cabinet. Grabbing something from the cabinet isn’t immediate – you need to walk up there, open the right drawer and thumb through a few index cards to find the right file – but it’s still pretty quick.

Sometimes other people need to look at a file that’s in your cabinet. There’s a guy with a cart, Buster (representing a sort of ring bus) who just keeps doing his rounds through all the offices. When an office worker needs a file they don’t have in their private cabinet, they just write a small request slip and hand it to Buster. For simplicity’s sake, let’s just say Buster knows where everything is. So the next time he comes by your office, Buster will check if someone requested any files that are in your cabinet, and if so will just silently pull these files out of the cabinet and put them on his cart. The next time he comes by the office of whoever requested the file, he’ll drop the file in their cabinet, and leave a receipt on the desk.

Every once in a while, Buster notices a requested file isn’t in the cabinet, but on the desk instead. In that case, he can’t just silently grab it; he needs to ask the worker at the desk whether they’re done with it, and if no, that worker and the one who put in the request need to agree on what to do. There are tediously elaborate corporate protocols on what to do in that situation (meetings will be called for sure!).

The filing cabinets are usually full. That means Buster can’t just put a new file in; he has to make space first, which he does by grabbing another file, preferably one that hasn’t been used in a while. Those files, Buster takes to the archive in the basement (L3 cache). In the basement, groups of files are kept densely packed in cardboard boxes on industrial shelving units. The regular office workers don’t get to come down here at all; it’s well out of their way and they’re not familiar with the details of the filing system. We leave that to the archival clerks.

Whenever Buster gets down here, he drops all the old files he grabbed for archival in the “in” tray at the front desk. He also drops off all the request slips for files that aren’t in any of the filing cabinets upstairs, but he doesn’t wait around until the clerks bring the files back, since it takes a while. They’re just going to take the request slips, pick up the file in question, and drop it off in the “out” tray whenever they’re done. So every time Buster comes around, he grabs whatever is in the “out” tray, puts it on his cart and takes it to the recipient when he next comes by.

Now, the problem is, there’s a lot of these files, and even with the efficient packing, they’re not even close to fitting in the basement. Most of the files are kept off-site; this is an office building in a nice part of town, and rents over here are way too high to spend that much space on storage. Instead, the company rents warehouse space 30 minutes out of town, where most of the old files are kept (this corresponds to DRAM). At the front desk of the basement sits Megan, the Head Archivist. Megan keeps track of which files are kept in the basement, and which are warehoused. So when Buster drops his request slips in the “in” tray, she checks which of them correspond to files in the basement (to be handled by the archival clerks) and which aren’t on-site. The latter just get added to a big pile of requests. Maybe once or twice a day, they send a van to the warehouse to grab the requested files, along with a corresponding number of old files to be mothballed (as said, the basement is full; they need to make space before they can store any more files from the warehouse).

Buster doesn’t know or care about the details of the whole warehousing operation; that’s Megan’s job. All he knows is that usually, those request slips he hands to the archive are handled pretty quickly, but sometimes they take hours.

Back to the original question

So, what’s the point of this whole elaborate exercise? Briefly, to establish a more concrete model than an opaque “magic cache” that allows us to think more clearly about the logistics involved. The logistics are just as important in designing a chip as they are in running an efficient office.

The original question was “why don’t we build a single large cache, instead of several small ones”. So if you have say a quad-core machine with 4 cores that each have 32KB L1 data cache, 256KB L2 cache, plus 2MB of shared L3 cache, why don’t we just have a ~3MB shared cache to begin with?

In our analogy: for pretty much the same reason we have individual office desks that are maybe 1.50m wide, instead of seating four different people at a single enormous desk that is 150m wide.

The point of having something on the desk is that it’s within easy reach. If we make the desk too big, we defeat that purpose: if we need to walk 50m to get the file we want, the fact that it’s technically “right here on our desk” isn’t really helping. The same goes for L1 caches! Making them bigger makes them (physically) larger. Among other things, this makes accessing them slower and consume more energy (for various reasons). L1 caches are sized so they’re large enough to be useful, but small enough so they’re still fast to access.

A second point is that L1 caches deal with different types of accesses than other levels in the cache hierarchy. First off, there’s several of them: there’s the L1 data cache, but there’s also a L1 instruction cache, and e.g. Intel Core CPUs also have another instruction cache, the uOp cache, which is (depending on your point of view) either a parallel L1 instruction cache or a “L0 instruction cache”.

L1 data caches gets asked to read and write individual items that are most commonly between 1 and 8 bytes in size, somewhat more rarely larger (for SIMD instructions). Cache levels higher up in the hierarchy don’t generally bother with individual bytes. In our office analogy, everything that’s not on a desk somewhere is just handled at the granularity of individual files (or larger), corresponding to cache lines. The same is true in memory subsystems. When a core is performing a memory access, you deal in individual bytes; the higher-level caches generally handle data wholesale, one cache line at a time.

L1 instruction caches have quite different access patterns than data caches do, and unlike the L1 data cache, they’re read-only as far as the core is concerned. (Writing into the instruction cache generally happens indirectly, by putting the data in one of the higher-level unified caches, and then having the instruction cache reload their data from there). For these (and other) reasons, instruction caches are generally built quite differently from data caches; using a single unified L1 cache means that the resulting design needs to meet several conflicting design criteria, forcing compromises that make it worse at either purpose. A unified L1 cache also needs to handle both the instruction and data traffic, which is quite the load!

As an aside: as a programmer, it’s easy to ignore how much cache bandwidth is needed to fetch instructions, but it’s quite a lot. For example, when not running code from the uOp cache, all Intel Core i7 CPU cores can fetch 16 bytes worth of instructions from the L1 instruction cache, every cycle, and will in fact keep doing so as long as instruction execution is keeping up with decoding. At 3GHz, we’re talking on the order of 50GB/s per core here, just for instruction fetches – though, granted, only if the instruction fetch unit is busy all the time, instead of being stalled for some reason or other. In practice, the L2 cache usually only sees a small fraction of this, because L1 instruction caches work quite well. But if you’re designing a unified L1 cache, you need to anticipate at least bursts of both high instruction and high data traffic (think something like a fast memcpy of a few kilobytes of data with both source and destination in the L1 data caches).

This is a general point, by the way. CPU cores can handle many memory accesses per cycle, as long as they all hit within the L1 caches. For a “Haswell” or later Core i7 at 3GHz, we’re talking aggregate code+data L1 bandwidths well over 300GB/s per core if you have just the right instruction mix; very unlikely in practice, but you still get bursts of very hot activity sometimes.

L1 caches are designed to be as fast as possible to handle these bursts of activity when they occur. Only what misses L1 needs to be passed on to higher cache levels, which don’t need to be nearly as fast, nor have as much bandwidth. They can worry more about power efficiency and density instead.

Third point: sharing. A big part of the point of having individual desks in the office analogy, or per-core L1 caches, is that they’re private. If it’s on your private desk, you don’t need to ask anyone; you can just grab it.

This is crucial. In the office analogy, if you were sharing a giant desk with 4 people, you can’t just grab a file. That’s not just your desk, and one of your other 3 co-workers might need that file right now (maybe they’re trying to cross-reference it with another file they just picked up at the other end of the desk!). Every single time you want to pick up something, you need to yell out “everyone OK if I grab this?”, and if someone else wanted it first, you have to wait. Or you can have some scheme where everyone needs to grab a ticket and wait in line until it’s their turn if there’s a conflict. Or something else; the details don’t matter much here, but anything you do requires coordination with others.

The same thing applies with multiple cores sharing a cache. You can’t just start stomping over data unannounced; anything you do in a shared cache needs to be coordinated with all the others you’re sharing it with.

That’s why we have private L1 caches. The L1 cache is your “desk”. While you’re sitting there, you can just go ahead and work. The L2 cache (“filing cabinet”) handles most of the coordination with others. Most of the time, the worker (the CPU core) is sitting at the desk. Buster can just come by, pick up a list of new requests, and put previously requested files into the filing cabinet without interrupting the worker at all.

It’s only when the worker and Buster want to access the filing cabinet at the same time, or when someone else has requested a file that’s lying on the worker’s desk, that they need to stop and talk to each other.

In short, the L1 cache’s job is to serve its CPU core first and foremost. Because it’s private, it requires very little coordination. The L2 cache is still private to the CPU core, but along with just caching, it has an extra responsibility: deal with most bus traffic and communication instead of interrupting the core (which has better things to do).

The L3 cache is a shared resource, and access to it does need to be coordinated globally. In the office analogy, this worked by the workers only having a single way to access it: namely, by going through Buster (the bus). The bus is a choke point; the hope is that the preceding two cache levels have winnowed down the number of memory accesses far enough that this doesn’t end up being a performance bottleneck.


This article covers one particular cache topology that is matches current desktop (and notebook) x86 CPUs: per-core split L1I/L1D caches, per-core unified L2 cache, shared unified L3 cache, with the cores connected via a ring bus.

Not every system looks like this. Some (primarily older) systems don’t have split instruction and data caches; some have full Harvard architectures that treat instruction and data memory as completely separate all the way through. Often L2s are shared between multiple cores (think one office with one filing cabinet and multiple desks). In this type of configuration, the L2 caches effectively act as part of the bus between cores. Many systems don’t have L3 caches, and some have both L3 and L4 caches! I also didn’t talk about systems with multiple CPU sockets etc.

I stuck with a ring bus because it fits in nicely with the analogy. Ring buses are reasonably common. Sometimes (especially when only two or three blocks need to be connected) it’s a full mesh; sometimes it is multiple ring buses connected with a crossbar (which maps reasonably to the office analogy: a multi-story office building with one “Buster” making his rounds per floor, and an elevator connecting the floors).

As a software developer, there’s a natural tendency to assume that you can magically connect module A to module B and the data just teleports from one end to the other. The way memory actually works is incredibly complicated by now, but the abstraction presented to the programmer is just one of a large, flat array of bytes.

Hardware doesn’t work like that. Pieces aren’t magically connected through an invisible ether. Modules A and B aren’t abstract concepts; they’re physical devices, actual tiny machines, that take up actual physical area on a silicon die. Chips have a “floor plan”, and that’s not a vague nod or an in-joke; it’s an actual 2D map of what goes where. If you want to connect A to B, you need to run an actual, physical wire between them. Wires take up space, and driving them takes power (more the longer they are). Running a bunch of wires between A and B means it’s physically blocking area that could be used to connect other things (yes, chips use wires on multiple layers, but it’s still a serious problem; google “routing congestion” if you’re interested). Moving data around on a chip is an actual logistical problem, and a fiendishly complicated one at that.

So while the office thing is tongue-in-cheek, “who needs to talk to whom” and “how does the geometry of this system look – does it admit a sensible layout?” are very relevant questions for both hardware and overall system design that can have a huge impact. Spatial metaphors are a useful way of conceptualizing the underlying reality.

SSE: mind the gap!

SSE and SSE2 are available in every single x86-family CPU with 64-bit support. You too can play around with SIMD, which is great fun! Unfortunately, SSE2 level in particular also happens to be what is probably the most maddeningly non-orthogonal SIMD instruction set in the world, where operations are either available or not available for particular data types with little rhyme or reason, especially where integers are involved. Later revisions (especially starting around SSE4.1) fill in some of the more annoying gaps, but plenty of us are stuck with supporting the older CPUs for at least a few more years, and besides – not to mess with the authentic SSE experience – even on AVX2-supporting CPUs, there’s still a few of the classic gaps remaining.

So, here’s a list of tricks to get you around some of the more common, eh, “idiosyncrasies” of SSE and its descendants. This happens to be mostly focused on the integer side; the floating-point side is generally less, well, weird. I’ll keep the individual descriptions relatively brief since the whole point of this post is to collect lots of tricks. The assumption here is that you’re already somewhat familiar with the instructions, so I’ll not explain the basics (maybe another time). I’ll use the official Intel intrinsics (as exposed in C/C++) since that’s probably the most common way people interact with these instructions intentionally (awkward glance in the direction of auto-vectorization here. No making eye contact. Moving on.)

Branchless “select” (cond ? a : b)

The natural mode of operation in SIMD computations is to do things branchlessly. If some part of a computation is conditional, rather than doing the equivalent of an if, it’s more typical to do both the computation for the “if” and “else” forks, and then merge the results based on the condition. The “select” I mean is the operation which takes the condition and both results and performs the rough equivalent of C’s ternary operator cond ? a : b. You first evaluate both sides, giving a and b. You then evaluate the condition using a SIMD compare, which returns a vector containing a bit mask that is has all bits set for lanes that meet cond, and all bits clear for lanes that don’t.

This select operation can always be done using a few bitwise operations (which is well known), but starting in SSE 4.1 we get slightly more efficient variants too (less well known, and the reason I mention this):

  • Integer (all vers): _mm_or_si128(_mm_and_si128(a, cond), _mm_andnot_si128(cond, b)).
  • 32-bit float (all vers): _mm_or_ps(_mm_and_ps(a, cond), _mm_andnot_ps(cond, b)).
  • 64-bit float (all vers): _mm_or_pd(_mm_and_pd(a, cond), _mm_andnot_pd(cond, b)).
  • Integer (SSE4.1+): _mm_blendv_epi8(a, b, cond).
  • 32-bit float (SSE4.1+): _mm_blendv_ps(a, b, cond).
  • 64-bit float (SSE4.1+): _mm_blendv_pd(a, b, cond).

The andnot operations don’t come in handy very often, but they’re the best choice here (pre-SSE4.1).

If you don’t want to use cond but its logical negation, just switch the positions of a and b, since (!cond) ? a : b is the same as cond ? b : a.

Unsigned integer compares

SSE, in all incarnations, offers precisely two types of integer comparisons: test for equality (PCMPEQt, _mm_cmpeq_T, where t and T stand for various type suffixes) and test for signed greater-than (PCMPGTt, _mm_cmpgt_T). Most other comparison types can be produced using nothing but logical negation and standard identities:

  • a == b is supported directly.
  • a != b is !(a == b).
  • a > b (signed) is supported directly.
  • a < b (signed) is the same as b > a (swap a and b).
  • a >= b (signed) is !(a < b) (which in turn is !(b > a)).
  • a <= b (signed) is !(a > b).

See previous note on selection operations on how to get rid of the NOT in the most common use case. Conspicuously absent from that list is any type of unsigned ordered comparison. However, a trick that works is to bias both integers so that signed comparison does the right thing:

  • a > b (unsigned, 8-bit) is the same as (a - 0x80) > (b - 0x80) (signed, 8-bit).
  • a > b (unsigned, 16-bit) is the same as (a - 0x8000) > (b - 0x8000) (signed, 16-bit).
  • a > b (unsigned, 32-bit) is the same as (a - 0x80000000) > (b - 0x80000000) (signed, 32-bit).

The same argument-swapping and NOT-ing tricks as above still apply to give you the other compare types. In general, the trick is to add (or subtract, or XOR – they all do the same thing in this particular case) the INT_MIN for the respective type from both operands before doing the compare. This turns the smallest possible unsigned integer, 0, into the smallest possible signed integer for the given type; after that, the ordering works out. In particular, when comparing against a constant, this addition (or subtraction, or XOR) can be baked into the constant operand, so the unsigned compare “only” ends up doing one more operation than a signed compare (instead of two).

A completely different approach is to use the unsigned integer min/max instructions (more about those in a second) to build less-or-equal or greater-or-equal comparisons:

  • a <= b if and only if max(a, b) == b.
  • a >= b if and only if min(a, b) == b.

The good news is that this reduces unsigned comparisons to either an unsigned min or a max, followed by an equality comparison, which is only 2 instead of 3 operations. The bad news is that the requisite unsigned min/max operations only exist for uint8s in SSE2. The uint16/uint32 variants were finally added with SSE4.1; if your minimum target is earlier, you’re stuck with the bias-then-compare variants above.

Integer min and max

SSE4.1 has the full set of integer min/max for 8-, 16- and 32-bit types, both signed and unsigned. So if you’re targeting SSE4.1 or later, good for you!

If you’re stuck with anything earlier, you’re decidedly more limited. In SSE2, you get integer min/max for uint8 and int16. If you need min/max for int8, uint16, or anything 32-bit, you’re on your own.

Luckily, we can just combine some of the techniques above to derive a solution. The general patterns here are:

  max(a, b) == (a > b) ? a : b;
  min(a, b) == (a > b) ? b : a;

So this is just a combination of a compare and a “select” operation. When the compare is signed (the int8 and int32 cases), the comparison maps to a single SSE intrinsic. The unsigned compares (uint16 and uint32) can be solved using the bias-then-signed-compare trick which in turn gives us an unsigned min/max.

32-bit and 64-bit loads/stores

This one has nothing to do with the actual instruction set and everything to do with the intrinsics: yes, SSE2 has 32-bit (MOVD) and 64-bit (MOVQ) loads and stores, the standard intrinsics just do their best to confuse you about it:

  • 64-bit loads are _mm_loadl_epi64. This intrinsic takes a __m128i * as an argument. Don’t take that seriously. The actual load is 64-bit sized, not 128-bit sized, and there is no alignment requirement.
  • 64-bit stores are _mm_storel_epi64. Again, the __m128i * is confusing and does not mean that the actual store is 128-bit or that there are alignment requirements. It isn’t and there are not.
  • 32-bit loads are even more hidden! Namely, you write _mm_cvtsi32_si128(*x) where x is a pointer to a 32-bit integer. No direct load intrinsic, but compilers will turn this into a MOVD with memory operand where applicable.
  • 32-bit stores, likewise: *x = _mm_cvtsi128_si32(value). Now you know.


There’s lots of different ways to provide multiplies in a SIMD instruction set, and by now SSE has tried most of them in one form or another.

Let’s start with the (historically) first variant: multiplying 16-bit numbers. The relevant instructions originated in the Pentium MMX and compute the low and high halves (bottom and top 16 bits) of a signed 16-bit×16-bit product. MMX only has signed multiplies, but SSE also added a “high half of unsigned 16-bit times 16-bit product” instruction (the low halves of signed and unsigned products are identical), so we’re not gonna have to worry about that particular problem, not yet anyway.

These instructions are fine if you want the low or high halves of the product. What if you want the full 32-bit product of vectors of 16-bit values? You compute the low and high halves and then merge them using the “unpack” instructions. This is the standard approach, but not very obvious if you haven’t deal with this kind of thing before. So for a full 16×16→32-bit product (note this produces two vectors worth of results), we get:

  // EITHER: a*b (16-bit lanes), signed
  __m128i lo16 = _mm_mullo_epi16(a, b);
  __m128i hi16 = _mm_mulhi_epi16(a, b);

  // OR: a*b (16-bit lanes), unsigned
  __m128i lo16 = _mm_mullo_epi16(a, b);
  __m128i hi16 = _mm_mulhi_epu16(a, b);

  // THEN: merge results
  __m128i res0 = _mm_unpacklo_epi16(lo16, hi16); // result lanes 0..3
  __m128i res1 = _mm_unpackhi_epi16(lo16, hi16); // result lanes 4..7

But what if you’re working with 32-bit values? There is a 32×32→32-bit product (PMULLD / _mm_mullo_epi32), but it was only added with SSE4.1, and it’s significantly slower than the other SSE2 multiplies in many implementations. So you might either not want to set your minimum target that high, or you might be looking for something quicker.

There’s full 32×32→64-bit products, which are available from SSE2 on as
PMULUDQ/_mm_mul_epu32 (unsigned). SSE4.1 adds the signed equivalent PMULDQ/_mm_mul_epi32 (UPDATE: An older version of this post incorrectly stated that PMULDQ was SSE2. Thanks Exophase for pointing it out!). These ones only compute two products (between the even lanes of the two sources) and place them in a 128-bit result. The odd 32-bit lanes are ignored completely, so if you want four 32×32→32-bit products, you need at least two of these multiplies and a lot of wrangling:

  // res = _mm_mullo_epi32(a, b) equivalent using SSE2, via PMULUDQ.

  // even and odd lane products
  __m128i evnp = _mm_mul_epu32(a, b);
  __m128i odda = _mm_srli_epi64(a, 32);
  __m128i oddb = _mm_srli_epi64(b, 32);
  __m128i oddp = _mm_mul_epu32(odda, oddb);

  // merge results
  __m128i evn_mask = _mm_setr_epi32(-1, 0, -1, 0);
  __m128i evn_result = _mm_and_si128(evnp, evn_mask);
  __m128i odd_result = _mm_slli_epi64(oddp, 32);
  __m128i res = _mm_or_si128(evn_result, odd_result);

It works, but it’s a mouthful.

But what if you’re using 32-bit vector lanes, but happen to know that the numbers we’re trying to multiply are in fact in the range [-32768,32767] (i.e. representable as signed 16-bit integers)? We could try narrowing the 32-bit lanes into 16 bits then using the 16×16→32 sequences above, but is that really the best we can do?

It is not: PMADDWD (_mm_madd_epi16), MMX/SSE2’s amazing and strange (but mostly amazing) dot product operation, has our back, for we can do this:

   // a and b have 32-bit lanes with values that fit in int16s.
   // produces the 32-bit result
   //   res[i] = a[i] * b[i]

   // clears high 16 bits of every 32-bit lane
   __m128i bm = _mm_and_si128(b, _mm_set1_epi32(0xffff));

   // after this, madd_epi16 does what we want!
   __m128i res = _mm_madd_epi16(a, bm);

   // can swap role of a and b above too, when convenient.

That’s a lot shorter than narrowing to 16-bit first would be! Alas, it only works for int16 (signed). What if we’re working in 32-bit lanes with values that fit inside a uint16 (unsigned)? It’s not quite as slick, but still, better than narrowing to 16-bit first or dealing with the logistics when synthesizing 32×32→32-bit muls from PMULDQ/PMULUDQ:

   // a and b have 32-bit lanes with values that fit in uint16s,
   // i.e. a[i] == (uint16)a[i] and same for b[i].
   // produces the 32-bit result
   //   res[i] = a[i] * b[i]

   // compute low and high 16-bit products
   __m128i lop = _mm_mullo_epi16(a, b);
   __m128i hip = _mm_mulhi_epu16(a, b);

   // merge results
   __m128i res = _mm_or_si128(lop, _mm_slli_epi32(hip, 16));

Horizontal adds, dot products etc. (float)

SSE3 adds horizontal adds HADDPS (_mm_hadd_ps) and HADDPD (_mm_hadd_pd) and SSE4.1 throws in the dot-product instructions DPPS (_mm_dp_ps) and DPPD (_mm_dp_pd).

Generally, don’t expect these operations to be magic. They exist in the instruction set but are fast precisely nowhere; in all x86 implementations I’m familiar with, they just turn into a canned sequence of more basic (SSE2-level) operations. So more often that not, you will end up requiring a higher minimum CPU target for little to no speed gain. Caveat: these instructions are a smaller than their replacement instruction sequence, so using them can reduce code size slightly. But still, don’t expect this to be fast.

If you want good SIMD performance, don’t lean on horizontal and dot-product style operations; process data in batches (not just one vec4 at a time) and transpose on input, or use a SoA layout to begin with.

The other kind of horizontal adds, dot products etc. (integer)

SSE does have a bunch of horizontal add and dot product-style operations that don’t suck, but they’re on the integer pipe, and not what you’d expect.

Nope, not PHADDW/PHADDD (_mm_hadd_epi16/_mm_hadd_epi32). These are SSSE3 and later only and OK but not particularly great (similar disclaimer as for the floating-point horizontal adds applies).

No, I’m talking about PMADDWD (_mm_madd_epi16, SSE2 with its ancestor around since the original MMX), PSADBW (_mm_sad_epu8, SSE2) and PMADDUBSW (_mm_maddubs_epi16, SSSE3). The official manuals describe what these instructions do, so I won’t bother going into too much detail, but here’s the basic idea: PMADDWD and PMADDUBSW are 2-element dot-product instructions between pairs of adjacent SIMD lanes. PMADDWD computes two int16 by int16 products for each pair of 16-bit lanes and sums the 32-bit integer products to yield the 32-bit result lanes. PMADDUBSW computes two uint8 by int8 products for each pair of 8-bit lanes and sums the 16-bit integer products to yield the 16-bit result lanes. These can be used to compute dot products of this problem; but they also have “degenerate” configurations that are very useful:

  • _mm_madd_epi16(x, _mm_set1_epi16(1)) sums the 16-bit even and odd lanes of x in pairs to yield 32-bit results.
  • _mm_maddubs_epi16(_mm_unpacklo_epi8(a, b), _mm_setr_epi8(1, -1, 1, -1, ..., 1, -1)) happens to be the fastest way to compute the 16-bit signed differences between 8-bit unsigned vectors a and b on processors that support SSSE3.
  • The 16-bit multiply example above shows another special configuration.

Long story short, these dot product instructions are surprisingly versatile in decidedly non-obvious ways.

Finally, PSADBW (_mm_sad_epu8, SSE2). This one is intended for motion estimation in video codecs, but it also happens to be the one actually really fast horizontal add you get on x86. In particular, _mm_sad_epu8(x, _mm_setzero_si128()) computes two 16-bit horizontal sums of groups of 8 uint8 lanes in a single, and quite fast, operation. We can do the same trick we did for compares in reverse to compute the sum of 8 int8s instead: add (or subtract, or XOR) _mm_set1_epi8(-128) to x (before the PSADBW), the subtract 128×8 from the resulting 16-bit sums.

To be continued!

There’s a lot more of these, but this feels like enough to chew on for a single blog post. So there will be a sequel covering, at least, integer widening/narrowing and variable shifts. Until then!