I originally intended to write something different for part 2 of this series, but since I’ve started rewriting that article no less than 3 times at this point, I’m just gonna switch the order of topics around a bit.

### Transpose from even/odd interleaves

I already showed one instance of this last time (“variant 2”) for the 4×4 case, but let’s go at it a bit more systematically. Since we’ve already beaten this size to back, let’s spice things up a bit and do 8×8 this time:

a0 a1 a2 a3 a4 a5 a6 a7
b0 b1 b2 b3 b4 b5 b6 b7
c0 c1 c2 c3 c4 c5 c6 c7
d0 d1 d2 d3 d4 d5 d6 d7
e0 e1 e2 e3 e4 e5 e6 e7
f0 f1 f2 f3 f4 f5 f6 f7
g0 g1 g2 g3 g4 g5 g6 g7
h0 h1 h2 h3 h4 h5 h6 h7


“Variant 2” from last time was the version where we started by interleaving rows not with their immediate neighbors, but with the rows that were two steps away. The key here is that we have to do multiple interleave steps to complete the transpose, and with every even-odd interleave, we space the original elements of a vector further apart. So if we interleaved rows A and B in the first step, we would have a0 b0 after the first step, but a0 xx b0 xx as soon as we interleaved that with something else. That’s not what we want. So instead, we start by interleaving rows A and E – after a total of 3 passes, that should put e0 where it’s supposed to go, 4 elements from a0.

The same argument goes for the other rows, too—so let’s just do an even-odd interleave between the entire first half and the last half of our rows:

a0 e0 a1 e1 a2 e2 a3 e3
a4 e4 a5 e5 a6 e6 a7 e7
b0 f0 b1 f1 b2 f2 b3 f3
b4 f4 b5 f5 b6 f6 b7 f7
c0 g0 c1 g1 c2 g2 c3 g3
c4 g4 c5 g5 c6 g6 c7 g7
d0 h0 d1 h1 d2 h2 d3 h3
d4 h4 d5 h5 d6 h6 d7 h7


For the next step, we want to interleave the row containing a0 with the row containing the elements that needs to end up 2 places away from a0 in the final result – namely, c0, and similar for the other two rows. Which again boils down to doing an even-odd interleave between the entire first and second halves:

a0 c0 e0 g0 a1 c1 e1 g1
a2 c2 e2 g2 a3 c3 e3 g3
a4 c4 e4 g4 a5 c5 e5 g5
a6 c6 e6 g6 a7 c7 e7 g7
b0 d0 f0 h0 b1 d1 f1 h1
b2 d2 f2 h2 b3 d3 f3 h3
b4 d4 f4 h4 b5 d5 f5 h5
b6 d6 f6 h6 b7 d7 f7 h7


At this point we’re just one turn of the crank away from the result we want, so let’s go for it and do one more round of interleaves…

a0 b0 c0 d0 e0 f0 g0 h0
a1 b1 c1 d1 e1 f1 g1 h1
a2 b2 c2 d2 e2 f2 g2 h2
a3 b3 c3 d3 e3 f3 g3 h3
a4 b4 c4 d4 e4 f4 g4 h4
a5 b5 c5 d5 e5 f5 g5 h5
a6 b6 c6 d6 e6 f6 g6 h6
a7 b7 c7 d7 e7 f7 g7 h7


and that’s it, 8×8 matrix successfully transposed.

This is nothing I haven’t shown you already, although in a different order than before. This form here makes the underlying algorithm much clearer, and also generalizes in the obvious way to larger sizes, should you ever need them. But that’s not the reason I’m talking about this. Time to get to the fun part!

### Rotations

Let’s look a bit closer at how the elements move during every pass. For this purpose, let’s just treat all of the elements as a 64-element array. The first row contains the first 8 elements, the second row contains the second 8 elements, and so forth. a0 starts out in row zero and column zero, the first element of the array, and stays there for the entire time (boring!). b3 starts out in row 1, column 3 – that’s element number 11 (1×8 + 3). Now, the algorithm above simply applies the same permutation to the overall array 3 times. Let’s look at how the array elements move in one step—for reasons that will become clear in a second, I’ll give the array indices in binary:

 From To 000 000 → 000 000 000 001 → 000 010 000 010 → 000 100 000 011 → 000 110 ⋮ 001 000 → 010 000 001 001 → 010 010 001 010 → 010 100 ⋮ 100 000 → 000 001 100 001 → 000 011 100 010 → 000 101 100 011 → 000 111 ⋮ 101 010 → 010 101 ⋮

Since we have 8 rows and 8 columns, the first 3 and last 3 bits in each index correspond to the row and column indices, respectively. Anyway, can you see the pattern? Even-odd interleaving the first half of the array with the second half in effect performs a bitwise rotate left on the element indices!

Among other things, this instantly explains why it takes exactly three passes to transpose a 8×8 matrix with this approach: the row and column indices take 3 bits each. So after 3 rotate-lefts, we’ve swapped the rows and columns—which is exactly what a matrix transpose does. Another salient point is that repeating even-odd interleaves like this will return us to our starting arrangement after 6 passes. This is easy to see once we know that such a step effectively rotates the bits of the element index; it’s not at all obvious when looking at the permutation by itself.

But it goes further than that. For one thing, the even/odd interleave construction really works for any even number of elements; it certainly works for all powers of two. So we’re not strictly limited to square matrices here. Say we have a 4×8 matrix (4 rows, 8 columns). That’s 32 elements total, or a 5-bit element index, laid out in binary like r1r0c2c1c0, where r are row index bits and c are column index bits. After two interleave passes, we’re at c2c1c0r1r0, corresponding to the transposed layout—a 8×4 matrix. To go from there to the original layout, we would have to run three more passes, rotating everything by another 3 bits, back to where we started.

Which illustrates another interesting point: for non-square matrices, going in one direction using this method can be much cheaper than going in the other direction. That’s because the even/odd interleave step only gives us a “rotate left” operation, and sometimes the shorter path would be to “rotate right” instead. On some architectures the corresponding deinterleave operation is available as well (e.g. ARM NEON), but often it’s not, at least not directly.

### Groups

Let’s step back for a moment, though. We have an operation: even/odd interleave between the first and second halves of sequences with 2k elements, which is really just a particular permutation. And now we know that we can get the inverse operation via repeated interleaving. Which means that our even-odd interleave generates a group. Now, as long as we really only do even-odd interleaves on the complete sequence, this group is a cyclic group of order k—it has to be: it’s a finite group generated by a single element, ergo a cyclic group, and we already know how long the cycle is, based on the “rotate left” property.

So to make matters a bit more interesting, let’s get back to the original topic of this article! Namely, the SIMD bit. While it’s convenient to build a complete matrix transpose out of a single operation on all elements simultaneously, namely a very wide even/odd interleave, that’s not how the actual code looks. We have fixed-width registers, and to synthesize anything wider than that, we have to break the data down into smaller chunks and process them individually anyway. However, we do need to have full even/odd interleaves to get a permutation group structure, so we can’t allow using single “low” or “high” interleave instructions without their opposite half.

What kind of model do we end up with? Let’s list the ingredients. We have:

• A set of n SIMD registers, each k “elements” wide. We’ll assume that k is a power of two. What an “element” is depends on context; a 128-bit SIMD registers might be viewed as consisting of 16 byte-sized elements, or 8 16-bit elements, or 4 32-bit elements… you get the idea.
• An even-odd interleave (perfect shuffle) operation between a pair of SIMD registers. For convenience, we’ll assume that the results are returned in the same 2 registers. Implementing this might require another temporary register depending on the architecture, but let’s ignore that for the purposes of this article; we only care about the state of the registers before and after interleaves, not during them.
• Finally, we assume that registers are completely interchangeable, and that we can “rename” them at will; that is, we’ll consider all permutations that can be performed by just renumbering the registers to be equivalent.

To explain the latter, we would consider an arrangement like this:

r0 = a0 b0 c0 d0
r1 = a1 b1 c1 d1


(where r0 and r1 correspond to SIMD registers) to be equivalent to:

r0 = a1 b1 c1 d1
r1 = a0 b0 c0 d0


or even

r3 = a0 b0 c0 d0
r5 = a1 b1 c1 d1


To rephrase it, we don’t care about differences in “register allocation”: as long as we get all the individual rows we need, any order will do.

### What permutations can be generated using interleaves?

This is a question I’ve been wondering about ever since I first saw the original MMX instructions, but I didn’t really spend much time thinking about it until fairly recently, when I got curious. So I don’t have a full answer, but I do have some interesting partial results.

Let’s get the trivial case out of the way first: if k=1 (that is, each register contains exactly one element), then clearly we can reach any permutation we want, and without doing any work to boot—every register contains one value, and as explained above, our model treats register-level permutations as free. However, as soon as we have multiple elements inside a single register, things start to get interesting.

### Permutations generated for n=2, k=2

We’re only permuting n×k = 4 elements here, so the groups in question are all small enough to enumerate their elements on a piece of paper, which makes this a good place to start. Also, with n=2, the “register permutation” side of things is really simple (we either swap the two registers or we don’t). For the even-odd interleave, we would normally have to specify which two registers to interleave—but since we only have two, we can just agree that we always want to interleave register 0 with register 1. Should we want the opposite order (interleave register 1 with register 0), we can simply swap the two registers beforehand. So our available operations boil down to just two permutations on 4 elements:

• Swap registers 0 and 1—this swaps the first two and the last two elements, so it corresponds to the permutation $0123 \mapsto 2301$ or (02)(13) in cycle notation. This is an involution: applying it twice swaps the elements back.
• Even/odd interleave between registers 0 and 1. This boils down to the permutation $0123 \mapsto 0213$ or (12) which swaps the two middle elements and is also an involution.

These are the only operations we permit, so we have a finite group that is generated by involutions: it must be a dihedral group. In fact, it turns out to be Dih4, the symmetry group of a square, which is of order 8. So using 2 registers, we can reach only 8 of the $4! = 24$ permutations of 4 elements. So what happens when we have more registers at our disposal?

### Permutations generated for n>2, k=2

The next smallest case is n=3, k=2, which gives us permutations of 6 elements. $6! = 720$, so this is still small enough to simply run a search, and that’s what I did. Or, to be more precise, I wrote a program that did the searching for me. It turns out that the even-odd interleave of the first two registers combined with arbitrary permutations on the registers (of which there are $3! = 6$) is enough to reach all 720 permutations in S6, the symmetric group on 6 elements. Beyond this, I can’t say much about how this representation of the group works out; it would be nice if there was an easy way to find shortest paths between permutations for example (which would have uses for code generation), but if there is, I don’t know how. That said, my understanding of group theory is fairly basic; I’d really appreciate input from someone with more experience dealing with finite groups here.

I can tell you what happens for n>3, though: we already know we can produce all permutations using only 3 registers. And using the exact same steps, we can reach any permutation of the first 3 registers for n>3, leaving the other registers untouched. But that’s in fact enough to generate an arbitrary permutation of n elements, as follows: Say we have n=4, and we start with

r0 = e0 e1
r1 = e2 e3
r2 = e4 e5
r3 = e6 e7


and we want to end up with the (arbitrarily chosen) permutation

r0 = e3 e7
r1 = e0 e5
r2 = e1 e4
r3 = e6 e2


To begin, we try to generate the value we want to end up in r3 (namely, e6 e2). First, we swap rows around so that we have the source values we need in rows 0 and 1 (that is, registers r0 and r1). In our example, that just requires swapping r0 with r3:

r0 = e6 e7
r1 = e2 e3
r2 = e4 e5
r3 = e0 e1


Next, we know that we can reach arbitrary permutations of the first three rows (registers). In particular, we can shuffle things around so that r0 contains the value we’d like to be in r3:

r0 = e6 e2
r1 = e7 e3
r2 = e4 e5
r3 = e0 e1


This is one way of doing it, but really any permutation that has e6 e2 in the first row would work. Anyway, now that we have produced the value we wanted, we can swap it back into r3:

r0 = e0 e1
r1 = e7 e3
r2 = e4 e5
r3 = e6 e2


We now have the value we want in r3, and all the remaining unused source elements remain in rows 0–2. And again, since we know we can achieve arbitrary permutations of 6 elements using 3 registers using the n=3 case, we’re done! For n>4, the proof works in the same way: we first generate rows 3, 4, …, n-1 one by one; once a row is done, we never touch it again (it can’t contain any source elements we need for the remaining rows, since we only allow permutations). In the end we will always arrive at a configuration that has all the remaining “unfinished” elements in rows 0–2, which is a configuration we can solve.

I don’t mean to suggest that this is an efficient way to solve this problem; quite the opposite, in fact. But it’s an easy way to prove that once we have n=3 solved, higher n’s don’t add any substantial difficulty.

### Summary

This covers the easiest cases, k=1 and k=2, and answers the question I originally wondered about in the positive: using only interleaves, you can produce arbitrary permutations of the input elements in registers, as long as you only have k=2 elements per register (for example, 32-bit values inside the 64-bit MMX registers) and at least 3 SIMD registers worth of storage. Without any nice bounds on the number of operations required or an algorithmic way to compute an optimal interleave/rename sequence, I’m the first one to admit that this has little practical relevance, but it’s cool stuff nonetheless! Coming up, I’ll talk a bit about the (more interesting) k=4 case, but I think this is enough material for a single blog post. Until next time!

This one tends to show up fairly frequently in SIMD code: Matrix transposes of one sort or another. The canonical application is transforming data from AoS (array of structures) to the more SIMD-friendly SoA (structure of arrays) format: For concreteness, say we have 4 float vertex positions in 4-wide SIMD registers

  p0 = { x0, y0, z0, w0 }
p1 = { x1, y1, z1, w1 }
p2 = { x2, y2, z2, w2 }
p3 = { x3, y3, z3, w3 }


and would really like them in transposed order instead:

  X = { x0, x1, x2, x3 }
Y = { y0, y1, y2, y3 }
Z = { z0, z1, z2, z3 }
W = { w0, w1, w2, w3 }


Note that here and in the following, I’m writing SIMD 4-vectors as arrays of 4 elements here – none of this nonsense that some authors tend to do where they write vectors as “w, z, y, x” on Little Endian platforms. Endianness is a concept that makes sense for numbers and no sense at all for arrays, which SIMD vectors are, but that’s a rant for another day, so just be advised that I’m always writing things in the order that they’re stored in memory.

Anyway, transposing vectors like this is one application, and the one I’m gonna stick with for the moment because it “only” requires 4×4 values, which are the smallest “interesting” size in a certain sense. Keep in mind there are other applications though. For example, when implementing 2D separable filters, the “vertical” direction (filtering between rows) is usually easy, whereas “horizontal” (filtering between columns within the same register) is trickier – to the point that it’s often faster to transpose, perform a vertical filter, and then transpose back. Anyway, let’s not worry about applications right now, just trust me that it tends to come up more frequently than you might expect. So how do we do this?

### One way to do it

The method I see most often is to try and group increasingly larger parts of the result together. For example, we’d like to get “x0″ and “x1″ adjacent to each other, and the same with “x2″ and “x3″, “y0″ and “y1″ and so forth. The canonical way to do this is using the “unpack” (x86), “merge” (PowerPC) or “unzip” (ARM NEON) intrinsics. So to bring “x0″ and “x1″ together in the right order, we would do:

  a0 = interleave32_lft(p0, p1) = { x0, x1, y0, y1 }


where interleave32_lft (“interleave 32-bit words, left half”) corresponds to UNPCKLPS (x86, floats), PUNPCKLDQ (x86, ints), or vmrghw (PowerPC). And to be symmetric, we do the same thing with the other half, giving us:

  a0 = interleave32_lft(p0, p1) = { x0, x1, y0, y1 }
a1 = interleave32_rgt(p0, p1) = { z0, z1, w0, w1 }


where interleave32_rgt corresponds to UNPCKHPS (x86, floats), PUNPCKHDQ (x86, ints), or vmrglw (PowerPC). The reason I haven’t mentioned the individual opcodes for NEON is that their “unzips” always work on pairs of registers and handle both the “left” and “right” halves at once, forming a combined

  (a0, a1) = interleave32(p0, p1)


operation (VUZP.32) that also happens to be a good way to thing about the whole operation on other architectures – even though it is not the ideal way to perform transposes on NEON, but I’m getting ahead of myself here. Anyway, again by symmetry we then do the same process with the other two rows, yielding:

  // (a0, a1) = interleave32(p0, p1)
// (a2, a3) = interleave32(p2, p3)
a0 = interleave32_lft(p0, p1) = { x0, x1, y0, y1 }
a1 = interleave32_rgt(p0, p1) = { z0, z1, w0, w1 }
a2 = interleave32_lft(p2, p3) = { x2, x3, y2, y3 }
a3 = interleave32_rgt(p2, p3) = { z2, z3, w2, w3 }


And presto, we now have all even-odd pairs nicely lined up. Now we can build X by combining the left halves from a0 and a2. Their respective right halves also combine into Y. So we do a similar process like before, only this time we’re working on groups that are pairs of 32-bit values – in other words, we’re really dealing with 64-bit groups:

  // (X, Y) = interleave64(a0, a2)
// (Z, W) = interleave64(a1, a3)
X = interleave64_lft(a0, a2) = { x0, x1, x2, x3 }
Y = interleave64_rgt(a0, a2) = { y0, y1, y2, y3 }
Z = interleave64_lft(a1, a3) = { z0, z1, z2, z3 }
W = interleave64_rgt(a1, a3) = { w0, w1, w2, w3 }


This time, interleave64_lft (interleave64_rgt) correspond to MOVLHPS (MOVHLPS) for floats on x86, PUNPCKLQDQ (PUNPCKHQDQ) for ints on x86, or VSWP of d registers on ARM NEON. PowerPCs have no dedicated instruction for this but can synthesize it using vperm. The variety here is why I use my own naming scheme in this article, by the way.

Anyway, that’s one way to do it with interleaves. There’s more than one, however!

### Interleaves, variant 2

What if, instead of interleaving p0 with p1, we pair it with p2 instead? By process of elimination, that means we have to pair p1 with p3. Where does that lead us? Let’s find out!

  // (b0, b2) = interleave32(p0, p2)
// (b1, b3) = interleave32(p1, p3)
b0 = interleave32_lft(p0, p2) = { x0, x2, y0, y2 }
b1 = interleave32_lft(p1, p3) = { x1, x3, y1, y3 }
b2 = interleave32_rgt(p0, p2) = { z0, z2, w0, w2 }
b3 = interleave32_rgt(p1, p3) = { z1, z3, w1, w3 }


Can you see it? We have four nice little squares in each of the quadrants now, and are in fact just one more set of interleaves away from our desired result:

  // (X, Y) = interleave32(b0, b1)
// (Z, W) = interleave32(b2, b3)
X = interleave32_lft(b0, b1) = { x0, x1, x2, x3 }
Y = interleave32_rgt(b0, b1) = { y0, y1, y2, y3 }
Z = interleave32_lft(b2, b3) = { z0, z1, z2, z3 }
W = interleave32_rgt(b2, b3) = { w0, w1, w2, w3 }


This one uses just one type of interleave instruction, which is preferable if you the 64-bit interleaves don’t exist on your target platform (PowerPC) or would require loading a different permutation vector (SPUs, which have to do the whole thing using shufb).

### It gets a bit weird

Well, let’s just plunge ahead and start by 64-bit interleaving p0 and p1, then see whether it leads anywhere.

  // (c0, c1) = interleave64(p0, p1)
// (c2, c3) = interleave64(p2, p3)
c0 = interleave64_lft(p0, p1) = { x0, y0, x1, y1 }
c1 = interleave64_rgt(p0, p1) = { z0, w0, z1, w1 }
c2 = interleave64_lft(p2, p3) = { x2, y2, x3, y3 }
c3 = interleave64_rgt(p2, p3) = { z2, w2, z3, w3 }


Okay. For this one, we can’t continue with our regular interleaves, but we still have the property that each of our target vectors (X, Y, Z, and W) can be built using elements from only two of the c’s. In fact, the low half of each target vector comes from one c and the high half from another, which means that on x86, we can combine the two using SHUFPS. On PPC, there’s always vperm, SPUs have shufb, and NEON has VTBL, all of which are much more general, so again, it can be done there as well:

  // 4 SHUFPS on x86
X = { c0[0], c0[2], c2[0], c2[2] } = { x0, x1, x2, x3 }
Y = { c0[1], co[3], c2[1], c2[3] } = { y0, y1, y2, y3 }
Z = { c1[0], z1[2], c3[0], c3[2] } = { z0, z1, z2, z3 }
W = { c1[1], c1[3], c3[1], c3[3] } = { w0, w1, w3, w3 }


As said, this one is a bit weird, but it’s the method used for _MM_TRANSPOSE4_PS in Microsoft’s version of Intel’s emmintrin.h (SSE intrinsics header) to this day, and used to be the standard implementation in GCC’s version as well until it got replaced with the first method I discussed.

Anyway, that was starting by 64-bit interleaving p0 and p1. Can we get it if we interleave with p2 too?

### The plot thickens

Again, let’s just try it!

  // (c0, c2) = interleave64(p0, p2)
// (c1, c3) = interleave64(p1, p3)
c0 = interleave64_lft(p0, p2) = { x0, y0, x2, y2 }
c1 = interleave64_lft(p1, p3) = { x1, y1, x3, y3 }
c2 = interleave64_rgt(p0, p2) = { z0, w0, z2, w2 }
c3 = interleave64_rgt(p1, p3) = { z1, w1, z3, w3 }


Huh. This one leaves the top left and bottom right 2×2 blocks alone and swaps the other two. But we still got closer to our goal – if we swap the top right and bottom left element in each of the four 2×2 blocks, we have a full transpose as well. And NEON happens to have an instruction for that (VTRN.32). As usual, the other platforms can try to emulate this using more general shuffles:

  // 2 VTRN.32 on NEON:
// (X, Y) = vtrn.32(c0, c1)
// (Z, W) = vtrn.32(c2, c3)
X = { c0[0], c1[0], c0[2], c1[2] } = { x0, x1, x2, x3 }
Y = { c0[1], c1[1], c0[3], c1[3] } = { y0, y1, y2, y3 }
Z = { c2[0], c3[0], c2[2], c3[2] } = { z0, z1, z2, z3 }
W = { c2[1], c3[1], c2[3], c3[3] } = { w0, w1, w2, w3 }


Just like NEON’s “unzip” instructions, VTRN both reads and writes two registers, so it is in essence doing the work of two instructions on the other architectures. Which means that we now have 4 different methods to do the same thing that are essentially the same cost in terms of computational complexity. Sure, some methods end up faster than others on different architectures due to various implementation choices, but really, in essence none of these are fundamentally more difficult (or easier) than the others.

Nor are these the only ones – for the last variant, we started by swapping the 2×2 blocks within the 4×4 matrix and then transposing the individual 2×2 blocks, but doing it the other way round works just as well (and is again the same cost). In fact, this generalizes to arbitrary power-of-two sized square matrices – you can just partition it into differently sized block transposes which can run in any order. This even works with rectangular matrices, with some restrictions. (A standard way to perform “chunky to planar” conversion for old bit plane-based graphics architectures uses this general approach to good effect).

### And now?

Okay, so far, we have a menagerie of different matrix transpose techniques, all of which essentially have the same complexity. If you’re interested in SIMD coding, I suppose you can just use this as a reference. However, that’s not the actual reason I’m writing this; the real reason is that the whole “why are these things all essentially the same complexity” thing intrigued me, so a while back I looked into this and found out a whole bunch of cool properties that are probably not useful for coding at all, but which I nevertheless found interesting. In other words, I’ll write a few more posts on this topic, which I will spend gleefully nerding out with no particular goal whatsoever. If you don’t care, just stop reading now. You’re welcome!

This post is outside the usual technical pure technical focus of this blog, but the subject matter is important to me, and I wanted to write down my thoughts on it while they were still fresh.

This is a tweet by Anita Sarkeesian, and a disturbingly large number of misogynistic (and outright hateful) replies she got in response. This would be a shameful display in any context, but it is especially bad considering that her “objectionable” statement is both a factual one and completely right. There were no games with female protagonists shown at the Xbox One E3 reveal.

So why the hate? Presumably because of things like her excellent “Tropes vs. Women in Video Games” series (which is excellent, and which you should watch immediately if you haven’t done so already. Go ahead. This post will still be there when you come back). As a player of video games, it’s hard not to notice how obviously, painfully right the points she makes are. I think Anita is doing the game developer community a great service by pointing out how badly we’re currently screwing up in this regard; not only can we do better, we should be collectively ashamed that we don’t.

However, that is not the point I’m trying to make here. Instead, I want to address a rebuttal I hear frequently in cases like this: “Relax, you’re taking it far too seriously, it’s just a game!”, or alternatively, “Games don’t affect people like that”. (This statement also makes frequent appearance in discussions on the role of violence in games).

When coming from game developers, I find this statement extremely baffling. And here’s why: most games referenced in this context are interactive experiences that a player will spend at least a few hours with. Considering how the human brain works, it is exceedingly unlikely that something, anything a player spends that amount of time on will leave no trace whatsoever in her brain. That does not mean that violent games makes people violent, or rampant misogyny in video games makes players hate women. That kind of thinking is far too simplistic and reductionist. But still: players will spend multiple hours of their life in a virtual world of your creation. That experience will, in some way or other, influence their thinking. Independent of whether it’s actually “harmful” to submerge players in a world of casual misogyny and ultraviolence – is there any point to it? Or is it just something we do out of laziness and without consciously thinking about it? And if so, what does that say about us?

But let’s take a step back. Say you really, truly believe that games leave no lasting impression of any kind on the player. That whether they play it or not makes no difference at all, for better or for worse, to their lives – they might as well have been sleeping. Because if you really think that is the case – then why are you spending years of your life working on the damn things?

Sometimes, you need to turn a linearly independent set of vectors into an orthonormal basis – or, equivalently, take a matrix that is “close” to orthogonal (for example, an orthogonal matrix that has been updated multiple times and might have started to drift due to round-off error) and make it properly orthogonal again.

The standard way to solve this problem is called Gram-Schmidt orthogonalization. The idea is pretty simple. Say we have a list of (linearly independent) input vectors v1, …, vn. For the first vector in our output orthogonal basis, we just normalize the first input vector:

$\displaystyle \mathbf{u}_1 = \mathrm{normalize}(\mathbf{v}_1) := \frac{\mathbf{v}_1}{\|\mathbf{v}_1\|}$

For the second vector u2 to be orthogonal to the first, we need to remove the component of v2 parallel to u1, which is a simple projection:

$\displaystyle \mathbf{u}_2 = \mathrm{normalize}(\mathbf{v}_2 - (\mathbf{v}_2 \cdot \mathbf{u}_1) \mathbf{u}_1)$

We now have two orthonormal vectors; for the third vector, we now need to remove the components that are parallel to either of them:

$\displaystyle \mathbf{u}_3 = \mathrm{normalize}(\mathbf{v}_3 - (\mathbf{v}_3 \cdot \mathbf{u}_1) \mathbf{u}_1 - (\mathbf{v}_3 \cdot \mathbf{u}_2) \mathbf{u}_2)$

and so forth. You get the idea. This is the “classical” Gram-Schmidt process, or “CGS”. It’s simple and easy to derive, and works just fine in exact arithmetic. However, when performed using floating-point arithmetic, it is numerically unstable – badly so. Let me give an example: consider the matrix

$\displaystyle \mathbf{A} := \begin{pmatrix} 1 & 1 & 1 \\ \epsilon & \epsilon & 0 \\ \epsilon & 0 & \epsilon \end{pmatrix}$

where ε is any value small enough so that (1 + ε2) rounds to 1 in the given floating-point type. I’m using single-precision IEEE floating point and ε=10-4 for this example. Let’s run this through the classic Gram-Schmidt process:

static void classic_gram_schmidt(Mat33f &out, const Mat33f &in)
{
out.col[0] = normalize(in.col[0]);
out.col[1] = normalize(in.col[1] -
dot(in.col[1], out.col[0])*out.col[0]);
out.col[2] = normalize(in.col[2] -
dot(in.col[2], out.col[0])*out.col[0] -
dot(in.col[2], out.col[1])*out.col[1]);
}


which produces this result (rounded to 4 decimal digits):

$\displaystyle \mathrm{CGS}(\mathbf{A}) = \left( \begin{array}{rrr} 1.0000 & 0.0000 & 0.0000 \\ 0.0001 & 0.0000 & -0.7071 \\ 0.0001 & -1.0000 & -0.7071 \end{array} \right)$

Ouch. The first column not being “perfectly” normalized is expected – after all, we explicitly chose ε so that (1 + ε2) rounded to 1 – but the third column is not at all orthogonal to the second column; in fact, there’s a perfect 45 degree angle between the two. For an orthogonalization algorithm, that’s a pretty serious failure.

It turns out that there’s a really simple fix though: “modified” Gram-Schmidt. Instead of computing all the dot products from the original vectors, perform the projections one by one, using the result of the previous projection as the input to the next. In exact arithmetic, this is equivalent, but using floating-point arithmetic, this version:

static void modified_gram_schmidt(Mat33f &out, const Mat33f &in)
{
out.col[0] = normalize(in.col[0]);
out.col[1] = normalize(in.col[1] -
dot(in.col[1], out.col[0])*out.col[0]);

out.col[2] = in.col[2] -
dot(in.col[2], out.col[0])*out.col[0];
// note the second dot product is computed from the partial
// result out.col[2], not the input vector in.col[2]!
out.col[2] -= dot(out.col[2], out.col[1])*out.col[1];
out.col[2] = normalize(out.col[2]);
}


produces (again rounded to 4 decimal digits):

$\displaystyle \mathrm{MGS}(\mathbf{A}) = \left( \begin{array}{rrr} 1.0000 & 0.0000 & 0.0000 \\ 0.0001 & 0.0000 & -1.0000 \\ 0.0001 & -1.0000 & 0.0000 \end{array} \right)$

Much better. Now, by itself, better results on a single matrix don’t tell us anything, but it turns out that the MGS algorithm comes with a bounded error guarantee: The orthogonalized matrix Q satisfies the inequality

$\displaystyle \| \mathbf{I} - \mathbf{Q}^T \mathbf{Q} \|_2 \le \frac{c_1}{1 - c_2 \kappa u} \kappa u$

where c1 and c2 are constants, u is the machine precision (the “epsilon” for the given floating point type), and κ = κ(A) is the condition number of the input matrix. And in fact, orthogonalizing a matrix using MGS is numerically equivalent to performing a Householder QR decomposition (a known stable algorithm) on the matrix A augmented with a few extra zero rows at the top – which also means that, in addition to the above error bound on the orthogonality of Q, MGS is also backwards stable with a nice error bound. (Both claims are proven here).

Long story short: this is a nice example of a numerical algorithm where two approaches identical in exact arithmetic yield dramatically different results when computed using floating-point. And it comes with an action item: if you have code that orthogonalizes a matrix (or orthonormalizes a tuple of basis vectors) using a Gram-Schmidt-like method, you should check whether it corresponds to the classical or modified GS algorithm. The modified algorithm has roughly the same cost (albeit with a different dependency structure that is slightly less amenable to vectorization) and is numerically much superior. Even for something as small as 3×3 matrices, as the example above shows. And if you want to play around with it, feel free to check out the code I used for the numerical experiments.

UPDATE: And before I get angry comments: in 3D, the cheapest way to compute the third basis vector is to not look at the third source vector at all, and instead simply use the cross product of the first two (assuming they’re normalized). This is cheaper, stable, and guarantees that the result will be a right-handed orthonormal basis. It does not, however, generalize to higher dimensions, so knowing about MGS is still useful.

There’s tons of useful trig identities. You could spend the time to learn them by heart, or just look them up on Wikipedia when necessary. But I’ve always had problems remembering where the signs and such go when trying to memorize this directly. At least for me, what worked way better is this: spend a few hours familiarizing yourself with complex numbers if you haven’t done so already; after that, most identities that you need in practice are easy to derive from Euler’s formula:

$e^{ix} = \exp(ix) = \cos(x) + i \sin(x)$

Let’s do the basic addition formulas first. Euler’s formula gives:

$\cos(x+y) + i \sin(x+y) = \exp(i(x+y)) = \exp(ix) \exp(iy)$

and once we apply the identity again we get:

$(\cos(x) + i \sin(x)) (\cos(y) + i \sin(y))$

multiplying out:

$(\cos(x) \cos(y) - \sin(x) \sin(y)) + i (\sin(x) \cos(y) + \cos(x) \sin(y))$

The terms in parentheses are all real numbers; equating them with our original expression yields the result

$\cos(x+y) = \cos(x) \cos(y) - \sin(x) \sin(y)$
$\sin(x+y) = \sin(x) \cos(y) + \cos(x) \sin(y)$

Both addition formulas for the price of one. (In fact, this exploits that the addition formulas for trigonometric functions and the addition formula for exponents are really the same thing). The main point being that if you know complex multiplication, you never have to remember what the grouping of factors and the signs are, something I used to have trouble remembering.

Plugging in x=y into the above also immediately gives the double-angle formulas:

$\cos(2x) = \cos(x)^2 - \sin(x)^2$
$\sin(2x) = 2 \sin(x) \cos(x)$

so if you know the addition formulas there’s really no reason to learn these separately.

Then there’s the well-known

$\cos(x)^2 + \sin(x)^2 = 1$

but it’s really just the Pythagorean theorem in disguise (since cos(x) and sin(x) are the side lengths of a right-angled triangle). So not really a new formula either!

Moving either the cosine or sine terms to the right-hand side gives the two immensely useful equations:

$\cos(x)^2 = 1 - \sin(x)^2$
$\sin(x)^2 = 1 - \cos(x)^2$

In particular, that second one is perfect if you need the sine squared of an angle that you only have the cosine of (usually because you’ve determined it using a dot product). Judicious application of these two tends to be a great way to simplify superfluous math in shaders (and elsewhere), one of my pet peeves.

For practice, let’s apply these two identities to the cosine double-angle formula:

$\cos(2x) = \cos(x)^2 - \sin(x)^2 = 2 \cos(x)^2 - 1 \Leftrightarrow cos(x)^2 = (cos(2x) + 1) / 2$
$\cos(2x) = \cos(x)^2 - \sin(x)^2 = 1 - 2 \sin(x)^2 \Leftrightarrow sin(x)^2 = (1 - cos(2x)) / 2$

why, it’s the half-angle formulas! Fancy meeting you here!

Can we do something with the sine double-angle formula too? Well, it’s not too fancy, but we can get this:

$\sin(2x) = 2 \sin(x) \cos(x) \Leftrightarrow \sin(x) \cos(x) = \sin(2x) / 2$

Now, let’s go back to the original addition formulas and let’s see what happens when we plug in negative values for y. Using $\sin(-x) = -\sin(x)$ and $\cos(-x) = \cos(x)$, we get:

$\cos(x-y) = \cos(x) \cos(y) + \sin(x) \sin(y)$
$\sin(x-y) = \sin(x) \cos(y) - \cos(x) \sin(y)$

Hey look, flipped signs! This means that we can now add these to (or subtract them from) the original formulas to get even more identities!

$\cos(x+y) + \cos(x-y) = 2 \cos(x) \cos(y)$
$\cos(x-y) - \cos(x+y) = 2 \sin(x) \sin(y)$
$\sin(x+y) + \sin(x-y) = 2 \sin(x) \cos(y)$
$\sin(x+y) - \sin(x-y) = 2 \cos(x) \sin(y)$

It’s the product-to-sum identities this time. I got one more! We’ve deliberately flipped signs and then added/subtracted the addition formulas to get the above set. What if we do the same trick in reverse to get rid of those x+y and x-y terms? Let’s set $x = (a + b)/2$ and $y = (b - a)/2$ and plug that into the identities above and we get:

$\cos(b) + \cos(a) = 2 \cos((a+b)/2) \cos((b-a)/2)$
$\cos(a) - \cos(b) = 2 \sin((a + b)/2) \sin((b - a)/2)$
$\sin(b) + \sin(a) = 2 \sin((a + b)/2) \cos((b - a)/2)$

Ta-dah, it’s the sum-to-product identities. Now, admittedly, we’ve taken quite a few steps to get here, and looking these up when you need them is going to be faster than walking through the derivation (if you ever need them in the first place – I don’t think I’ve ever used the product/sum identities in practice). But still, working these out is a good exercise, and a lot less likely to go wrong (at least for me) than memorizing lots of similar formulas. (I never can get the signs right that way)

Bonus exercise: work out general expressions for $\cos(x)^n$ and $\sin(x)^n$. Hint:

$\cos(x) = (\exp(ix) + \exp(-ix))/2$
$\sin(x) = (\exp(ix) - \exp(-ix))/2i$.

And I think that’s enough for now. (At some later point, I might do an extra post about one of the sneakier trig techniques: the Weierstrass substitution).

One interesting thing about x86 is that it’s changed two major architectural “magic values” in the past 10 years. The first is the addition of 64-bit mode, which not only widens all general-purpose registers and gives a much larger virtual address space, it also increases the number of general-purpose and XMM registers from 8 to 16. The second is AVX, which allows all SSE (and other SIMD) instructions to be encoded using non-destructive 3-operand forms instead of the original 2-operand forms.

Since modern x86 processors are trying really hard to run both 32- and 64-bit code well (and same for SSE vs. AVX), this gives us an opportunity to compare the relative performance of these choices in a reasonably level playing field, when running the same (C++) code. Of course, this is nowhere near a perfect comparison, especially since switching from 32 to 64 bits also changes the sizes of pointers and (at the very least) the code generator used by the compiler, but it’s still interesting to be able to do the experiment on a single machine with no fuss. So, without further ado, here’s a quick comparison using the Software Occlusion Culling demo I’ve been writing about for the past month – a fairly SIMD-heavy workload.

Version Occlusion cull Render scene
x86 (baseline) 2.88ms 1.39ms
x86, /arch:SSE2 2.88ms (+0.2%) 1.48ms (+5.8%)
x86, /arch:AVX 2.77ms (-3.8%) 1.43ms (+2.7%)
x64 2.71ms (-5.7%) 1.29ms (-7.2%)
x64, /arch:AVX 2.63ms (-8.7%) 1.28ms (-8.5%)

Note that /arch:AVX makes VC++ use AVX forms of SSE vector instructions (i.e. 3-operand), but it’s all still 4-wide SIMD, not the new 8-wide SIMD floating point. Getting that would require changes to the code. And of course the code uses SSE2 (and, in fact, even SSE4.1) instructions whether we turn on /arch:SSE2 on x86 or not – this only affects how “regular” floating-point code is generated. Also, the speedup percentages are computed from the full-precision values, not the truncated values I put in the table. (Which doesn’t mean much, since I truncated the values to about their level of accuracy)

So what does this tell us? Hard to be sure. It’s very few data points and I haven’t done any work to eliminate the effect of e.g. memory layout / code placement, which can be very much significant. And of course I’ve also changed the compiler. That said, a few observations:

• Not much of a win turning on /arch:SSE2 on the regular x86 code. If anything, the rendering part of the code gets worse from the “enhanced instruction set” usage. I did not investigate further.
• The 3-operand AVX instructions provide a solid win of a few percentage points in both 32-bit and 64-bit mode. Considering I’m not using any 8-wide instructions, this is almost exclusively the impact of having less register-register move instructions.
• Yes, going to 64 bits does make a noticeable difference. Note in particular the dip in rendering time. Whether it’s due to the overhead of 32-bit thunks on a 64-bit system, better code generation on the app side, better code on the D3D runtime/driver side, or most likely a combination of all these factors, the D3D rendering code sure gets a lot faster. And similarly, the SIMD-heavy occlusion cull code sees a good speed-up too. I have not investigated whether this is primarily due to the extra registers, or due to code generation improvements.

I don’t think there’s any particular lesson here, but it’s definitely interesting.