A lot of bit manipulation operations exist basically everywhere: AND, OR, XOR/EOR, shifts, and so forth. Some exist only on some architectures but have obvious implementations everywhere else – NAND, NOR, equivalence/XNOR and so forth.

Some exist as built-in instructions on some architectures and require fairly lengthy multi-instruction sequences when they’re not available. Some examples would be population count (number of bits set; available on recent x86s and POWER but not PowerPC), or bit reversal (available on ARMv6T2 and above).

And then there’s bit scanning operations. All of these can be done fairly efficiently on most architectures, but it’s not always obvious how.

### x86 bit scanning operations

On x86, there’s `BSF` (bit scan forward), `BSR` (bit scan reverse), `TZCNT` (trailing zero count) and `LZCNT` (leading zero count). The first two are “old” (having existed since 386), the latter are very new. `BSF` and `TZCNT` do basically the same thing, with one difference: `BSF` has “undefined” results on a zero input (what is the index of the first set bit in the integer 0?), whereas the “trailing zeros” definition for `TZCNT` provides an obvious answer: the width of the register. `BSR` returns the index of the “last” set bit (again undefined on 0 input), `LZCNT` returns the number of leading zero bits, so these two actually produce different results.

• `x86_bsf(x) = x86_tzcnt(x)` unless x=0 (in which case `bsf` is undefined).
• `x86_bsr(x) = (reg_width - 1) - x86_lzcnt(x)` unless x=0 (in which case `bsr` is undefined).

In the following, I’ll only use `LZCNT` and `TZCNT` (use above table to convert where necessary) simply to avoid that nasty edge case at 0.

### PowerPC bit scanning operations

PPC has `cntlzw` (Count Leading Zeros Word) and `cntlzd` (Count Leading Zeros Double Word) but no equivalent for trailing zero bits. We can get fairly close though: there’s the old trick `x & -x` which clears all but the lowest set bit in `x`. As long as x is nonzero, this value has exactly one bit set, and we can determine its position using a leading zero count.

• `x86_tzcnt(x) = (reg_width - 1) - ppc_cntlz(x & -x)` unless x=0 (in which case the PPC expression returns -1).
• `x86_lzcnt(x) = ppc_cntlz(x)`.

### ARM bit scanning operations

ARM also has `CLZ` (Count Leading Zeros) but nothing for trailing zero bits. But it also offers the aforementioned `RBIT` which reverses the bits in a word, which makes a trailing zero count easy to accomplish:

• `x86_tzcnt(x) = arm_clz(arm_rbit(x))`.
• `x86_lzcnt(x) = arm_clz(x)`.

### Bonus

Finally, ARMs NEON also offers `VCLS` (Vector Count Leading Sign Bits), which (quoting from the documentation) “counts the number of consecutive bits following the topmost bit, that are the same as the topmost bit”. Well, we can do that on all architectures I mentioned as well, using only ingredients we already have: `arm_cls(x) = x86_lzcnt(x ^ (x >> 1)) - 1` (the shift here is an arithmetic shift). The expression `y = x ^ (x >> 1)` gives a value that has bit `n` set if and only if bits `n` and `n + 1` of x are the same. By induction, the number of leading zeros in `y` is thus exactly the number of leading bits in `x` that match the sign bit. This count includes the topmost (sign) bit, so it’s always at least 1, and the instruction definition I just quoted requires us to return the number of bits following the topmost bit that match it. So we subtract 1 to get the right result. Since we can do a fast leading zero count on all quoted platforms, we’re good.

Conclusion: bit scans / zero bit counts in either direction can be done fairly efficiently on all architectures covered, but you need to be careful when zeros are involved.

Several long posts in the works; in the meantime, have this:

Bob, just out of the shower and still drowsy, gets a call: his dad just died. Stunned, he calls a few of his other relatives. 30 minutes later, he shows up at work, to meet his boss: “You’re late again, Bob. You had two warnings. That’s it – you’re fired.”

And so forth. If this is a story about Bob, these are high emotional stakes: Bob’s probably having one of the worst days of his life right now. This is his personal apocalypse, and we’re immersed in it, feeling every blow.

Zoom out. A thousand people in the city got the same call that morning. It’s a virus outbreak! Bob might still be one of the protagonists – the character the audience is supposed to identify with. He’s having a bad day, maybe the worst day of his life, sure – but he’s expected to get over it and somehow help the trained virologists through a series of contrived plot twists and stop the epidemic from spreading further.

Zoom out. Maybe there’s no virus outbreak. But there’s a giant asteroid heading for earth! In this kind of story, Bob might still be the everyman character who’s having a really bad day. And he will still get drawn into the plot somehow. We may be supposed to feel sorry for him, or the story might play what’s happening to Bob for laughs, but either way, there are bigger things at stake here! Get a grip, Bob!

Drama is not scale-invariant. Depending on the scope, an individual goes from the center of the universe in the story, to a piece on a chess board, to a bacterium under the microscope, and ultimately becomes completely invisible.

This, in short, is the problem I have with most time-travel stories.

Once your story widens its scope so far that the very substance of space and time takes center stage and there’s a series of delicate (or not-so-delicate) manipulations to make the “right” universe, the one where everyone goes the way it “should” be, be the “right” one (whatever that is supposed to mean), the characters and the world they inhabit become set dressing. You have put the entire universe in a neat little box, labeled it “the universe where Tim dies”, and then you zoom out and look at the tree of all possible timelines and how they branch, and you start rearranging things until you can’t see “the universe where Tim dies” in the part of the timeline that you care anymore.

In short, you are now writing a story about rebalancing a tree data structure. There happen to be people and so on in it, and you may even care about them, but at this point the scale and the stakes are so all-encompassing that it’s hard to really care, and all that’s left is a bunch of neat abstractions.

I like abstractions. I like writing abstractions, and writing about abstractions. I have written a multi-part series on multiple variants of the half-edge data structure on this very blog! And these kinds of stories tickle the part of my brain that likes abstractions and manipulating them.

But they also leave the part of my brain that is social and thinks about humans and their relationships oddly cold and unsatisfied, and I think that’s an inherent problem of this type of story, and I don’t think it can be fixed.

Another Android post. As you can tell by the previous post, I’ve been doing some testing on Android lately, and before that came building, deploying, and testing/debugging. As part of the latter, I was trying to get `ndk-gdb` to work, on which I spent about one and a half day full-time (without success), and then later about as much time waiting (and sometimes answering questions) when some Android devs on Twitter took pity on me and helped me figure the problem out. Since I found no mention of this issue in the usual places (Stack Overflow etc.), I’m writing it up here in case someone else runs into the same issues later.

### First problem: Android 4.3

The symptom here is that `ndk-gdb` won’t be able to successfully start the target app at all, and dies pretty early with an error. “Package is unknown”, “data directory not found” or something similar. This is apparently something that got broken in the update to Android 4.3 – see this issue in the Android bug tracker. If you run into this problem with Android 4.3 running on an “official” Google device (i.e. the Nexus-branded ones), it can be fixed by flashing the device to use the Google-provided Factory Images with Android 4.3. If you’re not on a Nexus device – well, sucks to be you; there’s some other workarounds mentioned in the issue, and Google says that the problem will be fixed in a “future release”, but for now you’re out of luck.

### Second problem: “Waiting for Debugger” – NDK r9 vs. JDK 7

The second problem occurs a bit further down the line: `ndk-gdb` actually manages to launch the app on the phone and gets into gdb, but the app then freezes showing a “Waiting for Debugger” screen and won’t continue no matter what you do. Note that there are lots of ways to get stuck at that screen, see Stack Overflow and the like; in particular, if you see that screen even when launching the app directly on the Android device (instead of starting it via `ndk-gdb --start` or `ndk-gdb --launch` on the host), this is a completely different problem and what I’m describing here doesn’t apply.

Anyway, this one took ages to figure out. After about two days (when I had managed to find the original problem I was trying to debug on a colleague’s machine, where `ndk-gdb` worked), I realized that everything seemed to work fine on his machine, which had an older Android NDK, but did not work on two of my machines, which were both using NDK r9. So I went over the change log for r9 to check if there was anything related to `ndk-gdb`, and indeed, there was this item:

• Updated ndk-gdb script so that the `--start` or `--launch` actions now wait for the GNU Debug Server, so that it can more reliably hit breakpoints set early in the execution path (such as breakpoints in JNI code). (Issue 41278)

Note: This feature requires jdb and produces warning about pending breakpoints. Specify the `--nowait` option to restore previous behavior.

Aha! Finally, a clue. So I tried running my projects with `ndk-gdb --start --nowait`, and indeed, that worked just fine (in retrospect, I should have searched for a way to disable the wait sooner, but hindsight is always 20/20). That was good enough for me, although it meant I didn’t get to enjoy the fix for the Android issue cited in the change log. This is annoying, but not hard to work around: just `sleep` for a few seconds early in your JNI main function to give the debugger time to attach. I was still curious about what’s going on though, but I had absolutely no clue how to proceed from there – digging into it any further would’ve required knowledge of internals that I just didn’t have.

This is when an Android developer on Twitter offered to step in and see if he could figure it out for me; all I had to do was give him some debug logs. Fair enough! And today around noon, he hit paydirt.

Okay, so here’s the problem: as the change log entry notes, the “wait for debugger” is handled in the Java part of the app and goes through jdb (the Java debugger), whereas the native code side is handled by gdbserver and gdb. And the problem was on the Java side of the fence, which I really don’t know anything about. Anyway, I could attach jdb just fine (and run jdb commands successfully), but the wait dialog on the Android phone just wouldn’t go away no matter what I did. It turns out that the problem was caused by me using JDK 7, when Android only officially supports JDK 6. Everything else that I’ve tried worked fine, and none of the build (or other) Android SDK scripts complained about the version mismatch, but apparently on the Android side, things won’t work correctly if a JDK 7 version of jdb tries to connect. And while you’re at it, make sure you’re using a 32-bit JDK too, even if you’re on a 64-bit machine; I didn’t make that mistake, but apparently that one can cause problems too. After I switched to the 32-bit JDK6u38 from here (the old Java JDK site, which unlike the new Oracle-hosted site won’t make you create an user account if you want to download old versions) things started working: I can now use `ndk-gdb` just fine, and it properly waits for the debugger to attach so I can set breakpoints as early as I like without resorting to the `sleep` hack.

### Summary (aka TL;DR)

Use Android 4.2 or older, or flash to the “factory image” if you want native debugging to even start.

If you’re using the NDK r9, make sure you’re using a 32-bit JDK 6 (not 7) or you might get stuck at the “Waiting for Debugger” prompt indefinitely.

Thanks to Branimir Karadžić for pointing out the first issue to me (if I hadn’t known that it was a general Android 4.3 thing, I would’ve wasted a lot of time on this), and huge thanks to Justin Webb for figuring out the second one!

Well done, Android. The Enrichment Center once again reminds you that Android Hell is a real place where you will be sent at the first sign of defiance.

RAD Game Tools, my employer, recently (version 2.2e) started shipping Bink 2.2 on Android, and we decided it was time to go over the example texture upload code in our SDK and see if there was any changes we should make – the original code was written years ago. So far, the answer seems to be “no”: what we’re doing seems to be about as good as we can expect when sticking to “mainstream” GL ES, reasonably widely-deployed extensions and the parts of Android officially exposed in the NDK. That said, I have done a bunch of performance measurements over the past few days, and they paint an interesting (if somewhat depressing) picture. A lot of people on Twitter seemed to be interested in my initial findings, so I asked my boss if it was okay if I published the “proper” results here and he said yes – hence this post.

### Setting

Okay, here’s what we’re measuring: we’re playing back a 1280×720 29.97fps Bink 2 video – namely, an earlier cut of this trailer that we have a very high-quality master version of (it’s one of our standard test videos); I’m sure the exact version of the video we use is on the Internet somewhere too, but I didn’t find it within the 2 minutes of googling, so here goes. We’re only using the first 700 frames of the video to speed up testing (700 frames is enough to get a decent sample).

Like most popular video codecs, Bink 2 produces output data in planar YUV, with the U/V color planes sub-sampled 2x both horizontally and vertically. These three planes get uploaded as 3 separate textures (which together form a “texture set”): one 1280×720 texture for luminance (Y) and two 640×360 textures for chrominance (Cb/Cr). (Bink and Bink 2 also support encoding an alpha channel, which adds another 1280×720 texture to the set). All three textures use the `GL_LUMINANCE` pixel format by default, with `GL_UNSIGNED_BYTE` data; that is, one byte per texel. This data is converted to RGB using a simple fragment shader.

Every frame, we upload new data for all 3 textures in a set using `glTexSubImage2D` from the internal video frame buffers, uploading the entire image (we could track dirty regions fairly easily, but with slow uploads this increases frame rate variance, which is a bad thing). We then draw a single quad using the 3 textures and our fragment shader. All very straightforward stuff.

Furthermore, we actually keep two texture sets around – everything is double-buffered. You will see why this is a good idea despite the increased memory consumption in a second.

### A wrinkle

One problem with GL ES targets is that the original GL ES went a bit overboard in removing core GL features. One important feature they removed was `GL_UNPACK_ROW_LENGTH` – this parameter sets the distance between adjacent rows in a client-specified image, counted in pixels. Why would you care about this? Simple: Say you have a 256×256 texture that you want to update from a system memory copy, but you know that you only changed the lower-left 128×128 pixels. By default, `glTexSubImage2D` with `width = height = 128` will assume that the rows of the source image are 128 pixels wide and densely packed. Thus, to update just a 128×128 pixel region, you would have to either copy the lower left 128×128 pixels of your system memory texture into a smaller array that is densely packed, or call `glTexSubImage2D` 128 times, uploading a row at a time. Neither of these is very appealing from a performance perspective. But if you have `GL_UNPACK_ROW_LENGTH`, you can just set it to 256 and upload everything with a single call. Much nicer.

The reason Bink 2 needs this is because we support arbitrary-width videos, but like most video codecs, the actual coding is done in terms of larger units. For example, MPEG 1 through to H.264 all use 16×16-pixel “macroblocks”, and any video that is not a multiple of 16 pixels will get padded out to a multiple-of-16-size internally. Even if you didn’t need the extra data in the codec, you would still want adjacent rows in the plane buffers to be multiples of at least 16 pixels, simply so that every row is 16-byte aligned (an important magic number for a lot of SIMD instruction sets). Bink 2′s equivalent of macroblocks is 32×32 pixels in size, so we internally want rows to be a multiple of 32 pixels wide.

What this all means is that if you decide you really want a 65×65 pixel video, that’s fine, but we’re going to allocate our internal buffers as if it was 96 pixels wide (and 80 pixels tall – we can omit storage for the last 16 rows in the last macroblock). Which is where the unpack row length comes into play – if we have it, we can support “odd-sized” videos efficiently; if we don’t, we have to use the slower fallback, i.e. call `glTexSubImage2D` for every scan line individually. Luckily, there is the GL_EXT_unpack_subimage GL ES extension that adds this feature back in and is available on most recent devices; but for “odd” sizes on older devices, we’re stuck with uploading a row at a time.

That said, none of this affects our test video, since 1280 pixels width is a multiple of 32; I just though I’d mention it anyway since it’s one of random, non-obvious API compatibility issues you run into. Anyway, back to the subject.

Okay, so here’s what I did: Bink 2 decodes the video on another (or multiple other) threads. Periodically – ideally, 30 times a second – we upload the current frame and draw it to the screen. My test program will never drop any frames; in other words, we may run slower than 30fps, but we will always upload and render all 700 frames, and we will never run faster than 30fps (well, 29.997fps, but close enough).

Around the texture upload, my test program does this:

```    // update the GL textures
clock_t start = clock();
Update_Bink_textures( &TextureSet, Bink );
clock_t total = clock() - start;

upload_stats.record( (float) ( 1000.0 * total / CLOCKS_PER_SEC ) );
```

where `upload_stats` is an instance of the `RunStatistics` class I used in the Optimizing Software Occlusion Culling series. This gives me order statistics, mean and standard deviation for the texture update times, in milliseconds.

I also have several different test variants that I run:

• `GL_LUMINANCE` tests upload the texture data as `GL_LUMINANCE` as explained above. This is the “normal” path.
• `GL_RGBA` tests upload the same bytes as a `GL_RGBA` texture, with all X coordinates (and the texture width) divided by 4. In other words, they transfer the same amount of data (and in fact the same data), just interpreted differently. This was done to check whether RGBA textures enjoy special optimizations in the drivers (spoiler: it seems like they do).
• `use1x1` tests force all `glTexSubImage2D` calls to upload just 1×1 pixels – in other words, this gives us the cost of API overhead, possible synchronization and texture ghosting while virtually removing any per-pixel costs (such as CPU color space conversion, swizzling, DMA transfers or memory bandwidth).
• `nodraw` tests do all of the texture uploading, but then don’t actually draw the quad. This still measures processing time for the texture upload, but since the texture isn’t actually used, no synchronization or ghosting is ever necessary.
• `uploadall` uses `glTexImage2D` instead of `glTexSubImage2D` to upload the whole texture. In theory, this will guarantee to the driver that all existing texture data is overwritten – so while texture ghosting might still have to allocate memory for a new texture, it won’t have to copy the old contents at least. In practice, it’s not clear if the drivers actually make use of that fact. For obvious reasons, this and `use1x1` are mutually exclusive, and I only ran this test on the PowerVR device.

### Results

So, without further ado, here’s the results on the 4 devices I tested: (apologies for the tiny font size, but that was the only way to squeeze it into the blog layout)

Device / GPU Format min 25th med 75th max avg sdev
2010 Droid X (PowerVR SGX 530) GL_LUMINANCE 14.190 15.472 17.700 20.233 70.893 19.704 5.955
GL_RGBA 11.139 13.245 14.221 14.832 28.412 14.382 1.830
GL_LUMINANCE use1x1 0.061 38.269 39.398 41.077 93.750 41.905 6.517
GL_RGBA use1x1 0.061 30.761 32.348 32.837 59.906 33.165 4.305
GL_LUMINANCE nodraw 9.979 12.726 13.427 14.985 29.632 13.854 1.788
GL_RGBA nodraw 5.188 10.376 11.291 12.024 26.215 10.864 2.013
GL_LUMINANCE use1x1 nodraw 0.030 0.061 0.061 0.092 0.733 0.086 0.058
GL_RGBA use1x1 nodraw 0.030 0.061 0.061 0.091 0.916 0.082 0.081
GL_LUMINANCE uploadall 13.611 15.106 17.822 19.653 73.944 19.312 6.145
GL_RGBA uploadall 7.171 12.543 13.489 14.282 34.119 13.751 1.854
GL_LUMINANCE uploadall nodraw 9.491 12.756 13.702 14.862 33.966 13.994 2.176
GL_RGBA uploadall nodraw 5.158 9.796 10.956 11.718 22.735 10.465 2.135
2012 Nexus 7 (Nvidia Tegra 3) GL_LUMINANCE 6.659 7.706 8.710 10.627 18.842 9.597 2.745
GL_RGBA 3.278 3.600 4.128 4.906 9.244 4.395 1.011
GL_LUMINANCE use1x1 0.298 0.361 0.421 0.567 1.843 0.468 0.151
GL_RGBA use1x1 0.297 0.354 0.422 0.561 1.687 0.468 0.152
GL_LUMINANCE nodraw 6.690 7.674 8.669 9.815 24.035 9.495 2.929
GL_RGBA nodraw 3.208 3.501 3.973 5.974 12.059 4.737 1.589
GL_LUMINANCE use1x1 nodraw 0.295 0.360 0.413 0.676 1.569 0.520 0.204
GL_RGBA use1x1 nodraw 0.270 0.327 0.404 0.663 1.946 0.506 0.234
2013 Nexus 7 (Qualcomm Adreno 320) GL_LUMINANCE 0.732 0.976 1.190 3.907 22.249 2.383 1.879
GL_RGBA 0.610 0.824 0.977 3.510 13.368 2.163 1.695
GL_LUMINANCE use1x1 0.030 0.061 0.061 0.091 3.143 0.080 0.187
GL_RGBA use1x1 0.030 0.061 0.091 0.092 4.303 0.104 0.248
GL_LUMINANCE nodraw 0.793 1.098 3.570 4.425 25.760 3.001 2.076
GL_RGBA nodraw 0.732 0.916 1.038 3.937 26.370 2.416 2.190
GL_LUMINANCE use1x1 nodraw 0.030 0.061 0.091 0.092 4.181 0.090 0.204
GL_RGBA use1x1 nodraw 0.030 0.061 0.091 0.122 4.272 0.114 0.292
2012 Nexus 10 (ARM Mali T604) GL_LUMINANCE 1.292 2.782 3.590 4.439 16.893 3.656 1.256
GL_RGBA 1.451 2.782 3.432 4.358 8.517 3.551 0.982
GL_LUMINANCE use1x1 0.193 0.284 0.369 0.670 17.598 0.862 2.230
GL_RGBA use1x1 0.100 0.147 0.199 0.313 20.896 0.656 2.349
GL_LUMINANCE nodraw 1.314 2.179 2.320 2.823 10.677 2.548 0.700
GL_RGBA nodraw 1.209 2.101 2.196 2.539 5.008 2.414 0.553
GL_LUMINANCE use1x1 nodraw 0.190 0.294 0.365 0.601 2.113 0.456 0.228
GL_RGBA use1x1 nodraw 0.094 0.119 0.162 0.288 2.771 0.217 0.162

Yes, bunch of raw data, no fancy graphs – not this time. Here’s my observations:

• `GL_RGBA` textures are indeed a good deal faster than luminance ones on most devices. However, the ratio is not big enough to make CPU-side color space conversion to RGB (or even just interleaving the planes into an RGBA layout on the CPU side) a win, so there’s not much to do about it.
• Variability between devices is huge. Hooray for fragmentation.
• Newer devices tend to have fairly reasonable texture upload times, but there’s still lots of variation.
• Holy crap does the Droid X show badly in this test – it has both really slow upload times and horrible texture ghosting costs, and that despite us already alternating between a pair of texture sets! I hope that’s a problem that’s been fixed in the meantime, but since I don’t have any newer PowerVR devices here to test with, I can’t be sure.

So, to summarize it in one word: Ugh.

I originally intended to write something different for part 2 of this series, but since I’ve started rewriting that article no less than 3 times at this point, I’m just gonna switch the order of topics around a bit.

### Transpose from even/odd interleaves

I already showed one instance of this last time (“variant 2”) for the 4×4 case, but let’s go at it a bit more systematically. Since we’ve already beaten this size to back, let’s spice things up a bit and do 8×8 this time:

```a0 a1 a2 a3 a4 a5 a6 a7
b0 b1 b2 b3 b4 b5 b6 b7
c0 c1 c2 c3 c4 c5 c6 c7
d0 d1 d2 d3 d4 d5 d6 d7
e0 e1 e2 e3 e4 e5 e6 e7
f0 f1 f2 f3 f4 f5 f6 f7
g0 g1 g2 g3 g4 g5 g6 g7
h0 h1 h2 h3 h4 h5 h6 h7
```

“Variant 2” from last time was the version where we started by interleaving rows not with their immediate neighbors, but with the rows that were two steps away. The key here is that we have to do multiple interleave steps to complete the transpose, and with every even-odd interleave, we space the original elements of a vector further apart. So if we interleaved rows A and B in the first step, we would have `a0 b0` after the first step, but `a0 xx b0 xx` as soon as we interleaved that with something else. That’s not what we want. So instead, we start by interleaving rows A and E – after a total of 3 passes, that should put `e0` where it’s supposed to go, 4 elements from `a0`.

The same argument goes for the other rows, too—so let’s just do an even-odd interleave between the entire first half and the last half of our rows:

```a0 e0 a1 e1 a2 e2 a3 e3
a4 e4 a5 e5 a6 e6 a7 e7
b0 f0 b1 f1 b2 f2 b3 f3
b4 f4 b5 f5 b6 f6 b7 f7
c0 g0 c1 g1 c2 g2 c3 g3
c4 g4 c5 g5 c6 g6 c7 g7
d0 h0 d1 h1 d2 h2 d3 h3
d4 h4 d5 h5 d6 h6 d7 h7
```

For the next step, we want to interleave the row containing `a0` with the row containing the elements that needs to end up 2 places away from `a0` in the final result – namely, `c0`, and similar for the other two rows. Which again boils down to doing an even-odd interleave between the entire first and second halves:

```a0 c0 e0 g0 a1 c1 e1 g1
a2 c2 e2 g2 a3 c3 e3 g3
a4 c4 e4 g4 a5 c5 e5 g5
a6 c6 e6 g6 a7 c7 e7 g7
b0 d0 f0 h0 b1 d1 f1 h1
b2 d2 f2 h2 b3 d3 f3 h3
b4 d4 f4 h4 b5 d5 f5 h5
b6 d6 f6 h6 b7 d7 f7 h7
```

At this point we’re just one turn of the crank away from the result we want, so let’s go for it and do one more round of interleaves…

```a0 b0 c0 d0 e0 f0 g0 h0
a1 b1 c1 d1 e1 f1 g1 h1
a2 b2 c2 d2 e2 f2 g2 h2
a3 b3 c3 d3 e3 f3 g3 h3
a4 b4 c4 d4 e4 f4 g4 h4
a5 b5 c5 d5 e5 f5 g5 h5
a6 b6 c6 d6 e6 f6 g6 h6
a7 b7 c7 d7 e7 f7 g7 h7
```

and that’s it, 8×8 matrix successfully transposed.

This is nothing I haven’t shown you already, although in a different order than before. This form here makes the underlying algorithm much clearer, and also generalizes in the obvious way to larger sizes, should you ever need them. But that’s not the reason I’m talking about this. Time to get to the fun part!

### Rotations

Let’s look a bit closer at how the elements move during every pass. For this purpose, let’s just treat all of the elements as a 64-element array. The first row contains the first 8 elements, the second row contains the second 8 elements, and so forth. `a0` starts out in row zero and column zero, the first element of the array, and stays there for the entire time (boring!). `b3` starts out in row 1, column 3 – that’s element number 11 (1×8 + 3). Now, the algorithm above simply applies the same permutation to the overall array 3 times. Let’s look at how the array elements move in one step—for reasons that will become clear in a second, I’ll give the array indices in binary:

 From To 000 000 → 000 000 000 001 → 000 010 000 010 → 000 100 000 011 → 000 110 ⋮ 001 000 → 010 000 001 001 → 010 010 001 010 → 010 100 ⋮ 100 000 → 000 001 100 001 → 000 011 100 010 → 000 101 100 011 → 000 111 ⋮ 101 010 → 010 101 ⋮

Since we have 8 rows and 8 columns, the first 3 and last 3 bits in each index correspond to the row and column indices, respectively. Anyway, can you see the pattern? Even-odd interleaving the first half of the array with the second half in effect performs a bitwise rotate left on the element indices!

Among other things, this instantly explains why it takes exactly three passes to transpose a 8×8 matrix with this approach: the row and column indices take 3 bits each. So after 3 rotate-lefts, we’ve swapped the rows and columns—which is exactly what a matrix transpose does. Another salient point is that repeating even-odd interleaves like this will return us to our starting arrangement after 6 passes. This is easy to see once we know that such a step effectively rotates the bits of the element index; it’s not at all obvious when looking at the permutation by itself.

But it goes further than that. For one thing, the even/odd interleave construction really works for any even number of elements; it certainly works for all powers of two. So we’re not strictly limited to square matrices here. Say we have a 4×8 matrix (4 rows, 8 columns). That’s 32 elements total, or a 5-bit element index, laid out in binary like `r1r0c2c1c0`, where `r` are row index bits and `c` are column index bits. After two interleave passes, we’re at `c2c1c0r1r0`, corresponding to the transposed layout—a 8×4 matrix. To go from there to the original layout, we would have to run three more passes, rotating everything by another 3 bits, back to where we started.

Which illustrates another interesting point: for non-square matrices, going in one direction using this method can be much cheaper than going in the other direction. That’s because the even/odd interleave step only gives us a “rotate left” operation, and sometimes the shorter path would be to “rotate right” instead. On some architectures the corresponding deinterleave operation is available as well (e.g. ARM NEON), but often it’s not, at least not directly.

### Groups

Let’s step back for a moment, though. We have an operation: even/odd interleave between the first and second halves of sequences with 2k elements, which is really just a particular permutation. And now we know that we can get the inverse operation via repeated interleaving. Which means that our even-odd interleave generates a group. Now, as long as we really only do even-odd interleaves on the complete sequence, this group is a cyclic group of order k—it has to be: it’s a finite group generated by a single element, ergo a cyclic group, and we already know how long the cycle is, based on the “rotate left” property.

So to make matters a bit more interesting, let’s get back to the original topic of this article! Namely, the SIMD bit. While it’s convenient to build a complete matrix transpose out of a single operation on all elements simultaneously, namely a very wide even/odd interleave, that’s not how the actual code looks. We have fixed-width registers, and to synthesize anything wider than that, we have to break the data down into smaller chunks and process them individually anyway. However, we do need to have full even/odd interleaves to get a permutation group structure, so we can’t allow using single “low” or “high” interleave instructions without their opposite half.

What kind of model do we end up with? Let’s list the ingredients. We have:

• A set of n SIMD registers, each k “elements” wide. We’ll assume that k is a power of two. What an “element” is depends on context; a 128-bit SIMD registers might be viewed as consisting of 16 byte-sized elements, or 8 16-bit elements, or 4 32-bit elements… you get the idea.
• An even-odd interleave (perfect shuffle) operation between a pair of SIMD registers. For convenience, we’ll assume that the results are returned in the same 2 registers. Implementing this might require another temporary register depending on the architecture, but let’s ignore that for the purposes of this article; we only care about the state of the registers before and after interleaves, not during them.
• Finally, we assume that registers are completely interchangeable, and that we can “rename” them at will; that is, we’ll consider all permutations that can be performed by just renumbering the registers to be equivalent.

To explain the latter, we would consider an arrangement like this:

```r0 = a0 b0 c0 d0
r1 = a1 b1 c1 d1
```

(where `r0` and `r1` correspond to SIMD registers) to be equivalent to:

```r0 = a1 b1 c1 d1
r1 = a0 b0 c0 d0
```

or even

```r3 = a0 b0 c0 d0
r5 = a1 b1 c1 d1
```

To rephrase it, we don’t care about differences in “register allocation”: as long as we get all the individual rows we need, any order will do.

### What permutations can be generated using interleaves?

This is a question I’ve been wondering about ever since I first saw the original MMX instructions, but I didn’t really spend much time thinking about it until fairly recently, when I got curious. So I don’t have a full answer, but I do have some interesting partial results.

Let’s get the trivial case out of the way first: if k=1 (that is, each register contains exactly one element), then clearly we can reach any permutation we want, and without doing any work to boot—every register contains one value, and as explained above, our model treats register-level permutations as free. However, as soon as we have multiple elements inside a single register, things start to get interesting.

### Permutations generated for n=2, k=2

We’re only permuting n×k = 4 elements here, so the groups in question are all small enough to enumerate their elements on a piece of paper, which makes this a good place to start. Also, with n=2, the “register permutation” side of things is really simple (we either swap the two registers or we don’t). For the even-odd interleave, we would normally have to specify which two registers to interleave—but since we only have two, we can just agree that we always want to interleave register 0 with register 1. Should we want the opposite order (interleave register 1 with register 0), we can simply swap the two registers beforehand. So our available operations boil down to just two permutations on 4 elements:

• Swap registers 0 and 1—this swaps the first two and the last two elements, so it corresponds to the permutation $0123 \mapsto 2301$ or (02)(13) in cycle notation. This is an involution: applying it twice swaps the elements back.
• Even/odd interleave between registers 0 and 1. This boils down to the permutation $0123 \mapsto 0213$ or (12) which swaps the two middle elements and is also an involution.

These are the only operations we permit, so we have a finite group that is generated by involutions: it must be a dihedral group. In fact, it turns out to be Dih4, the symmetry group of a square, which is of order 8. So using 2 registers, we can reach only 8 of the $4! = 24$ permutations of 4 elements. So what happens when we have more registers at our disposal?

### Permutations generated for n>2, k=2

The next smallest case is n=3, k=2, which gives us permutations of 6 elements. $6! = 720$, so this is still small enough to simply run a search, and that’s what I did. Or, to be more precise, I wrote a program that did the searching for me. It turns out that the even-odd interleave of the first two registers combined with arbitrary permutations on the registers (of which there are $3! = 6$) is enough to reach all 720 permutations in S6, the symmetric group on 6 elements. Beyond this, I can’t say much about how this representation of the group works out; it would be nice if there was an easy way to find shortest paths between permutations for example (which would have uses for code generation), but if there is, I don’t know how. That said, my understanding of group theory is fairly basic; I’d really appreciate input from someone with more experience dealing with finite groups here.

I can tell you what happens for n>3, though: we already know we can produce all permutations using only 3 registers. And using the exact same steps, we can reach any permutation of the first 3 registers for n>3, leaving the other registers untouched. But that’s in fact enough to generate an arbitrary permutation of n elements, as follows: Say we have n=4, and we start with

```r0 = e0 e1
r1 = e2 e3
r2 = e4 e5
r3 = e6 e7
```

and we want to end up with the (arbitrarily chosen) permutation

```r0 = e3 e7
r1 = e0 e5
r2 = e1 e4
r3 = e6 e2
```

To begin, we try to generate the value we want to end up in `r3` (namely, `e6 e2`). First, we swap rows around so that we have the source values we need in rows 0 and 1 (that is, registers `r0` and `r1`). In our example, that just requires swapping `r0` with `r3`:

```r0 = e6 e7
r1 = e2 e3
r2 = e4 e5
r3 = e0 e1
```

Next, we know that we can reach arbitrary permutations of the first three rows (registers). In particular, we can shuffle things around so that `r0` contains the value we’d like to be in `r3`:

```r0 = e6 e2
r1 = e7 e3
r2 = e4 e5
r3 = e0 e1
```

This is one way of doing it, but really any permutation that has `e6 e2` in the first row would work. Anyway, now that we have produced the value we wanted, we can swap it back into `r3`:

```r0 = e0 e1
r1 = e7 e3
r2 = e4 e5
r3 = e6 e2
```

We now have the value we want in `r3`, and all the remaining unused source elements remain in rows 0–2. And again, since we know we can achieve arbitrary permutations of 6 elements using 3 registers using the n=3 case, we’re done! For n>4, the proof works in the same way: we first generate rows 3, 4, …, n-1 one by one; once a row is done, we never touch it again (it can’t contain any source elements we need for the remaining rows, since we only allow permutations). In the end we will always arrive at a configuration that has all the remaining “unfinished” elements in rows 0–2, which is a configuration we can solve.

I don’t mean to suggest that this is an efficient way to solve this problem; quite the opposite, in fact. But it’s an easy way to prove that once we have n=3 solved, higher n’s don’t add any substantial difficulty.

### Summary

This covers the easiest cases, k=1 and k=2, and answers the question I originally wondered about in the positive: using only interleaves, you can produce arbitrary permutations of the input elements in registers, as long as you only have k=2 elements per register (for example, 32-bit values inside the 64-bit MMX registers) and at least 3 SIMD registers worth of storage. Without any nice bounds on the number of operations required or an algorithmic way to compute an optimal interleave/rename sequence, I’m the first one to admit that this has little practical relevance, but it’s cool stuff nonetheless! Coming up, I’ll talk a bit about the (more interesting) k=4 case, but I think this is enough material for a single blog post. Until next time!

This one tends to show up fairly frequently in SIMD code: Matrix transposes of one sort or another. The canonical application is transforming data from AoS (array of structures) to the more SIMD-friendly SoA (structure of arrays) format: For concreteness, say we have 4 float vertex positions in 4-wide SIMD registers

```  p0 = { x0, y0, z0, w0 }
p1 = { x1, y1, z1, w1 }
p2 = { x2, y2, z2, w2 }
p3 = { x3, y3, z3, w3 }
```

and would really like them in transposed order instead:

```  X = { x0, x1, x2, x3 }
Y = { y0, y1, y2, y3 }
Z = { z0, z1, z2, z3 }
W = { w0, w1, w2, w3 }
```

Note that here and in the following, I’m writing SIMD 4-vectors as arrays of 4 elements here – none of this nonsense that some authors tend to do where they write vectors as “w, z, y, x” on Little Endian platforms. Endianness is a concept that makes sense for numbers and no sense at all for arrays, which SIMD vectors are, but that’s a rant for another day, so just be advised that I’m always writing things in the order that they’re stored in memory.

Anyway, transposing vectors like this is one application, and the one I’m gonna stick with for the moment because it “only” requires 4×4 values, which are the smallest “interesting” size in a certain sense. Keep in mind there are other applications though. For example, when implementing 2D separable filters, the “vertical” direction (filtering between rows) is usually easy, whereas “horizontal” (filtering between columns within the same register) is trickier – to the point that it’s often faster to transpose, perform a vertical filter, and then transpose back. Anyway, let’s not worry about applications right now, just trust me that it tends to come up more frequently than you might expect. So how do we do this?

### One way to do it

The method I see most often is to try and group increasingly larger parts of the result together. For example, we’d like to get “x0″ and “x1″ adjacent to each other, and the same with “x2″ and “x3″, “y0″ and “y1″ and so forth. The canonical way to do this is using the “unpack” (x86), “merge” (PowerPC) or “unzip” (ARM NEON) intrinsics. So to bring “x0″ and “x1″ together in the right order, we would do:

```  a0 = interleave32_lft(p0, p1) = { x0, x1, y0, y1 }
```

where `interleave32_lft` (“interleave 32-bit words, left half”) corresponds to `UNPCKLPS` (x86, floats), `PUNPCKLDQ` (x86, ints), or `vmrghw` (PowerPC). And to be symmetric, we do the same thing with the other half, giving us:

```  a0 = interleave32_lft(p0, p1) = { x0, x1, y0, y1 }
a1 = interleave32_rgt(p0, p1) = { z0, z1, w0, w1 }
```

where `interleave32_rgt` corresponds to `UNPCKHPS` (x86, floats), `PUNPCKHDQ` (x86, ints), or `vmrglw` (PowerPC). The reason I haven’t mentioned the individual opcodes for NEON is that their “unzips” always work on pairs of registers and handle both the “left” and “right” halves at once, forming a combined

```  (a0, a1) = interleave32(p0, p1)
```

operation (`VUZP.32`) that also happens to be a good way to thing about the whole operation on other architectures – even though it is not the ideal way to perform transposes on NEON, but I’m getting ahead of myself here. Anyway, again by symmetry we then do the same process with the other two rows, yielding:

```  // (a0, a1) = interleave32(p0, p1)
// (a2, a3) = interleave32(p2, p3)
a0 = interleave32_lft(p0, p1) = { x0, x1, y0, y1 }
a1 = interleave32_rgt(p0, p1) = { z0, z1, w0, w1 }
a2 = interleave32_lft(p2, p3) = { x2, x3, y2, y3 }
a3 = interleave32_rgt(p2, p3) = { z2, z3, w2, w3 }
```

And presto, we now have all even-odd pairs nicely lined up. Now we can build `X` by combining the left halves from `a0` and `a2`. Their respective right halves also combine into `Y`. So we do a similar process like before, only this time we’re working on groups that are pairs of 32-bit values – in other words, we’re really dealing with 64-bit groups:

```  // (X, Y) = interleave64(a0, a2)
// (Z, W) = interleave64(a1, a3)
X = interleave64_lft(a0, a2) = { x0, x1, x2, x3 }
Y = interleave64_rgt(a0, a2) = { y0, y1, y2, y3 }
Z = interleave64_lft(a1, a3) = { z0, z1, z2, z3 }
W = interleave64_rgt(a1, a3) = { w0, w1, w2, w3 }
```

This time, `interleave64_lft` (`interleave64_rgt`) correspond to `MOVLHPS` (`MOVHLPS`) for floats on x86, `PUNPCKLQDQ` (`PUNPCKHQDQ`) for ints on x86, or `VSWP` of `d` registers on ARM NEON. PowerPCs have no dedicated instruction for this but can synthesize it using `vperm`. The variety here is why I use my own naming scheme in this article, by the way.

Anyway, that’s one way to do it with interleaves. There’s more than one, however!

### Interleaves, variant 2

What if, instead of interleaving `p0` with `p1`, we pair it with `p2` instead? By process of elimination, that means we have to pair `p1` with `p3`. Where does that lead us? Let’s find out!

```  // (b0, b2) = interleave32(p0, p2)
// (b1, b3) = interleave32(p1, p3)
b0 = interleave32_lft(p0, p2) = { x0, x2, y0, y2 }
b1 = interleave32_lft(p1, p3) = { x1, x3, y1, y3 }
b2 = interleave32_rgt(p0, p2) = { z0, z2, w0, w2 }
b3 = interleave32_rgt(p1, p3) = { z1, z3, w1, w3 }
```

Can you see it? We have four nice little squares in each of the quadrants now, and are in fact just one more set of interleaves away from our desired result:

```  // (X, Y) = interleave32(b0, b1)
// (Z, W) = interleave32(b2, b3)
X = interleave32_lft(b0, b1) = { x0, x1, x2, x3 }
Y = interleave32_rgt(b0, b1) = { y0, y1, y2, y3 }
Z = interleave32_lft(b2, b3) = { z0, z1, z2, z3 }
W = interleave32_rgt(b2, b3) = { w0, w1, w2, w3 }
```

This one uses just one type of interleave instruction, which is preferable if you the 64-bit interleaves don’t exist on your target platform (PowerPC) or would require loading a different permutation vector (SPUs, which have to do the whole thing using `shufb`).

### It gets a bit weird

Well, let’s just plunge ahead and start by 64-bit interleaving `p0` and `p1`, then see whether it leads anywhere.

```  // (c0, c1) = interleave64(p0, p1)
// (c2, c3) = interleave64(p2, p3)
c0 = interleave64_lft(p0, p1) = { x0, y0, x1, y1 }
c1 = interleave64_rgt(p0, p1) = { z0, w0, z1, w1 }
c2 = interleave64_lft(p2, p3) = { x2, y2, x3, y3 }
c3 = interleave64_rgt(p2, p3) = { z2, w2, z3, w3 }
```

Okay. For this one, we can’t continue with our regular interleaves, but we still have the property that each of our target vectors (X, Y, Z, and W) can be built using elements from only two of the c’s. In fact, the low half of each target vector comes from one c and the high half from another, which means that on x86, we can combine the two using `SHUFPS`. On PPC, there’s always `vperm`, SPUs have `shufb`, and NEON has `VTBL`, all of which are much more general, so again, it can be done there as well:

```  // 4 SHUFPS on x86
X = { c0[0], c0[2], c2[0], c2[2] } = { x0, x1, x2, x3 }
Y = { c0[1], co[3], c2[1], c2[3] } = { y0, y1, y2, y3 }
Z = { c1[0], z1[2], c3[0], c3[2] } = { z0, z1, z2, z3 }
W = { c1[1], c1[3], c3[1], c3[3] } = { w0, w1, w3, w3 }
```

As said, this one is a bit weird, but it’s the method used for `_MM_TRANSPOSE4_PS` in Microsoft’s version of Intel’s `emmintrin.h` (SSE intrinsics header) to this day, and used to be the standard implementation in GCC’s version as well until it got replaced with the first method I discussed.

Anyway, that was starting by 64-bit interleaving `p0` and `p1`. Can we get it if we interleave with `p2` too?

### The plot thickens

Again, let’s just try it!

```  // (c0, c2) = interleave64(p0, p2)
// (c1, c3) = interleave64(p1, p3)
c0 = interleave64_lft(p0, p2) = { x0, y0, x2, y2 }
c1 = interleave64_lft(p1, p3) = { x1, y1, x3, y3 }
c2 = interleave64_rgt(p0, p2) = { z0, w0, z2, w2 }
c3 = interleave64_rgt(p1, p3) = { z1, w1, z3, w3 }
```

Huh. This one leaves the top left and bottom right 2×2 blocks alone and swaps the other two. But we still got closer to our goal – if we swap the top right and bottom left element in each of the four 2×2 blocks, we have a full transpose as well. And NEON happens to have an instruction for that (`VTRN.32`). As usual, the other platforms can try to emulate this using more general shuffles:

```  // 2 VTRN.32 on NEON:
// (X, Y) = vtrn.32(c0, c1)
// (Z, W) = vtrn.32(c2, c3)
X = { c0[0], c1[0], c0[2], c1[2] } = { x0, x1, x2, x3 }
Y = { c0[1], c1[1], c0[3], c1[3] } = { y0, y1, y2, y3 }
Z = { c2[0], c3[0], c2[2], c3[2] } = { z0, z1, z2, z3 }
W = { c2[1], c3[1], c2[3], c3[3] } = { w0, w1, w2, w3 }
```

Just like NEON’s “unzip” instructions, `VTRN` both reads and writes two registers, so it is in essence doing the work of two instructions on the other architectures. Which means that we now have 4 different methods to do the same thing that are essentially the same cost in terms of computational complexity. Sure, some methods end up faster than others on different architectures due to various implementation choices, but really, in essence none of these are fundamentally more difficult (or easier) than the others.

Nor are these the only ones – for the last variant, we started by swapping the 2×2 blocks within the 4×4 matrix and then transposing the individual 2×2 blocks, but doing it the other way round works just as well (and is again the same cost). In fact, this generalizes to arbitrary power-of-two sized square matrices – you can just partition it into differently sized block transposes which can run in any order. This even works with rectangular matrices, with some restrictions. (A standard way to perform “chunky to planar” conversion for old bit plane-based graphics architectures uses this general approach to good effect).

### And now?

Okay, so far, we have a menagerie of different matrix transpose techniques, all of which essentially have the same complexity. If you’re interested in SIMD coding, I suppose you can just use this as a reference. However, that’s not the actual reason I’m writing this; the real reason is that the whole “why are these things all essentially the same complexity” thing intrigued me, so a while back I looked into this and found out a whole bunch of cool properties that are probably not useful for coding at all, but which I nevertheless found interesting. In other words, I’ll write a few more posts on this topic, which I will spend gleefully nerding out with no particular goal whatsoever. If you don’t care, just stop reading now. You’re welcome!

This post is outside the usual technical pure technical focus of this blog, but the subject matter is important to me, and I wanted to write down my thoughts on it while they were still fresh.

This is a tweet by Anita Sarkeesian, and a disturbingly large number of misogynistic (and outright hateful) replies she got in response. This would be a shameful display in any context, but it is especially bad considering that her “objectionable” statement is both a factual one and completely right. There were no games with female protagonists shown at the Xbox One E3 reveal.

So why the hate? Presumably because of things like her excellent “Tropes vs. Women in Video Games” series (which is excellent, and which you should watch immediately if you haven’t done so already. Go ahead. This post will still be there when you come back). As a player of video games, it’s hard not to notice how obviously, painfully right the points she makes are. I think Anita is doing the game developer community a great service by pointing out how badly we’re currently screwing up in this regard; not only can we do better, we should be collectively ashamed that we don’t.

However, that is not the point I’m trying to make here. Instead, I want to address a rebuttal I hear frequently in cases like this: “Relax, you’re taking it far too seriously, it’s just a game!”, or alternatively, “Games don’t affect people like that”. (This statement also makes frequent appearance in discussions on the role of violence in games).

When coming from game developers, I find this statement extremely baffling. And here’s why: most games referenced in this context are interactive experiences that a player will spend at least a few hours with. Considering how the human brain works, it is exceedingly unlikely that something, anything a player spends that amount of time on will leave no trace whatsoever in her brain. That does not mean that violent games makes people violent, or rampant misogyny in video games makes players hate women. That kind of thinking is far too simplistic and reductionist. But still: players will spend multiple hours of their life in a virtual world of your creation. That experience will, in some way or other, influence their thinking. Independent of whether it’s actually “harmful” to submerge players in a world of casual misogyny and ultraviolence – is there any point to it? Or is it just something we do out of laziness and without consciously thinking about it? And if so, what does that say about us?

But let’s take a step back. Say you really, truly believe that games leave no lasting impression of any kind on the player. That whether they play it or not makes no difference at all, for better or for worse, to their lives – they might as well have been sleeping. Because if you really think that is the case – then why are you spending years of your life working on the damn things?