Another Android post. As you can tell by the previous post, I’ve been doing some testing on Android lately, and before that came building, deploying, and testing/debugging. As part of the latter, I was trying to get ndk-gdb to work, on which I spent about one and a half day full-time (without success), and then later about as much time waiting (and sometimes answering questions) when some Android devs on Twitter took pity on me and helped me figure the problem out. Since I found no mention of this issue in the usual places (Stack Overflow etc.), I’m writing it up here in case someone else runs into the same issues later.
First problem: Android 4.3
The symptom here is that ndk-gdb won’t be able to successfully start the target app at all, and dies pretty early with an error. “Package is unknown”, “data directory not found” or something similar. This is apparently something that got broken in the update to Android 4.3 – see this issue in the Android bug tracker. If you run into this problem with Android 4.3 running on an “official” Google device (i.e. the Nexus-branded ones), it can be fixed by flashing the device to use the Google-provided Factory Images with Android 4.3. If you’re not on a Nexus device – well, sucks to be you; there’s some other workarounds mentioned in the issue, and Google says that the problem will be fixed in a “future release”, but for now you’re out of luck.
Second problem: “Waiting for Debugger” – NDK r9 vs. JDK 7
The second problem occurs a bit further down the line: ndk-gdb actually manages to launch the app on the phone and gets into gdb, but the app then freezes showing a “Waiting for Debugger” screen and won’t continue no matter what you do. Note that there are lots of ways to get stuck at that screen, see Stack Overflow and the like; in particular, if you see that screen even when launching the app directly on the Android device (instead of starting it via ndk-gdb --start or ndk-gdb --launch on the host), this is a completely different problem and what I’m describing here doesn’t apply.
Anyway, this one took ages to figure out. After about two days (when I had managed to find the original problem I was trying to debug on a colleague’s machine, where ndk-gdb worked), I realized that everything seemed to work fine on his machine, which had an older Android NDK, but did not work on two of my machines, which were both using NDK r9. So I went over the change log for r9 to check if there was anything related to ndk-gdb, and indeed, there was this item:
- Updated ndk-gdb script so that the
--startor--launchactions now wait for the GNU Debug Server, so that it can more reliably hit breakpoints set early in the execution path (such as breakpoints in JNI code). (Issue 41278)
Note: This feature requires jdb and produces warning about pending breakpoints. Specify the
--nowaitoption to restore previous behavior.
Aha! Finally, a clue. So I tried running my projects with ndk-gdb --start --nowait, and indeed, that worked just fine (in retrospect, I should have searched for a way to disable the wait sooner, but hindsight is always 20/20). That was good enough for me, although it meant I didn’t get to enjoy the fix for the Android issue cited in the change log. This is annoying, but not hard to work around: just sleep for a few seconds early in your JNI main function to give the debugger time to attach. I was still curious about what’s going on though, but I had absolutely no clue how to proceed from there – digging into it any further would’ve required knowledge of internals that I just didn’t have.
This is when an Android developer on Twitter offered to step in and see if he could figure it out for me; all I had to do was give him some debug logs. Fair enough! And today around noon, he hit paydirt.
Okay, so here’s the problem: as the change log entry notes, the “wait for debugger” is handled in the Java part of the app and goes through jdb (the Java debugger), whereas the native code side is handled by gdbserver and gdb. And the problem was on the Java side of the fence, which I really don’t know anything about. Anyway, I could attach jdb just fine (and run jdb commands successfully), but the wait dialog on the Android phone just wouldn’t go away no matter what I did. It turns out that the problem was caused by me using JDK 7, when Android only officially supports JDK 6. Everything else that I’ve tried worked fine, and none of the build (or other) Android SDK scripts complained about the version mismatch, but apparently on the Android side, things won’t work correctly if a JDK 7 version of jdb tries to connect. And while you’re at it, make sure you’re using a 32-bit JDK too, even if you’re on a 64-bit machine; I didn’t make that mistake, but apparently that one can cause problems too. After I switched to the 32-bit JDK6u38 from here (the old Java JDK site, which unlike the new Oracle-hosted site won’t make you create an user account if you want to download old versions) things started working: I can now use ndk-gdb just fine, and it properly waits for the debugger to attach so I can set breakpoints as early as I like without resorting to the sleep hack.
Summary (aka TL;DR)
Use Android 4.2 or older, or flash to the “factory image” if you want native debugging to even start.
If you’re using the NDK r9, make sure you’re using a 32-bit JDK 6 (not 7) or you might get stuck at the “Waiting for Debugger” prompt indefinitely.
Thanks to Branimir Karadžić for pointing out the first issue to me (if I hadn’t known that it was a general Android 4.3 thing, I would’ve wasted a lot of time on this), and huge thanks to Justin Webb for figuring out the second one!
Well done, Android. The Enrichment Center once again reminds you that Android Hell is a real place where you will be sent at the first sign of defiance.
RAD Game Tools, my employer, recently (version 2.2e) started shipping Bink 2.2 on Android, and we decided it was time to go over the example texture upload code in our SDK and see if there was any changes we should make – the original code was written years ago. So far, the answer seems to be “no”: what we’re doing seems to be about as good as we can expect when sticking to “mainstream” GL ES, reasonably widely-deployed extensions and the parts of Android officially exposed in the NDK. That said, I have done a bunch of performance measurements over the past few days, and they paint an interesting (if somewhat depressing) picture. A lot of people on Twitter seemed to be interested in my initial findings, so I asked my boss if it was okay if I published the “proper” results here and he said yes – hence this post.
Setting
Okay, here’s what we’re measuring: we’re playing back a 1280×720 29.97fps Bink 2 video – namely, an earlier cut of this trailer that we have a very high-quality master version of (it’s one of our standard test videos); I’m sure the exact version of the video we use is on the Internet somewhere too, but I didn’t find it within the 2 minutes of googling, so here goes. We’re only using the first 700 frames of the video to speed up testing (700 frames is enough to get a decent sample).
Like most popular video codecs, Bink 2 produces output data in planar YUV, with the U/V color planes sub-sampled 2x both horizontally and vertically. These three planes get uploaded as 3 separate textures (which together form a “texture set”): one 1280×720 texture for luminance (Y) and two 640×360 textures for chrominance (Cb/Cr). (Bink and Bink 2 also support encoding an alpha channel, which adds another 1280×720 texture to the set). All three textures use the GL_LUMINANCE pixel format by default, with GL_UNSIGNED_BYTE data; that is, one byte per texel. This data is converted to RGB using a simple fragment shader.
Every frame, we upload new data for all 3 textures in a set using glTexSubImage2D from the internal video frame buffers, uploading the entire image (we could track dirty regions fairly easily, but with slow uploads this increases frame rate variance, which is a bad thing). We then draw a single quad using the 3 textures and our fragment shader. All very straightforward stuff.
Furthermore, we actually keep two texture sets around – everything is double-buffered. You will see why this is a good idea despite the increased memory consumption in a second.
A wrinkle
One problem with GL ES targets is that the original GL ES went a bit overboard in removing core GL features. One important feature they removed was GL_UNPACK_ROW_LENGTH – this parameter sets the distance between adjacent rows in a client-specified image, counted in pixels. Why would you care about this? Simple: Say you have a 256×256 texture that you want to update from a system memory copy, but you know that you only changed the lower-left 128×128 pixels. By default, glTexSubImage2D with width = height = 128 will assume that the rows of the source image are 128 pixels wide and densely packed. Thus, to update just a 128×128 pixel region, you would have to either copy the lower left 128×128 pixels of your system memory texture into a smaller array that is densely packed, or call glTexSubImage2D 128 times, uploading a row at a time. Neither of these is very appealing from a performance perspective. But if you have GL_UNPACK_ROW_LENGTH, you can just set it to 256 and upload everything with a single call. Much nicer.
The reason Bink 2 needs this is because we support arbitrary-width videos, but like most video codecs, the actual coding is done in terms of larger units. For example, MPEG 1 through to H.264 all use 16×16-pixel “macroblocks”, and any video that is not a multiple of 16 pixels will get padded out to a multiple-of-16-size internally. Even if you didn’t need the extra data in the codec, you would still want adjacent rows in the plane buffers to be multiples of at least 16 pixels, simply so that every row is 16-byte aligned (an important magic number for a lot of SIMD instruction sets). Bink 2’s equivalent of macroblocks is 32×32 pixels in size, so we internally want rows to be a multiple of 32 pixels wide.
What this all means is that if you decide you really want a 65×65 pixel video, that’s fine, but we’re going to allocate our internal buffers as if it was 96 pixels wide (and 80 pixels tall – we can omit storage for the last 16 rows in the last macroblock). Which is where the unpack row length comes into play – if we have it, we can support “odd-sized” videos efficiently; if we don’t, we have to use the slower fallback, i.e. call glTexSubImage2D for every scan line individually. Luckily, there is the GL_EXT_unpack_subimage GL ES extension that adds this feature back in and is available on most recent devices; but for “odd” sizes on older devices, we’re stuck with uploading a row at a time.
That said, none of this affects our test video, since 1280 pixels width is a multiple of 32; I just though I’d mention it anyway since it’s one of random, non-obvious API compatibility issues you run into. Anyway, back to the subject.
Measuring texture updates
Okay, so here’s what I did: Bink 2 decodes the video on another (or multiple other) threads. Periodically – ideally, 30 times a second – we upload the current frame and draw it to the screen. My test program will never drop any frames; in other words, we may run slower than 30fps, but we will always upload and render all 700 frames, and we will never run faster than 30fps (well, 29.997fps, but close enough).
Around the texture upload, my test program does this:
// update the GL textures
clock_t start = clock();
Update_Bink_textures( &TextureSet, Bink );
clock_t total = clock() - start;
upload_stats.record( (float) ( 1000.0 * total / CLOCKS_PER_SEC ) );
where upload_stats is an instance of the RunStatistics class I used in the Optimizing Software Occlusion Culling series. This gives me order statistics, mean and standard deviation for the texture update times, in milliseconds.
I also have several different test variants that I run:
GL_LUMINANCEtests upload the texture data asGL_LUMINANCEas explained above. This is the “normal” path.GL_RGBAtests upload the same bytes as aGL_RGBAtexture, with all X coordinates (and the texture width) divided by 4. In other words, they transfer the same amount of data (and in fact the same data), just interpreted differently. This was done to check whether RGBA textures enjoy special optimizations in the drivers (spoiler: it seems like they do).use1x1tests force allglTexSubImage2Dcalls to upload just 1×1 pixels – in other words, this gives us the cost of API overhead, possible synchronization and texture ghosting while virtually removing any per-pixel costs (such as CPU color space conversion, swizzling, DMA transfers or memory bandwidth).nodrawtests do all of the texture uploading, but then don’t actually draw the quad. This still measures processing time for the texture upload, but since the texture isn’t actually used, no synchronization or ghosting is ever necessary.uploadallusesglTexImage2Dinstead ofglTexSubImage2Dto upload the whole texture. In theory, this will guarantee to the driver that all existing texture data is overwritten – so while texture ghosting might still have to allocate memory for a new texture, it won’t have to copy the old contents at least. In practice, it’s not clear if the drivers actually make use of that fact. For obvious reasons, this anduse1x1are mutually exclusive, and I only ran this test on the PowerVR device.
Results
So, without further ado, here’s the results on the 4 devices I tested: (apologies for the tiny font size, but that was the only way to squeeze it into the blog layout)
| Device / GPU | Format | min | 25th | med | 75th | max | avg | sdev |
|---|---|---|---|---|---|---|---|---|
| 2010 Droid X (PowerVR SGX 530) | GL_LUMINANCE | 14.190 | 15.472 | 17.700 | 20.233 | 70.893 | 19.704 | 5.955 |
| GL_RGBA | 11.139 | 13.245 | 14.221 | 14.832 | 28.412 | 14.382 | 1.830 | |
| GL_LUMINANCE use1x1 | 0.061 | 38.269 | 39.398 | 41.077 | 93.750 | 41.905 | 6.517 | |
| GL_RGBA use1x1 | 0.061 | 30.761 | 32.348 | 32.837 | 59.906 | 33.165 | 4.305 | |
| GL_LUMINANCE nodraw | 9.979 | 12.726 | 13.427 | 14.985 | 29.632 | 13.854 | 1.788 | |
| GL_RGBA nodraw | 5.188 | 10.376 | 11.291 | 12.024 | 26.215 | 10.864 | 2.013 | |
| GL_LUMINANCE use1x1 nodraw | 0.030 | 0.061 | 0.061 | 0.092 | 0.733 | 0.086 | 0.058 | |
| GL_RGBA use1x1 nodraw | 0.030 | 0.061 | 0.061 | 0.091 | 0.916 | 0.082 | 0.081 | |
| GL_LUMINANCE uploadall | 13.611 | 15.106 | 17.822 | 19.653 | 73.944 | 19.312 | 6.145 | |
| GL_RGBA uploadall | 7.171 | 12.543 | 13.489 | 14.282 | 34.119 | 13.751 | 1.854 | |
| GL_LUMINANCE uploadall nodraw | 9.491 | 12.756 | 13.702 | 14.862 | 33.966 | 13.994 | 2.176 | |
| GL_RGBA uploadall nodraw | 5.158 | 9.796 | 10.956 | 11.718 | 22.735 | 10.465 | 2.135 | |
| 2012 Nexus 7 (Nvidia Tegra 3) | GL_LUMINANCE | 6.659 | 7.706 | 8.710 | 10.627 | 18.842 | 9.597 | 2.745 |
| GL_RGBA | 3.278 | 3.600 | 4.128 | 4.906 | 9.244 | 4.395 | 1.011 | |
| GL_LUMINANCE use1x1 | 0.298 | 0.361 | 0.421 | 0.567 | 1.843 | 0.468 | 0.151 | |
| GL_RGBA use1x1 | 0.297 | 0.354 | 0.422 | 0.561 | 1.687 | 0.468 | 0.152 | |
| GL_LUMINANCE nodraw | 6.690 | 7.674 | 8.669 | 9.815 | 24.035 | 9.495 | 2.929 | |
| GL_RGBA nodraw | 3.208 | 3.501 | 3.973 | 5.974 | 12.059 | 4.737 | 1.589 | |
| GL_LUMINANCE use1x1 nodraw | 0.295 | 0.360 | 0.413 | 0.676 | 1.569 | 0.520 | 0.204 | |
| GL_RGBA use1x1 nodraw | 0.270 | 0.327 | 0.404 | 0.663 | 1.946 | 0.506 | 0.234 | |
| 2013 Nexus 7 (Qualcomm Adreno 320) | GL_LUMINANCE | 0.732 | 0.976 | 1.190 | 3.907 | 22.249 | 2.383 | 1.879 |
| GL_RGBA | 0.610 | 0.824 | 0.977 | 3.510 | 13.368 | 2.163 | 1.695 | |
| GL_LUMINANCE use1x1 | 0.030 | 0.061 | 0.061 | 0.091 | 3.143 | 0.080 | 0.187 | |
| GL_RGBA use1x1 | 0.030 | 0.061 | 0.091 | 0.092 | 4.303 | 0.104 | 0.248 | |
| GL_LUMINANCE nodraw | 0.793 | 1.098 | 3.570 | 4.425 | 25.760 | 3.001 | 2.076 | |
| GL_RGBA nodraw | 0.732 | 0.916 | 1.038 | 3.937 | 26.370 | 2.416 | 2.190 | |
| GL_LUMINANCE use1x1 nodraw | 0.030 | 0.061 | 0.091 | 0.092 | 4.181 | 0.090 | 0.204 | |
| GL_RGBA use1x1 nodraw | 0.030 | 0.061 | 0.091 | 0.122 | 4.272 | 0.114 | 0.292 | |
| 2012 Nexus 10 (ARM Mali T604) | GL_LUMINANCE | 1.292 | 2.782 | 3.590 | 4.439 | 16.893 | 3.656 | 1.256 |
| GL_RGBA | 1.451 | 2.782 | 3.432 | 4.358 | 8.517 | 3.551 | 0.982 | |
| GL_LUMINANCE use1x1 | 0.193 | 0.284 | 0.369 | 0.670 | 17.598 | 0.862 | 2.230 | |
| GL_RGBA use1x1 | 0.100 | 0.147 | 0.199 | 0.313 | 20.896 | 0.656 | 2.349 | |
| GL_LUMINANCE nodraw | 1.314 | 2.179 | 2.320 | 2.823 | 10.677 | 2.548 | 0.700 | |
| GL_RGBA nodraw | 1.209 | 2.101 | 2.196 | 2.539 | 5.008 | 2.414 | 0.553 | |
| GL_LUMINANCE use1x1 nodraw | 0.190 | 0.294 | 0.365 | 0.601 | 2.113 | 0.456 | 0.228 | |
| GL_RGBA use1x1 nodraw | 0.094 | 0.119 | 0.162 | 0.288 | 2.771 | 0.217 | 0.162 |
Yes, bunch of raw data, no fancy graphs – not this time. Here’s my observations:
GL_RGBAtextures are indeed a good deal faster than luminance ones on most devices. However, the ratio is not big enough to make CPU-side color space conversion to RGB (or even just interleaving the planes into an RGBA layout on the CPU side) a win, so there’s not much to do about it.- Variability between devices is huge. Hooray for fragmentation.
- Newer devices tend to have fairly reasonable texture upload times, but there’s still lots of variation.
- Holy crap does the Droid X show badly in this test – it has both really slow upload times and horrible texture ghosting costs, and that despite us already alternating between a pair of texture sets! I hope that’s a problem that’s been fixed in the meantime, but since I don’t have any newer PowerVR devices here to test with, I can’t be sure.
So, to summarize it in one word: Ugh.
I originally intended to write something different for part 2 of this series, but since I’ve started rewriting that article no less than 3 times at this point, I’m just gonna switch the order of topics around a bit.
Transpose from even/odd interleaves
I already showed one instance of this last time (“variant 2”) for the 4×4 case, but let’s go at it a bit more systematically. Since we’ve already beaten this size to back, let’s spice things up a bit and do 8×8 this time:
a0 a1 a2 a3 a4 a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7 c0 c1 c2 c3 c4 c5 c6 c7 d0 d1 d2 d3 d4 d5 d6 d7 e0 e1 e2 e3 e4 e5 e6 e7 f0 f1 f2 f3 f4 f5 f6 f7 g0 g1 g2 g3 g4 g5 g6 g7 h0 h1 h2 h3 h4 h5 h6 h7
“Variant 2” from last time was the version where we started by interleaving rows not with their immediate neighbors, but with the rows that were two steps away. The key here is that we have to do multiple interleave steps to complete the transpose, and with every even-odd interleave, we space the original elements of a vector further apart. So if we interleaved rows A and B in the first step, we would have a0 b0 after the first step, but a0 xx b0 xx as soon as we interleaved that with something else. That’s not what we want. So instead, we start by interleaving rows A and E – after a total of 3 passes, that should put e0 where it’s supposed to go, 4 elements from a0.
The same argument goes for the other rows, too—so let’s just do an even-odd interleave between the entire first half and the last half of our rows:
a0 e0 a1 e1 a2 e2 a3 e3 a4 e4 a5 e5 a6 e6 a7 e7 b0 f0 b1 f1 b2 f2 b3 f3 b4 f4 b5 f5 b6 f6 b7 f7 c0 g0 c1 g1 c2 g2 c3 g3 c4 g4 c5 g5 c6 g6 c7 g7 d0 h0 d1 h1 d2 h2 d3 h3 d4 h4 d5 h5 d6 h6 d7 h7
For the next step, we want to interleave the row containing a0 with the row containing the elements that needs to end up 2 places away from a0 in the final result – namely, c0, and similar for the other two rows. Which again boils down to doing an even-odd interleave between the entire first and second halves:
a0 c0 e0 g0 a1 c1 e1 g1 a2 c2 e2 g2 a3 c3 e3 g3 a4 c4 e4 g4 a5 c5 e5 g5 a6 c6 e6 g6 a7 c7 e7 g7 b0 d0 f0 h0 b1 d1 f1 h1 b2 d2 f2 h2 b3 d3 f3 h3 b4 d4 f4 h4 b5 d5 f5 h5 b6 d6 f6 h6 b7 d7 f7 h7
At this point we’re just one turn of the crank away from the result we want, so let’s go for it and do one more round of interleaves…
a0 b0 c0 d0 e0 f0 g0 h0 a1 b1 c1 d1 e1 f1 g1 h1 a2 b2 c2 d2 e2 f2 g2 h2 a3 b3 c3 d3 e3 f3 g3 h3 a4 b4 c4 d4 e4 f4 g4 h4 a5 b5 c5 d5 e5 f5 g5 h5 a6 b6 c6 d6 e6 f6 g6 h6 a7 b7 c7 d7 e7 f7 g7 h7
and that’s it, 8×8 matrix successfully transposed.
This is nothing I haven’t shown you already, although in a different order than before. This form here makes the underlying algorithm much clearer, and also generalizes in the obvious way to larger sizes, should you ever need them. But that’s not the reason I’m talking about this. Time to get to the fun part!
Rotations
Let’s look a bit closer at how the elements move during every pass. For this purpose, let’s just treat all of the elements as a 64-element array. The first row contains the first 8 elements, the second row contains the second 8 elements, and so forth. a0 starts out in row zero and column zero, the first element of the array, and stays there for the entire time (boring!). b3 starts out in row 1, column 3 – that’s element number 11 (1×8 + 3). Now, the algorithm above simply applies the same permutation to the overall array 3 times. Let’s look at how the array elements move in one step—for reasons that will become clear in a second, I’ll give the array indices in binary:
| From | To | |
| 000 000 | → | 000 000 |
| 000 001 | → | 000 010 |
| 000 010 | → | 000 100 |
| 000 011 | → | 000 110 |
| ⋮ | ||
| 001 000 | → | 010 000 |
| 001 001 | → | 010 010 |
| 001 010 | → | 010 100 |
| ⋮ | ||
| 100 000 | → | 000 001 |
| 100 001 | → | 000 011 |
| 100 010 | → | 000 101 |
| 100 011 | → | 000 111 |
| ⋮ | ||
| 101 010 | → | 010 101 |
| ⋮ |
Since we have 8 rows and 8 columns, the first 3 and last 3 bits in each index correspond to the row and column indices, respectively. Anyway, can you see the pattern? Even-odd interleaving the first half of the array with the second half in effect performs a bitwise rotate left on the element indices!
Among other things, this instantly explains why it takes exactly three passes to transpose a 8×8 matrix with this approach: the row and column indices take 3 bits each. So after 3 rotate-lefts, we’ve swapped the rows and columns—which is exactly what a matrix transpose does. Another salient point is that repeating even-odd interleaves like this will return us to our starting arrangement after 6 passes. This is easy to see once we know that such a step effectively rotates the bits of the element index; it’s not at all obvious when looking at the permutation by itself.
But it goes further than that. For one thing, the even/odd interleave construction really works for any even number of elements; it certainly works for all powers of two. So we’re not strictly limited to square matrices here. Say we have a 4×8 matrix (4 rows, 8 columns). That’s 32 elements total, or a 5-bit element index, laid out in binary like r1r0c2c1c0, where r are row index bits and c are column index bits. After two interleave passes, we’re at c2c1c0r1r0, corresponding to the transposed layout—a 8×4 matrix. To go from there to the original layout, we would have to run three more passes, rotating everything by another 3 bits, back to where we started.
Which illustrates another interesting point: for non-square matrices, going in one direction using this method can be much cheaper than going in the other direction. That’s because the even/odd interleave step only gives us a “rotate left” operation, and sometimes the shorter path would be to “rotate right” instead. On some architectures the corresponding deinterleave operation is available as well (e.g. ARM NEON), but often it’s not, at least not directly.
Groups
Let’s step back for a moment, though. We have an operation: even/odd interleave between the first and second halves of sequences with 2k elements, which is really just a particular permutation. And now we know that we can get the inverse operation via repeated interleaving. Which means that our even-odd interleave generates a group. Now, as long as we really only do even-odd interleaves on the complete sequence, this group is a cyclic group of order k—it has to be: it’s a finite group generated by a single element, ergo a cyclic group, and we already know how long the cycle is, based on the “rotate left” property.
So to make matters a bit more interesting, let’s get back to the original topic of this article! Namely, the SIMD bit. While it’s convenient to build a complete matrix transpose out of a single operation on all elements simultaneously, namely a very wide even/odd interleave, that’s not how the actual code looks. We have fixed-width registers, and to synthesize anything wider than that, we have to break the data down into smaller chunks and process them individually anyway. However, we do need to have full even/odd interleaves to get a permutation group structure, so we can’t allow using single “low” or “high” interleave instructions without their opposite half.
What kind of model do we end up with? Let’s list the ingredients. We have:
- A set of n SIMD registers, each k “elements” wide. We’ll assume that k is a power of two. What an “element” is depends on context; a 128-bit SIMD registers might be viewed as consisting of 16 byte-sized elements, or 8 16-bit elements, or 4 32-bit elements… you get the idea.
- An even-odd interleave (perfect shuffle) operation between a pair of SIMD registers. For convenience, we’ll assume that the results are returned in the same 2 registers. Implementing this might require another temporary register depending on the architecture, but let’s ignore that for the purposes of this article; we only care about the state of the registers before and after interleaves, not during them.
- Finally, we assume that registers are completely interchangeable, and that we can “rename” them at will; that is, we’ll consider all permutations that can be performed by just renumbering the registers to be equivalent.
To explain the latter, we would consider an arrangement like this:
r0 = a0 b0 c0 d0 r1 = a1 b1 c1 d1
(where r0 and r1 correspond to SIMD registers) to be equivalent to:
r0 = a1 b1 c1 d1 r1 = a0 b0 c0 d0
or even
r3 = a0 b0 c0 d0 r5 = a1 b1 c1 d1
To rephrase it, we don’t care about differences in “register allocation”: as long as we get all the individual rows we need, any order will do.
What permutations can be generated using interleaves?
This is a question I’ve been wondering about ever since I first saw the original MMX instructions, but I didn’t really spend much time thinking about it until fairly recently, when I got curious. So I don’t have a full answer, but I do have some interesting partial results.
Let’s get the trivial case out of the way first: if k=1 (that is, each register contains exactly one element), then clearly we can reach any permutation we want, and without doing any work to boot—every register contains one value, and as explained above, our model treats register-level permutations as free. However, as soon as we have multiple elements inside a single register, things start to get interesting.
Permutations generated for n=2, k=2
We’re only permuting n×k = 4 elements here, so the groups in question are all small enough to enumerate their elements on a piece of paper, which makes this a good place to start. Also, with n=2, the “register permutation” side of things is really simple (we either swap the two registers or we don’t). For the even-odd interleave, we would normally have to specify which two registers to interleave—but since we only have two, we can just agree that we always want to interleave register 0 with register 1. Should we want the opposite order (interleave register 1 with register 0), we can simply swap the two registers beforehand. So our available operations boil down to just two permutations on 4 elements:
- Swap registers 0 and 1—this swaps the first two and the last two elements, so it corresponds to the permutation
or (02)(13) in cycle notation. This is an involution: applying it twice swaps the elements back.
- Even/odd interleave between registers 0 and 1. This boils down to the permutation
or (12) which swaps the two middle elements and is also an involution.
These are the only operations we permit, so we have a finite group that is generated by involutions: it must be a dihedral group. In fact, it turns out to be Dih4, the symmetry group of a square, which is of order 8. So using 2 registers, we can reach only 8 of the permutations of 4 elements. So what happens when we have more registers at our disposal?
Permutations generated for n>2, k=2
The next smallest case is n=3, k=2, which gives us permutations of 6 elements. , so this is still small enough to simply run a search, and that’s what I did. Or, to be more precise, I wrote a program that did the searching for me. It turns out that the even-odd interleave of the first two registers combined with arbitrary permutations on the registers (of which there are
) is enough to reach all 720 permutations in S6, the symmetric group on 6 elements. Beyond this, I can’t say much about how this representation of the group works out; it would be nice if there was an easy way to find shortest paths between permutations for example (which would have uses for code generation), but if there is, I don’t know how. That said, my understanding of group theory is fairly basic; I’d really appreciate input from someone with more experience dealing with finite groups here.
I can tell you what happens for n>3, though: we already know we can produce all permutations using only 3 registers. And using the exact same steps, we can reach any permutation of the first 3 registers for n>3, leaving the other registers untouched. But that’s in fact enough to generate an arbitrary permutation of n elements, as follows: Say we have n=4, and we start with
r0 = e0 e1 r1 = e2 e3 r2 = e4 e5 r3 = e6 e7
and we want to end up with the (arbitrarily chosen) permutation
r0 = e3 e7 r1 = e0 e5 r2 = e1 e4 r3 = e6 e2
To begin, we try to generate the value we want to end up in r3 (namely, e6 e2). First, we swap rows around so that we have the source values we need in rows 0 and 1 (that is, registers r0 and r1). In our example, that just requires swapping r0 with r3:
r0 = e6 e7 r1 = e2 e3 r2 = e4 e5 r3 = e0 e1
Next, we know that we can reach arbitrary permutations of the first three rows (registers). In particular, we can shuffle things around so that r0 contains the value we’d like to be in r3:
r0 = e6 e2 r1 = e7 e3 r2 = e4 e5 r3 = e0 e1
This is one way of doing it, but really any permutation that has e6 e2 in the first row would work. Anyway, now that we have produced the value we wanted, we can swap it back into r3:
r0 = e0 e1 r1 = e7 e3 r2 = e4 e5 r3 = e6 e2
We now have the value we want in r3, and all the remaining unused source elements remain in rows 0–2. And again, since we know we can achieve arbitrary permutations of 6 elements using 3 registers using the n=3 case, we’re done! For n>4, the proof works in the same way: we first generate rows 3, 4, …, n-1 one by one; once a row is done, we never touch it again (it can’t contain any source elements we need for the remaining rows, since we only allow permutations). In the end we will always arrive at a configuration that has all the remaining “unfinished” elements in rows 0–2, which is a configuration we can solve.
I don’t mean to suggest that this is an efficient way to solve this problem; quite the opposite, in fact. But it’s an easy way to prove that once we have n=3 solved, higher n’s don’t add any substantial difficulty.
Summary
This covers the easiest cases, k=1 and k=2, and answers the question I originally wondered about in the positive: using only interleaves, you can produce arbitrary permutations of the input elements in registers, as long as you only have k=2 elements per register (for example, 32-bit values inside the 64-bit MMX registers) and at least 3 SIMD registers worth of storage. Without any nice bounds on the number of operations required or an algorithmic way to compute an optimal interleave/rename sequence, I’m the first one to admit that this has little practical relevance, but it’s cool stuff nonetheless! Coming up, I’ll talk a bit about the (more interesting) k=4 case, but I think this is enough material for a single blog post. Until next time!
This one tends to show up fairly frequently in SIMD code: Matrix transposes of one sort or another. The canonical application is transforming data from AoS (array of structures) to the more SIMD-friendly SoA (structure of arrays) format: For concreteness, say we have 4 float vertex positions in 4-wide SIMD registers
p0 = { x0, y0, z0, w0 }
p1 = { x1, y1, z1, w1 }
p2 = { x2, y2, z2, w2 }
p3 = { x3, y3, z3, w3 }
and would really like them in transposed order instead:
X = { x0, x1, x2, x3 }
Y = { y0, y1, y2, y3 }
Z = { z0, z1, z2, z3 }
W = { w0, w1, w2, w3 }
Note that here and in the following, I’m writing SIMD 4-vectors as arrays of 4 elements here – none of this nonsense that some authors tend to do where they write vectors as “w, z, y, x” on Little Endian platforms. Endianness is a concept that makes sense for numbers and no sense at all for arrays, which SIMD vectors are, but that’s a rant for another day, so just be advised that I’m always writing things in the order that they’re stored in memory.
Anyway, transposing vectors like this is one application, and the one I’m gonna stick with for the moment because it “only” requires 4×4 values, which are the smallest “interesting” size in a certain sense. Keep in mind there are other applications though. For example, when implementing 2D separable filters, the “vertical” direction (filtering between rows) is usually easy, whereas “horizontal” (filtering between columns within the same register) is trickier – to the point that it’s often faster to transpose, perform a vertical filter, and then transpose back. Anyway, let’s not worry about applications right now, just trust me that it tends to come up more frequently than you might expect. So how do we do this?
One way to do it
The method I see most often is to try and group increasingly larger parts of the result together. For example, we’d like to get “x0” and “x1” adjacent to each other, and the same with “x2” and “x3”, “y0” and “y1” and so forth. The canonical way to do this is using the “unpack” (x86), “merge” (PowerPC) or “unzip” (ARM NEON) intrinsics. So to bring “x0” and “x1” together in the right order, we would do:
a0 = interleave32_lft(p0, p1) = { x0, x1, y0, y1 }
where interleave32_lft (“interleave 32-bit words, left half”) corresponds to UNPCKLPS (x86, floats), PUNPCKLDQ (x86, ints), or vmrghw (PowerPC). And to be symmetric, we do the same thing with the other half, giving us:
a0 = interleave32_lft(p0, p1) = { x0, x1, y0, y1 }
a1 = interleave32_rgt(p0, p1) = { z0, z1, w0, w1 }
where interleave32_rgt corresponds to UNPCKHPS (x86, floats), PUNPCKHDQ (x86, ints), or vmrglw (PowerPC). The reason I haven’t mentioned the individual opcodes for NEON is that their “unzips” always work on pairs of registers and handle both the “left” and “right” halves at once, forming a combined
(a0, a1) = interleave32(p0, p1)
operation (VUZP.32) that also happens to be a good way to thing about the whole operation on other architectures – even though it is not the ideal way to perform transposes on NEON, but I’m getting ahead of myself here. Anyway, again by symmetry we then do the same process with the other two rows, yielding:
// (a0, a1) = interleave32(p0, p1)
// (a2, a3) = interleave32(p2, p3)
a0 = interleave32_lft(p0, p1) = { x0, x1, y0, y1 }
a1 = interleave32_rgt(p0, p1) = { z0, z1, w0, w1 }
a2 = interleave32_lft(p2, p3) = { x2, x3, y2, y3 }
a3 = interleave32_rgt(p2, p3) = { z2, z3, w2, w3 }
And presto, we now have all even-odd pairs nicely lined up. Now we can build X by combining the left halves from a0 and a2. Their respective right halves also combine into Y. So we do a similar process like before, only this time we’re working on groups that are pairs of 32-bit values – in other words, we’re really dealing with 64-bit groups:
// (X, Y) = interleave64(a0, a2)
// (Z, W) = interleave64(a1, a3)
X = interleave64_lft(a0, a2) = { x0, x1, x2, x3 }
Y = interleave64_rgt(a0, a2) = { y0, y1, y2, y3 }
Z = interleave64_lft(a1, a3) = { z0, z1, z2, z3 }
W = interleave64_rgt(a1, a3) = { w0, w1, w2, w3 }
This time, interleave64_lft (interleave64_rgt) correspond to MOVLHPS (MOVHLPS) for floats on x86, PUNPCKLQDQ (PUNPCKHQDQ) for ints on x86, or VSWP of d registers on ARM NEON. PowerPCs have no dedicated instruction for this but can synthesize it using vperm. The variety here is why I use my own naming scheme in this article, by the way.
Anyway, that’s one way to do it with interleaves. There’s more than one, however!
Interleaves, variant 2
What if, instead of interleaving p0 with p1, we pair it with p2 instead? By process of elimination, that means we have to pair p1 with p3. Where does that lead us? Let’s find out!
// (b0, b2) = interleave32(p0, p2)
// (b1, b3) = interleave32(p1, p3)
b0 = interleave32_lft(p0, p2) = { x0, x2, y0, y2 }
b1 = interleave32_lft(p1, p3) = { x1, x3, y1, y3 }
b2 = interleave32_rgt(p0, p2) = { z0, z2, w0, w2 }
b3 = interleave32_rgt(p1, p3) = { z1, z3, w1, w3 }
Can you see it? We have four nice little squares in each of the quadrants now, and are in fact just one more set of interleaves away from our desired result:
// (X, Y) = interleave32(b0, b1)
// (Z, W) = interleave32(b2, b3)
X = interleave32_lft(b0, b1) = { x0, x1, x2, x3 }
Y = interleave32_rgt(b0, b1) = { y0, y1, y2, y3 }
Z = interleave32_lft(b2, b3) = { z0, z1, z2, z3 }
W = interleave32_rgt(b2, b3) = { w0, w1, w2, w3 }
This one uses just one type of interleave instruction, which is preferable if you the 64-bit interleaves don’t exist on your target platform (PowerPC) or would require loading a different permutation vector (SPUs, which have to do the whole thing using shufb).
Okay, both of these methods start with a 32-bit interleave. What if we were to start with a 64-bit interleave instead?
It gets a bit weird
Well, let’s just plunge ahead and start by 64-bit interleaving p0 and p1, then see whether it leads anywhere.
// (c0, c1) = interleave64(p0, p1)
// (c2, c3) = interleave64(p2, p3)
c0 = interleave64_lft(p0, p1) = { x0, y0, x1, y1 }
c1 = interleave64_rgt(p0, p1) = { z0, w0, z1, w1 }
c2 = interleave64_lft(p2, p3) = { x2, y2, x3, y3 }
c3 = interleave64_rgt(p2, p3) = { z2, w2, z3, w3 }
Okay. For this one, we can’t continue with our regular interleaves, but we still have the property that each of our target vectors (X, Y, Z, and W) can be built using elements from only two of the c’s. In fact, the low half of each target vector comes from one c and the high half from another, which means that on x86, we can combine the two using SHUFPS. On PPC, there’s always vperm, SPUs have shufb, and NEON has VTBL, all of which are much more general, so again, it can be done there as well:
// 4 SHUFPS on x86
X = { c0[0], c0[2], c2[0], c2[2] } = { x0, x1, x2, x3 }
Y = { c0[1], co[3], c2[1], c2[3] } = { y0, y1, y2, y3 }
Z = { c1[0], z1[2], c3[0], c3[2] } = { z0, z1, z2, z3 }
W = { c1[1], c1[3], c3[1], c3[3] } = { w0, w1, w3, w3 }
As said, this one is a bit weird, but it’s the method used for _MM_TRANSPOSE4_PS in Microsoft’s version of Intel’s emmintrin.h (SSE intrinsics header) to this day, and used to be the standard implementation in GCC’s version as well until it got replaced with the first method I discussed.
Anyway, that was starting by 64-bit interleaving p0 and p1. Can we get it if we interleave with p2 too?
The plot thickens
Again, let’s just try it!
// (c0, c2) = interleave64(p0, p2)
// (c1, c3) = interleave64(p1, p3)
c0 = interleave64_lft(p0, p2) = { x0, y0, x2, y2 }
c1 = interleave64_lft(p1, p3) = { x1, y1, x3, y3 }
c2 = interleave64_rgt(p0, p2) = { z0, w0, z2, w2 }
c3 = interleave64_rgt(p1, p3) = { z1, w1, z3, w3 }
Huh. This one leaves the top left and bottom right 2×2 blocks alone and swaps the other two. But we still got closer to our goal – if we swap the top right and bottom left element in each of the four 2×2 blocks, we have a full transpose as well. And NEON happens to have an instruction for that (VTRN.32). As usual, the other platforms can try to emulate this using more general shuffles:
// 2 VTRN.32 on NEON:
// (X, Y) = vtrn.32(c0, c1)
// (Z, W) = vtrn.32(c2, c3)
X = { c0[0], c1[0], c0[2], c1[2] } = { x0, x1, x2, x3 }
Y = { c0[1], c1[1], c0[3], c1[3] } = { y0, y1, y2, y3 }
Z = { c2[0], c3[0], c2[2], c3[2] } = { z0, z1, z2, z3 }
W = { c2[1], c3[1], c2[3], c3[3] } = { w0, w1, w2, w3 }
Just like NEON’s “unzip” instructions, VTRN both reads and writes two registers, so it is in essence doing the work of two instructions on the other architectures. Which means that we now have 4 different methods to do the same thing that are essentially the same cost in terms of computational complexity. Sure, some methods end up faster than others on different architectures due to various implementation choices, but really, in essence none of these are fundamentally more difficult (or easier) than the others.
Nor are these the only ones – for the last variant, we started by swapping the 2×2 blocks within the 4×4 matrix and then transposing the individual 2×2 blocks, but doing it the other way round works just as well (and is again the same cost). In fact, this generalizes to arbitrary power-of-two sized square matrices – you can just partition it into differently sized block transposes which can run in any order. This even works with rectangular matrices, with some restrictions. (A standard way to perform “chunky to planar” conversion for old bit plane-based graphics architectures uses this general approach to good effect).
And now?
Okay, so far, we have a menagerie of different matrix transpose techniques, all of which essentially have the same complexity. If you’re interested in SIMD coding, I suppose you can just use this as a reference. However, that’s not the actual reason I’m writing this; the real reason is that the whole “why are these things all essentially the same complexity” thing intrigued me, so a while back I looked into this and found out a whole bunch of cool properties that are probably not useful for coding at all, but which I nevertheless found interesting. In other words, I’ll write a few more posts on this topic, which I will spend gleefully nerding out with no particular goal whatsoever. If you don’t care, just stop reading now. You’re welcome!
This post is outside the usual technical pure technical focus of this blog, but the subject matter is important to me, and I wanted to write down my thoughts on it while they were still fresh.
This is a tweet by Anita Sarkeesian, and a disturbingly large number of misogynistic (and outright hateful) replies she got in response. This would be a shameful display in any context, but it is especially bad considering that her “objectionable” statement is both a factual one and completely right. There were no games with female protagonists shown at the Xbox One E3 reveal.
So why the hate? Presumably because of things like her excellent “Tropes vs. Women in Video Games” series (which is excellent, and which you should watch immediately if you haven’t done so already. Go ahead. This post will still be there when you come back). As a player of video games, it’s hard not to notice how obviously, painfully right the points she makes are. I think Anita is doing the game developer community a great service by pointing out how badly we’re currently screwing up in this regard; not only can we do better, we should be collectively ashamed that we don’t.
However, that is not the point I’m trying to make here. Instead, I want to address a rebuttal I hear frequently in cases like this: “Relax, you’re taking it far too seriously, it’s just a game!”, or alternatively, “Games don’t affect people like that”. (This statement also makes frequent appearance in discussions on the role of violence in games).
When coming from game developers, I find this statement extremely baffling. And here’s why: most games referenced in this context are interactive experiences that a player will spend at least a few hours with. Considering how the human brain works, it is exceedingly unlikely that something, anything a player spends that amount of time on will leave no trace whatsoever in her brain. That does not mean that violent games makes people violent, or rampant misogyny in video games makes players hate women. That kind of thinking is far too simplistic and reductionist. But still: players will spend multiple hours of their life in a virtual world of your creation. That experience will, in some way or other, influence their thinking. Independent of whether it’s actually “harmful” to submerge players in a world of casual misogyny and ultraviolence – is there any point to it? Or is it just something we do out of laziness and without consciously thinking about it? And if so, what does that say about us?
But let’s take a step back. Say you really, truly believe that games leave no lasting impression of any kind on the player. That whether they play it or not makes no difference at all, for better or for worse, to their lives – they might as well have been sleeping. Because if you really think that is the case – then why are you spending years of your life working on the damn things?
See also: Brian Moriarty’s lecture “Entrain”.
Sometimes, you need to turn a linearly independent set of vectors into an orthonormal basis – or, equivalently, take a matrix that is “close” to orthogonal (for example, an orthogonal matrix that has been updated multiple times and might have started to drift due to round-off error) and make it properly orthogonal again.
The standard way to solve this problem is called Gram-Schmidt orthogonalization. The idea is pretty simple. Say we have a list of (linearly independent) input vectors v1, …, vn. For the first vector in our output orthogonal basis, we just normalize the first input vector:
For the second vector u2 to be orthogonal to the first, we need to remove the component of v2 parallel to u1, which is a simple projection:
We now have two orthonormal vectors; for the third vector, we now need to remove the components that are parallel to either of them:
and so forth. You get the idea. This is the “classical” Gram-Schmidt process, or “CGS”. It’s simple and easy to derive, and works just fine in exact arithmetic. However, when performed using floating-point arithmetic, it is numerically unstable – badly so. Let me give an example: consider the matrix
where ε is any value small enough so that (1 + ε2) rounds to 1 in the given floating-point type. I’m using single-precision IEEE floating point and ε=10-4 for this example. Let’s run this through the classic Gram-Schmidt process:
static void classic_gram_schmidt(Mat33f &out, const Mat33f &in)
{
out.col[0] = normalize(in.col[0]);
out.col[1] = normalize(in.col[1] -
dot(in.col[1], out.col[0])*out.col[0]);
out.col[2] = normalize(in.col[2] -
dot(in.col[2], out.col[0])*out.col[0] -
dot(in.col[2], out.col[1])*out.col[1]);
}
which produces this result (rounded to 4 decimal digits):
Ouch. The first column not being “perfectly” normalized is expected – after all, we explicitly chose ε so that (1 + ε2) rounded to 1 – but the third column is not at all orthogonal to the second column; in fact, there’s a perfect 45 degree angle between the two. For an orthogonalization algorithm, that’s a pretty serious failure.
It turns out that there’s a really simple fix though: “modified” Gram-Schmidt. Instead of computing all the dot products from the original vectors, perform the projections one by one, using the result of the previous projection as the input to the next. In exact arithmetic, this is equivalent, but using floating-point arithmetic, this version:
static void modified_gram_schmidt(Mat33f &out, const Mat33f &in)
{
out.col[0] = normalize(in.col[0]);
out.col[1] = normalize(in.col[1] -
dot(in.col[1], out.col[0])*out.col[0]);
out.col[2] = in.col[2] -
dot(in.col[2], out.col[0])*out.col[0];
// note the second dot product is computed from the partial
// result out.col[2], not the input vector in.col[2]!
out.col[2] -= dot(out.col[2], out.col[1])*out.col[1];
out.col[2] = normalize(out.col[2]);
}
produces (again rounded to 4 decimal digits):
Much better. Now, by itself, better results on a single matrix don’t tell us anything, but it turns out that the MGS algorithm comes with a bounded error guarantee: The orthogonalized matrix Q satisfies the inequality
where c1 and c2 are constants, u is the machine precision (the “epsilon” for the given floating point type), and κ = κ(A) is the condition number of the input matrix. And in fact, orthogonalizing a matrix using MGS is numerically equivalent to performing a Householder QR decomposition (a known stable algorithm) on the matrix A augmented with a few extra zero rows at the top – which also means that, in addition to the above error bound on the orthogonality of Q, MGS is also backwards stable with a nice error bound. (Both claims are proven here).
Long story short: this is a nice example of a numerical algorithm where two approaches identical in exact arithmetic yield dramatically different results when computed using floating-point. And it comes with an action item: if you have code that orthogonalizes a matrix (or orthonormalizes a tuple of basis vectors) using a Gram-Schmidt-like method, you should check whether it corresponds to the classical or modified GS algorithm. The modified algorithm has roughly the same cost (albeit with a different dependency structure that is slightly less amenable to vectorization) and is numerically much superior. Even for something as small as 3×3 matrices, as the example above shows. And if you want to play around with it, feel free to check out the code I used for the numerical experiments.
UPDATE: And before I get angry comments: in 3D, the cheapest way to compute the third basis vector is to not look at the third source vector at all, and instead simply use the cross product of the first two (assuming they’re normalized). This is cheaper, stable, and guarantees that the result will be a right-handed orthonormal basis. It does not, however, generalize to higher dimensions, so knowing about MGS is still useful.
There’s tons of useful trig identities. You could spend the time to learn them by heart, or just look them up on Wikipedia when necessary. But I’ve always had problems remembering where the signs and such go when trying to memorize this directly. At least for me, what worked way better is this: spend a few hours familiarizing yourself with complex numbers if you haven’t done so already; after that, most identities that you need in practice are easy to derive from Euler’s formula:
Let’s do the basic addition formulas first. Euler’s formula gives:
and once we apply the identity again we get:
multiplying out:
The terms in parentheses are all real numbers; equating them with our original expression yields the result
Both addition formulas for the price of one. (In fact, this exploits that the addition formulas for trigonometric functions and the addition formula for exponents are really the same thing). The main point being that if you know complex multiplication, you never have to remember what the grouping of factors and the signs are, something I used to have trouble remembering.
Plugging in x=y into the above also immediately gives the double-angle formulas:
so if you know the addition formulas there’s really no reason to learn these separately.
Then there’s the well-known
but it’s really just the Pythagorean theorem in disguise (since cos(x) and sin(x) are the side lengths of a right-angled triangle). So not really a new formula either!
Moving either the cosine or sine terms to the right-hand side gives the two immensely useful equations:
In particular, that second one is perfect if you need the sine squared of an angle that you only have the cosine of (usually because you’ve determined it using a dot product). Judicious application of these two tends to be a great way to simplify superfluous math in shaders (and elsewhere), one of my pet peeves.
For practice, let’s apply these two identities to the cosine double-angle formula:
why, it’s the half-angle formulas! Fancy meeting you here!
Can we do something with the sine double-angle formula too? Well, it’s not too fancy, but we can get this:
Now, let’s go back to the original addition formulas and let’s see what happens when we plug in negative values for y. Using and
, we get:
Hey look, flipped signs! This means that we can now add these to (or subtract them from) the original formulas to get even more identities!
It’s the product-to-sum identities this time. I got one more! We’ve deliberately flipped signs and then added/subtracted the addition formulas to get the above set. What if we do the same trick in reverse to get rid of those x+y and x-y terms? Let’s set and
and plug that into the identities above and we get:
Ta-dah, it’s the sum-to-product identities. Now, admittedly, we’ve taken quite a few steps to get here, and looking these up when you need them is going to be faster than walking through the derivation (if you ever need them in the first place – I don’t think I’ve ever used the product/sum identities in practice). But still, working these out is a good exercise, and a lot less likely to go wrong (at least for me) than memorizing lots of similar formulas. (I never can get the signs right that way)
Bonus exercise: work out general expressions for and
. Hint:
.
And I think that’s enough for now. (At some later point, I might do an extra post about one of the sneakier trig techniques: the Weierstrass substitution).
One interesting thing about x86 is that it’s changed two major architectural “magic values” in the past 10 years. The first is the addition of 64-bit mode, which not only widens all general-purpose registers and gives a much larger virtual address space, it also increases the number of general-purpose and XMM registers from 8 to 16. The second is AVX, which allows all SSE (and other SIMD) instructions to be encoded using non-destructive 3-operand forms instead of the original 2-operand forms.
Since modern x86 processors are trying really hard to run both 32- and 64-bit code well (and same for SSE vs. AVX), this gives us an opportunity to compare the relative performance of these choices in a reasonably level playing field, when running the same (C++) code. Of course, this is nowhere near a perfect comparison, especially since switching from 32 to 64 bits also changes the sizes of pointers and (at the very least) the code generator used by the compiler, but it’s still interesting to be able to do the experiment on a single machine with no fuss. So, without further ado, here’s a quick comparison using the Software Occlusion Culling demo I’ve been writing about for the past month – a fairly SIMD-heavy workload.
| Version | Occlusion cull | Render scene |
|---|---|---|
| x86 (baseline) | 2.88ms | 1.39ms |
x86, /arch:SSE2 |
2.88ms (+0.2%) | 1.48ms (+5.8%) |
x86, /arch:AVX |
2.77ms (-3.8%) | 1.43ms (+2.7%) |
| x64 | 2.71ms (-5.7%) | 1.29ms (-7.2%) |
x64, /arch:AVX |
2.63ms (-8.7%) | 1.28ms (-8.5%) |
Note that /arch:AVX makes VC++ use AVX forms of SSE vector instructions (i.e. 3-operand), but it’s all still 4-wide SIMD, not the new 8-wide SIMD floating point. Getting that would require changes to the code. And of course the code uses SSE2 (and, in fact, even SSE4.1) instructions whether we turn on /arch:SSE2 on x86 or not – this only affects how “regular” floating-point code is generated. Also, the speedup percentages are computed from the full-precision values, not the truncated values I put in the table. (Which doesn’t mean much, since I truncated the values to about their level of accuracy)
So what does this tell us? Hard to be sure. It’s very few data points and I haven’t done any work to eliminate the effect of e.g. memory layout / code placement, which can be very much significant. And of course I’ve also changed the compiler. That said, a few observations:
- Not much of a win turning on
/arch:SSE2on the regular x86 code. If anything, the rendering part of the code gets worse from the “enhanced instruction set” usage. I did not investigate further. - The 3-operand AVX instructions provide a solid win of a few percentage points in both 32-bit and 64-bit mode. Considering I’m not using any 8-wide instructions, this is almost exclusively the impact of having less register-register move instructions.
- Yes, going to 64 bits does make a noticeable difference. Note in particular the dip in rendering time. Whether it’s due to the overhead of 32-bit thunks on a 64-bit system, better code generation on the app side, better code on the D3D runtime/driver side, or most likely a combination of all these factors, the D3D rendering code sure gets a lot faster. And similarly, the SIMD-heavy occlusion cull code sees a good speed-up too. I have not investigated whether this is primarily due to the extra registers, or due to code generation improvements.
I don’t think there’s any particular lesson here, but it’s definitely interesting.
I’ve been asked several times about this, so I wanted to make an “official” statement:
No, I will not prepare ePub/PDF (“book”) versions of posts, particularly the “A trip through the Graphics Pipeline 2011” and “Optimizing Software Occlusion Culling” series. However, should someone be willing to prepare such a thing, I’d be very happy to provide them with a WordPress extended RSS dump of the site contents (with your comments and all other emails / personal data removed, don’t worry) and host the results. If you’re interested in helping, please write a comment with a valid email address and I’ll get in touch with you.
To clarify the legal situation, I have put both these series into the public domain (using the CC-0 “license” waiver). This means you may do with these posts whatever you want. You may edit them, update them or add additional information; you may turn them into an eBook, PDF, or hardcopy book; you may use it as a starting point for a graphics pipeline Wiki, if you are so inclined – I don’t have the energy or web development chops to set that kind of thing up, but I’d be happy to contribute to it if it existed! You may also claim that you wrote them yourself, sell it to a publisher for a million bucks, and invest the proceeds in land mines you bury in a public park. I would rather that you not do these things, but it boils down to this: if you were to do it, would I want to make the whole affair even more unpleasant for myself than it would already be by engaging in complicated and expensive legal proceedings? And my answer to that question is a clear “no”.
In fact, my reasons for not preparing eBook versions and for releasing the texts in the public domain are basically the same: I enjoy writing these posts, and I enjoy seeing people read them. I do not enjoy wrestling with publication formats or blogging frameworks, and I certainly don’t enjoy dealing with legal issues. The reason I can manage to write a few thousand words of technical content a week despite having a full-time job is because I’ve structured the experience to be as enjoyable and low-friction for me as possible. Last year, I tried editing the “A trip through the Graphics Pipeline 2011” series into a book format, and progress was excruciatingly slow, because ultimately it was not a fun task for me; it felt like an unpaid part-time job, so at some point I just stopped.
So this is the deal: I’m a professional software developer that happens to like writing. But the writing is a pure “bonus”; I do it because I enjoy it, but only as long as it’s on my terms – I write what I feel like writing, on whatever schedule pleases me, and without any additional process beyond hitting “Publish” once I’m done. I’ll be happy to help anyone who wants to do more than that, but I’m not going to do it myself.
This post is part of a series – go here for the index.
Welcome back! Last time, I promised to end the series with a bit of reflection on the results. So, time to find out how far we’ve come!
The results
Without further ado, here’s the breakdown of per-frame work at the end of the respective posts (names abbreviated), in order:
| Post | Cull / setup | Render depth | Depth test | Render scene | Total |
|---|---|---|---|---|---|
| Initial | 1.988 | 3.410 | 2.091 | 5.567 | 13.056 |
| Write Combining | 1.946 | 3.407 | 2.058 | 3.497 | 10.908 |
| Sharing | 1.420 | 3.432 | 1.829 | 3.490 | 10.171 |
| Cache issues | 1.045 | 3.485 | 1.980 | 3.420 | 9.930 |
| Frustum culling | 0.735 | 3.424 | 1.812 | 3.495 | 9.466 |
| Depth buffers 1 | 0.740 | 3.061 | 1.791 | 3.434 | 9.026 |
| Depth buffers 2 | 0.739 | 2.755 | 1.484 | 3.578 | 8.556 |
| Workers 1 | 0.418 | 2.134 | 1.354 | 3.553 | 7.459 |
| Workers 2 | 0.197 | 2.217 | 1.191 | 3.463 | 7.068 |
| Dataflows | 0.180 | 2.224 | 0.831 | 3.589 | 6.824 |
| Speculation | 0.169 | 1.972 | 0.766 | 3.655 | 6.562 |
| Mopping up | 0.183 | 1.940 | 0.797 | 1.389 | 4.309 |
| Total diff. | -90.0% | -43.1% | -61.9% | -75.0% | -67.0% |
| Speedup | 10.86x | 1.76x | 2.62x | 4.01x | 3.03x |
What, you think that doesn’t tell you much? Okay, so did I. Have a graph instead:
The image is a link to the full-size version that you probably want to look at. Note that in both the table and the image, updating the depth test pass to use the rasterizer improvements is chalked up to “Depth buffers done quick, part 2”, not “The care and feeding of worker threads, part 1” where I mentioned it in the text.
From the graph, you should clearly see one very interesting fact: the two biggest individual improvements – the write combining fix at 2.1ms and “Mopping up” at 2.2ms – both affect the D3D rendering code, and don’t have anything to do with the software occlusion culling code. In fact, it wasn’t until “Depth buffers done quick” that we actually started working on that part of the code. Which makes you wonder…
What-if machine
Is the software occlusion culling actually worth it? That is, how much do we actually get for the CPU time we invest in occlusion culling? To help answer this, I ran a few more tests:
| Test | Cull / setup | Render depth | Depth test | Render scene | Total |
|---|---|---|---|---|---|
| Initial | 1.988 | 3.410 | 2.091 | 5.567 | 13.056 |
| Initial, no occ. | 1.433 | 0.000 | 0.000 | 25.184 | 26.617 |
| Cherry-pick | 1.548 | 3.462 | 1.977 | 2.084 | 9.071 |
| Cherry-pick, no occ. | 1.360 | 0.000 | 0.000 | 10.124 | 11.243 |
| Final | 0.183 | 1.940 | 0.797 | 1.389 | 4.309 |
| Final, no occ. | 0.138 | 0.000 | 0.000 | 6.866 | 7.004 |
Yes, the occlusion culling was a solid win both before and after. But the interesting value is the “cherry-pick” one. This is the original code, with only the following changes applied: (okay, and also with the timekeeping code added, in case you feel like nitpicking)
- Don’t read back from the constant buffers we’re writing. Total diff: 3 lines.
- Don’t update debug counters in CPUTFrustum. Total diff: 2 lines.
- Use only one dynamic constant buffer. Total diff: 10 lines changed, 8 added.
- Load materials only once. Total diff: 7 lines changed, 1 added.
- Share materials instead of cloning them. Total diff: 3 lines changed.
- AABBoxRasterizer traversal fix – keep list of models instead of going over whole database every time. Total diff: 15 lines added, 18 deleted.
In other words, “Cherry-pick” is within a few dozen lines of the original code, all of the changes are to “framework” code not the actual sample, and none of them do anything fancy. Yet it makes the difference between occlusion culling enabled and disabled shrink to about a 1.24x speedup, down from the 2x it was before!
A brief digression
This kind of thing is, in a nutshell, the reason why graphics papers really need to come with source code. Anything GPU-related in particular is full of performance cliffs like this. In this case, I had the source code, so I could investigate what was going on, fix a few problems, and get a much more realistic assessment of the gain to expect from this kind of technique. Had it just been a paper claiming a “2x improvement”, I would certainly not have been able to reproduce that result – note that in the “final” version, the speedup goes back to about 1.63x, but that’s with a considerable amount of extra work.
I mention this because it’s a very common problem: whatever technique the author of a paper is proposing is well-optimized and tweaked to look good, whereas the things that it’s being compared with are often a very sloppy implementation. The end result is lots of papers that claim “substantial gains” over the prior state of the art that somehow never materialize for anyone else. At one extreme, I’ve had one of my professors state outright at one point that he just stopped giving out source code to their algorithms because the effort invested in getting other people to successfully replicate his old results “distracted” him from producing new ones. (I’m not going to name names here, but he later stated a several other things along the same lines, and he’s probably the number one reason for me deciding against pursuing a career in academia.)
To that kind of attitude, I have only one thing to say: If you care only about producing results and not independent verification, then you may be brilliant, but you are not a scientist, and there’s a very good chance that your life’s work is useless to anyone but yourself.
Conversely, exposing your code to outside eyes might not be the optimal way to stroke your ego in case somebody finds an obvious mistake :), but it sure makes your approach a lot more likely to actually become relevant in practice. Anyway, let’s get back to the subject at hand.
Observations
The number one lesson from all of this probably is that there’s lots of ways to shoot yourself in the foot in graphics, and that it’s really easy to do so without even noticing it. So don’t assume, profile. I’ve used a fancy profiler with event-based sampling (VTune), but even a simple tool like Sleepy will tell you when a small piece of code takes a disproportionate amount of time. You just have to be on the lookout for these things.
Which brings me to the next point: you should always have an expectation of how long things should take. A common misconception is that profilers are primarily useful to identify the hot spots in an application, so you can focus your efforts there. Let’s have another look at the very first profiler screenshot I posted in this series:
If I had gone purely by what takes the largest amount of time, I’d have started with the depth buffer rasterization pass; as you should well recall, it took me several posts to explain what’s even going on in that code, and as you can see from the chart above, while we got a good win out of improving it (about 1.1ms total), doing so took lots of individual changes. Compare with what I actually worked on first – namely, the Write Combining issue, which gave us a 2.1ms win for a three-line change.
So what’s the secret? Don’t use a profile exclusively to look for hot spots. In particular, if your profile has the hot spots you expected (like the depth buffer rasterizer in this example), they’re not worth more than a quick check to see if there’s any obvious waste going on. What you really want to look for are anomalies: code that seems to be running into execution issues (like SetRenderStates with the read-back from write-combined memory running at over 9 cycles per instruction), or things that just shouldn’t take as much time as they seem to (like the frustum culling code we looked at for the next few posts). If used correctly, a profiler is a powerful tool not just for performance tuning, but also to find deeper underlying architectural issues.
While you’re at it…
Anyway, once you’ve picked a suitable target, I recommend that you do not just the necessary work to knock it out of the top 10 (or some other arbitrary cut-off). After “Frustum culling: turning the crank“, a commenter asked why I would spend the extra time optimizing a function that was, at the time, only at the #10 spot in the profile. A perfectly valid question, but one I have three separate answers to:
First, the answer I gave in the comments at the time: code is not just isolated from everything else; it exists in a context. A lot of the time in optimizing code (or even just reading it, for that matter) is spent building up a mental model of what’s going on and how it relates to the rest of the system. The best time to make changes to code is while that mental model is still current; if you drop the topic and work somewhere else for a bit, you’ll have to redo at least part of that work again. So if you have ideas for further improvements while you’re working on code, that’s a good time to try them out (once you’ve finished your current task, anyway). If you run out of ideas, or if you notice you’re starting to micro-optimize where you really shouldn’t, then stop. But by all means continue while the going is good; even if you don’t need that code to be faster now, you might want it later.
Second, never mind the relative position. As you can see in the table above, the “advanced” frustum culling changes reduced the total frame time by about 0.4ms. That’s about as much as we got out of our first set of depth buffer rendering changes, even though it was much simpler work. Particularly for games, where you usually have a set frame rate target, you don’t particularly care where exactly you get the gains from; 0.3ms less is 0.3ms less, no matter whether it’s done by speeding up one of the Top 10 functions slightly or something else substantially!
Third, relating to my comment about looking for anomalies above: unless there’s a really stupid mistake somewhere, it’s fairly likely that the top 10, or top 20, or top whatever hot spots are actually code that does substantial work – certainly so for code that other people have already optimized. However, most people do tend to work on the hot spots first when looking to improve performance. My favorite sport when optimizing code is starting in the middle ranks: while everyone else is off banging their head against the hard problems, I will casually snipe at functions in the 0.05%-1.0% total run time range. This has two advantages: first, you can often get rid of a lot of these functions entirely. Even if it’s only 0.2% of your total time, if you manage to get rid of it, that’s 0.2% that are gone. It’s usually a lot easier to get rid of a 0.2% function than it is to squeeze an extra 2% out of a 10%-run time function that 10 people have already looked at. And second, the top hot spots are usually in leafy code. But down in the middle ranks is “middle management” – code that’s just passing data around, maybe with some minor reformatting. That’s your entry point to re-designing data flows: this is the code where subsystems meet – the place where restructuring will make a difference. When optimizing interfaces, it’s crucial to be working on the interfaces that actually have problems, and this is how you find them.
Ground we’ve covered
Throughout this series, my emphasis has been on changes that are fairly high-yield but have low impact in terms of how much disruption they cause. I also made no substantial algorithmic changes. That was fully intentional, but it might be surprising; after all, as any (good) text covering optimization will tell you, it’s much more important to get your algorithms right than it is to fine-tune your code. So why this bias?
Again, I did this for a reason: while algorithmic changes are indeed the ticket when you need large speed-ups, they’re also very context-sensitive. For example, instead of optimizing the frustum culling code the way I did – by making the code more SIMD- and cache-friendly – I could have just switched to a bounding volume hierarchy instead. And normally, I probably would have. But there’s plenty of material on bounding volume hierarchies out there, and I trust you to be able to find it yourself; by now, there’s also a good amount of Google-able material on “Data-oriented Design” (I dislike the term; much like “Object-oriented Design”, it means everything and nothing) and designing algorithms and data structures from scratch for good SIMD and cache efficiency.
But I found that there’s a distinct lack of material for the actual problem most of us actually face when optimizing: how do I make existing code faster without breaking it or rewriting it from scratch? So my point with this series is that there’s a lot you can accomplish purely using fairly local and incremental changes. And while the actual changes are specific to the code, the underlying ideas are very much universal, or at least I hope so. And I couldn’t resist throwing in some low-level architectural material too, which I hope will come in handy. :)
Changes I intentionally did not make
So finally, here’s a list of things I did not discuss in this series, because they were either too invasive, too tricky or changed the algorithms substantially:
- Changing the way the binner works. We don’t need that much information per triangle, and currently we gather vertices both in the binner and the rasterizer, which is a fairly expensive step. I did implement a variant that writes out signed 16-bit coordinates and the set-up Z plane equation; it saves roughly another 0.1ms in the final rasterizer, but it’s a fairly invasive change. Code is here for those who are interested. (I may end up posting other stuff to that branch later, hence the name).
- A hierarchical rasterizer for the larger triangles. Another thing I implemented (note this branch is based off a pre-blog version of the codebase) but did not feel like writing about because it took a lot of effort to deliver, ultimately, fairly little gain.
- Other rasterizer techniques or tweaks. I could have implemented a scanline rasterizer, or a different traversal strategy, or a dozen other things. I chose not to; I wanted to write an introduction to edge-function rasterizers, since they’re cool, simple to understand and less well-known than they should be, and this series gave me a good excuse. I did not, however, want to spend more time on actual rasterizer optimization than the two posts I wrote; it’s easy to spend years of your life on that kind of stuff (I’ve seen it happen!), but there’s a point to be made that this series was already too long, and I did not want to stretch it even further.
- Directly rasterizing quads in the depth test rasterizer. The depth test rasterizer only handles boxes, which are built from 6 quads. It’s possible to build an edge function rasterizer that directly traverses quads instead of triangles. Again, I wrote the code (not on Github this time) but decided against writing about it; while the basic idea is fairly simple, it turned out to be really ugly to make it work in a “drop-in” fashion with the rest of the code. See this comment and my reply for a few extra details.
- Ray-trace the boxes in the test pass instead of rasterizing them. Another suggestion by Doug. It’s a cool idea and I think it has potential, but I didn’t try it.
- Render a lower-res depth buffer using very low-poly, conservative models. This is how I’d actually use this technique for a game; I think bothering with a full-size depth buffer is just a waste of memory bandwidth and processing time, and we do spend a fair amount of our total time just transforming vertices too. Nor is there a big advantage to using the more detailed models for culling. That said, changing this would have required dedicated art for the low-poly occluders (which I didn’t want to do); it also would’ve violated my “no-big-changes” rule for this series. Both these changes are definitely worth looking into if you want to ship this in a game.
- Try other occlusion culling techniques. Out of the (already considerably bloated) scope of this series.
And that’s it! I hope you had as much fun reading these posts as I did writing them. But for now, it’s back to your regularly scheduled, piece-meal blog fare, at least for the time being! Should I feel the urge to write another novella-sized series of posts again in the near future, I’ll be sure to let you all know by the point I’m, oh, nine posts in or so.

