There’s tons of useful trig identities. You could spend the time to learn them by heart, or just look them up on Wikipedia when necessary. But I’ve always had problems remembering where the signs and such go when trying to memorize this directly. At least for me, what worked way better is this: spend a few hours familiarizing yourself with complex numbers if you haven’t done so already; after that, most identities that you need in practice are easy to derive from Euler’s formula:

$e^{ix} = \exp(ix) = \cos(x) + i \sin(x)$

Let’s do the basic addition formulas first. Euler’s formula gives:

$\cos(x+y) + i \sin(x+y) = \exp(i(x+y)) = \exp(ix) \exp(iy)$

and once we apply the identity again we get:

$(\cos(x) + i \sin(x)) (\cos(y) + i \sin(y))$

multiplying out:

$(\cos(x) \cos(y) - \sin(x) \sin(y)) + i (\sin(x) \cos(y) + \cos(x) \sin(y))$

The terms in parentheses are all real numbers; equating them with our original expression yields the result

$\cos(x+y) = \cos(x) \cos(y) - \sin(x) \sin(y)$
$\sin(x+y) = \sin(x) \cos(y) + \cos(x) \sin(y)$

Both addition formulas for the price of one. (In fact, this exploits that the addition formulas for trigonometric functions and the addition formula for exponents are really the same thing). The main point being that if you know complex multiplication, you never have to remember what the grouping of factors and the signs are, something I used to have trouble remembering.

Plugging in x=y into the above also immediately gives the double-angle formulas:

$\cos(2x) = \cos(x)^2 - \sin(x)^2$
$\sin(2x) = 2 \sin(x) \cos(x)$

so if you know the addition formulas there’s really no reason to learn these separately.

Then there’s the well-known

$\cos(x)^2 + \sin(x)^2 = 1$

but it’s really just the Pythagorean theorem in disguise (since cos(x) and sin(x) are the side lengths of a right-angled triangle). So not really a new formula either!

Moving either the cosine or sine terms to the right-hand side gives the two immensely useful equations:

$\cos(x)^2 = 1 - \sin(x)^2$
$\sin(x)^2 = 1 - \cos(x)^2$

In particular, that second one is perfect if you need the sine squared of an angle that you only have the cosine of (usually because you’ve determined it using a dot product). Judicious application of these two tends to be a great way to simplify superfluous math in shaders (and elsewhere), one of my pet peeves.

For practice, let’s apply these two identities to the cosine double-angle formula:

$\cos(2x) = \cos(x)^2 - \sin(x)^2 = 2 \cos(x)^2 - 1 \Leftrightarrow cos(x)^2 = (cos(2x) + 1) / 2$
$\cos(2x) = \cos(x)^2 - \sin(x)^2 = 1 - 2 \sin(x)^2 \Leftrightarrow sin(x)^2 = (1 - cos(2x)) / 2$

why, it’s the half-angle formulas! Fancy meeting you here!

Can we do something with the sine double-angle formula too? Well, it’s not too fancy, but we can get this:

$\sin(2x) = 2 \sin(x) \cos(x) \Leftrightarrow \sin(x) \cos(x) = \sin(2x) / 2$

Now, let’s go back to the original addition formulas and let’s see what happens when we plug in negative values for y. Using $\sin(-x) = -\sin(x)$ and $\cos(-x) = \cos(x)$, we get:

$\cos(x-y) = \cos(x) \cos(y) + \sin(x) \sin(y)$
$\sin(x-y) = \sin(x) \cos(y) - \cos(x) \sin(y)$

Hey look, flipped signs! This means that we can now add these to (or subtract them from) the original formulas to get even more identities!

$\cos(x+y) + \cos(x-y) = 2 \cos(x) \cos(y)$
$\cos(x-y) - \cos(x+y) = 2 \sin(x) \sin(y)$
$\sin(x+y) + \sin(x-y) = 2 \sin(x) \cos(y)$
$\sin(x+y) - \sin(x-y) = 2 \cos(x) \sin(y)$

It’s the product-to-sum identities this time. I got one more! We’ve deliberately flipped signs and then added/subtracted the addition formulas to get the above set. What if we do the same trick in reverse to get rid of those x+y and x-y terms? Let’s set $x = (a + b)/2$ and $y = (b - a)/2$ and plug that into the identities above and we get:

$\cos(b) + \cos(a) = 2 \cos((a+b)/2) \cos((b-a)/2)$
$\cos(a) - \cos(b) = 2 \sin((a + b)/2) \sin((b - a)/2)$
$\sin(b) + \sin(a) = 2 \sin((a + b)/2) \cos((b - a)/2)$

Ta-dah, it’s the sum-to-product identities. Now, admittedly, we’ve taken quite a few steps to get here, and looking these up when you need them is going to be faster than walking through the derivation (if you ever need them in the first place – I don’t think I’ve ever used the product/sum identities in practice). But still, working these out is a good exercise, and a lot less likely to go wrong (at least for me) than memorizing lots of similar formulas. (I never can get the signs right that way)

Bonus exercise: work out general expressions for $\cos(x)^n$ and $\sin(x)^n$. Hint:

$\cos(x) = (\exp(ix) + \exp(-ix))/2$
$\sin(x) = (\exp(ix) - \exp(-ix))/2i$.

And I think that’s enough for now. (At some later point, I might do an extra post about one of the sneakier trig techniques: the Weierstrass substitution).

One interesting thing about x86 is that it’s changed two major architectural “magic values” in the past 10 years. The first is the addition of 64-bit mode, which not only widens all general-purpose registers and gives a much larger virtual address space, it also increases the number of general-purpose and XMM registers from 8 to 16. The second is AVX, which allows all SSE (and other SIMD) instructions to be encoded using non-destructive 3-operand forms instead of the original 2-operand forms.

Since modern x86 processors are trying really hard to run both 32- and 64-bit code well (and same for SSE vs. AVX), this gives us an opportunity to compare the relative performance of these choices in a reasonably level playing field, when running the same (C++) code. Of course, this is nowhere near a perfect comparison, especially since switching from 32 to 64 bits also changes the sizes of pointers and (at the very least) the code generator used by the compiler, but it’s still interesting to be able to do the experiment on a single machine with no fuss. So, without further ado, here’s a quick comparison using the Software Occlusion Culling demo I’ve been writing about for the past month – a fairly SIMD-heavy workload.

Version Occlusion cull Render scene
x86 (baseline) 2.88ms 1.39ms
x86, /arch:SSE2 2.88ms (+0.2%) 1.48ms (+5.8%)
x86, /arch:AVX 2.77ms (-3.8%) 1.43ms (+2.7%)
x64 2.71ms (-5.7%) 1.29ms (-7.2%)
x64, /arch:AVX 2.63ms (-8.7%) 1.28ms (-8.5%)

Note that /arch:AVX makes VC++ use AVX forms of SSE vector instructions (i.e. 3-operand), but it’s all still 4-wide SIMD, not the new 8-wide SIMD floating point. Getting that would require changes to the code. And of course the code uses SSE2 (and, in fact, even SSE4.1) instructions whether we turn on /arch:SSE2 on x86 or not – this only affects how “regular” floating-point code is generated. Also, the speedup percentages are computed from the full-precision values, not the truncated values I put in the table. (Which doesn’t mean much, since I truncated the values to about their level of accuracy)

So what does this tell us? Hard to be sure. It’s very few data points and I haven’t done any work to eliminate the effect of e.g. memory layout / code placement, which can be very much significant. And of course I’ve also changed the compiler. That said, a few observations:

• Not much of a win turning on /arch:SSE2 on the regular x86 code. If anything, the rendering part of the code gets worse from the “enhanced instruction set” usage. I did not investigate further.
• The 3-operand AVX instructions provide a solid win of a few percentage points in both 32-bit and 64-bit mode. Considering I’m not using any 8-wide instructions, this is almost exclusively the impact of having less register-register move instructions.
• Yes, going to 64 bits does make a noticeable difference. Note in particular the dip in rendering time. Whether it’s due to the overhead of 32-bit thunks on a 64-bit system, better code generation on the app side, better code on the D3D runtime/driver side, or most likely a combination of all these factors, the D3D rendering code sure gets a lot faster. And similarly, the SIMD-heavy occlusion cull code sees a good speed-up too. I have not investigated whether this is primarily due to the extra registers, or due to code generation improvements.

I don’t think there’s any particular lesson here, but it’s definitely interesting.

I’ve been asked several times about this, so I wanted to make an “official” statement:

No, I will not prepare ePub/PDF (“book”) versions of posts, particularly the “A trip through the Graphics Pipeline 2011” and “Optimizing Software Occlusion Culling” series. However, should someone be willing to prepare such a thing, I’d be very happy to provide them with a WordPress extended RSS dump of the site contents (with your comments and all other emails / personal data removed, don’t worry) and host the results. If you’re interested in helping, please write a comment with a valid email address and I’ll get in touch with you.

To clarify the legal situation, I have put both these series into the public domain (using the CC-0 “license” waiver). This means you may do with these posts whatever you want. You may edit them, update them or add additional information; you may turn them into an eBook, PDF, or hardcopy book; you may use it as a starting point for a graphics pipeline Wiki, if you are so inclined – I don’t have the energy or web development chops to set that kind of thing up, but I’d be happy to contribute to it if it existed! You may also claim that you wrote them yourself, sell it to a publisher for a million bucks, and invest the proceeds in land mines you bury in a public park. I would rather that you not do these things, but it boils down to this: if you were to do it, would I want to make the whole affair even more unpleasant for myself than it would already be by engaging in complicated and expensive legal proceedings? And my answer to that question is a clear “no”.

In fact, my reasons for not preparing eBook versions and for releasing the texts in the public domain are basically the same: I enjoy writing these posts, and I enjoy seeing people read them. I do not enjoy wrestling with publication formats or blogging frameworks, and I certainly don’t enjoy dealing with legal issues. The reason I can manage to write a few thousand words of technical content a week despite having a full-time job is because I’ve structured the experience to be as enjoyable and low-friction for me as possible. Last year, I tried editing the “A trip through the Graphics Pipeline 2011″ series into a book format, and progress was excruciatingly slow, because ultimately it was not a fun task for me; it felt like an unpaid part-time job, so at some point I just stopped.

So this is the deal: I’m a professional software developer that happens to like writing. But the writing is a pure “bonus”; I do it because I enjoy it, but only as long as it’s on my terms – I write what I feel like writing, on whatever schedule pleases me, and without any additional process beyond hitting “Publish” once I’m done. I’ll be happy to help anyone who wants to do more than that, but I’m not going to do it myself.

This post is part of a series – go here for the index.

Welcome back! Last time, I promised to end the series with a bit of reflection on the results. So, time to find out how far we’ve come!

### The results

Without further ado, here’s the breakdown of per-frame work at the end of the respective posts (names abbreviated), in order:

Post Cull / setup Render depth Depth test Render scene Total
Initial 1.988 3.410 2.091 5.567 13.056
Write Combining 1.946 3.407 2.058 3.497 10.908
Sharing 1.420 3.432 1.829 3.490 10.171
Cache issues 1.045 3.485 1.980 3.420 9.930
Frustum culling 0.735 3.424 1.812 3.495 9.466
Depth buffers 1 0.740 3.061 1.791 3.434 9.026
Depth buffers 2 0.739 2.755 1.484 3.578 8.556
Workers 1 0.418 2.134 1.354 3.553 7.459
Workers 2 0.197 2.217 1.191 3.463 7.068
Dataflows 0.180 2.224 0.831 3.589 6.824
Speculation 0.169 1.972 0.766 3.655 6.562
Mopping up 0.183 1.940 0.797 1.389 4.309
Total diff. -90.0% -43.1% -61.9% -75.0% -67.0%
Speedup 10.86x 1.76x 2.62x 4.01x 3.03x

What, you think that doesn’t tell you much? Okay, so did I. Have a graph instead:

The image is a link to the full-size version that you probably want to look at. Note that in both the table and the image, updating the depth test pass to use the rasterizer improvements is chalked up to “Depth buffers done quick, part 2″, not “The care and feeding of worker threads, part 1″ where I mentioned it in the text.

From the graph, you should clearly see one very interesting fact: the two biggest individual improvements – the write combining fix at 2.1ms and “Mopping up” at 2.2ms – both affect the D3D rendering code, and don’t have anything to do with the software occlusion culling code. In fact, it wasn’t until “Depth buffers done quick” that we actually started working on that part of the code. Which makes you wonder…

### What-if machine

Is the software occlusion culling actually worth it? That is, how much do we actually get for the CPU time we invest in occlusion culling? To help answer this, I ran a few more tests:

Test Cull / setup Render depth Depth test Render scene Total
Initial 1.988 3.410 2.091 5.567 13.056
Initial, no occ. 1.433 0.000 0.000 25.184 26.617
Cherry-pick 1.548 3.462 1.977 2.084 9.071
Cherry-pick, no occ. 1.360 0.000 0.000 10.124 11.243
Final 0.183 1.940 0.797 1.389 4.309
Final, no occ. 0.138 0.000 0.000 6.866 7.004

Yes, the occlusion culling was a solid win both before and after. But the interesting value is the “cherry-pick” one. This is the original code, with only the following changes applied: (okay, and also with the timekeeping code added, in case you feel like nitpicking)

In other words, “Cherry-pick” is within a few dozen lines of the original code, all of the changes are to “framework” code not the actual sample, and none of them do anything fancy. Yet it makes the difference between occlusion culling enabled and disabled shrink to about a 1.24x speedup, down from the 2x it was before!

### A brief digression

This kind of thing is, in a nutshell, the reason why graphics papers really need to come with source code. Anything GPU-related in particular is full of performance cliffs like this. In this case, I had the source code, so I could investigate what was going on, fix a few problems, and get a much more realistic assessment of the gain to expect from this kind of technique. Had it just been a paper claiming a “2x improvement”, I would certainly not have been able to reproduce that result – note that in the “final” version, the speedup goes back to about 1.63x, but that’s with a considerable amount of extra work.

I mention this because it’s a very common problem: whatever technique the author of a paper is proposing is well-optimized and tweaked to look good, whereas the things that it’s being compared with are often a very sloppy implementation. The end result is lots of papers that claim “substantial gains” over the prior state of the art that somehow never materialize for anyone else. At one extreme, I’ve had one of my professors state outright at one point that he just stopped giving out source code to their algorithms because the effort invested in getting other people to successfully replicate his old results “distracted” him from producing new ones. (I’m not going to name names here, but he later stated a several other things along the same lines, and he’s probably the number one reason for me deciding against pursuing a career in academia.)

To that kind of attitude, I have only one thing to say: If you care only about producing results and not independent verification, then you may be brilliant, but you are not a scientist, and there’s a very good chance that your life’s work is useless to anyone but yourself.

Conversely, exposing your code to outside eyes might not be the optimal way to stroke your ego in case somebody finds an obvious mistake :), but it sure makes your approach a lot more likely to actually become relevant in practice. Anyway, let’s get back to the subject at hand.

### Observations

The number one lesson from all of this probably is that there’s lots of ways to shoot yourself in the foot in graphics, and that it’s really easy to do so without even noticing it. So don’t assume, profile. I’ve used a fancy profiler with event-based sampling (VTune), but even a simple tool like Sleepy will tell you when a small piece of code takes a disproportionate amount of time. You just have to be on the lookout for these things.

Which brings me to the next point: you should always have an expectation of how long things should take. A common misconception is that profilers are primarily useful to identify the hot spots in an application, so you can focus your efforts there. Let’s have another look at the very first profiler screenshot I posted in this series:

If I had gone purely by what takes the largest amount of time, I’d have started with the depth buffer rasterization pass; as you should well recall, it took me several posts to explain what’s even going on in that code, and as you can see from the chart above, while we got a good win out of improving it (about 1.1ms total), doing so took lots of individual changes. Compare with what I actually worked on first – namely, the Write Combining issue, which gave us a 2.1ms win for a three-line change.

So what’s the secret? Don’t use a profile exclusively to look for hot spots. In particular, if your profile has the hot spots you expected (like the depth buffer rasterizer in this example), they’re not worth more than a quick check to see if there’s any obvious waste going on. What you really want to look for are anomalies: code that seems to be running into execution issues (like SetRenderStates with the read-back from write-combined memory running at over 9 cycles per instruction), or things that just shouldn’t take as much time as they seem to (like the frustum culling code we looked at for the next few posts). If used correctly, a profiler is a powerful tool not just for performance tuning, but also to find deeper underlying architectural issues.

### While you’re at it…

Anyway, once you’ve picked a suitable target, I recommend that you do not just the necessary work to knock it out of the top 10 (or some other arbitrary cut-off). After “Frustum culling: turning the crank“, a commenter asked why I would spend the extra time optimizing a function that was, at the time, only at the #10 spot in the profile. A perfectly valid question, but one I have three separate answers to:

First, the answer I gave in the comments at the time: code is not just isolated from everything else; it exists in a context. A lot of the time in optimizing code (or even just reading it, for that matter) is spent building up a mental model of what’s going on and how it relates to the rest of the system. The best time to make changes to code is while that mental model is still current; if you drop the topic and work somewhere else for a bit, you’ll have to redo at least part of that work again. So if you have ideas for further improvements while you’re working on code, that’s a good time to try them out (once you’ve finished your current task, anyway). If you run out of ideas, or if you notice you’re starting to micro-optimize where you really shouldn’t, then stop. But by all means continue while the going is good; even if you don’t need that code to be faster now, you might want it later.

Second, never mind the relative position. As you can see in the table above, the “advanced” frustum culling changes reduced the total frame time by about 0.4ms. That’s about as much as we got out of our first set of depth buffer rendering changes, even though it was much simpler work. Particularly for games, where you usually have a set frame rate target, you don’t particularly care where exactly you get the gains from; 0.3ms less is 0.3ms less, no matter whether it’s done by speeding up one of the Top 10 functions slightly or something else substantially!

Third, relating to my comment about looking for anomalies above: unless there’s a really stupid mistake somewhere, it’s fairly likely that the top 10, or top 20, or top whatever hot spots are actually code that does substantial work – certainly so for code that other people have already optimized. However, most people do tend to work on the hot spots first when looking to improve performance. My favorite sport when optimizing code is starting in the middle ranks: while everyone else is off banging their head against the hard problems, I will casually snipe at functions in the 0.05%-1.0% total run time range. This has two advantages: first, you can often get rid of a lot of these functions entirely. Even if it’s only 0.2% of your total time, if you manage to get rid of it, that’s 0.2% that are gone. It’s usually a lot easier to get rid of a 0.2% function than it is to squeeze an extra 2% out of a 10%-run time function that 10 people have already looked at. And second, the top hot spots are usually in leafy code. But down in the middle ranks is “middle management” – code that’s just passing data around, maybe with some minor reformatting. That’s your entry point to re-designing data flows: this is the code where subsystems meet – the place where restructuring will make a difference. When optimizing interfaces, it’s crucial to be working on the interfaces that actually have problems, and this is how you find them.

### Ground we’ve covered

Throughout this series, my emphasis has been on changes that are fairly high-yield but have low impact in terms of how much disruption they cause. I also made no substantial algorithmic changes. That was fully intentional, but it might be surprising; after all, as any (good) text covering optimization will tell you, it’s much more important to get your algorithms right than it is to fine-tune your code. So why this bias?

Again, I did this for a reason: while algorithmic changes are indeed the ticket when you need large speed-ups, they’re also very context-sensitive. For example, instead of optimizing the frustum culling code the way I did – by making the code more SIMD- and cache-friendly – I could have just switched to a bounding volume hierarchy instead. And normally, I probably would have. But there’s plenty of material on bounding volume hierarchies out there, and I trust you to be able to find it yourself; by now, there’s also a good amount of Google-able material on “Data-oriented Design” (I dislike the term; much like “Object-oriented Design”, it means everything and nothing) and designing algorithms and data structures from scratch for good SIMD and cache efficiency.

But I found that there’s a distinct lack of material for the actual problem most of us actually face when optimizing: how do I make existing code faster without breaking it or rewriting it from scratch? So my point with this series is that there’s a lot you can accomplish purely using fairly local and incremental changes. And while the actual changes are specific to the code, the underlying ideas are very much universal, or at least I hope so. And I couldn’t resist throwing in some low-level architectural material too, which I hope will come in handy. :)

### Changes I intentionally did not make

So finally, here’s a list of things I did not discuss in this series, because they were either too invasive, too tricky or changed the algorithms substantially:

• Changing the way the binner works. We don’t need that much information per triangle, and currently we gather vertices both in the binner and the rasterizer, which is a fairly expensive step. I did implement a variant that writes out signed 16-bit coordinates and the set-up Z plane equation; it saves roughly another 0.1ms in the final rasterizer, but it’s a fairly invasive change. Code is here for those who are interested. (I may end up posting other stuff to that branch later, hence the name).
• A hierarchical rasterizer for the larger triangles. Another thing I implemented (note this branch is based off a pre-blog version of the codebase) but did not feel like writing about because it took a lot of effort to deliver, ultimately, fairly little gain.
• Other rasterizer techniques or tweaks. I could have implemented a scanline rasterizer, or a different traversal strategy, or a dozen other things. I chose not to; I wanted to write an introduction to edge-function rasterizers, since they’re cool, simple to understand and less well-known than they should be, and this series gave me a good excuse. I did not, however, want to spend more time on actual rasterizer optimization than the two posts I wrote; it’s easy to spend years of your life on that kind of stuff (I’ve seen it happen!), but there’s a point to be made that this series was already too long, and I did not want to stretch it even further.
• Directly rasterizing quads in the depth test rasterizer. The depth test rasterizer only handles boxes, which are built from 6 quads. It’s possible to build an edge function rasterizer that directly traverses quads instead of triangles. Again, I wrote the code (not on Github this time) but decided against writing about it; while the basic idea is fairly simple, it turned out to be really ugly to make it work in a “drop-in” fashion with the rest of the code. See this comment and my reply for a few extra details.
• Ray-trace the boxes in the test pass instead of rasterizing them. Another suggestion by Doug. It’s a cool idea and I think it has potential, but I didn’t try it.
• Render a lower-res depth buffer using very low-poly, conservative models. This is how I’d actually use this technique for a game; I think bothering with a full-size depth buffer is just a waste of memory bandwidth and processing time, and we do spend a fair amount of our total time just transforming vertices too. Nor is there a big advantage to using the more detailed models for culling. That said, changing this would have required dedicated art for the low-poly occluders (which I didn’t want to do); it also would’ve violated my “no-big-changes” rule for this series. Both these changes are definitely worth looking into if you want to ship this in a game.
• Try other occlusion culling techniques. Out of the (already considerably bloated) scope of this series.

And that’s it! I hope you had as much fun reading these posts as I did writing them. But for now, it’s back to your regularly scheduled, piece-meal blog fare, at least for the time being! Should I feel the urge to write another novella-sized series of posts again in the near future, I’ll be sure to let you all know by the point I’m, oh, nine posts in or so.

This post is part of a series – go here for the index.

Welcome back! This post is going to be slightly different from the others. So far, I’ve attempted to group the material thematically, so that each post has a coherent theme (to a first-order approximation, anyway). Well, this one doesn’t – this is a collection of everything that didn’t fit anywhere else. But don’t worry, there’s still some good stuff in here! That said, one warning: there’s a bunch of poking around in the framework code this time, and it didn’t come with docs, so I’m honestly not quite sure how some of the internals are supposed to work. So the code changes referenced this time are definitely on the hacky side of things.

### The elephant in the room

Featured quite near the top of all the profiles we’ve seen so far are two functions I haven’t talked about before:

In case you’re wondering, the VIDMM_Global::ReferenceDmaBuffer is what used to be just “[dxgmms1.sys]” in the previous posts; I’ve set up VTune to use the symbol server to get debug symbols for this DLL. Now, I haven’t talked about this code before because it’s part of the GPU rendering, not the software rasterizer, but let’s broaden our scope one final time.

What you can see here is the video memory manager going over the list of resources (vertex/index buffers, constant buffers, textures, and so forth) referenced by a DMA buffer (which is what WDDM calls GPU command buffers in the native format) and completely blowing out the cache; each resource has some amount of associated metadata that the memory manager needs to look at (and possibly update), and it turns out there’s many of them. The cache is not amused.

So, what can we do to use less resources? There’s lots of options, but one thing I had noticed while measuring loading time is that there’s one dynamic constant buffer per model:

// Create the model constant buffer.
HRESULT hr;
D3D11_BUFFER_DESC bd = {0};
bd.ByteWidth = sizeof(CPUTModelConstantBuffer);
bd.BindFlags = D3D11_BIND_CONSTANT_BUFFER;
bd.Usage = D3D11_USAGE_DYNAMIC;
bd.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
hr = (CPUT_DX11::GetDevice())->CreateBuffer( &bd, NULL,
&mpModelConstantBuffer );
ASSERT( !FAILED( hr ), _L("Error creating constant buffer.") );


Note that they’re all the same size, and it turns out that all of them get updated (using a Map with DISCARD) immediately before they get used for rendering. And because there’s about 27000 models in this example, we’re talking about a lot of constant buffers here.

What if we instead just created one dynamic model constant buffer, and shared it between all the models? It’s a fairly simple change to make, if you’re willing to do it in a hacky fashion (as said, not very clean code this time). For this test, I took the liberty of adding some timing around the actual D3D rendering code as well, so we can compare. It’s probably gonna make a difference, but how much can it be, really?

Change: Single shared dynamic model constant buffer

Render scene min 25th med 75th max mean sdev
Original 3.392 3.501 3.551 3.618 4.155 3.586 0.137
One dynamic CB 2.474 2.562 2.600 2.644 3.043 2.609 0.068

It turns out that reducing the number of distinct constant buffers referenced per frame by several thousand is a pretty big deal. Drivers work hard to make constant buffer DISCARD really, really fast, and they make sure that the underlying allocations get handled quickly. And discarding a single constant buffer a thousand times in a frame works out to be a lot faster than discarding a thousand constant buffers once each.

Lesson learned: for “throwaway” constant buffers, it’s a good idea to design your renderer so it only allocates one underlying D3D constant buffer per size class. More are not necessary and can (evidently) induce a substantial amount of overhead. D3D11.1 adds a few features that allow you to further reduce that count down to a single constant buffer that’s used the same way that dynamic vertex/index buffers are; as you can see, there’s a reason. Here’s the profile after this single fix:

Still a lot of time spent in the driver and the video memory manager, but if you compare the raw cycle counts with the previous image, you can see that this change really made quite a dent.

This was (for the most part) something I worked on just to make my life easier – as you can imagine, while writing this series, I’ve recorded lots of profiling and tests runs, and the loading time is a fixed cost I pay every time. I won’t go in depth here, but I still want to give a brief summary of the changes I made and why. If you want to follow along, the changes in the source code start at the “Track loading time” commit.

#### Initial: 9.29s

First, I simply added a timer and code to print the loading time to the debug output window.

#### Load materials once, not once per model: 4.54s

One thing I noticed way back in January when I did my initial testing was that most materials seem to get loaded multiple times; there seems to be logic in the asset library code to avoid loading materials multiple times, but it didn’t appear to work for me. So I modified the code to actually load each material only once and then create copies when requested. As you can see, this change by itself roughly cut loading times in half.

#### FindAsset optimizations: 4.32s

FindAsset is the function used in the asset manager to actually look up resources by name. With two simples changes to avoid unnecessary path name resolution and string compares, the loading time loses another 200ms.

I mentioned this in “A string processing rant“, but didn’t actually merge the changes into the blog branch so far. Well, here you go: with these three commits that together rewrite a substantial portion of the config file reading, we lose almost another 2 seconds. Yes, that was 2 whole seconds worth of unnecessary allocations and horribly inefficient string handling. I wrote that rant for a reason.

#### Improve shader input layout cache: 2.03s

D3D11 wants shader input layouts to be created with a pointer to the bytecode of the shader it’s going to be used with, to handle vertex format to shader binding. The “shader input layout cache” is just an internal cache to produce such input layouts for all unique combinations of vertex formats and shaders we use. The original implementation of this cache was fairly inefficient, but the code already contained a “TODO” comment with instructions of how to fix it. In this commit, I implemented that fix.

#### Reduce temporary strings: 1.88s

There were still a bunch of unnecessary string temporaries being created, which I found simply by looking at the call stack profiles of free calls during the loading phase (yet another useful application for profilers)! Two commits later, this problem was resolved too.

#### Actually share materials: 1.46s

Finally, this commit goes one step further than just loading the materials once, it also actually shares the same material instance between all its users (the previous version created copies). This is not necessarily a safe change to make. I have no idea what invariants the asset manager tries to enforce, if any. Certainly, this would cause problems if someone were to start modifying materials after loading – you’d need to introduce copy-on-write or something similar. But in our case (i.e. the Software Occlusion Culling demo), the materials do not get modified after loading, and sharing them is completely safe.

Not only does this reduce loading time by another 400ms, it also makes rendering a lot faster, because suddenly there’s a lot less cache misses when setting up shaders and render states for the individual models:

Change: Share materials.

Render scene min 25th med 75th max mean sdev
Original 3.392 3.501 3.551 3.618 4.155 3.586 0.137
One dynamic CB 2.474 2.562 2.600 2.644 3.043 2.609 0.068
Share materials 1.870 1.922 1.938 1.964 2.331 1.954 0.057

Again, this is somewhat extreme because there’s so many different models around, but it illustrates the point: you really want to make sure there’s no unnecessary duplication of data used during rendering; you’re going to be missing the cache enough during regular rendering as it is.

And at that point, I decided that I could live with 1.5 seconds of loading time, so I didn’t pursue the matter any further. :)

### The final rendering tweak

There’s one more function with a high number of cache misses in the profiles I’ve been running, even though it’s never been at the top. That function is AABBoxRasterizerSSE::RenderVisible, which uses the (post-occlusion-test) visibility information to render all visible models. Here’s the code:

void AABBoxRasterizerSSE::RenderVisible(CPUTAssetSet **pAssetSet,
CPUTRenderParametersDX &renderParams,
UINT numAssetSets)
{
int count = 0;

for(UINT assetId = 0, modelId = 0; assetId < numAssetSets; assetId++)
{
for(UINT nodeId = 0; nodeId < GetAssetCount(); nodeId++)
{
CPUTRenderNode* pRenderNode = NULL;
CPUTResult result = pAssetSet[assetId]->GetAssetByIndex(nodeId, &pRenderNode);
ASSERT((CPUT_SUCCESS == result), _L ("Failed getting asset by index"));
if(pRenderNode->IsModel())
{
if(mpVisible[modelId])
{
CPUTModelDX11* model = (CPUTModelDX11*)pRenderNode;
model = (CPUTModelDX11*)pRenderNode;
model->Render(renderParams);
count++;
}
modelId++;
}
pRenderNode->Release();
}
}
mNumCulled =  mNumModels - count;
}


This code first enumerates all RenderNodes (a base class) in the active asset libraries, ask each of them “are you a model?”, and if so renders it. This is a construct that I’ve seen several times before – but from a performance standpoint, this is a terrible idea. We walk over the whole scene database, do a virtual function call (which means we have, at the very least, load the cache line containing the vtable pointer) to check if the current item is a model, and only then check if it is culled – in which case we just ignore it.

That is a stupid game and we should stop playing it.

Luckily, it’s easy to fix: at load time, we traverse the scene database once, to make a list of all the models. Note the code already does such a pass to initialize the bounding boxes etc. for the occlusion culling pass; all we have to do is set an extra array that maps modelIds to the corresponding models. Then the actual rendering code turns into:

void AABBoxRasterizerSSE::RenderVisible(CPUTAssetSet **pAssetSet,
CPUTRenderParametersDX &renderParams,
UINT numAssetSets)
{
int count = 0;

for(modelId = 0; modelId < mNumModels; modelId++)
{
if(mpVisible[modelId])
{
mpModels[modelId]->Render(renderParams);
count++;
}
}

mNumCulled =  mNumModels - count;
}


That already looks much better. But how much does it help?

Change: Cull before accessing models

Render scene min 25th med 75th max mean sdev
Original 3.392 3.501 3.551 3.618 4.155 3.586 0.137
One dynamic CB 2.474 2.562 2.600 2.644 3.043 2.609 0.068
Share materials 1.870 1.922 1.938 1.964 2.331 1.954 0.057
Fix RenderVisible 1.321 1.358 1.371 1.406 1.731 1.388 0.047

I rest my case.

And I figure that this nice 2.59x cumulative speedup on the rendering code is a good stopping point for the coding part of this series – quit while you’re ahead and all that. There’s a few more minor fixes (both for actual bugs and speed problems) on Github, but it’s all fairly small change, so I won’t go into the details.

This series is not yet over, though; we’ve covered a lot of ground, and every case study should spend some time reflecting on the lessons learned. I also want to explain why I covered what I did, what I left out, and a few notes on the way I tend to approach performance problems. So all that will be in the next and final post of this series. Until then!

This post is part of a series – go here for the index.

Welcome back! Today, it’s time to take a closer look at the triangle binning code, which we’ve only seen mentioned briefly so far, and we’re going to see a few more pitfalls that all relate to speculative execution.

### Loads blocked by what?

There’s one more micro-architectural issue this program runs into that I haven’t talked about before. Here’s the obligatory profiler screenshot:

The full column name reads “Loads Blocked by Store Forwarding”. So, what’s going on there? For this one, I’m gonna have to explain a bit first.

So let’s talk about stores in an out-of-order processor. In this series, we already saw how conditional branches and memory sharing between cores get handled on modern x86 cores: namely, with speculative execution. For branches, the core tries to predict which direction they will go, and automatically starts fetching and executing the corresponding instructions. Similarly, memory accesses are assumed to not conflict with what other cores are doing at the same time, and just march on ahead. But if it later turns out that the branch actually went in the other direction, that there was a memory conflict, or that some exception / hardware interrupt occurred, all the instructions that were executed in the meantime are invalid and their results must be discarded – the speculation didn’t pan out. The implicit assumption is that our speculation (branches behave as predicted, memory accesses generally don’t conflict and CPU exceptions/interrupts are rare) is right most of the time, so it generally pays off to forge ahead, and the savings are worth the occasional extra work of undoing a bunch of instructions when we turned out to be wrong.

But wait, how does the CPU “undo” instructions? Well, conceptually it takes a “snapshot” of the current machine state every time it’s about to start an operation that it might later have to undo. If that instructions makes it all the way through the pipeline without incident, it just gets retired normally, the snapshot gets thrown away and we know that our speculation was successful. But if there is a problem somewhere, the machine can just throw away all the work it did in the meantime, rewind back to the snapshot and retry.

Of course, CPUs don’t actually take full snapshots. Instead, they make use of the out-of-order machinery to do things much more efficiently: out-of-order CPUs have more registers internally than are exposed in the ISA (Instruction Set Architecture), and use a technique called “register renaming” to map the small set of architectural registers onto the larger set of physical registers. The “snapshotting” then doesn’t actually need to save register contents; it just needs to keep track of what the current register mapping at the snapshot point was, and make sure that the associated physical registers from the “before” snapshot don’t get reused until the instruction is safely retired.

This takes care of register modifications. We already know what happens with loads from memory – we just run them, and if it later turns out that the memory contents changed between the load instruction’s execution and its retirement, we need to re-run that block of code. Stores are the tricky part: we can’t easily do “memory renaming” since memory (unlike registers) is a shared resource, and also unlike registers rarely gets written in whole “accounting units” (cache lines) at a time.

The solution are store buffers: when a store instruction is executed, we do all the necessary groundwork – address translation, access right checking and so forth – but don’t actually write to memory just yet; rather, the target address and the associated data bits are written into a store buffer, where they just sit around for a while; the store buffers form a log of all pending writes to memory. Only after the core is sure that the store instruction will actually be executed (branch results etc. are known and no exceptions were triggered) will these values actually be written back to the cache.

Buffering stores this way has numerous advantages (beyond just making speculation easier), and is a technique not just used in out-of-order architectures; there’s just one problem though: what happens if I run code like this?

  mov  [x], eax
mov  ebx, [x]


Assuming no other threads writing to the same memory at the same time, you would certainly hope that at the end of this instruction sequence, eax and ebx contain the same value. But remember that the first instruction (the store) just writes to a store buffer, whereas the second instruction (the load) normally just references the cache. At the very least, we have to detect that this is happening – i.e., that we are trying to load from an address that currently has a write logged in a store buffer – but there’s numerous things we could do with that information.

One option is to simply stall the core and wait until the store is done before the load can start. This is fairly cheap to implement in hardware, but it does slow down the software running on it. This option was chosen by the in-order cores used in the current generation of game consoles, and the result is the dreaded “Load Hit Store” stall. It’s a way to solve the problem, but let’s just say it won’t win you many friends.

So x86 cores normally use a technique called “store to load forwarding” or just “store forwarding”, where loads can actually read data directly from the store buffers, at least under certain conditions. This is much more expensive in hardware – it adds a lot of wires between the load unit and the store buffers – but it is far less finicky to use on the software side.

So what are the conditions? The details depend on the core in question. Generally, if you store a value to a naturally aligned location in memory, and do a load with the same size as the store, you can expect store forwarding to work. If you do trickier stuff – span multiple cache lines, or use mismatched sizes between the loads and stores, for example – it really does depend. Some of the more recent Intel cores can also forward larger stores into smaller loads (e.g. a DWord read from a location written with MOVDQA) under certain circumstances, for example. The dual case (large load overlapping with smaller stores) is substantially harder though, because it can involved multiple store buffers at the same time, and I currently know of no processor that implements this. And whenever you hit a case where the processor can’t perform store forwarding, you get the “Loads Blocked by Store Forwarding” stall above (effectively, x86′s version of a Load-Hit-Store).

### Revenge of the cycle-eaters

Which brings us back to the example at hand: what’s going on in those functions, BinTransformedTrianglesMT in particular? Some investigation of the compiled code shows that the first sign of blocked loads is near these reads:

Gather(xformedPos, index, numLanes);

vFxPt4 xFormedFxPtPos[3];
for(int i = 0; i < 3; i++)
{
xFormedFxPtPos[i].X = ftoi_round(xformedPos[i].X);
xFormedFxPtPos[i].Y = ftoi_round(xformedPos[i].Y);
xFormedFxPtPos[i].Z = ftoi_round(xformedPos[i].Z);
xFormedFxPtPos[i].W = ftoi_round(xformedPos[i].W);
}


and looking at the code for Gather shows us exactly what’s going on:

void TransformedMeshSSE::Gather(vFloat4 pOut[3], UINT triId,
UINT numLanes)
{
for(UINT l = 0; l < numLanes; l++)
{
for(UINT i = 0; i < 3; i++)
{
UINT index = mpIndices[(triId * 3) + (l * 3) + i];
pOut[i].X.lane[l] = mpXformedPos[index].m128_f32[0];
pOut[i].Y.lane[l] = mpXformedPos[index].m128_f32[1];
pOut[i].Z.lane[l] = mpXformedPos[index].m128_f32[2];
pOut[i].W.lane[l] = mpXformedPos[index].m128_f32[3];
}
}
}


Aha! This is the code that transforms our vertices from the AoS (array of structures) form that’s used in memory into the SoA (structure of arrays) form we use during binning (and also the two rasterizers). Note that the output vectors are written element by element; then, as soon as we try to read the whole vector into a register, we hit a forwarding stall, because the core can’t forward the results from the 4 different stores per vector to a single load. It turns out that the other two instances of forwarding stalls run into this problem for the same reason – during the gather of bounding box vertices and triangle vertices in the rasterizer, respectively.

So how do we fix it? Well, we’d really like those vectors to be written using full-width SIMD stores instead. Luckily, that’s not too hard: converting data from AoS to SoA is essentially a matrix transpose, and our typical use case happens to be 4 separate 4-vectors, i.e. a 4×4 matrix; luckily, a 4×4 matrix transpose is fairly easy to do in SSE, and Intel’s intrinsics header file even comes with a macro that implements it. So here’s the updated Gather that uses a SSE transpose:

void TransformedMeshSSE::Gather(vFloat4 pOut[3], UINT triId,
UINT numLanes)
{
const UINT *pInd0 = &mpIndices[triId * 3];
const UINT *pInd1 = pInd0 + (numLanes > 1 ? 3 : 0);
const UINT *pInd2 = pInd0 + (numLanes > 2 ? 6 : 0);
const UINT *pInd3 = pInd0 + (numLanes > 3 ? 9 : 0);

for(UINT i = 0; i < 3; i++)
{
__m128 v0 = mpXformedPos[pInd0[i]]; // x0 y0 z0 w0
__m128 v1 = mpXformedPos[pInd1[i]]; // x1 y1 z1 w1
__m128 v2 = mpXformedPos[pInd2[i]]; // x2 y2 z2 w2
__m128 v3 = mpXformedPos[pInd3[i]]; // x3 y3 z3 w3
_MM_TRANSPOSE4_PS(v0, v1, v2, v3);
// After transpose:
pOut[i].X = VecF32(v0); // v0 = x0 x1 x2 x3
pOut[i].Y = VecF32(v1); // v1 = y0 y1 y2 y3
pOut[i].Z = VecF32(v2); // v2 = z0 z1 z2 z3
pOut[i].W = VecF32(v3); // v3 = w0 w1 w2 w3
}
}


Not much to talk about here. The other two instances of this get modified in the exact same way. So how much does it help?

Change: Gather using SSE instructions and transpose

Total cull time min 25th med 75th max mean sdev
Initial 3.148 3.208 3.243 3.305 4.321 3.271 0.100
SSE Gather 2.934 3.078 3.110 3.156 3.992 3.133 0.103
Render depth min 25th med 75th max mean sdev
Initial 2.206 2.220 2.228 2.242 2.364 2.234 0.022
SSE Gather 2.099 2.119 2.137 2.156 2.242 2.141 0.028
Depth test min 25th med 75th max mean sdev
Initial 0.813 0.830 0.839 0.847 0.886 0.839 0.013
SSE Gather 0.773 0.793 0.802 0.809 0.843 0.801 0.012

So we’re another 0.13ms down, about 0.04ms of which we gain in the depth testing pass and the remaining 0.09ms in the rendering pass. And a re-run with VTune confirms that the blocked loads are indeed gone:

### Vertex transformation

Last time, we modified the vertex transform code in the depth test rasterizer to get rid of the z-clamping and simplify the clipping logic. We also changed the logic to make better use of the regular structure of our input vertices. We don’t have any special structure we can use to make vertex transforms on regular meshes faster, but we definitely can (and should) improve the projection and near-clip logic, turning this:

mpXformedPos[i] = TransformCoords(&mpVertices[i].position,
cumulativeMatrix);
float oneOverW = 1.0f/max(mpXformedPos[i].m128_f32[3], 0.0000001f);
mpXformedPos[i] = _mm_mul_ps(mpXformedPos[i],
_mm_set1_ps(oneOverW));
mpXformedPos[i].m128_f32[3] = oneOverW;


into this:

__m128 xform = TransformCoords(&mpVertices[i].position,
cumulativeMatrix);
__m128 vertZ = _mm_shuffle_ps(xform, xform, 0xaa);
__m128 vertW = _mm_shuffle_ps(xform, xform, 0xff);
__m128 projected = _mm_div_ps(xform, vertW);

// set to all-0 if near-clipped
__m128 mNoNearClip = _mm_cmple_ps(vertZ, vertW);
mpXformedPos[i] = _mm_and_ps(projected, mNoNearClip);


Here, near-clipped vertices are set to the (invalid) x=y=z=w=0, and the binner code can just check for w==0 to test whether a vertex is near-clipped instead of having to use the original w tests (which again had a hardcoded near plane value).

This change doesn’t have any significant impact on the running time, but it does get rid of the hardcoded near plane location for good, so I thought it was worth mentioning.

### Again with the memory ordering

And if we profile again, we notice there’s at least one more surprise waiting for us in the binning code:

Machine clears? We’ve seen them before, way back in “Cores don’t like to share“. And yes, they’re again for memory ordering reasons. What did we do wrong this time? It turns out that the problematic code has been in there since the beginning, and ran just fine for quite a while, but ever since the scheduling optimizations we did in “The care and feeding of worker threads“, we now have binning jobs running tightly packed enough to run into memory ordering issues. So what’s the problem? Here’s the code:

// Add triangle to the tiles or bins that the bounding box covers
int row, col;
for(row = startY; row <= endY; row++)
{
int offset1 = YOFFSET1_MT * row;
int offset2 = YOFFSET2_MT * row;
for(col = startX; col <= endX; col++)
{
int idx1 = offset1 + (XOFFSET1_MT * col) + taskId;
int idx2 = offset2 + (XOFFSET2_MT * col) +
(taskId * MAX_TRIS_IN_BIN_MT) + pNumTrisInBin[idx1];
pBin[idx2] = index + i;
pBinModel[idx2] = modelId;
pBinMesh[idx2] = meshId;
pNumTrisInBin[idx1] += 1;
}
}


The problem turns out to be the array pNumTrisInBin. Even though it’s accessed as 1D, it is effectively a 3D array like this:

uint16 pNumTrisInBin[TILE_ROWS][TILE_COLS][BINNER_TASKS]

The TILE_ROWS and TILE_COLS parts should be obvious. The BINNER_TASKS needs some explanation though: as you hopefully remember, we try to divide the work between binning tasks so that each of them gets roughly the same amount of triangles. Now, before we start binning triangles, we don’t know which tiles they will go into – after all, that’s what the binner is there to find out.

We could have just one output buffer (bin) per tile; but then, whenever two binner tasks simultaneously end up trying to add a triangle to the same tile, they will end up getting serialized because they try to increment the same counter. And even worse, it would mean that the actual order of triangles in the bins would be different between every run, depending on when exactly each thread was running; while not fatal for depth buffers (we just end up storing the max of all triangles rendered to a pixel anyway, which is ordering-invariant) it’s still a complete pain to debug.

Hence there is one bin per tile per binning worker. We already know that the binning workers get assigned the triangles in the order they occur in the models – with the 32 binning workers we use, the first binning task gets the first 1/32 of the triangles, and second binning task gets the second 1/32, and so forth. And each binner processes triangles in order. This means that the rasterizer tasks can still process triangles in the original order they occur in the mesh – first process all triangles inserted by binner 0, then all triangles inserted by binner 1, and so forth. Since they’re in distinct memory ranges, that’s easily done. And each bin has a separate triangle counter, so they don’t interfere, right? Nothing to see here, move along.

Well, except for the bit where coherency is managed on a cache line granularity. Now, as you can see from the above declaration, the triangle counts for all the binner tasks are stored in adjacent 16-bit words; 32 of them, to be precise, one per binner task. So what was the size of a cache line again? 64 bytes, you say?

Oops.

Yep, even though it’s 32 separate counters, for the purposes of the memory subsystem it’s just the same as if it was all a single counter per tile (well, it might be slightly better than that if the initial pointer isn’t 64-byte aligned, but you get the idea).

Luckily for us, the fix is dead easy: all we have to do is shuffle the order of the array indices around.

uint16 pNumTrisInBin[BINNER_TASKS][TILE_ROWS][TILE_COLS]

We also happen to have 32 tiles total – which means that now, each binner task gets its own cache line by itself (again, provided we align things correctly). So again, it’s a really easy fix. The question being – how much does it help?

Change: Change pNumTrisInBin array indexing

Total cull time min 25th med 75th max mean sdev
Initial 3.148 3.208 3.243 3.305 4.321 3.271 0.100
SSE Gather 2.934 3.078 3.110 3.156 3.992 3.133 0.103
Change bin inds 2.842 2.933 2.980 3.042 3.914 3.007 0.125
Render depth min 25th med 75th max mean sdev
Initial 2.206 2.220 2.228 2.242 2.364 2.234 0.022
SSE Gather 2.099 2.119 2.137 2.156 2.242 2.141 0.028
Change bin inds 1.980 2.008 2.026 2.046 2.172 2.032 0.035

That’s right, a 0.1ms difference from changing the memory layout of a 1024-entry, 2048-byte array. You really need to be extremely careful with the layout of shared data when dealing with multiple cores at the same time.

### Once more, with branching

At this point, the binner is starting to look fairly good, but there’s one more thing that springs to eye:

Branch mispredictions. Now, the two rasterizers have legitimate reason to be mispredicting branches some of the time – they’re processing triangles with fairly unpredictable sizes, and the depth test rasterizer also has an early-out that’s hard to predict. But the binner has less of an excuse – sure, the triangles have very different dimensions measured in 2×2 pixel blocks, but the vast majority of our triangles fits inside one of our (generously sized!) 320×90 pixel tiles. So where are all these branches?

for(int i = 0; i < numLanes; i++)
{
// Skip triangle if area is zero
if(triArea.lane[i] <= 0) continue;
if(vEndX.lane[i] < vStartX.lane[i] ||
vEndY.lane[i] < vStartY.lane[i]) continue;

float oneOverW[3];
for(int j = 0; j < 3; j++)
oneOverW[j] = xformedPos[j].W.lane[i];

// Reject the triangle if any of its verts are outside the
// near clip plane
if(oneOverW[0] == 0.0f || oneOverW[1] == 0.0f ||
oneOverW[2] == 0.0f) continue;

// ...
}


Oh yeah, that. In particular, the first test (which checks for degenerate and back-facing triangles) will reject roughly half of all triangles and can be fairly random (as far as the CPU is concerned). Now, last time we had an issue with branch mispredicts, we simply removed the offending early-out. That’s a really bad idea in this case – any triangles we don’t reject here, we’re gonna waste even more work on later. No, these tests really should all be done here.

However, there’s no need for them to be done like this; right now, we have a whole slew of branches that are all over the map. Can’t we consolidate the branches somehow?

Of course we can. The basic idea is to do all the tests on 4 triangles at a time, while we’re still in SIMD form:

// Figure out which lanes are active
VecS32 mFront = cmpgt(triArea, VecS32::zero());
VecS32 mNonemptyX = cmpgt(vEndX, vStartX);
VecS32 mNonemptyY = cmpgt(vEndY, vStartY);
VecF32 mAccept1 = bits2float(mFront & mNonemptyX & mNonemptyY);

// All verts must be inside the near clip volume
VecF32 mW0 = cmpgt(xformedPos[0].W, VecF32::zero());
VecF32 mW1 = cmpgt(xformedPos[1].W, VecF32::zero());
VecF32 mW2 = cmpgt(xformedPos[2].W, VecF32::zero());

VecF32 mAccept = and(and(mAccept1, mW0), and(mW1, mW2));
// laneMask == (1 << numLanes) - 1; - initialized earlier


Note I change the “is not near-clipped test” from !(w == 0.0f) to w > 0.0f, on account of me knowing that all legal w’s happen to not just be non-zero, they’re positive (okay, what really happened is that I forgot to add a “cmpne” when I wrote VecF32 and didn’t feel like adding it here). Other than that, it’s fairly straightforward. We build a mask in vector registers, then turn it into an integer mask of active lanes using MOVMSKPS.

With this, we could turn all the original branches into a single test in the i loop:

if((triMask & (1 << i)) == 0)
continue;


However, we can do slightly better than that: it turns out we can iterate pretty much directly over the set bits in triMask, which means we’re now down to one single branch in the outer loop – the loop counter itself. The modified loop looks like this:

while(triMask)
{
int i = FindClearLSB(&triMask);
// ...
}


So what does the magic FindClearLSB function do? It better not contain any branches! But lucky for us, it’s quite straightforward:

// Find index of least-significant set bit in mask
// and clear it (mask must be nonzero)
static int FindClearLSB(unsigned int *mask)
{
unsigned long idx;
return idx;
}


all it takes is _BitScanForward (the VC++ intrinsic for the x86 BSF instruction) and a really old trick for clearing the least-significant set bit in a value. In other words, this compiles into about 3 integer instructions and is completely branch-free. Good enough. So does it help?

Change: Less branches in binner

Total cull time min 25th med 75th max mean sdev
Initial 3.148 3.208 3.243 3.305 4.321 3.271 0.100
SSE Gather 2.934 3.078 3.110 3.156 3.992 3.133 0.103
Change bin inds 2.842 2.933 2.980 3.042 3.914 3.007 0.125
Less branches 2.786 2.879 2.915 2.969 3.706 2.936 0.092
Render depth min 25th med 75th max mean sdev
Initial 2.206 2.220 2.228 2.242 2.364 2.234 0.022
SSE Gather 2.099 2.119 2.137 2.156 2.242 2.141 0.028
Change bin inds 1.980 2.008 2.026 2.046 2.172 2.032 0.035
Less branches 1.905 1.934 1.946 1.959 2.012 1.947 0.019

That’s another 0.07ms off the total, for about a 10% reduction in median total cull time for this post, and a 12.7% reduction in median rasterizer time. And for our customary victory lap, here’s the VTune results after this change:

The branch mispredictions aren’t gone, but we did make a notable dent. It’s more obvious if you compare the number of clock cyles with the previous image.

And with that, I’ll conclude this journey into both the triangle binner and the dark side of speculative execution. We’re also getting close to the end of this series – the next post will look again at the loading and rendering code we’ve been intentionally ignoring for most of this series :), and after that I’ll finish with a summary and wrap-up – including a list of things I didn’t cover, and why not.

This post is part of a series – go here for the index.

Welcome back! So far, we’ve spent quite some time “zoomed in” on various components of the Software Occlusion Culling demo, looking at various micro-architectural pitfalls and individual loops. In the last two posts, we “zoomed out” and focused on the big picture: what work runs when, and how to keep all cores busy. Now, it’s time to look at what lies in between: the plumbing, if you will. We’ll be looking at the dataflows between subsystems and modules and how to improve them.

This is one of my favorite topics in optimization, and it’s somewhat under-appreciated. There’s plenty of material on how to make loops run fast (although a lot of it is outdated or just wrong, so beware), and at this point there’s plenty of ways of getting concurrency up and running: there’s OpenMP, Intel’s TBB, Apple’s GCD, Windows Thread Pools and ConcRT for CPU, there’s OpenCL, CUDA and DirectCompute for jobs that are GPU-suitable, and so forth; you get the idea. The point being that it’s not hard to find a shrink-wrap solution that gets you up and running, and a bit of profiling (like we just did) is usually enough to tell you what needs to be done to make it all go smoothly.

But back to the topic at hand: improving dataflow. The problem is that, unlike the other two aspects I mentioned, there’s really no recipe to follow; it’s very much context-dependent. It basically boils down to looking at both sides of the interface between systems and functions and figuring out if there’s a better way to handle that interaction. We’ve seen a bit of that earlier when talking about frustum culling; rather than trying to define it in words, I’ll just do it by example, so let’s dive right in!

### A simple example

A good example is the member variable TransformedAABBoxSSE::mVisible, declared like this:

bool *mVisible;


A pointer to a bool. So where does that pointer come from?

inline void SetVisible(bool *visible){mVisible = visible;}


It turns out that the constructor initializes this pointer to NULL, and the only method that ever does anything with mVisible is RasterizeAndDepthTestAABBox, which executes *mVisible = true; if the bounding box is found to be visible. So how does this all get used?

mpVisible[i] = false;
mpTransformedAABBox[i].SetVisible(&mpVisible[i]);
if(...)
{
mpTransformedAABBox[i].TransformAABBox();
mpTransformedAABBox[i].RasterizeAndDepthTestAABBox(...);
}


That’s it. That’s the only call sites. There’s really no reason for mVisible to be state – semantically, it’s just a return value for RasterizeAndDepthTestAABBox, so that’s what it should be – always try to get rid of superfluous state. This doesn’t even have anything to do with optimization per se; explicit dataflow is easy for programmers to see and reason about, while implicit dataflow (through pointers, members and state) is hard to follow (both for humans and compilers!) and error-prone.

Anyway, making this return value explicit is really basic, so I’m not gonna walk through the details; you can always look at the corresponding commit. I won’t bother benchmarking this change either.

### A more interesting case

In the depth test rasterizer, right after determining the bounding box, there’s this piece of code:

for(int vv = 0; vv < 3; vv++)
{
// If W (holding 1/w in our case) is not between 0 and 1,
// then vertex is behind near clip plane (1.0 in our case).
// If W < 1 (for W>0), and 1/W < 0 (for W < 0).
VecF32 nearClipMask0 = cmple(xformedPos[vv].W, VecF32(0.0f));
VecF32 nearClipMask1 = cmpge(xformedPos[vv].W, VecF32(1.0f));

{
// All four vertices are behind the near plane (we're
// processing four triangles at a time w/ SSE)
return true;
}
}


Okay. The transform code sets things up so that the “w” component of the screen-space positions actually contains 1/w; the first part of this code then tries to figure out whether the source vertex was in front of the near plane (i.e. outside the view frustum or not). An ugly wrinkle here is that the near plane is hard-coded to be at 1. Doing this after dividing by w adds extra complications since the code needs to be careful about the signs. And the second comment is outright wrong – it in fact early-outs when any of the four active triangles have vertex number vv outside the near-clip plane, not when all of them do. In other words, if any of the 4 active triangles get near-clipped, the test rasterizer will just punt and return true (“visible”).

So here’s the thing: there’s really no reason to do this check after we’re done with triangle setup. Nor do we even have to gather the 3 triangle vertices to discover that one of them is in front of the near plane. A box has 8 vertices, and we’ll know whether any of them are in front of the near plane as soon as we’re done transforming them, before we even think about triangle setup! So let’s look at the function that transforms the vertices:

void TransformedAABBoxSSE::TransformAABBox()
{
for(UINT i = 0; i < AABB_VERTICES; i++)
{
mpXformedPos[i] = TransformCoords(&mpBBVertexList[i],
mCumulativeMatrix);
float oneOverW = 1.0f/max(mpXformedPos[i].m128_f32[3],
0.0000001f);
mpXformedPos[i] = mpXformedPos[i] * oneOverW;
mpXformedPos[i].m128_f32[3] = oneOverW;
}
}


As we can see, returning 1/w does in fact take a bit of extra work, so we’d like to avoid it, especially since that 1/w is really only referenced by the near-clip checking code. Also, the code seems to clamp w at some arbitrary small positive value – which means that the part of the near clip computation in the depth test rasterizer that worries about w<0 is actually unnecessary. This is the kind of thing I’m talking about – each piece of code in isolation seems reasonable, but once you look at both sides it becomes clear that the pieces don’t fit together all that well.

It turns out that after TransformCoords, we’re in “homogeneous viewport space”, i.e. we’re still in a homogeneous space, but unlike the homogeneous clip space you might be used to from vertex shaders, this one also has the viewport transform baked in. But our viewport transform leaves z alone (we fixed that in the previous post!), so we still have a D3D-style clip volume for z:

$0 \le z \le w$

Since we’re using a reversed clip volume, the z≤w constraint is the near-plane one. Note that this test doesn’t need any special cases for negative signs and also doesn’t have a hardcoded near-plane location any more: it just automatically uses whatever the projection matrix says, which is the right thing to do!

Even better, if we test for near-clip anyway, there’s no need to clamp w at all. We know that anything with w≤0 is outside the near plane, and if a vertex is outside the near plane we’re not gonna rasterize the box anyway. Now we might still end up dividing by 0, but since we’re dealing with floats, this is a well-defined operation (it might return infinities or NaNs, but that’s fine).

And on the subject of not rasterizing the box: as I said earlier, as soon as one vertex is outside the near-plane, we know we’re going to return true from the depth test rasterizer, so there’s no point even starting the operation. To facilitate this, we just make TransformAABBox return whether the box should be rasterized or not. Putting it all together:

bool TransformedAABBoxSSE::TransformAABBox()
{
__m128 zAllIn = _mm_castsi128_ps(_mm_set1_epi32(~0));

for(UINT i = 0; i < AABB_VERTICES; i++)
{
__m128 vert = TransformCoords(&mpBBVertexList[i],
mCumulativeMatrix);

// We have inverted z; z is inside of near plane iff z <= w.
__m128 vertZ = _mm_shuffle_ps(vert, vert, 0xaa); //vert.zzzz
__m128 vertW = _mm_shuffle_ps(vert, vert, 0xff); //vert.wwww
__m128 zIn = _mm_cmple_ps(vertZ, vertW);
zAllIn = _mm_and_ps(zAllIn, zIn);

// project
mpXformedPos[i] = _mm_div_ps(vert, vertW);
}

// return true if and only if all verts inside near plane
return _mm_movemask_ps(zAllIn) == 0xf;
}


In case you’re wondering why this code uses raw SSE intrinsics and not VecF32, it’s because I’m purposefully trying to keep anything depending on the SIMD width out of VecF32, which makes it a lot easier to go to 8-wide AVX should we want to at some point. But this code really uses 4-vectors of (x,y,z,w) and needs to do shuffles, so it doesn’t fit in that model and I want to keep it separate. But the actual logic is just what I described.

And once we have this return value from TransformAABBox, we get to remove the near-clip test from the depth test rasterizer, and we get to move our early-out for near-clipped boxes all the way to the call site:

if(mpTransformedAABBox[i].TransformAABBox())
mpVisible[i] = mpTransformedAABBox[i].RasterizeAndDepthTestAABBox(...);
else
mpVisible[i] = true;


So, the oneOverW hack, the clamping hack and the hard-coded near plane are gone. That’s already a victory in terms of code quality, but did it improve the run time?

Change: Transform/early-out fixes

Depth test min 25th med 75th max mean sdev
Start 1.109 1.152 1.166 1.182 1.240 1.167 0.022
Transform fixes 1.054 1.092 1.102 1.112 1.146 1.102 0.016

Another 0.06ms off our median depth test time, which may not sound big but is over 5% of what’s left of it at this point.

### Getting warmer

The bounding box rasterizer has one more method that’s called per-box though, and this is one that really deserves some special attention. Meet IsTooSmall:

bool TransformedAABBoxSSE::IsTooSmall(__m128 *pViewMatrix,
__m128 *pProjMatrix, CPUTCamera *pCamera)
{
float radius = mBBHalf.lengthSq(); // Use length-squared to
// avoid sqrt().  Relative comparisons hold.

float fov = pCamera->GetFov();
float tanOfHalfFov = tanf(fov * 0.5f);

MatrixMultiply(mWorldMatrix, pViewMatrix, mCumulativeMatrix);
MatrixMultiply(mCumulativeMatrix, pProjMatrix,
mCumulativeMatrix);
MatrixMultiply(mCumulativeMatrix, mViewPortMatrix,
mCumulativeMatrix);

__m128 center = _mm_set_ps(1.0f, mBBCenter.z, mBBCenter.y,
mBBCenter.x);
__m128 mBBCenterOSxForm = TransformCoords(&center,
mCumulativeMatrix);
float w = mBBCenterOSxForm.m128_f32[3];
if( w > 1.0f )
{
float r2DivW2DivTanFov = radiusDivW / tanOfHalfFov;

return r2DivW2DivTanFov <
(mOccludeeSizeThreshold * mOccludeeSizeThreshold);
}

return false;
}


Note that MatrixMultiply(A, B, C) performs C = A * B; the rest should be easy enough to figure out from the code. Now there’s really several problems with this function, so let’s go straight to a list:

• radius (which is really radius squared) only depends on mBBHalf, which is fixed at initialization time. There’s no need to recompute it every time.
• Similarly, fov and tanOfHalfFov only depend on the camera, and absolutely do not need to be recomputed once for every box. This is what gave us the _tan_pentium4 cameo all the way back in “Frustum culling: turning the crank”, by the way.
• The view matrix, projection matrix and viewport matrix are also all camera or global constants. Again, no need to multiply these together for every box – the only matrix that is different between boxes is the very first one, the world matrix, and since matrix multiplication is associative, we can just concatenate the other three once.
• There’s also no need for mOccludeeSizeThreshold to be squared every time – we can do that once.
• Nor is there a need for it to be stored per box, since it’s a global constant owned by the depth test rasterizer.
• (radius / w) / tanOfHalfFov would be better computed as radius / (w * tanOfHalfFov).
• But more importantly, since all we’re doing is a compare and both w and tanOfHalfFov are positive, we can just multiply through by them and get rid of the divide altogether.

All these things are common problems that I must have fixed a hundred times, but I have to admit that it’s pretty rare to see so many of them in a single page of code. Anyway, rather than fixing these one by one, let’s just cut to the chase: instead of all the redundant computations, we just move everything that only depends on the camera (or is global) into a single struct that holds our setup, which I dubbed BoxTestSetup. Here’s the code:

struct BoxTestSetup
{
__m128 mViewProjViewport[4];

void Init(const __m128 viewMatrix[4],
const __m128 projMatrix[4], CPUTCamera *pCamera,
float occludeeSizeThreshold);
};

void BoxTestSetup::Init(const __m128 viewMatrix[4],
const __m128 projMatrix[4], CPUTCamera *pCamera,
float occludeeSizeThreshold)
{
// viewportMatrix is a global float4x4; we need a __m128[4]
__m128 viewPortMatrix[4];

MatrixMultiply(viewMatrix, projMatrix, mViewProjViewport);
MatrixMultiply(mViewProjViewport, viewPortMatrix,
mViewProjViewport);

float fov = pCamera->GetFov();
float tanOfHalfFov = tanf(fov * 0.5f);
radiusThreshold = occludeeSizeThreshold * occludeeSizeThreshold
* tanOfHalfFov;
}


This is initialized once we start culling and simply kept on the stack. Then we just pass it to IsTooSmall, which after our surgery looks like this:

bool TransformedAABBoxSSE::IsTooSmall(const BoxTestSetup &setup)
{
MatrixMultiply(mWorldMatrix, setup.mViewProjViewport,
mCumulativeMatrix);

__m128 center = _mm_set_ps(1.0f, mBBCenter.z, mBBCenter.y,
mBBCenter.x);
__m128 mBBCenterOSxForm = TransformCoords(&center,
mCumulativeMatrix);
float w = mBBCenterOSxForm.m128_f32[3];
if( w > 1.0f )
{
}

return false;
}


Wow, that method sure seems to have lost a few pounds. Let’s run the numbers:

Change: IsTooSmall cleanup

Depth test min 25th med 75th max mean sdev
Start 1.109 1.152 1.166 1.182 1.240 1.167 0.022
Transform fixes 1.054 1.092 1.102 1.112 1.146 1.102 0.016
IsTooSmall cleanup 0.860 0.893 0.908 0.917 0.954 0.905 0.018

Another 0.2ms off the median run time, bringing our total reduction for this post to about 22%. So are we done? Not yet!

### The state police

Currently, each TransformedAABBoxSSE still keeps its own copy of the cumulative transform matrix and a copy of its transformed vertices. But it’s not necessary for these to be persistent – we compute them once, use them to rasterize the box, then don’t look at them again until the next frame. So, like mVisible earlier, there’s really no need to keep them around as state; instead, it’s better to just store them on the stack. Less pointers per TransformedAABBoxSSE, less cache misses, and – perhaps most important of all – it makes the bounding box objects themselves stateless. Granted, that’s the case only because our world is perfectly static and nothing is animated at runtime, but still, stateless is good! Stateless is easier to read, easier to debug, and easier to test.

Again, this is another change that is purely mechanical – just pass in a pointer to cumulativeMatrix and xformedPos to the functions that want them. So this time, I’m just going to refer you directly to the two commits that implement this idea, and skip straight to the results:

Change: Reduce amount of state

Depth test min 25th med 75th max mean sdev
Start 1.109 1.152 1.166 1.182 1.240 1.167 0.022
Transform fixes 1.054 1.092 1.102 1.112 1.146 1.102 0.016
IsTooSmall cleanup 0.860 0.893 0.908 0.917 0.954 0.905 0.018
Reduce state 0.834 0.862 0.873 0.886 0.938 0.875 0.017

Only about 0.03ms this time, but we also save 192 bytes (plus allocator overhead) worth of memory per box, which is a nice bonus. And anyway, we’re not done yet, because I have one more!

### It’s more fun to compute

There’s one more piece of unnecessary data we currently store per bounding box: the vertex list, initialized in CreateAABBVertexIndexList:

float3 min = mBBCenter - bbHalf;
float3 max = mBBCenter + bbHalf;

//Top 4 vertices in BB
mpBBVertexList[0] = _mm_set_ps(1.0f, max.z, max.y, max.x);
mpBBVertexList[1] = _mm_set_ps(1.0f, max.z, max.y, min.x);
mpBBVertexList[2] = _mm_set_ps(1.0f, min.z, max.y, min.x);
mpBBVertexList[3] = _mm_set_ps(1.0f, min.z, max.y, max.x);
// Bottom 4 vertices in BB
mpBBVertexList[4] = _mm_set_ps(1.0f, min.z, min.y, max.x);
mpBBVertexList[5] = _mm_set_ps(1.0f, max.z, min.y, max.x);
mpBBVertexList[6] = _mm_set_ps(1.0f, max.z, min.y, min.x);
mpBBVertexList[7] = _mm_set_ps(1.0f, min.z, min.y, min.x);


This is, in effect, just treating the bounding box as a general mesh. But that’s extremely wasteful – we already store center and half-extent, the min/max corner positions are trivial to reconstruct from that information, and all the other vertices can be constructed by splicing min/max together componentwise using a set of masks that is the same for all bounding boxes. So these 8*16 = 128 bytes of vertex data really don’t pay their way.

But more importantly, note that the we only ever use two distinct values for x, y and z each. Now TransformAABBox, which we already saw above, uses TransformCoords to compute the matrix-vector product v*M with the cumulative transform matrix, using the expression

v.x * M.row[0] + v.y * M.row[1] + v.z * M.row[2] + M.row[3] (v.w is assumed to be 1)

and because we know that v.x is either min.x or max.x, we can multiply both by M.row[0] once and store the result. Then the 8 individual vertices can skip the multiplies altogether. Putting it all together leads to the following new code for TransformAABBox:

// 0 = use min corner, 1 = use max corner
static const int sBBxInd[AABB_VERTICES] = { 1, 0, 0, 1, 1, 1, 0, 0 };
static const int sBByInd[AABB_VERTICES] = { 1, 1, 1, 1, 0, 0, 0, 0 };
static const int sBBzInd[AABB_VERTICES] = { 1, 1, 0, 0, 0, 1, 1, 0 };

bool TransformedAABBoxSSE::TransformAABBox(__m128 xformedPos[],
const __m128 cumulativeMatrix[4])
{
// w ends up being garbage, but it doesn't matter - we ignore
// it anyway.
__m128 vCenter = _mm_loadu_ps(&mBBCenter.x);
__m128 vHalf   = _mm_loadu_ps(&mBBHalf.x);

__m128 vMin    = _mm_sub_ps(vCenter, vHalf);
__m128 vMax    = _mm_add_ps(vCenter, vHalf);

// transforms
__m128 xRow[2], yRow[2], zRow[2];
xRow[0] = _mm_shuffle_ps(vMin, vMin, 0x00) * cumulativeMatrix[0];
xRow[1] = _mm_shuffle_ps(vMax, vMax, 0x00) * cumulativeMatrix[0];
yRow[0] = _mm_shuffle_ps(vMin, vMin, 0x55) * cumulativeMatrix[1];
yRow[1] = _mm_shuffle_ps(vMax, vMax, 0x55) * cumulativeMatrix[1];
zRow[0] = _mm_shuffle_ps(vMin, vMin, 0xaa) * cumulativeMatrix[2];
zRow[1] = _mm_shuffle_ps(vMax, vMax, 0xaa) * cumulativeMatrix[2];

__m128 zAllIn = _mm_castsi128_ps(_mm_set1_epi32(~0));

for(UINT i = 0; i < AABB_VERTICES; i++)
{
// Transform the vertex
__m128 vert = cumulativeMatrix[3];
vert += xRow[sBBxInd[i]];
vert += yRow[sBByInd[i]];
vert += zRow[sBBzInd[i]];

// We have inverted z; z is inside of near plane iff z <= w.
__m128 vertZ = _mm_shuffle_ps(vert, vert, 0xaa); //vert.zzzz
__m128 vertW = _mm_shuffle_ps(vert, vert, 0xff); //vert.wwww
__m128 zIn = _mm_cmple_ps(vertZ, vertW);
zAllIn = _mm_and_ps(zAllIn, zIn);

// project
xformedPos[i] = _mm_div_ps(vert, vertW);
}

// return true if and only if none of the verts are z-clipped
return _mm_movemask_ps(zAllIn) == 0xf;
}


Admittedly, quite a bit longer than the original one, but that’s because we front-load a lot of the computation; most of the per-vertex work done in TransformCoords is gone. And here’s our reward:

Change: Get rid of per-box vertex list

Depth test min 25th med 75th max mean sdev
Start 1.109 1.152 1.166 1.182 1.240 1.167 0.022
Transform fixes 1.054 1.092 1.102 1.112 1.146 1.102 0.016
IsTooSmall cleanup 0.860 0.893 0.908 0.917 0.954 0.905 0.018
Reduce state 0.834 0.862 0.873 0.886 0.938 0.875 0.017
Remove vert list 0.801 0.823 0.830 0.839 0.867 0.831 0.012

This brings our total for this post to a nearly 25% reduction in median depth test time, plus about 320 bytes memory reduction per TransformedAABBoxSSE – which, since we have about 27000 of them, works out to well over 8 megabytes. Such are the rewards for widening the scope beyond optimizing functions by themselves.

And as usual, the code for this time (plus some changes I haven’t discussed yet) is up on Github. Until next time!