Skip to content

Mopping up

This post is part of a series – go here for the index.

Welcome back! This post is going to be slightly different from the others. So far, I’ve attempted to group the material thematically, so that each post has a coherent theme (to a first-order approximation, anyway). Well, this one doesn’t – this is a collection of everything that didn’t fit anywhere else. But don’t worry, there’s still some good stuff in here! That said, one warning: there’s a bunch of poking around in the framework code this time, and it didn’t come with docs, so I’m honestly not quite sure how some of the internals are supposed to work. So the code changes referenced this time are definitely on the hacky side of things.

The elephant in the room

Featured quite near the top of all the profiles we’ve seen so far are two functions I haven’t talked about before:

Rendering hot spots

In case you’re wondering, the VIDMM_Global::ReferenceDmaBuffer is what used to be just “[dxgmms1.sys]” in the previous posts; I’ve set up VTune to use the symbol server to get debug symbols for this DLL. Now, I haven’t talked about this code before because it’s part of the GPU rendering, not the software rasterizer, but let’s broaden our scope one final time.

What you can see here is the video memory manager going over the list of resources (vertex/index buffers, constant buffers, textures, and so forth) referenced by a DMA buffer (which is what WDDM calls GPU command buffers in the native format) and completely blowing out the cache; each resource has some amount of associated metadata that the memory manager needs to look at (and possibly update), and it turns out there’s many of them. The cache is not amused.

So, what can we do to use less resources? There’s lots of options, but one thing I had noticed while measuring loading time is that there’s one dynamic constant buffer per model:

// Create the model constant buffer.
HRESULT hr;
D3D11_BUFFER_DESC bd = {0};
bd.ByteWidth = sizeof(CPUTModelConstantBuffer);
bd.BindFlags = D3D11_BIND_CONSTANT_BUFFER;
bd.Usage = D3D11_USAGE_DYNAMIC;
bd.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
hr = (CPUT_DX11::GetDevice())->CreateBuffer( &bd, NULL,
    &mpModelConstantBuffer );
ASSERT( !FAILED( hr ), _L("Error creating constant buffer.") );

Note that they’re all the same size, and it turns out that all of them get updated (using a Map with DISCARD) immediately before they get used for rendering. And because there’s about 27000 models in this example, we’re talking about a lot of constant buffers here.

What if we instead just created one dynamic model constant buffer, and shared it between all the models? It’s a fairly simple change to make, if you’re willing to do it in a hacky fashion (as said, not very clean code this time). For this test, I took the liberty of adding some timing around the actual D3D rendering code as well, so we can compare. It’s probably gonna make a difference, but how much can it be, really?

Change: Single shared dynamic model constant buffer

Render scene min 25th med 75th max mean sdev
Original 3.392 3.501 3.551 3.618 4.155 3.586 0.137
One dynamic CB 2.474 2.562 2.600 2.644 3.043 2.609 0.068

It turns out that reducing the number of distinct constant buffers referenced per frame by several thousand is a pretty big deal. Drivers work hard to make constant buffer DISCARD really, really fast, and they make sure that the underlying allocations get handled quickly. And discarding a single constant buffer a thousand times in a frame works out to be a lot faster than discarding a thousand constant buffers once each.

Lesson learned: for “throwaway” constant buffers, it’s a good idea to design your renderer so it only allocates one underlying D3D constant buffer per size class. More are not necessary and can (evidently) induce a substantial amount of overhead. D3D11.1 adds a few features that allow you to further reduce that count down to a single constant buffer that’s used the same way that dynamic vertex/index buffers are; as you can see, there’s a reason. Here’s the profile after this single fix:

Render after dynamic CB fix

Still a lot of time spent in the driver and the video memory manager, but if you compare the raw cycle counts with the previous image, you can see that this change really made quite a dent.

Loading time

This was (for the most part) something I worked on just to make my life easier – as you can imagine, while writing this series, I’ve recorded lots of profiling and tests runs, and the loading time is a fixed cost I pay every time. I won’t go in depth here, but I still want to give a brief summary of the changes I made and why. If you want to follow along, the changes in the source code start at the “Track loading time” commit.

Initial: 9.29s

First, I simply added a timer and code to print the loading time to the debug output window.

Load materials once, not once per model: 4.54s

One thing I noticed way back in January when I did my initial testing was that most materials seem to get loaded multiple times; there seems to be logic in the asset library code to avoid loading materials multiple times, but it didn’t appear to work for me. So I modified the code to actually load each material only once and then create copies when requested. As you can see, this change by itself roughly cut loading times in half.

FindAsset optimizations: 4.32s

FindAsset is the function used in the asset manager to actually look up resources by name. With two simples changes to avoid unnecessary path name resolution and string compares, the loading time loses another 200ms.

Better config file loading: 2.54s

I mentioned this in “A string processing rant“, but didn’t actually merge the changes into the blog branch so far. Well, here you go: with these three commits that together rewrite a substantial portion of the config file reading, we lose almost another 2 seconds. Yes, that was 2 whole seconds worth of unnecessary allocations and horribly inefficient string handling. I wrote that rant for a reason.

Improve shader input layout cache: 2.03s

D3D11 wants shader input layouts to be created with a pointer to the bytecode of the shader it’s going to be used with, to handle vertex format to shader binding. The “shader input layout cache” is just an internal cache to produce such input layouts for all unique combinations of vertex formats and shaders we use. The original implementation of this cache was fairly inefficient, but the code already contained a “TODO” comment with instructions of how to fix it. In this commit, I implemented that fix.

Reduce temporary strings: 1.88s

There were still a bunch of unnecessary string temporaries being created, which I found simply by looking at the call stack profiles of free calls during the loading phase (yet another useful application for profilers)! Two commits later, this problem was resolved too.

Actually share materials: 1.46s

Finally, this commit goes one step further than just loading the materials once, it also actually shares the same material instance between all its users (the previous version created copies). This is not necessarily a safe change to make. I have no idea what invariants the asset manager tries to enforce, if any. Certainly, this would cause problems if someone were to start modifying materials after loading – you’d need to introduce copy-on-write or something similar. But in our case (i.e. the Software Occlusion Culling demo), the materials do not get modified after loading, and sharing them is completely safe.

Not only does this reduce loading time by another 400ms, it also makes rendering a lot faster, because suddenly there’s a lot less cache misses when setting up shaders and render states for the individual models:

Change: Share materials.

Render scene min 25th med 75th max mean sdev
Original 3.392 3.501 3.551 3.618 4.155 3.586 0.137
One dynamic CB 2.474 2.562 2.600 2.644 3.043 2.609 0.068
Share materials 1.870 1.922 1.938 1.964 2.331 1.954 0.057

Again, this is somewhat extreme because there’s so many different models around, but it illustrates the point: you really want to make sure there’s no unnecessary duplication of data used during rendering; you’re going to be missing the cache enough during regular rendering as it is.

And at that point, I decided that I could live with 1.5 seconds of loading time, so I didn’t pursue the matter any further. :)

The final rendering tweak

There’s one more function with a high number of cache misses in the profiles I’ve been running, even though it’s never been at the top. That function is AABBoxRasterizerSSE::RenderVisible, which uses the (post-occlusion-test) visibility information to render all visible models. Here’s the code:

void AABBoxRasterizerSSE::RenderVisible(CPUTAssetSet **pAssetSet,
    CPUTRenderParametersDX &renderParams,
    UINT numAssetSets)
{
    int count = 0;

    for(UINT assetId = 0, modelId = 0; assetId < numAssetSets; assetId++)
    {
        for(UINT nodeId = 0; nodeId < GetAssetCount(); nodeId++)
        {
            CPUTRenderNode* pRenderNode = NULL;
            CPUTResult result = pAssetSet[assetId]->GetAssetByIndex(nodeId, &pRenderNode);
            ASSERT((CPUT_SUCCESS == result), _L ("Failed getting asset by index")); 
            if(pRenderNode->IsModel())
            {
                if(mpVisible[modelId])
                {
                    CPUTModelDX11* model = (CPUTModelDX11*)pRenderNode;
                    model = (CPUTModelDX11*)pRenderNode;
                    model->Render(renderParams);
                    count++;
                }
                modelId++;			
            }
            pRenderNode->Release();
        }
    }
    mNumCulled =  mNumModels - count;
}

This code first enumerates all RenderNodes (a base class) in the active asset libraries, ask each of them “are you a model?”, and if so renders it. This is a construct that I’ve seen several times before – but from a performance standpoint, this is a terrible idea. We walk over the whole scene database, do a virtual function call (which means we have, at the very least, load the cache line containing the vtable pointer) to check if the current item is a model, and only then check if it is culled – in which case we just ignore it.

That is a stupid game and we should stop playing it.

Luckily, it’s easy to fix: at load time, we traverse the scene database once, to make a list of all the models. Note the code already does such a pass to initialize the bounding boxes etc. for the occlusion culling pass; all we have to do is set an extra array that maps modelIds to the corresponding models. Then the actual rendering code turns into:

void AABBoxRasterizerSSE::RenderVisible(CPUTAssetSet **pAssetSet,
    CPUTRenderParametersDX &renderParams,
    UINT numAssetSets)
{
    int count = 0;

    for(modelId = 0; modelId < mNumModels; modelId++)
    {
        if(mpVisible[modelId])
        {
            mpModels[modelId]->Render(renderParams);
            count++;
        }
    }

    mNumCulled =  mNumModels - count;
}

That already looks much better. But how much does it help?

Change: Cull before accessing models

Render scene min 25th med 75th max mean sdev
Original 3.392 3.501 3.551 3.618 4.155 3.586 0.137
One dynamic CB 2.474 2.562 2.600 2.644 3.043 2.609 0.068
Share materials 1.870 1.922 1.938 1.964 2.331 1.954 0.057
Fix RenderVisible 1.321 1.358 1.371 1.406 1.731 1.388 0.047

I rest my case.

And I figure that this nice 2.59x cumulative speedup on the rendering code is a good stopping point for the coding part of this series – quit while you’re ahead and all that. There’s a few more minor fixes (both for actual bugs and speed problems) on Github, but it’s all fairly small change, so I won’t go into the details.

This series is not yet over, though; we’ve covered a lot of ground, and every case study should spend some time reflecting on the lessons learned. I also want to explain why I covered what I did, what I left out, and a few notes on the way I tend to approach performance problems. So all that will be in the next and final post of this series. Until then!

Speculatively speaking

This post is part of a series – go here for the index.

Welcome back! Today, it’s time to take a closer look at the triangle binning code, which we’ve only seen mentioned briefly so far, and we’re going to see a few more pitfalls that all relate to speculative execution.

Loads blocked by what?

There’s one more micro-architectural issue this program runs into that I haven’t talked about before. Here’s the obligatory profiler screenshot:

Store-to-load forwarding issues

The full column name reads “Loads Blocked by Store Forwarding”. So, what’s going on there? For this one, I’m gonna have to explain a bit first.

So let’s talk about stores in an out-of-order processor. In this series, we already saw how conditional branches and memory sharing between cores get handled on modern x86 cores: namely, with speculative execution. For branches, the core tries to predict which direction they will go, and automatically starts fetching and executing the corresponding instructions. Similarly, memory accesses are assumed to not conflict with what other cores are doing at the same time, and just march on ahead. But if it later turns out that the branch actually went in the other direction, that there was a memory conflict, or that some exception / hardware interrupt occurred, all the instructions that were executed in the meantime are invalid and their results must be discarded – the speculation didn’t pan out. The implicit assumption is that our speculation (branches behave as predicted, memory accesses generally don’t conflict and CPU exceptions/interrupts are rare) is right most of the time, so it generally pays off to forge ahead, and the savings are worth the occasional extra work of undoing a bunch of instructions when we turned out to be wrong.

But wait, how does the CPU “undo” instructions? Well, conceptually it takes a “snapshot” of the current machine state every time it’s about to start an operation that it might later have to undo. If that instructions makes it all the way through the pipeline without incident, it just gets retired normally, the snapshot gets thrown away and we know that our speculation was successful. But if there is a problem somewhere, the machine can just throw away all the work it did in the meantime, rewind back to the snapshot and retry.

Of course, CPUs don’t actually take full snapshots. Instead, they make use of the out-of-order machinery to do things much more efficiently: out-of-order CPUs have more registers internally than are exposed in the ISA (Instruction Set Architecture), and use a technique called “register renaming” to map the small set of architectural registers onto the larger set of physical registers. The “snapshotting” then doesn’t actually need to save register contents; it just needs to keep track of what the current register mapping at the snapshot point was, and make sure that the associated physical registers from the “before” snapshot don’t get reused until the instruction is safely retired.

This takes care of register modifications. We already know what happens with loads from memory – we just run them, and if it later turns out that the memory contents changed between the load instruction’s execution and its retirement, we need to re-run that block of code. Stores are the tricky part: we can’t easily do “memory renaming” since memory (unlike registers) is a shared resource, and also unlike registers rarely gets written in whole “accounting units” (cache lines) at a time.

The solution are store buffers: when a store instruction is executed, we do all the necessary groundwork – address translation, access right checking and so forth – but don’t actually write to memory just yet; rather, the target address and the associated data bits are written into a store buffer, where they just sit around for a while; the store buffers form a log of all pending writes to memory. Only after the core is sure that the store instruction will actually be executed (branch results etc. are known and no exceptions were triggered) will these values actually be written back to the cache.

Buffering stores this way has numerous advantages (beyond just making speculation easier), and is a technique not just used in out-of-order architectures; there’s just one problem though: what happens if I run code like this?

  mov  [x], eax
  mov  ebx, [x]

Assuming no other threads writing to the same memory at the same time, you would certainly hope that at the end of this instruction sequence, eax and ebx contain the same value. But remember that the first instruction (the store) just writes to a store buffer, whereas the second instruction (the load) normally just references the cache. At the very least, we have to detect that this is happening – i.e., that we are trying to load from an address that currently has a write logged in a store buffer – but there’s numerous things we could do with that information.

One option is to simply stall the core and wait until the store is done before the load can start. This is fairly cheap to implement in hardware, but it does slow down the software running on it. This option was chosen by the in-order cores used in the current generation of game consoles, and the result is the dreaded “Load Hit Store” stall. It’s a way to solve the problem, but let’s just say it won’t win you many friends.

So x86 cores normally use a technique called “store to load forwarding” or just “store forwarding”, where loads can actually read data directly from the store buffers, at least under certain conditions. This is much more expensive in hardware – it adds a lot of wires between the load unit and the store buffers – but it is far less finicky to use on the software side.

So what are the conditions? The details depend on the core in question. Generally, if you store a value to a naturally aligned location in memory, and do a load with the same size as the store, you can expect store forwarding to work. If you do trickier stuff – span multiple cache lines, or use mismatched sizes between the loads and stores, for example – it really does depend. Some of the more recent Intel cores can also forward larger stores into smaller loads (e.g. a DWord read from a location written with MOVDQA) under certain circumstances, for example. The dual case (large load overlapping with smaller stores) is substantially harder though, because it can involved multiple store buffers at the same time, and I currently know of no processor that implements this. And whenever you hit a case where the processor can’t perform store forwarding, you get the “Loads Blocked by Store Forwarding” stall above (effectively, x86’s version of a Load-Hit-Store).

Revenge of the cycle-eaters

Which brings us back to the example at hand: what’s going on in those functions, BinTransformedTrianglesMT in particular? Some investigation of the compiled code shows that the first sign of blocked loads is near these reads:

Gather(xformedPos, index, numLanes);
		
vFxPt4 xFormedFxPtPos[3];
for(int i = 0; i < 3; i++)
{
    xFormedFxPtPos[i].X = ftoi_round(xformedPos[i].X);
    xFormedFxPtPos[i].Y = ftoi_round(xformedPos[i].Y);
    xFormedFxPtPos[i].Z = ftoi_round(xformedPos[i].Z);
    xFormedFxPtPos[i].W = ftoi_round(xformedPos[i].W);
}

and looking at the code for Gather shows us exactly what’s going on:

void TransformedMeshSSE::Gather(vFloat4 pOut[3], UINT triId,
    UINT numLanes)
{
    for(UINT l = 0; l < numLanes; l++)
    {
        for(UINT i = 0; i < 3; i++)
        {
            UINT index = mpIndices[(triId * 3) + (l * 3) + i];
            pOut[i].X.lane[l] = mpXformedPos[index].m128_f32[0];
            pOut[i].Y.lane[l] = mpXformedPos[index].m128_f32[1];
            pOut[i].Z.lane[l] = mpXformedPos[index].m128_f32[2];
            pOut[i].W.lane[l] = mpXformedPos[index].m128_f32[3];
        }
    }
}

Aha! This is the code that transforms our vertices from the AoS (array of structures) form that’s used in memory into the SoA (structure of arrays) form we use during binning (and also the two rasterizers). Note that the output vectors are written element by element; then, as soon as we try to read the whole vector into a register, we hit a forwarding stall, because the core can’t forward the results from the 4 different stores per vector to a single load. It turns out that the other two instances of forwarding stalls run into this problem for the same reason – during the gather of bounding box vertices and triangle vertices in the rasterizer, respectively.

So how do we fix it? Well, we’d really like those vectors to be written using full-width SIMD stores instead. Luckily, that’s not too hard: converting data from AoS to SoA is essentially a matrix transpose, and our typical use case happens to be 4 separate 4-vectors, i.e. a 4×4 matrix; luckily, a 4×4 matrix transpose is fairly easy to do in SSE, and Intel’s intrinsics header file even comes with a macro that implements it. So here’s the updated Gather that uses a SSE transpose:

void TransformedMeshSSE::Gather(vFloat4 pOut[3], UINT triId,
    UINT numLanes)
{
    const UINT *pInd0 = &mpIndices[triId * 3];
    const UINT *pInd1 = pInd0 + (numLanes > 1 ? 3 : 0);
    const UINT *pInd2 = pInd0 + (numLanes > 2 ? 6 : 0);
    const UINT *pInd3 = pInd0 + (numLanes > 3 ? 9 : 0);

    for(UINT i = 0; i < 3; i++)
    {
        __m128 v0 = mpXformedPos[pInd0[i]]; // x0 y0 z0 w0
        __m128 v1 = mpXformedPos[pInd1[i]]; // x1 y1 z1 w1
        __m128 v2 = mpXformedPos[pInd2[i]]; // x2 y2 z2 w2
        __m128 v3 = mpXformedPos[pInd3[i]]; // x3 y3 z3 w3
        _MM_TRANSPOSE4_PS(v0, v1, v2, v3);
        // After transpose:
        pOut[i].X = VecF32(v0); // v0 = x0 x1 x2 x3
        pOut[i].Y = VecF32(v1); // v1 = y0 y1 y2 y3
        pOut[i].Z = VecF32(v2); // v2 = z0 z1 z2 z3
        pOut[i].W = VecF32(v3); // v3 = w0 w1 w2 w3
    }
}

Not much to talk about here. The other two instances of this get modified in the exact same way. So how much does it help?

Change: Gather using SSE instructions and transpose

Total cull time min 25th med 75th max mean sdev
Initial 3.148 3.208 3.243 3.305 4.321 3.271 0.100
SSE Gather 2.934 3.078 3.110 3.156 3.992 3.133 0.103
Render depth min 25th med 75th max mean sdev
Initial 2.206 2.220 2.228 2.242 2.364 2.234 0.022
SSE Gather 2.099 2.119 2.137 2.156 2.242 2.141 0.028
Depth test min 25th med 75th max mean sdev
Initial 0.813 0.830 0.839 0.847 0.886 0.839 0.013
SSE Gather 0.773 0.793 0.802 0.809 0.843 0.801 0.012

So we’re another 0.13ms down, about 0.04ms of which we gain in the depth testing pass and the remaining 0.09ms in the rendering pass. And a re-run with VTune confirms that the blocked loads are indeed gone:

Store forwarding fixed

Vertex transformation

Last time, we modified the vertex transform code in the depth test rasterizer to get rid of the z-clamping and simplify the clipping logic. We also changed the logic to make better use of the regular structure of our input vertices. We don’t have any special structure we can use to make vertex transforms on regular meshes faster, but we definitely can (and should) improve the projection and near-clip logic, turning this:

mpXformedPos[i] = TransformCoords(&mpVertices[i].position,
    cumulativeMatrix);
float oneOverW = 1.0f/max(mpXformedPos[i].m128_f32[3], 0.0000001f);
mpXformedPos[i] = _mm_mul_ps(mpXformedPos[i],
    _mm_set1_ps(oneOverW));
mpXformedPos[i].m128_f32[3] = oneOverW;

into this:

__m128 xform = TransformCoords(&mpVertices[i].position,
    cumulativeMatrix);
__m128 vertZ = _mm_shuffle_ps(xform, xform, 0xaa);
__m128 vertW = _mm_shuffle_ps(xform, xform, 0xff);
__m128 projected = _mm_div_ps(xform, vertW);

// set to all-0 if near-clipped
__m128 mNoNearClip = _mm_cmple_ps(vertZ, vertW);
mpXformedPos[i] = _mm_and_ps(projected, mNoNearClip);

Here, near-clipped vertices are set to the (invalid) x=y=z=w=0, and the binner code can just check for w==0 to test whether a vertex is near-clipped instead of having to use the original w tests (which again had a hardcoded near plane value).

This change doesn’t have any significant impact on the running time, but it does get rid of the hardcoded near plane location for good, so I thought it was worth mentioning.

Again with the memory ordering

And if we profile again, we notice there’s at least one more surprise waiting for us in the binning code:

Binning Machine Clears

Machine clears? We’ve seen them before, way back in “Cores don’t like to share“. And yes, they’re again for memory ordering reasons. What did we do wrong this time? It turns out that the problematic code has been in there since the beginning, and ran just fine for quite a while, but ever since the scheduling optimizations we did in “The care and feeding of worker threads“, we now have binning jobs running tightly packed enough to run into memory ordering issues. So what’s the problem? Here’s the code:

// Add triangle to the tiles or bins that the bounding box covers
int row, col;
for(row = startY; row <= endY; row++)
{
    int offset1 = YOFFSET1_MT * row;
    int offset2 = YOFFSET2_MT * row;
    for(col = startX; col <= endX; col++)
    {
        int idx1 = offset1 + (XOFFSET1_MT * col) + taskId;
        int idx2 = offset2 + (XOFFSET2_MT * col) +
            (taskId * MAX_TRIS_IN_BIN_MT) + pNumTrisInBin[idx1];
        pBin[idx2] = index + i;
        pBinModel[idx2] = modelId;
        pBinMesh[idx2] = meshId;
        pNumTrisInBin[idx1] += 1;
    }
}

The problem turns out to be the array pNumTrisInBin. Even though it’s accessed as 1D, it is effectively a 3D array like this:

uint16 pNumTrisInBin[TILE_ROWS][TILE_COLS][BINNER_TASKS]

The TILE_ROWS and TILE_COLS parts should be obvious. The BINNER_TASKS needs some explanation though: as you hopefully remember, we try to divide the work between binning tasks so that each of them gets roughly the same amount of triangles. Now, before we start binning triangles, we don’t know which tiles they will go into – after all, that’s what the binner is there to find out.

We could have just one output buffer (bin) per tile; but then, whenever two binner tasks simultaneously end up trying to add a triangle to the same tile, they will end up getting serialized because they try to increment the same counter. And even worse, it would mean that the actual order of triangles in the bins would be different between every run, depending on when exactly each thread was running; while not fatal for depth buffers (we just end up storing the max of all triangles rendered to a pixel anyway, which is ordering-invariant) it’s still a complete pain to debug.

Hence there is one bin per tile per binning worker. We already know that the binning workers get assigned the triangles in the order they occur in the models – with the 32 binning workers we use, the first binning task gets the first 1/32 of the triangles, and second binning task gets the second 1/32, and so forth. And each binner processes triangles in order. This means that the rasterizer tasks can still process triangles in the original order they occur in the mesh – first process all triangles inserted by binner 0, then all triangles inserted by binner 1, and so forth. Since they’re in distinct memory ranges, that’s easily done. And each bin has a separate triangle counter, so they don’t interfere, right? Nothing to see here, move along.

Well, except for the bit where coherency is managed on a cache line granularity. Now, as you can see from the above declaration, the triangle counts for all the binner tasks are stored in adjacent 16-bit words; 32 of them, to be precise, one per binner task. So what was the size of a cache line again? 64 bytes, you say?

Oops.

Yep, even though it’s 32 separate counters, for the purposes of the memory subsystem it’s just the same as if it was all a single counter per tile (well, it might be slightly better than that if the initial pointer isn’t 64-byte aligned, but you get the idea).

Luckily for us, the fix is dead easy: all we have to do is shuffle the order of the array indices around.

uint16 pNumTrisInBin[BINNER_TASKS][TILE_ROWS][TILE_COLS]

We also happen to have 32 tiles total – which means that now, each binner task gets its own cache line by itself (again, provided we align things correctly). So again, it’s a really easy fix. The question being – how much does it help?

Change: Change pNumTrisInBin array indexing

Total cull time min 25th med 75th max mean sdev
Initial 3.148 3.208 3.243 3.305 4.321 3.271 0.100
SSE Gather 2.934 3.078 3.110 3.156 3.992 3.133 0.103
Change bin inds 2.842 2.933 2.980 3.042 3.914 3.007 0.125
Render depth min 25th med 75th max mean sdev
Initial 2.206 2.220 2.228 2.242 2.364 2.234 0.022
SSE Gather 2.099 2.119 2.137 2.156 2.242 2.141 0.028
Change bin inds 1.980 2.008 2.026 2.046 2.172 2.032 0.035

That’s right, a 0.1ms difference from changing the memory layout of a 1024-entry, 2048-byte array. You really need to be extremely careful with the layout of shared data when dealing with multiple cores at the same time.

Once more, with branching

At this point, the binner is starting to look fairly good, but there’s one more thing that springs to eye:

Binning branch mispredicts

Branch mispredictions. Now, the two rasterizers have legitimate reason to be mispredicting branches some of the time – they’re processing triangles with fairly unpredictable sizes, and the depth test rasterizer also has an early-out that’s hard to predict. But the binner has less of an excuse – sure, the triangles have very different dimensions measured in 2×2 pixel blocks, but the vast majority of our triangles fits inside one of our (generously sized!) 320×90 pixel tiles. So where are all these branches?

for(int i = 0; i < numLanes; i++)
{
    // Skip triangle if area is zero 
    if(triArea.lane[i] <= 0) continue;
    if(vEndX.lane[i] < vStartX.lane[i] ||
       vEndY.lane[i] < vStartY.lane[i]) continue;
			
    float oneOverW[3];
    for(int j = 0; j < 3; j++)
        oneOverW[j] = xformedPos[j].W.lane[i];
			
    // Reject the triangle if any of its verts are outside the
    // near clip plane
    if(oneOverW[0] == 0.0f || oneOverW[1] == 0.0f ||
        oneOverW[2] == 0.0f) continue;

    // ...
}

Oh yeah, that. In particular, the first test (which checks for degenerate and back-facing triangles) will reject roughly half of all triangles and can be fairly random (as far as the CPU is concerned). Now, last time we had an issue with branch mispredicts, we simply removed the offending early-out. That’s a really bad idea in this case – any triangles we don’t reject here, we’re gonna waste even more work on later. No, these tests really should all be done here.

However, there’s no need for them to be done like this; right now, we have a whole slew of branches that are all over the map. Can’t we consolidate the branches somehow?

Of course we can. The basic idea is to do all the tests on 4 triangles at a time, while we’re still in SIMD form:

// Figure out which lanes are active
VecS32 mFront = cmpgt(triArea, VecS32::zero());
VecS32 mNonemptyX = cmpgt(vEndX, vStartX);
VecS32 mNonemptyY = cmpgt(vEndY, vStartY);
VecF32 mAccept1 = bits2float(mFront & mNonemptyX & mNonemptyY);

// All verts must be inside the near clip volume
VecF32 mW0 = cmpgt(xformedPos[0].W, VecF32::zero());
VecF32 mW1 = cmpgt(xformedPos[1].W, VecF32::zero());
VecF32 mW2 = cmpgt(xformedPos[2].W, VecF32::zero());

VecF32 mAccept = and(and(mAccept1, mW0), and(mW1, mW2));
// laneMask == (1 << numLanes) - 1; - initialized earlier
unsigned int triMask = _mm_movemask_ps(mAccept.simd) & laneMask;

Note I change the “is not near-clipped test” from !(w == 0.0f) to w > 0.0f, on account of me knowing that all legal w’s happen to not just be non-zero, they’re positive (okay, what really happened is that I forgot to add a “cmpne” when I wrote VecF32 and didn’t feel like adding it here). Other than that, it’s fairly straightforward. We build a mask in vector registers, then turn it into an integer mask of active lanes using MOVMSKPS.

With this, we could turn all the original branches into a single test in the i loop:

if((triMask & (1 << i)) == 0)
    continue;

However, we can do slightly better than that: it turns out we can iterate pretty much directly over the set bits in triMask, which means we’re now down to one single branch in the outer loop – the loop counter itself. The modified loop looks like this:

while(triMask)
{
    int i = FindClearLSB(&triMask);
    // ...
}

So what does the magic FindClearLSB function do? It better not contain any branches! But lucky for us, it’s quite straightforward:

// Find index of least-significant set bit in mask
// and clear it (mask must be nonzero)
static int FindClearLSB(unsigned int *mask)
{
    unsigned long idx;
    _BitScanForward(&idx, *mask);
    *mask &= *mask - 1;
    return idx;
}

all it takes is _BitScanForward (the VC++ intrinsic for the x86 BSF instruction) and a really old trick for clearing the least-significant set bit in a value. In other words, this compiles into about 3 integer instructions and is completely branch-free. Good enough. So does it help?

Change: Less branches in binner

Total cull time min 25th med 75th max mean sdev
Initial 3.148 3.208 3.243 3.305 4.321 3.271 0.100
SSE Gather 2.934 3.078 3.110 3.156 3.992 3.133 0.103
Change bin inds 2.842 2.933 2.980 3.042 3.914 3.007 0.125
Less branches 2.786 2.879 2.915 2.969 3.706 2.936 0.092
Render depth min 25th med 75th max mean sdev
Initial 2.206 2.220 2.228 2.242 2.364 2.234 0.022
SSE Gather 2.099 2.119 2.137 2.156 2.242 2.141 0.028
Change bin inds 1.980 2.008 2.026 2.046 2.172 2.032 0.035
Less branches 1.905 1.934 1.946 1.959 2.012 1.947 0.019

That’s another 0.07ms off the total, for about a 10% reduction in median total cull time for this post, and a 12.7% reduction in median rasterizer time. And for our customary victory lap, here’s the VTune results after this change:

Binning with branching improved

The branch mispredictions aren’t gone, but we did make a notable dent. It’s more obvious if you compare the number of clock cyles with the previous image.

And with that, I’ll conclude this journey into both the triangle binner and the dark side of speculative execution. We’re also getting close to the end of this series – the next post will look again at the loading and rendering code we’ve been intentionally ignoring for most of this series :), and after that I’ll finish with a summary and wrap-up – including a list of things I didn’t cover, and why not.

Reshaping dataflows

This post is part of a series – go here for the index.

Welcome back! So far, we’ve spent quite some time “zoomed in” on various components of the Software Occlusion Culling demo, looking at various micro-architectural pitfalls and individual loops. In the last two posts, we “zoomed out” and focused on the big picture: what work runs when, and how to keep all cores busy. Now, it’s time to look at what lies in between: the plumbing, if you will. We’ll be looking at the dataflows between subsystems and modules and how to improve them.

This is one of my favorite topics in optimization, and it’s somewhat under-appreciated. There’s plenty of material on how to make loops run fast (although a lot of it is outdated or just wrong, so beware), and at this point there’s plenty of ways of getting concurrency up and running: there’s OpenMP, Intel’s TBB, Apple’s GCD, Windows Thread Pools and ConcRT for CPU, there’s OpenCL, CUDA and DirectCompute for jobs that are GPU-suitable, and so forth; you get the idea. The point being that it’s not hard to find a shrink-wrap solution that gets you up and running, and a bit of profiling (like we just did) is usually enough to tell you what needs to be done to make it all go smoothly.

But back to the topic at hand: improving dataflow. The problem is that, unlike the other two aspects I mentioned, there’s really no recipe to follow; it’s very much context-dependent. It basically boils down to looking at both sides of the interface between systems and functions and figuring out if there’s a better way to handle that interaction. We’ve seen a bit of that earlier when talking about frustum culling; rather than trying to define it in words, I’ll just do it by example, so let’s dive right in!

A simple example

A good example is the member variable TransformedAABBoxSSE::mVisible, declared like this:

bool *mVisible;

A pointer to a bool. So where does that pointer come from?

inline void SetVisible(bool *visible){mVisible = visible;}

It turns out that the constructor initializes this pointer to NULL, and the only method that ever does anything with mVisible is RasterizeAndDepthTestAABBox, which executes *mVisible = true; if the bounding box is found to be visible. So how does this all get used?

mpVisible[i] = false;
mpTransformedAABBox[i].SetVisible(&mpVisible[i]);
if(...)
{
    mpTransformedAABBox[i].TransformAABBox();
    mpTransformedAABBox[i].RasterizeAndDepthTestAABBox(...);
}

That’s it. That’s the only call sites. There’s really no reason for mVisible to be state – semantically, it’s just a return value for RasterizeAndDepthTestAABBox, so that’s what it should be – always try to get rid of superfluous state. This doesn’t even have anything to do with optimization per se; explicit dataflow is easy for programmers to see and reason about, while implicit dataflow (through pointers, members and state) is hard to follow (both for humans and compilers!) and error-prone.

Anyway, making this return value explicit is really basic, so I’m not gonna walk through the details; you can always look at the corresponding commit. I won’t bother benchmarking this change either.

A more interesting case

In the depth test rasterizer, right after determining the bounding box, there’s this piece of code:

for(int vv = 0; vv < 3; vv++) 
{
    // If W (holding 1/w in our case) is not between 0 and 1,
    // then vertex is behind near clip plane (1.0 in our case).
    // If W < 1 (for W>0), and 1/W < 0 (for W < 0).
    VecF32 nearClipMask0 = cmple(xformedPos[vv].W, VecF32(0.0f));
    VecF32 nearClipMask1 = cmpge(xformedPos[vv].W, VecF32(1.0f));
    VecS32 nearClipMask = float2bits(or(nearClipMask0,
        nearClipMask1));

    if(!is_all_zeros(nearClipMask))
    {
        // All four vertices are behind the near plane (we're
        // processing four triangles at a time w/ SSE)
        return true;
    }
}

Okay. The transform code sets things up so that the “w” component of the screen-space positions actually contains 1/w; the first part of this code then tries to figure out whether the source vertex was in front of the near plane (i.e. outside the view frustum or not). An ugly wrinkle here is that the near plane is hard-coded to be at 1. Doing this after dividing by w adds extra complications since the code needs to be careful about the signs. And the second comment is outright wrong – it in fact early-outs when any of the four active triangles have vertex number vv outside the near-clip plane, not when all of them do. In other words, if any of the 4 active triangles get near-clipped, the test rasterizer will just punt and return true (“visible”).

So here’s the thing: there’s really no reason to do this check after we’re done with triangle setup. Nor do we even have to gather the 3 triangle vertices to discover that one of them is in front of the near plane. A box has 8 vertices, and we’ll know whether any of them are in front of the near plane as soon as we’re done transforming them, before we even think about triangle setup! So let’s look at the function that transforms the vertices:

void TransformedAABBoxSSE::TransformAABBox()
{
    for(UINT i = 0; i < AABB_VERTICES; i++)
    {
        mpXformedPos[i] = TransformCoords(&mpBBVertexList[i],
            mCumulativeMatrix);
        float oneOverW = 1.0f/max(mpXformedPos[i].m128_f32[3],
            0.0000001f);
        mpXformedPos[i] = mpXformedPos[i] * oneOverW;
        mpXformedPos[i].m128_f32[3] = oneOverW;
    }
}

As we can see, returning 1/w does in fact take a bit of extra work, so we’d like to avoid it, especially since that 1/w is really only referenced by the near-clip checking code. Also, the code seems to clamp w at some arbitrary small positive value – which means that the part of the near clip computation in the depth test rasterizer that worries about w<0 is actually unnecessary. This is the kind of thing I’m talking about – each piece of code in isolation seems reasonable, but once you look at both sides it becomes clear that the pieces don’t fit together all that well.

It turns out that after TransformCoords, we’re in “homogeneous viewport space”, i.e. we’re still in a homogeneous space, but unlike the homogeneous clip space you might be used to from vertex shaders, this one also has the viewport transform baked in. But our viewport transform leaves z alone (we fixed that in the previous post!), so we still have a D3D-style clip volume for z:

0 \le z \le w

Since we’re using a reversed clip volume, the z≤w constraint is the near-plane one. Note that this test doesn’t need any special cases for negative signs and also doesn’t have a hardcoded near-plane location any more: it just automatically uses whatever the projection matrix says, which is the right thing to do!

Even better, if we test for near-clip anyway, there’s no need to clamp w at all. We know that anything with w≤0 is outside the near plane, and if a vertex is outside the near plane we’re not gonna rasterize the box anyway. Now we might still end up dividing by 0, but since we’re dealing with floats, this is a well-defined operation (it might return infinities or NaNs, but that’s fine).

And on the subject of not rasterizing the box: as I said earlier, as soon as one vertex is outside the near-plane, we know we’re going to return true from the depth test rasterizer, so there’s no point even starting the operation. To facilitate this, we just make TransformAABBox return whether the box should be rasterized or not. Putting it all together:

bool TransformedAABBoxSSE::TransformAABBox()
{
    __m128 zAllIn = _mm_castsi128_ps(_mm_set1_epi32(~0));

    for(UINT i = 0; i < AABB_VERTICES; i++)
    {
        __m128 vert = TransformCoords(&mpBBVertexList[i],
            mCumulativeMatrix);

        // We have inverted z; z is inside of near plane iff z <= w.
        __m128 vertZ = _mm_shuffle_ps(vert, vert, 0xaa); //vert.zzzz
        __m128 vertW = _mm_shuffle_ps(vert, vert, 0xff); //vert.wwww
        __m128 zIn = _mm_cmple_ps(vertZ, vertW);
        zAllIn = _mm_and_ps(zAllIn, zIn);

        // project
        mpXformedPos[i] = _mm_div_ps(vert, vertW);
    }

    // return true if and only if all verts inside near plane
    return _mm_movemask_ps(zAllIn) == 0xf;
}

In case you’re wondering why this code uses raw SSE intrinsics and not VecF32, it’s because I’m purposefully trying to keep anything depending on the SIMD width out of VecF32, which makes it a lot easier to go to 8-wide AVX should we want to at some point. But this code really uses 4-vectors of (x,y,z,w) and needs to do shuffles, so it doesn’t fit in that model and I want to keep it separate. But the actual logic is just what I described.

And once we have this return value from TransformAABBox, we get to remove the near-clip test from the depth test rasterizer, and we get to move our early-out for near-clipped boxes all the way to the call site:

if(mpTransformedAABBox[i].TransformAABBox())
    mpVisible[i] = mpTransformedAABBox[i].RasterizeAndDepthTestAABBox(...);
else
    mpVisible[i] = true;

So, the oneOverW hack, the clamping hack and the hard-coded near plane are gone. That’s already a victory in terms of code quality, but did it improve the run time?

Change: Transform/early-out fixes

Depth test min 25th med 75th max mean sdev
Start 1.109 1.152 1.166 1.182 1.240 1.167 0.022
Transform fixes 1.054 1.092 1.102 1.112 1.146 1.102 0.016

Another 0.06ms off our median depth test time, which may not sound big but is over 5% of what’s left of it at this point.

Getting warmer

The bounding box rasterizer has one more method that’s called per-box though, and this is one that really deserves some special attention. Meet IsTooSmall:

bool TransformedAABBoxSSE::IsTooSmall(__m128 *pViewMatrix,
    __m128 *pProjMatrix, CPUTCamera *pCamera)
{
    float radius = mBBHalf.lengthSq(); // Use length-squared to
    // avoid sqrt().  Relative comparisons hold.

    float fov = pCamera->GetFov();
    float tanOfHalfFov = tanf(fov * 0.5f);

    MatrixMultiply(mWorldMatrix, pViewMatrix, mCumulativeMatrix);
    MatrixMultiply(mCumulativeMatrix, pProjMatrix,
        mCumulativeMatrix);
    MatrixMultiply(mCumulativeMatrix, mViewPortMatrix,
        mCumulativeMatrix);

    __m128 center = _mm_set_ps(1.0f, mBBCenter.z, mBBCenter.y,
        mBBCenter.x);
    __m128 mBBCenterOSxForm = TransformCoords(&center,
        mCumulativeMatrix);
    float w = mBBCenterOSxForm.m128_f32[3];
    if( w > 1.0f )
    {
        float radiusDivW = radius / w;
        float r2DivW2DivTanFov = radiusDivW / tanOfHalfFov;

        return r2DivW2DivTanFov <
            (mOccludeeSizeThreshold * mOccludeeSizeThreshold);
    }

    return false;
}

Note that MatrixMultiply(A, B, C) performs C = A * B; the rest should be easy enough to figure out from the code. Now there’s really several problems with this function, so let’s go straight to a list:

  • radius (which is really radius squared) only depends on mBBHalf, which is fixed at initialization time. There’s no need to recompute it every time.
  • Similarly, fov and tanOfHalfFov only depend on the camera, and absolutely do not need to be recomputed once for every box. This is what gave us the _tan_pentium4 cameo all the way back in “Frustum culling: turning the crank”, by the way.
  • The view matrix, projection matrix and viewport matrix are also all camera or global constants. Again, no need to multiply these together for every box – the only matrix that is different between boxes is the very first one, the world matrix, and since matrix multiplication is associative, we can just concatenate the other three once.
  • There’s also no need for mOccludeeSizeThreshold to be squared every time – we can do that once.
  • Nor is there a need for it to be stored per box, since it’s a global constant owned by the depth test rasterizer.
  • (radius / w) / tanOfHalfFov would be better computed as radius / (w * tanOfHalfFov).
  • But more importantly, since all we’re doing is a compare and both w and tanOfHalfFov are positive, we can just multiply through by them and get rid of the divide altogether.

All these things are common problems that I must have fixed a hundred times, but I have to admit that it’s pretty rare to see so many of them in a single page of code. Anyway, rather than fixing these one by one, let’s just cut to the chase: instead of all the redundant computations, we just move everything that only depends on the camera (or is global) into a single struct that holds our setup, which I dubbed BoxTestSetup. Here’s the code:

struct BoxTestSetup
{
    __m128 mViewProjViewport[4];
    float radiusThreshold;

    void Init(const __m128 viewMatrix[4],
        const __m128 projMatrix[4], CPUTCamera *pCamera,
        float occludeeSizeThreshold);
};

void BoxTestSetup::Init(const __m128 viewMatrix[4],
    const __m128 projMatrix[4], CPUTCamera *pCamera,
    float occludeeSizeThreshold)
{
    // viewportMatrix is a global float4x4; we need a __m128[4]
    __m128 viewPortMatrix[4];
    viewPortMatrix[0] = _mm_loadu_ps((float*)&viewportMatrix.r0);
    viewPortMatrix[1] = _mm_loadu_ps((float*)&viewportMatrix.r1);
    viewPortMatrix[2] = _mm_loadu_ps((float*)&viewportMatrix.r2);
    viewPortMatrix[3] = _mm_loadu_ps((float*)&viewportMatrix.r3);

    MatrixMultiply(viewMatrix, projMatrix, mViewProjViewport);
    MatrixMultiply(mViewProjViewport, viewPortMatrix,
        mViewProjViewport);

    float fov = pCamera->GetFov();
    float tanOfHalfFov = tanf(fov * 0.5f);
    radiusThreshold = occludeeSizeThreshold * occludeeSizeThreshold
        * tanOfHalfFov;
}

This is initialized once we start culling and simply kept on the stack. Then we just pass it to IsTooSmall, which after our surgery looks like this:

bool TransformedAABBoxSSE::IsTooSmall(const BoxTestSetup &setup)
{
    MatrixMultiply(mWorldMatrix, setup.mViewProjViewport,
        mCumulativeMatrix);

    __m128 center = _mm_set_ps(1.0f, mBBCenter.z, mBBCenter.y,
        mBBCenter.x);
    __m128 mBBCenterOSxForm = TransformCoords(&center,
        mCumulativeMatrix);
    float w = mBBCenterOSxForm.m128_f32[3];
    if( w > 1.0f )
    {
        return mRadiusSq < w * setup.radiusThreshold;
    }

    return false;
}

Wow, that method sure seems to have lost a few pounds. Let’s run the numbers:

Change: IsTooSmall cleanup

Depth test min 25th med 75th max mean sdev
Start 1.109 1.152 1.166 1.182 1.240 1.167 0.022
Transform fixes 1.054 1.092 1.102 1.112 1.146 1.102 0.016
IsTooSmall cleanup 0.860 0.893 0.908 0.917 0.954 0.905 0.018

Another 0.2ms off the median run time, bringing our total reduction for this post to about 22%. So are we done? Not yet!

The state police

Currently, each TransformedAABBoxSSE still keeps its own copy of the cumulative transform matrix and a copy of its transformed vertices. But it’s not necessary for these to be persistent – we compute them once, use them to rasterize the box, then don’t look at them again until the next frame. So, like mVisible earlier, there’s really no need to keep them around as state; instead, it’s better to just store them on the stack. Less pointers per TransformedAABBoxSSE, less cache misses, and – perhaps most important of all – it makes the bounding box objects themselves stateless. Granted, that’s the case only because our world is perfectly static and nothing is animated at runtime, but still, stateless is good! Stateless is easier to read, easier to debug, and easier to test.

Again, this is another change that is purely mechanical – just pass in a pointer to cumulativeMatrix and xformedPos to the functions that want them. So this time, I’m just going to refer you directly to the two commits that implement this idea, and skip straight to the results:

Change: Reduce amount of state

Depth test min 25th med 75th max mean sdev
Start 1.109 1.152 1.166 1.182 1.240 1.167 0.022
Transform fixes 1.054 1.092 1.102 1.112 1.146 1.102 0.016
IsTooSmall cleanup 0.860 0.893 0.908 0.917 0.954 0.905 0.018
Reduce state 0.834 0.862 0.873 0.886 0.938 0.875 0.017

Only about 0.03ms this time, but we also save 192 bytes (plus allocator overhead) worth of memory per box, which is a nice bonus. And anyway, we’re not done yet, because I have one more!

It’s more fun to compute

There’s one more piece of unnecessary data we currently store per bounding box: the vertex list, initialized in CreateAABBVertexIndexList:

float3 min = mBBCenter - bbHalf;
float3 max = mBBCenter + bbHalf;
	
//Top 4 vertices in BB
mpBBVertexList[0] = _mm_set_ps(1.0f, max.z, max.y, max.x);
mpBBVertexList[1] = _mm_set_ps(1.0f, max.z, max.y, min.x); 
mpBBVertexList[2] = _mm_set_ps(1.0f, min.z, max.y, min.x);
mpBBVertexList[3] = _mm_set_ps(1.0f, min.z, max.y, max.x);
// Bottom 4 vertices in BB
mpBBVertexList[4] = _mm_set_ps(1.0f, min.z, min.y, max.x);
mpBBVertexList[5] = _mm_set_ps(1.0f, max.z, min.y, max.x);
mpBBVertexList[6] = _mm_set_ps(1.0f, max.z, min.y, min.x);
mpBBVertexList[7] = _mm_set_ps(1.0f, min.z, min.y, min.x);

This is, in effect, just treating the bounding box as a general mesh. But that’s extremely wasteful – we already store center and half-extent, the min/max corner positions are trivial to reconstruct from that information, and all the other vertices can be constructed by splicing min/max together componentwise using a set of masks that is the same for all bounding boxes. So these 8*16 = 128 bytes of vertex data really don’t pay their way.

But more importantly, note that the we only ever use two distinct values for x, y and z each. Now TransformAABBox, which we already saw above, uses TransformCoords to compute the matrix-vector product v*M with the cumulative transform matrix, using the expression

v.x * M.row[0] + v.y * M.row[1] + v.z * M.row[2] + M.row[3] (v.w is assumed to be 1)

and because we know that v.x is either min.x or max.x, we can multiply both by M.row[0] once and store the result. Then the 8 individual vertices can skip the multiplies altogether. Putting it all together leads to the following new code for TransformAABBox:

// 0 = use min corner, 1 = use max corner
static const int sBBxInd[AABB_VERTICES] = { 1, 0, 0, 1, 1, 1, 0, 0 };
static const int sBByInd[AABB_VERTICES] = { 1, 1, 1, 1, 0, 0, 0, 0 };
static const int sBBzInd[AABB_VERTICES] = { 1, 1, 0, 0, 0, 1, 1, 0 };

bool TransformedAABBoxSSE::TransformAABBox(__m128 xformedPos[],
    const __m128 cumulativeMatrix[4])
{
    // w ends up being garbage, but it doesn't matter - we ignore
    // it anyway.
    __m128 vCenter = _mm_loadu_ps(&mBBCenter.x);
    __m128 vHalf   = _mm_loadu_ps(&mBBHalf.x);

    __m128 vMin    = _mm_sub_ps(vCenter, vHalf);
    __m128 vMax    = _mm_add_ps(vCenter, vHalf);

    // transforms
    __m128 xRow[2], yRow[2], zRow[2];
    xRow[0] = _mm_shuffle_ps(vMin, vMin, 0x00) * cumulativeMatrix[0];
    xRow[1] = _mm_shuffle_ps(vMax, vMax, 0x00) * cumulativeMatrix[0];
    yRow[0] = _mm_shuffle_ps(vMin, vMin, 0x55) * cumulativeMatrix[1];
    yRow[1] = _mm_shuffle_ps(vMax, vMax, 0x55) * cumulativeMatrix[1];
    zRow[0] = _mm_shuffle_ps(vMin, vMin, 0xaa) * cumulativeMatrix[2];
    zRow[1] = _mm_shuffle_ps(vMax, vMax, 0xaa) * cumulativeMatrix[2];

    __m128 zAllIn = _mm_castsi128_ps(_mm_set1_epi32(~0));

    for(UINT i = 0; i < AABB_VERTICES; i++)
    {
        // Transform the vertex
        __m128 vert = cumulativeMatrix[3];
        vert += xRow[sBBxInd[i]];
        vert += yRow[sBByInd[i]];
        vert += zRow[sBBzInd[i]];

        // We have inverted z; z is inside of near plane iff z <= w.
        __m128 vertZ = _mm_shuffle_ps(vert, vert, 0xaa); //vert.zzzz
        __m128 vertW = _mm_shuffle_ps(vert, vert, 0xff); //vert.wwww
        __m128 zIn = _mm_cmple_ps(vertZ, vertW);
        zAllIn = _mm_and_ps(zAllIn, zIn);

        // project
        xformedPos[i] = _mm_div_ps(vert, vertW);
    }

    // return true if and only if none of the verts are z-clipped
    return _mm_movemask_ps(zAllIn) == 0xf;
}

Admittedly, quite a bit longer than the original one, but that’s because we front-load a lot of the computation; most of the per-vertex work done in TransformCoords is gone. And here’s our reward:

Change: Get rid of per-box vertex list

Depth test min 25th med 75th max mean sdev
Start 1.109 1.152 1.166 1.182 1.240 1.167 0.022
Transform fixes 1.054 1.092 1.102 1.112 1.146 1.102 0.016
IsTooSmall cleanup 0.860 0.893 0.908 0.917 0.954 0.905 0.018
Reduce state 0.834 0.862 0.873 0.886 0.938 0.875 0.017
Remove vert list 0.801 0.823 0.830 0.839 0.867 0.831 0.012

This brings our total for this post to a nearly 25% reduction in median depth test time, plus about 320 bytes memory reduction per TransformedAABBoxSSE – which, since we have about 27000 of them, works out to well over 8 megabytes. Such are the rewards for widening the scope beyond optimizing functions by themselves.

And as usual, the code for this time (plus some changes I haven’t discussed yet) is up on Github. Until next time!

The care and feeding of worker threads, part 2

This post is part of a series – go here for the index.

In the previous post, we took a closer look at what our worker threads were doing and spent some time load-balancing the depth buffer rasterizer to reduce our overall latency. This time, we’ll have a closer look at the rest of the system.

A bug

But first, it’s time to look at a bug that I inadvertently introduced last time: If you tried running the code from last time, you might have noticed that toggling the “Multi Tasking” checkbox off and back on causes a one-frame glitch. I introduced this bug in the changes corresponding to the section “Balancing act”. Since I didn’t get any comments or mails about it, it seems like I got away with it :), but I wanted to rectify it here anyway.

The issue turned out to be that the IsTooSmall computation for occluders, which we moved from the “vertex transform” to the “frustum cull” pass last time, used stale information. The relevant piece of the main loop is this:

mpCamera->SetNearPlaneDistance(1.0f);
mpCamera->SetFarPlaneDistance(gFarClipDistance);
mpCamera->Update();

// If view frustum culling is enabled then determine which occluders
// and occludees are inside the view frustum and run the software
// occlusion culling on only the those models
if(mEnableFCulling)
{
    renderParams.mpCamera = mpCamera;
    mpDBR->IsVisible(mpCamera);
    mpAABB->IsInsideViewFrustum(mpCamera);
}

// if software occlusion culling is enabled
if(mEnableCulling)
{
    mpCamera->SetNearPlaneDistance(gFarClipDistance);
    mpCamera->SetFarPlaneDistance(1.0f);
    mpCamera->Update();

    // Set the camera transforms so that the occluders can
    // be transformed 
    mpDBR->SetViewProj(mpCamera->GetViewMatrix(),
        (float4x4*)mpCamera->GetProjectionMatrix());

    // (clear, render depth and perform occlusion test here)

    mpCamera->SetNearPlaneDistance(1.0f);
    mpCamera->SetFarPlaneDistance(gFarClipDistance);
    mpCamera->Update();
}

Note how the call that actually updates the view-projection matrix (highlighted in red) runs after the frustum-culling pass. That’s the bug I was running into. Fixing this bug is almost as simple as moving that call up (to before the frustum culling pass), but another wrinkle is that the depth-buffer pass uses an inverted Z-buffer with Z=0 at the far plane and Z=1 at the near plane – note the calls that swap the positions of the camera “near” and “far” planes before depth buffer rendering, and the ones that swap it back after. There’s good reasons for doing this, particularly if the depth buffer uses floats (as it does in our implementation). But to simplify matters here, I changed the code to do the swapping as part of the viewport transform instead, which means there’s no need to be modifying the camera/projection setup during the frame at all. This keeps the code simpler and also makes it easy to move the SetViewProj call to before the frustum culling pass, where it should be now that we’re using these matrices earlier.

Some extra instrumentation

In some of the previous posts, we already looked at the frustum culling logic; this time, I also added another timer that measures our total culling time, including frustum culling and everything related to rendering the depth buffer and performing the bounding box occlusion tests. The code itself is straightforward; I just wanted to add another explicit counter so we can see the explicit summary statistics as we make changes. I’ll use separate tables for the individual measurements:

Total cull time min 25th med 75th max mean sdev
Initial 3.767 3.882 3.959 4.304 5.075 4.074 0.235
Render depth min 25th med 75th max mean sdev
Initial 2.098 2.119 2.132 2.146 2.212 2.136 0.022
Depth test min 25th med 75th max mean sdev
Initial 1.249 1.366 1.422 1.475 1.656 1.425 0.081

Load balancing depth testing

Last time, we saw two fundamentally different ways to balance our multi-threaded workloads. The first was to simply split the work into N contiguous chunks. As we saw for the “transform vertices” and “bin meshes” passes, this works great provided that the individual work items generate a roughly uniform amount of work. Since vertex transform and binning work were roughly proportional to the number of vertices and triangles respectively, this kind of split worked well once we made sure to split after early-out processing.

In the second case, triangle rasterization, we couldn’t change the work partition after the fact: each task corresponded to one tile, and if we started touching two tiles in one task, it just wouldn’t work; there’d be race conditions. But at least we had a rough metric of how expensive each tile was going to be – the number of triangles in the respective bins – and we could use that to make sure that the “bulky” tiles would get processed first, to reduce the risk of picking up such a tile late and then having all other threads wait for its processing to finish.

Now, the depth tests are somewhat tricky, because neither of these strategies really apply. The cost of depth-testing a bounding box has two components: first, there is a fixed overhead of just processing a box (transforming its vertices and setting up the triangles), and second, there’s the actual rasterization with a cost that’s roughly proportional to the size of the bounding box in pixels when projected to the screen. For small boxes, the constant overhead is the bigger issue; for larger boxes, the per-pixel cost dominates. And at the point when we’re partitioning the work items across threads, we don’t know how big an area a box is going to cover on the screen, because we haven’t transformed the vertices yet! But still, our depth test pass is in desperate need of some balancing – here’s a typical example:

Imbalanced depth tests

There’s nothing that’s stopping us from treating the depth test pass the way we treat the regular triangle pass: chop it up into separate phases with explicit hand-overs and balance them separately. But that’s a really big and disruptive change, and it turns out we don’t have to go that far to get a decent improvement.

The key realization is that the array of model bounding boxes we’re traversing is not in a random order. Models that are near each other in the world also tend to be near each other in the array. Thus, when we just partition the list of world models into N separate contiguous chunks, they’re not gonna have a similar amount of work for most viewpoints: some chunks are closer to the viewer than others, and those will contain bounding boxes that take up more area on the screen and hence be more expensive to process.

Well, that’s easy enough to fix: don’t do that! Suppose we had two worker threads. Our current approach would then correspond to splitting the world database in the middle, giving the first half to the first worker, and the second half to the second worker. This is bad whenever there’s much more work in one of the halves, say because the camera happens to be in it and the models are just bigger on screen and take longer to depth-test. But there’s no need to split the world database like that! We can just as well split it non-contiguously, say into one half with even indices and another half with odd indices. We can still get a lopsided distribution, but only if we happen to be a lot closer to all the even-numbered models than we are to the odd-numbered ones, and that’s a lot less likely to happen by accident. Unless the meshes happen to form a grid or other regular structure that is, in which case you might still get screwed. :)

Anyway, the same idea generalizes to N threads: instead of partitioning the models into odd and even halves, group all models which have the same index mod N. And in practice we don’t want to interleave at the level of individual models, since them being close together also has an advantage: they tend to hit similar regions of the depth buffer, which have a good chance of being in the cache. So instead of interleaving at the level of individual models, we interleave groups of 64 (arbitrary choice!) models at a time; an idea similar to the disk striping used for RAIDs. It turns out to be a really easy change to make: just replace the original loop

for(UINT i = start; i < end; i++)
{
    // process model i
}

with the only marginally more complicated

static const UINT kChunkSize = 64;
for(UINT base = taskId*kChunkSize; base < mNumModels;
        base += mNumDepthTestTasks * kChunkSize)
{
    UINT end = min(base + kChunkSize, mNumModels);
    for(UINT i = base; i < end; i++)
    {
        // process model i
    }
}

and we’re done. Let’s see the change:

Change: “Striping” to load-balance depth test threads.

Depth test min 25th med 75th max mean sdev
Initial 1.249 1.366 1.422 1.475 1.656 1.425 0.081
Striped 1.109 1.152 1.166 1.182 1.240 1.167 0.022
Total cull time min 25th med 75th max mean sdev
Initial 3.767 3.882 3.959 4.304 5.075 4.074 0.235
Striped depth test 3.646 3.769 3.847 3.926 4.818 3.877 0.160

That’s pretty good for just changing a few lines. Here’s the corresponding Telemetry screenshot:

Depth tests after striping

Not as neatly balanced as some of the other ones we’ve seen, but we successfully managed to break up some of the huge packets, so it’s good enough for now.

One bottleneck remaining

At this point, we’re in pretty good shape as far as worker thread utilization is concerned, but there’s one big serial chunk still remaining, right between frustum culling and vertex transformation:

Depth buffer clears

Clearing the depth buffer. This is about 0.4ms, about a third of the time we spend depth testing, all tracing back to a single line in the code:

    // Clear the depth buffer
    mpCPURenderTargetPixels = (UINT*)mpCPUDepthBuf;
    memset(mpCPURenderTargetPixels, 0, SCREENW * SCREENH * 4);

Luckily, this one’s really easy to fix. We could try and turn this into another separate group of tasks, but there’s no need: we already have a pass that chops up the screen into several smaller pieces, namely the actual rasterization which works one tile at a time. And neither the vertex transform nor the binner that run before it actually care about the contents of the depth buffer. So we just clear one tile at a time, from the rasterizer code. As a bonus, this means that the active tile gets “pre-loaded” into the current core’s L2 cache before we start rendering. I’m not going to bother walking through the code here – it’s simple enough – but as usual, I’ll give you the results:

Change: Clear depth buffer in rasterizer workers

Total cull time min 25th med 75th max mean sdev
Initial 3.767 3.882 3.959 4.304 5.075 4.074 0.235
Striped depth test 3.646 3.769 3.847 3.926 4.818 3.877 0.160
Clear in rasterizer 3.428 3.579 3.626 3.677 4.734 3.658 0.155
Render depth min 25th med 75th max mean sdev
Initial 2.098 2.119 2.132 2.146 2.212 2.136 0.022
Clear in rasterizer 2.191 2.224 2.248 2.281 2.439 2.258 0.043

So even though we take a bit of a hit in rasterization latency, we still get a very solid 0.2ms win in the total cull time. Again, a very good pay-off considering the amount of work involved.

Summary

A lot of the posts in this series so far either needed conceptual/algorithmic leaps or at least some detailed micro-architectural profiling. But this post and the previous one did not. In fact, finding these problems took nothing but a timeline profiler, and none of the fixes were particularly complicated either. I used Telemetry because that’s what I’m familiar with, but I didn’t use any but its most basic features, and I’m sure you would’ve found the same problems with any other program of this type; I’m told Intel’s GPA can do the same thing, but I haven’t used it so far.

Just to drive this one home – this is what we started with:

Initial work distribution

(total cull time 7.36ms, for what it’s worth) and this is where we are now:

Finished worker balance

Note that the bottom one is zoomed in by 2x so you can read the labels! Compare the zone lengths where printed. Now, this is not a representative sample; I just grabbed an arbitrary frame from both sessions, so don’t draw any conclusions from these two images alone, but it’s still fairly impressive. I’m still not sure why TBB only seems to use some subset of its worker threads – maybe there’s some threshold before they wake up and our parallel code just doesn’t run for long enough? – but it should be fairly obvious that the overall packing is a lot better now.

Remember, people. This is the same code. I didn’t change any of the algorithms nor their implementations in any substantial way. All I did was spend some time on their callers, improving the work granularity and scheduling. If you’re using worker threads, this is absolutely something you need to have on your radar.

As usual, the code for this part is up on Github, this time with a few bonus commits I’m going to discuss next time (spoiler alert!), when I take a closer look at the depth testing code and the binner. See you then!

The care and feeding of worker threads, part 1

This post is part of a series – go here for the index.

It’s time for another post! After all the time I’ve spent on squeezing about 20% out of the depth rasterizer, I figured it was time to change gears and look at something different again. But before we get started on that new topic, there’s one more set of changes that I want to talk about.

The occlusion test rasterizer

So far, we’ve mostly been looking at one rasterizer only – the one that actually renders the depth buffer we cull against, and even more precisely, only multi-threaded SSE version of it. But the occlusion culling demo has two sets of rasterizers: the other set is used for the occlusion tests. It renders bounding boxes for the various models to be tested and checks whether they are fully occluded. Check out the code if you’re interested in the details.

This is basically the same rasterizer that we already talked about. In the previous two posts, I talked about optimizing the depth buffer rasterizer, but most of the same changes apply to the test rasterizer too. It didn’t make sense to talk through the same thing again, so I took the liberty of just making the same changes (with some minor tweaks) to the test rasterizer “off-screen”. So, just a heads-up: the test rasterizer has changed while you weren’t looking – unless you closely watch the Github repository, that is.

And now that we’ve established that there’s another inner loop we ought to be aware of, let’s zoom out a bit and look at the bigger picture.

Some open questions

There’s two questions you might have if you’ve been following this series closely so far. The first concerns a very visible difference between the depth and test rasterizers that you might have noticed if you ran the code. It’s also visible in the data in “Depth buffers done quick, part 1”, though I didn’t talk about it at the time. I’m talking, of course, about the large standard deviation we get for the execution time of the occlusion tests. Here’s a set of measurements for the code right after bringing the test rasterizer up to date:

Pass min 25th med 75th max mean sdev
Render depth 2.666 2.716 2.732 2.745 2.811 2.731 0.022
Occlusion test 1.335 1.545 1.587 1.631 1.761 1.585 0.066

Now, the standard deviation actually got a fair bit lower with the rasterizer changes (originally, we were well above 0.1ms), but it’s still surprisingly large, especially considering that the occlusion tests run roughly half as long (in terms of wall-clock time) as the depth rendering. And there’s also a second elephant in the room that’s been staring us in the face for quite a while. Let me recycle one of the VTune screenshots from last time:

Rasterizer hotspots without early-out

Right there at #4 is some code from TBB, namely, what turns out to be the “thread is idle” spin loop.

Well, so far, we’ve been profiling, measuring and optimizing this as if it was a single-threaded application, but it’s not. The code uses TBB to dispatch tasks to worker threads, and clearly, a lot of these worker threads seem to be idle a lot of the time. But why? To answer that question, we need a bit different information than what either a normal VTune analysis run or our summary timers give us. We want a detailed breakdown of what happens during a frame. Now, VTune has some support for that (as part of their threading/concurrency profiling), but the UI doesn’t work well for me, and neither does the the visualization; it seems to be geared towards HPC/throughput computing more than latency-sensitive applications like real-time graphics, and it’s also still based on sampling profiling, which means it’s low-overhead but fairly limited in the kind of data it can collect.

Instead, I’m going to go for the shameless plug and use Telemetry instead (full disclosure: I work at RAD). It works like this: I manually instrument the source code to tell Telemetry when certain events are happening, and Telemetry collects that data, sends the whole log to a server and can later visualize it. Most games I’ve worked on have some kind of “bar graph profiler” that can visualize within-frame events, but because Telemetry keeps the whole data stream, it can also be used to answer the favorite question (not!) of engine programmers everywhere: “Wait, what the hell just happened there?”. Instead of trying to explain it in words, I’m just gonna show you the screenshot of my initial profiling run after I hooked up Telemetry and added some basic markup: (Click on the image to get the full-sized version)

Initial Telemetry run

The time axis goes from left to right, and all of the blocks correspond to regions of code that I’ve marked up. Regions can nest, and when they do, the blocks stack. I’m only using really basic markup right now, because that turns out to be all we need for the time being. The different tracks correspond to different threads.

As you can see, despite the code using TBB and worker threads, it’s fairly rare for more than 2 threads to be actually running anything interesting at a time. Also, if you look at the “Rasterize” and “DepthTest” tasks, you’ll notice that we’re spending a fair amount of time just waiting for the last 2 threads to finish their respective jobs, while the other worker threads are idle. That’s where our variance in latency ultimately comes from – it all depends on how lucky (or unlucky) we get with scheduling, and the exact scheduling of tasks changes every frame. And now that we’ve seen how much time the worker threads spend being idle, it also shouldn’t surprise us that TBB’s idle spin loop ranked as high as it did in the profile.

What do we do about it, though?

Let’s start with something simple

As usual, we go for the low-hanging fruit first, and if you look at the left side of the screenshot I’ll posted, you’ll see a lot of blocks (“zones”) on the left side of the screen. In fact, the count is much higher than you probably think – these are LOD zones, which means that Telemetry has grouped a bunch of very short zones into larger groups for the purposes of visualization. As you can see from the mouse-over text, the single block I’m pointing at with the mouse cursor corresponds to 583 zones – and each of those zones corresponds to an individual TBB task! That’s because this culling code uses one TBB task per model to be culled. Ouch. Let’s zoom in a bit:

Telemetry: occluder visibility, zoomed

Note that even at this zoom level (the whole screen covers about 1.3ms), most zones are still LOD’d out. I’ve mouse-over’ed on a single task that happens to hit one or two L3 cache miss and so is long enough (at about 1500 cycles) to show up individually, but most of these tasks are closer to 600 cycles. In total, frustum culling the approximately 1600 occluder models takes up just above 1ms, as the captions helpfully say. For reference, the much smaller block that says “OccludeesVisible” and takes about 0.1ms? That one actually processes about 27000 models (it’s the code we optimized in “Frustum culling: turning the crank”). Again, ouch.

Fortunately, there’s a simple solution: don’t use one task per model. Instead, use a smaller number of tasks (I just used 32) that each cover multiple models. The code is fairly obvious, so I won’t bother repeating it here, but I am going to show you the results:

Telemetry: Occluder culling fixed

Down from 1ms to 0.08ms in two minutes of work. Now we could apply the same level of optimization as we did to the occludee culling, but I’m not going to bother, because at least not for the time being it’s fast enough. And with that out of the way, let’s look at the rasterization and depth testing part.

A closer look

Let’s look a bit more closely at what’s going on during rasterization:

Rasterization close-up

There are at least two noteworthy things clearly visible in this screenshot:

  1. There’s three separate passes – transform, bin, then rasterize.
  2. For some reason, we seem to have an odd mixture of really long tasks and very short ones.

The former shouldn’t come as a surprise, since it’s explicit in the code:

gTaskMgr.CreateTaskSet(&DepthBufferRasterizerSSEMT::TransformMeshes, this,
    NUM_XFORMVERTS_TASKS, NULL, 0, "Xform Vertices", &mXformMesh);
gTaskMgr.CreateTaskSet(&DepthBufferRasterizerSSEMT::BinTransformedMeshes, this,
    NUM_XFORMVERTS_TASKS, &mXformMesh, 1, "Bin Meshes", &mBinMesh);
gTaskMgr.CreateTaskSet(&DepthBufferRasterizerSSEMT::RasterizeBinnedTrianglesToDepthBuffer, this,
    NUM_TILES, &mBinMesh, 1, "Raster Tris to DB", &mRasterize);	

// Wait for the task set
gTaskMgr.WaitForSet(mRasterize);

What the screenshot does show us, however, is the cost of those synchronization points. There sure is a lot of “air” in that diagram, and we could get some significant gains from squeezing it out. The second point is more of a surprise though, because the code does in fact try pretty hard to make sure the tasks are evenly sized. There’s a problem, though:

void TransformedModelSSE::TransformMeshes(...)
{
    if(mVisible)
    {
        // compute mTooSmall

        if(!mTooSmall)
        {
            // transform verts
        }
    }
}

void TransformedModelSSE::BinTransformedTrianglesMT(...)
{
    if(mVisible && !mTooSmall)
    {
        // bin triangles
    }
}

Just because we make sure each task handles an equal number of vertices (as happens for the “TransformMeshes” tasks) or an equal number of triangles (“BinTransformedTriangles”) doesn’t mean they are similarly-sized, because the work subdivision ignores culling. Evidently, the tasks end up not being uniformly sized – not even close. Looks like we need to do some load balancing.

Balancing act

To simplify things, I moved the computation of mTooSmall from TransformMeshes into IsVisible – right after the frustum culling itself. That required some shuffling arguments around, but it’s exactly the kind of thing we already saw in “Frustum culling: turning the crank”, so there’s little point in going over it in detail again.

Once TransformMeshes and BinTransformedTrianglesMT use the exact same condition – mVisible && !mTooSmall – we can determine the list of models that are visible and not too small once, compute how many triangles and vertices these models have in total, and then use these corrected numbers which account for the culling when we’re setting up the individual transform and binning tasks.

This is easy to do: DepthBufferRasterizerSSE gets a few more member variables

UINT *mpModelIndexA; // 'active' models = visible and not too small
UINT mNumModelsA;
UINT mNumVerticesA;
UINT mNumTrianglesA;

and two new member functions

inline void ResetActive()
{
    mNumModelsA = mNumVerticesA = mNumTrianglesA = 0;
}

inline void Activate(UINT modelId)
{
    UINT activeId = mNumModelsA++;
    assert(activeId < mNumModels1);

    mpModelIndexA[activeId] = modelId;
    mNumVerticesA += mpStartV1[modelId + 1] - mpStartV1[modelId];
    mNumTrianglesA += mpStartT1[modelId + 1] - mpStartT1[modelId];
}

that handle the accounting. The depth buffer rasterizer already kept cumulative vertex and triangle counts for all models; I added one more element at the end so I could use the simplified vertex/triangle-counting logic.

Then, at the end of the IsVisible pass (after the worker threads are done), I run

// Determine which models are active
ResetActive();
for (UINT i=0; i < mNumModels1; i++)
    if(mpTransformedModels1[i].IsRasterized2DB())
        Activate(i);

where IsRasterized2DB() is just a predicate that returns mIsVisible && !mTooSmall (it was already there, so I used it).

After that, all that remains is distributing work over the active models only, using mNumVerticesA and mNumTrianglesA. This is as simple as turning the original loop in TransformMeshes

for(UINT ss = 0; ss < mNumModels1; ss++)

into

for(UINT active = 0; active < mNumModelsA; active++)
{
    UINT ss = mpModelIndexA[active];
    // ...
}

and the same for BinTransformedMeshes. All in all, this took me about 10 minutes to write, debug and test. And with that, we should have proper load balancing for the first two passes of rendering: transform and binning. The question, as always, is: does it help?

Change: Better rendering “front end” load balancing

Version min 25th med 75th max mean sdev
Initial depth render 2.666 2.716 2.732 2.745 2.811 2.731 0.022
Balance front end 2.282 2.323 2.339 2.362 2.476 2.347 0.034

Oh boy, does it ever. That’s a 14.4% reduction on top of what we already got last time. And Telemetry tells us we’re now doing a much better job at submitting uniform-sized tasks:

Balanced rasterization front end

In this frame, there’s still one transform batch that takes longer than the others; this happens sometimes, because of context switches for example. But note that the other threads nicely pick up the slack, and we’re still fine: a ~2x variation on the occasional item isn’t a big deal, provided most items are still roughly the same size. Also note that, even though there’s 8 worker threads, we never seem to be running more than 4 tasks at a time, and the hand-offs between threads (look at what happens in the BinMeshes phase) seem too perfectly synchronized to just happen accidentally. I’m assuming that TBB intentionally never uses more than 4 threads because the machine I’m running this on has a quad-core CPU (albeit with HyperThreading), but I haven’t checked whether this is just a configuration option or not; it probably is.

Balancing the rasterizer back end

Now we can’t do the same trick for the actual triangle rasterization, because it works in tiles, and they just end up with uneven amounts of work depending on what’s on the screen – there’s nothing we can do about that. That said, we’re definitely hurt by the uneven task sizes here too – for example, on my original Telemetry screenshot, you can clearly see how the non-uniform job sizes hurt us:

Initial bad rasterizer balance

The green thread picks up a tile with lots of triangles to render pretty late, and as a result everyone else ends up waiting for him to finish. This is not good.

However, lucky for us, there’s a solution: the TBB task manager will parcel out tasks roughly in the order they were submitted. So all we have to do is to make sure the “big” tiles come first. Well, after binning is done, we know exactly how many triangles end up in each tile. So what we do is insert a single task between
binning and rasterization that determines the right order to process the tiles in, then make the actual rasterization depend on it:

gTaskMgr.CreateTaskSet(&DepthBufferRasterizerSSEMT::BinSort, this,
    1, &mBinMesh, 1, "BinSort", &sortBins);
gTaskMgr.CreateTaskSet(&DepthBufferRasterizerSSEMT::RasterizeBinnedTrianglesToDepthBuffer,
    this, NUM_TILES, &sortBins, 1, "Raster Tris to DB", &mRasterize);	

So how does that function look? Well, all we have to do is count how many triangles ended up in each triangle, and then sort the tiles by that. The function is so short I’m just gonna show you the whole thing:

void DepthBufferRasterizerSSEMT::BinSort(VOID* taskData,
    INT context, UINT taskId, UINT taskCount)
{
    DepthBufferRasterizerSSEMT* me =
        (DepthBufferRasterizerSSEMT*)taskData;

    // Initialize sequence in identity order and compute total
    // number of triangles in the bins for each tile
    UINT tileTotalTris[NUM_TILES];
    for(UINT tile = 0; tile < NUM_TILES; tile++)
    {
        me->mTileSequence[tile] = tile;

        UINT base = tile * NUM_XFORMVERTS_TASKS;
        UINT numTris = 0;
        for (UINT bin = 0; bin < NUM_XFORMVERTS_TASKS; bin++)
            numTris += me->mpNumTrisInBin[base + bin];

        tileTotalTris[tile] = numTris;
    }

    // Sort tiles by number of triangles, decreasing.
    std::sort(me->mTileSequence, me->mTileSequence + NUM_TILES,
        [&](const UINT a, const UINT b)
        {
            return tileTotalTris[a] > tileTotalTris[b]; 
        });
}

where mTileSequence is just an array of UINTs with NUM_TILES elements. Then we just rename the taskId parameter of RasterizeBinnedTrianglesToDepthBuffer to rawTaskId and start the function like this:

    UINT taskId = mTileSequence[rawTaskId];

and presto, we have bin sorting. Here’s the results:

Change: Sort back-end tiles by amount of work

Version min 25th med 75th max mean sdev
Initial depth render 2.666 2.716 2.732 2.745 2.811 2.731 0.022
Balance front end 2.282 2.323 2.339 2.362 2.476 2.347 0.034
Balance back end 2.128 2.162 2.178 2.201 2.284 2.183 0.029

Once again, we’re 20% down from where we started! Now let’s check in Telemetry to make sure it worked correctly and we weren’t just lucky:

Rasterizer fully balanced

Now that’s just beautiful. See how the whole thing is now densely packed into the live threads, with almost no wasted space? This is how you want your profiles to look. Aside from the fact that our rasterization only seems to be running on 3 threads, that is – there’s always more digging to do. One fun thing I noticed is that TBB actually doesn’t process the tasks fully in-order; the two top threads indeed start from the biggest tiles and work their way forwards, but the bottom-most thread actually starts from the end of the queue, working its way towards the beginning. The tiny LOD zone I’m hovering over covers both the bin sorting task and the seven smallest tiles; the packets get bigger from there.

And with that, I think we have enough changes (and images!) for one post. We’ll continue ironing out scheduling kinks next time, but I think the lesson is already clear: you can’t just toss tasks to worker threads and expect things to go smoothly. If you want to get good thread utilization, better profile to make sure your threads actually do what you think they’re doing! And as usual, you can find the code for this post on Github, albeit without the Telemetry instrumentation for now – Telemetry is a commercial product, and I don’t want to introduce any dependencies that make it harder for people to compile the code. Take care, and until next time.

Optimizing Software Occlusion Culling – index

In January of 2013, some nice folks at Intel released a Software Occlusion Culling demo with full source code. I spent about two weekends playing around with the code, and after realizing that it made a great example for various things I’d been meaning to write about for a long time, started churning out blog posts about it for the next few weeks. This is the resulting series.

Here’s the list of posts (the series is now finished):

  1. “Write combining is not your friend”, on typical write combining issues when writing graphics code.
  2. “A string processing rant”, a slightly over-the-top post that starts with some bad string processing habits and ends in a rant about what a complete minefield the standard C/C++ string processing functions and classes are whenever non-ASCII character sets are involved.
  3. “Cores don’t like to share”, on some very common pitfalls when running multiple threads that share memory.
  4. “Fixing cache issues, the lazy way”. You could redesign your system to be more cache-friendly – but when you don’t have the time or the energy, you could also just do this.
  5. “Frustum culling: turning the crank” – on the other hand, if you do have the time and energy, might as well do it properly.
  6. “The barycentric conspiracy” is a lead-in to some in-depth posts on the triangle rasterizer that’s at the heart of Intel’s demo. It’s also a gripping tale of triangles, Möbius, and a plot centuries in the making.
  7. “Triangle rasterization in practice” – how to build your own precise triangle rasterizer and not die trying.
  8. “Optimizing the basic rasterizer”, because this is real time, not amateur hour.
  9. “Depth buffers done quick, part 1” – at last, looking at (and optimizing) the depth buffer rasterizer in Intel’s example.
  10. “Depth buffers done quick, part 2” – optimizing some more!
  11. “The care and feeding of worker threads, part 1” – this project uses multi-threading; time to look into what these threads are actually doing.
  12. “The care and feeding of worker threads, part 2” – more on scheduling.
  13. “Reshaping dataflows” – using global knowledge to perform local code improvements.
  14. “Speculatively speaking” – on store forwarding and speculative execution, using the triangle binner as an example.
  15. “Mopping up” – a bunch of things that didn’t fit anywhere else.
  16. “The Reckoning” – in which a lesson is learned, but the damage is irreversible.

All the code is available on Github; there’s various branches corresponding to various (simultaneous) tracks of development, including a lot of experiments that didn’t pan out. The articles all reference the blog branch which contains only the changes I talk about in the posts – i.e. the stuff I judged to be actually useful.

Special thanks to Doug McNabb and Charu Chandrasekaran at Intel for publishing the example with full source code and a permissive license, and for saying “yes” when I asked them whether they were okay with me writing about my findings in this way!


CC0


To the extent possible under law,

Fabian Giesen

has waived all copyright and related or neighboring rights to
Optimizing Software Occlusion Culling.

Depth buffers done quick, part 2

This post is part of a series – go here for the index.

Welcome back! At the end of the last post, we had just finished doing a first pass over the depth buffer rendering loops. Unfortunately, the first version of that post listed a final rendering time that was an outlier; more details in the post (which also has been updated to display the timing results in tables).

Notation matters

However, while writing that post, it became clear to me that I needed to do something about those damn over-long Intel SSE intrinsic names. Having them in regular source code is one thing, but it really sucks for presentation when performing two bitwise operations barely fits inside a single line of source code. So I whipped up two helper classes VecS32 (32-bit signed integer) and VecF32 (32-bit float) that are actual C++ implementations of the pseudo-code Vec4i I used in “Optimizing the basic rasterizer”. I then converted a lot of the SIMD code in the project to use those classes instead of dealing with __m128 and __m128i directly.

I’ve used this kind of approach in the past to provide a useful common subset of SIMD operations for cross-platform code; in this case, the main point was to get some basic operator overloads and more convenient notation, but as a happy side effect it’s now much easier to make the code use SSE2 instructions only. The original code uses SSE4.1, but with the everything nicely in one place, it’s easy to use MOVMSKPS / CMP instead of PTEST for the mask tests and PSRAD / ANDPS / ANDNOTPS / ORPS instead of BLENDVPS; you just have to do the substitution in one place. I haven’t done that in the code on Github, but I wanted to point out that it’s an option.

Anyway, I won’t go over the details of either the helper classes (it’s fairly basic stuff) or the modifications to the code (just glorified search and replace), but I will show you one before-after example to illustrate why I did it:

col = _mm_add_epi32(colOffset, _mm_set1_epi32(startXx));
__m128i aa0Col = _mm_mullo_epi32(aa0, col);
__m128i aa1Col = _mm_mullo_epi32(aa1, col);
__m128i aa2Col = _mm_mullo_epi32(aa2, col);

row = _mm_add_epi32(rowOffset, _mm_set1_epi32(startYy));
__m128i bb0Row = _mm_add_epi32(_mm_mullo_epi32(bb0, row), cc0);
__m128i bb1Row = _mm_add_epi32(_mm_mullo_epi32(bb1, row), cc1);
__m128i bb2Row = _mm_add_epi32(_mm_mullo_epi32(bb2, row), cc2);

__m128i sum0Row = _mm_add_epi32(aa0Col, bb0Row);
__m128i sum1Row = _mm_add_epi32(aa1Col, bb1Row);
__m128i sum2Row = _mm_add_epi32(aa2Col, bb2Row);

turns into:

VecS32 col = colOffset + VecS32(startXx);
VecS32 aa0Col = aa0 * col;
VecS32 aa1Col = aa1 * col;
VecS32 aa2Col = aa2 * col;

VecS32 row = rowOffset + VecS32(startYy);
VecS32 bb0Row = bb0 * row + cc0;
VecS32 bb1Row = bb1 * row + cc1;
VecS32 bb2Row = bb2 * row + cc2;

VecS32 sum0Row = aa0Col + bb0Row;
VecS32 sum1Row = aa1Col + bb1Row;
VecS32 sum2Row = aa2Col + bb2Row;

I don’t know about you, but I already find this much easier to parse visually, and the generated code is the same. And as soon as I had this, I just got rid of most of the explicit temporaries since they’re never referenced again anyway:

VecS32 col = VecS32(startXx) + colOffset;
VecS32 row = VecS32(startYy) + rowOffset;

VecS32 sum0Row = aa0 * col + bb0 * row + cc0;
VecS32 sum1Row = aa1 * col + bb1 * row + cc1;
VecS32 sum2Row = aa2 * col + bb2 * row + cc2;

And suddenly, with the ratio of syntactic noise to actual content back to a reasonable range, it’s actually possible to see what’s really going on here in one glance. Even if this was slower – and as I just told you, it’s not – it would still be totally worthwhile for development. You can’t always do it this easily; in particular, with integer SIMD instructions (particularly when dealing with pixels), I often find myself frequently switching between the interpretation of values (“typecasting”), and adding explicit types adds more syntactic noise than it eliminates. But in this case, we actually have several relatively long functions that only deal with either 32-bit ints or 32-bit floats, so it works beautifully.

And just to prove that it really didn’t change the performance:

Change: VecS32/VecF32

Version min 25th med 75th max mean sdev
Initial 3.367 3.420 3.432 3.445 3.512 3.433 0.021
End of part 1 3.020 3.081 3.095 3.106 3.149 3.093 0.020
Vec[SF]32 3.022 3.056 3.067 3.081 3.153 3.069 0.018

A bit more work on setup

With that out of the way, let’s spiral further outwards and have a look at our triangle setup code. Most of it sets up edge equations etc. for 4 triangles at a time; we only drop down to individual triangles once we’re about to actually rasterize them. Most of this code works exactly as we saw in “Optimizing the basic rasterizer”, but there’s one bit that performs a bit more work than necessary:

// Compute triangle area
VecS32 triArea = A0 * xFormedFxPtPos[0].X;
triArea += B0 * xFormedFxPtPos[0].Y;
triArea += C0;

VecF32 oneOverTriArea = VecF32(1.0f) / itof(triArea);

Contrary to what the comment says :), this actually computes twice the (signed) triangle area and is used to normalize the barycentric coordinates. That’s also why there’s a divide to compute its reciprocal. However, the computation of the area itself is more complicated than necessary and depends on C0. A better way is to just use the direct determinant expression. Since the area is computed in integers, this gives exactly the same results with one operations less, and without the dependency on C0:

VecS32 triArea = B2 * A1 - B1 * A2;
VecF32 oneOverTriArea = VecF32(1.0f) / itof(triArea);

And talking about the barycentric coordinates, there’s also this part of the setup that is performed per triangle, not across 4 triangles:

VecF32 zz[3], oneOverW[3];
for(int vv = 0; vv < 3; vv++)
{
    zz[vv] = VecF32(xformedvPos[vv].Z.lane[lane]);
    oneOverW[vv] = VecF32(xformedvPos[vv].W.lane[lane]);
}

VecF32 oneOverTotalArea(oneOverTriArea.lane[lane]);
zz[1] = (zz[1] - zz[0]) * oneOverTotalArea;
zz[2] = (zz[2] - zz[0]) * oneOverTotalArea;

The latter two lines perform the half-barycentric interpolation setup; the original code multiplied the zz[i] by oneOverTotalArea here (this is the normalization for the barycentric terms). But note that all the quantities involved here are vectors of four broadcast values; these are really scalar computations, and we can perform them while we’re still dealing with 4 triangles at a time! So right after the triangle area computation, we now do this:

// Z setup
VecF32 Z[3];
Z[0] = xformedvPos[0].Z;
Z[1] = (xformedvPos[1].Z - Z[0]) * oneOverTriArea;
Z[2] = (xformedvPos[2].Z - Z[0]) * oneOverTriArea;

Which allows us to get rid of the second half of the earlier block – all we have to do is load zz from Z[vv] rather than xformedvPos[vv].Z. Finally, the original code sets up oneOverW but never uses it, and it turns out that in this case, VC++’s data flow analysis was not smart enough to figure out that the computation is unnecessary. No matter – just delete that code as well.

So this batch is just a bunch of small, simple, local improvements: getting rid of a little unnecessary work in several places, or just grouping computations more effectively. It’s small fry, but it’s also very low-effort, so why not.

Change: Various minor setup improvements

Version min 25th med 75th max mean sdev
Initial 3.367 3.420 3.432 3.445 3.512 3.433 0.021
End of part 1 3.020 3.081 3.095 3.106 3.149 3.093 0.020
Vec[SF]32 3.022 3.056 3.067 3.081 3.153 3.069 0.018
Setup cleanups 2.977 3.032 3.046 3.058 3.101 3.045 0.020

As said, it’s minor, but a small win nonetheless.

Garbage in the bins

When I was originally performing the experiments that led to this series, I discovered something funny when I had the code at roughly this stage: occasionally, I would get triangles that had endXx < startXx (or endYy < startYy). I only noticed this because I changed the loop in a way that should have been equivalent, but turned out not to be: I was computing endXx - startXx as an unsigned integer, and it wrapped around, causing the code to start stomping over memory and eventually crash. At the time, I just made note to investigate this later and just added an if to detect the case early for the time being, but when I later came back to figure out what was going on, the explanation turned out to be quite interesting.

So, where do these triangles with empty bounding boxes come from? The actual per-triangle assignments

int startXx = startX.lane[lane];
int endXx   = endX.lane[lane];

just get their values from these vectors:

// Use bounding box traversal strategy to determine which
// pixels to rasterize 
VecS32 startX = vmax(
    vmin(
        vmin(xFormedFxPtPos[0].X, xFormedFxPtPos[1].X),
        xFormedFxPtPos[2].X), VecS32(tileStartX))
    & VecS32(~1);
VecS32 endX = vmin(
    vmax(
        vmax(xFormedFxPtPos[0].X, xFormedFxPtPos[1].X),
        xFormedFxPtPos[2].X) + VecS32(1), VecS32(tileEndX));

Horrible line-breaking aside (I just need to switch to a wider layout), this is fairly straightforward: startX is determined as the minimum of all vertex X coordinates, then clipped against the left tile boundary and finally rounded down to be a multiple of 2 (to align with the 2×2 tiling grid). Similarly, endX is the maximum of vertex X coordinates, clipped against the right boundary of the tile. Since we use an inclusive fill convention but exclusive loop bounds on the right side (the test is for < endXx not <= endXx), there’s an extra +1 in there.

Other than the clip to the tile bounds, this really just computes an axis-aligned bounding rectangle for the triangle and then potentially makes it a little bigger. So really, the only way to get endXx < startXx from this is for the triangle to have an empty intersection with the active tile’s bounding box. But if that’s the case, why was the triangle added to the bin for this tile to begin with? Time to look at the binner code.

The relevant piece of code is here. The bounding box determination for the whole triangle looks as follows:

VecS32 vStartX = vmax(
    vmin(
        vmin(xFormedFxPtPos[0].X, xFormedFxPtPos[1].X), 
        xFormedFxPtPos[2].X), VecS32(0));
VecS32 vEndX   = vmin(
    vmax(
        vmax(xFormedFxPtPos[0].X, xFormedFxPtPos[1].X),
        xFormedFxPtPos[2].X) + VecS32(1), VecS32(SCREENW));

Okay, that’s basically the same we saw before, only we’re clipping against the screen bounds not the tile bounds. And the same happens with Y. Nothing to see here so far, move along. But then, what does the code do with these bounds? Let’s have a look:

// Convert bounding box in terms of pixels to bounding box
// in terms of tiles
int startX = max(vStartX.lane[i]/TILE_WIDTH_IN_PIXELS, 0);
int endX   = min(vEndX.lane[i]/TILE_WIDTH_IN_PIXELS,
                 SCREENW_IN_TILES-1);

int startY = max(vStartY.lane[i]/TILE_HEIGHT_IN_PIXELS, 0);
int endY   = min(vEndY.lane[i]/TILE_HEIGHT_IN_PIXELS,
                 SCREENH_IN_TILES-1);

// Add triangle to the tiles or bins that the bounding box covers
int row, col;
for(row = startY; row <= endY; row++)
{
    int offset1 = YOFFSET1_MT * row;
    int offset2 = YOFFSET2_MT * row;
    for(col = startX; col <= endX; col++)
    {
        // ...
    }
}

And in this loop, the triangles get added to the corresponding bins. So the bug must be somewhere in here. Can you figure out what’s going on?

Okay, I’ll spill. The problem is triangles that are completely outside the top or left screen edges, but not too far outside, and it’s caused by the division at the top. Being regular C division, it’s truncating – that is, it always rounds towards zero (Note: In C99/C++11, it’s actually defined that way; C89 and C++98 leave it up to the compiler, but on x86 all compilers I’m aware of use truncation, since that’s what the hardware does). Say that our tiles measure 100×100 pixels (they don’t, but that doesn’t matter here). What happens if we get a triangle whose bounding box goes from, say, minX=-75 to maxX=-38? First, we compute vStartX to be 0 in that lane (vStartX is clipped against the left edge) and vEndX as -37 (it gets incremented by 1, but not clipped). This looks weird, but is completely fine – that’s an empty rectangle. However, in the computation of startX and endX, we divide both these values by 100, and get zero both times. And since the tile start and end coordinates are inclusive not exclusive (look at the loop conditions!), this is not fine – the leftmost column of tiles goes from x=0 to x=99 (inclusive), and our triangle doesn’t overlap that! Which is why we then get an empty bounding box in the actual rasterizer.

There’s two ways to fix this problem. The first is to use “floor division”, i.e. division that always rounds down, no matter the sign. This will again generate an empty rectangle in this case, and everything works fine. However, C/C++ don’t have a floor division operator, so this is somewhat awkward to express in code, and I went for the simpler option: just check whether the bounding rectangle is empty before we even do the divide.

if(vEndX.lane[i] < vStartX.lane[i] ||
   vEndY.lane[i] < vStartY.lane[i]) continue;

And there’s another problem with the code as-is: There’s an off-by-one error. Suppose we have a triangle with maxX=99. Then we’ll compute vEndX as 100 and end up inserting the triangle into the bin for x=100 to x=199, which again it doesn’t overlap. The solution is simple: stop adding 1 to vEndX and clamp it to SCREENW - 1 instead of SCREENW! And with these two issues fixed, we now have a binner that really only bins triangles into tiles intersected by their bounding boxes. Which, in a nice turn of events, also means that our depth rasterizer sees slightly fewer triangles! Does it help?

Change: Fix a few binning bugs

Version min 25th med 75th max mean sdev
Initial 3.367 3.420 3.432 3.445 3.512 3.433 0.021
End of part 1 3.020 3.081 3.095 3.106 3.149 3.093 0.020
Vec[SF]32 3.022 3.056 3.067 3.081 3.153 3.069 0.018
Setup cleanups 2.977 3.032 3.046 3.058 3.101 3.045 0.020
Binning fixes 2.972 3.008 3.022 3.035 3.079 3.022 0.020

Not a big improvement, but then again, this wasn’t even for performance, it was just a regular bug fix! Always nice when they pay off this way.

One more setup tweak

With that out of the way, there’s one bit of unnecessary work left in our triangle setup: If you look at the current triangle setup code, you’ll notice that we convert all four of X, Y, Z and W to integer (fixed-point), but we only actually look at the integer versions for X and Y. So we can stop converting Z and W. I also renamed the variables to have shorter names, simply to make the code more readable. So this change ends up affecting lots of lines, but the details are trivial, so I’m just going to give you the results:

Change: Don’t convert Z/W to fixed point

Version min 25th med 75th max mean sdev
Initial 3.367 3.420 3.432 3.445 3.512 3.433 0.021
End of part 1 3.020 3.081 3.095 3.106 3.149 3.093 0.020
Vec[SF]32 3.022 3.056 3.067 3.081 3.153 3.069 0.018
Setup cleanups 2.977 3.032 3.046 3.058 3.101 3.045 0.020
Binning fixes 2.972 3.008 3.022 3.035 3.079 3.022 0.020
No fixed-pt. Z/W 2.958 2.985 2.991 2.999 3.048 2.992 0.012

And with that, we are – finally! – down about 0.1ms from where we ended the previous post.

Time to profile

Evidently, progress is slowing down. This is entirely expected; we’re running out of easy targets. But while we’ve been starting intensely at code, we haven’t really done any more in-depth profiling than just looking at overall timings in quite a while. Time to bring out VTune again and check if the situation’s changed since our last detailed profiling run, way back at the start of “Frustum culling: turning the crank”.

Here’s the results:

Rasterization hot spots

Unlike our previous profiling runs, there’s really no smoking guns here. At a CPI rate of 0.459 (so we’re averaging about 2.18 instructions executed per cycle over the whole function!) we’re doing pretty well: in “Frustum culling: turning the crank”, we were still at 0.588 clocks per instruction. There’s a lot of L1 and L2 cache line replacements (i.e. cache lines getting cycled in and out), but that is to be expected – at 320×90 pixels times one float each, our tiles come out at about 112kb, which is larger than our L1 data cache and takes up a significant amount of the L2 cache for each core. But for all that, we don’t seem to be terribly bottlenecked by it; if we were seriously harmed by cache effects, we wouldn’t be running nearly as fast as we do.

Pretty much the only thing we do see is that we seem to be getting a lot of branch mispredictions. Now, if you were to drill into them, you would notice that most of these related to the row/column loops, so they’re purely a function of the triangle size. However, we do still perform the early-out check for each quad. With the initial version of the code, that’s a slight win (I checked, even though I didn’t bother telling you about it), but that a version of the code that had more code in the inner loop, and of course the test itself has some execution cost too. Is it still worthwhile? Let’s try removing it.

Change: Remove “quad not covered” early-out

Version min 25th med 75th max mean sdev
Initial 3.367 3.420 3.432 3.445 3.512 3.433 0.021
End of part 1 3.020 3.081 3.095 3.106 3.149 3.093 0.020
Vec[SF]32 3.022 3.056 3.067 3.081 3.153 3.069 0.018
Setup cleanups 2.977 3.032 3.046 3.058 3.101 3.045 0.020
Binning fixes 2.972 3.008 3.022 3.035 3.079 3.022 0.020
No fixed-pt. Z/W 2.958 2.985 2.991 2.999 3.048 2.992 0.012
No quad early-out 2.778 2.809 2.826 2.842 2.908 2.827 0.025

And just like that, another 0.17ms evaporate. I could do this all day. Let’s run the profiler again just to see what changed:

Rasterizer hotspots without early-out

Yes, branch mispredicts are down by about half, and cycles spent by about 10%. And we weren’t even that badly bottlenecked on branches to begin with, at least according to VTune! Just goes to show – CPUs really do like their code straight-line.

Bonus: per-pixel increments

There’s a few more minor modifications in the most recent set of changes that I won’t bother talking about, but there’s one more that I want to mention, and that several comments brought up last time: stepping the interpolated depth from pixel to pixel rather than recomputing it from the barycentric coordinates every time. I wanted to do this one last, because unlike our other changes, this one does change the resulting depth buffer noticeably. It’s not a huge difference, but changing the results is something I’ve intentionally avoided doing so far, so I wanted to do this change towards the end of the depth rasterizer modifications so it’s easier to “opt out” from.

That said, the change itself is really easy to make now: only do our current computation

VecF32 depth = zz[0] + itof(beta) * zz[1] + itof(gama) * zz[2];

once per line, and update depth incrementally per pixel (note that doing this properly requires changing the code a little bit, because the original code overwrites depth with the value we store to the depth buffer, but that’s easily changed):

depth += zx;

just like the edge equations themselves, where zx can be computed at setup time as

VecF32 zx = itof(aa1Inc) * zz[1] + itof(aa2Inc) * zz[2];

It should be easy to see why this produces the same results in exact arithmetic; but of course, in reality, there’s floating-point round-off error introduced in the computation of zx and by the repeated additions, so it’s not quite exact. That said, for our purposes (computing a depth buffer for occlusion culling), it’s probably fine. This gets rid of a lot of instructions in the loop, so it should come as no surprise that it’s faster, but let’s see by how much:

Change: Per-pixel depth increments

Version min 25th med 75th max mean sdev
Initial 3.367 3.420 3.432 3.445 3.512 3.433 0.021
End of part 1 3.020 3.081 3.095 3.106 3.149 3.093 0.020
Vec[SF]32 3.022 3.056 3.067 3.081 3.153 3.069 0.018
Setup cleanups 2.977 3.032 3.046 3.058 3.101 3.045 0.020
Binning fixes 2.972 3.008 3.022 3.035 3.079 3.022 0.020
No fixed-pt. Z/W 2.958 2.985 2.991 2.999 3.048 2.992 0.012
No quad early-out 2.778 2.809 2.826 2.842 2.908 2.827 0.025
Incremental depth 2.676 2.699 2.709 2.721 2.760 2.711 0.016

Down by about another 0.1ms per frame – which might be less than you expected considering how many instructions we just got rid of. What can I say – we’re starting to bump into other issues again.

Now, there’s more things we could try (isn’t there always?), but I think with five in-depth posts on rasterization and a 21% reduction in median run-time on what already started out as fairly optimized code, it’s time to close this chapter and start looking at other things. Which I will do in the next post. Until then, code for the new batch of changes is, as always, on Github.

Depth buffers done quick, part 1

This post is part of a series – go here for the index.

Welcome back to yet another post on my series about Intel’s Software Occlusion Culling demo. The past few posts were about triangle rasterization in general; at the end of the previous post, we saw how the techniques we’ve been discussing are actually implemented in the code. This time, we’re going to make it run faster – no further delays.

Step 0: Repeatable test setup

But before we change anything, let’s first set up repeatable testing conditions. What I’ve been doing for the previous profiles is start the program from VTune with sample collection paused, manually resume collection once loading is done, then manually exit the demo after about 20 seconds without moving the camera from the starting position.

That was good enough while we were basically just looking for unexpected hot spots that we could speed up massively with relatively little effort. For this round of changes, we expect less drastic differences between variants, so I added code that performs a repeatable testing protocol:

  1. Load the scene as before.
  2. Render 60 frames without measuring performance to allow everything to settle a bit. Graphics drivers tend to perform some initialization work (such as driver-side shader compilation) lazily, so the first few frames with any given data set tend to be spiky.
  3. Tell the profiler to start collecting samples.
  4. Render 600 frames.
  5. Tell the profile to stop collecting samples.
  6. Exit the program.

The sample already times how much time is spent in rendering the depth buffer and in the occlusion culling (which is another rasterizer that Z-tests a bounding box against the depth buffer prepared in the first step). I also log these measurements and print out some summary statistics at the end of the run. For both the rendering time and the occlusion test time, I print out the minimum, 25th percentile, median, 75th percentile and maximum of all observed values, together with the mean and standard deviation. This should give us a good idea of how these values are distributed. Here’s a first run:


Render time:
  min=3.400ms  25th=3.442ms  med=3.459ms  75th=3.473ms  max=3.545ms
  mean=3.459ms sdev=0.024ms
Test time:
  min=1.653ms  25th=1.875ms  med=1.964ms  75th=2.036ms  max=2.220ms
  mean=1.957ms sdev=0.108ms

and here’s a second run on the same code (and needless to say, the same machine) to test how repeatable these results are:

Render time:
  min=3.367ms  25th=3.420ms  med=3.432ms  75th=3.445ms  max=3.512ms
  mean=3.433ms sdev=0.021ms
Test time:
  min=1.586ms  25th=1.870ms  med=1.958ms  75th=2.025ms  max=2.211ms
  mean=1.941ms sdev=0.119ms

As you can see, the two runs are within about 1% of each other for all the measurements – good enough for our purposes, at least right now. Also, the distribution appears to be reasonably smooth, with the caveat that the depth testing times tend to be fairly noisy. I’ll give you the updated timings after every significant change so we can see how the speed evolves over time. And by the way, just to make that clear, this business of taking a few hundred samples and eyeballing the order statistics is most definitely not a statistically sound methodology. It happens to work out in our case because we have a nice repeatable test and will only be interested in fairly strong effects. But you need to be careful about how you measure and compare performance results in more general settings. That’s a topic for another time, though.

Now, to make it a bit more readable as I add more observations, I’ll present the results in a table as follows: (this is the render time)

Version min 25th med 75th max mean sdev
Initial 3.367 3.420 3.432 3.445 3.512 3.433 0.021

I won’t bother with the test time here (even though the initial version of this post did) because the code doesn’t get changed; it’s all noise.

Step 1: Get rid of special cases

Now, if you followed the links to the code I posted last time, you might’ve noticed that the code checks the variable gVisualizeDepthBuffer multiple times, even in the inner loop. An example is this passage that loads the current depth buffer values at the target location:

__m128 previousDepthValue;
if(gVisualizeDepthBuffer)
{
    previousDepthValue = _mm_set_ps(pDepthBuffer[idx],
        pDepthBuffer[idx + 1],
        pDepthBuffer[idx + SCREENW],
        pDepthBuffer[idx + SCREENW + 1]);
}
else
{
    previousDepthValue = *(__m128*)&pDepthBuffer[idx];
}

I briefly mentioned this last time: this rasterizer processes blocks of 2×2 pixels at a time. If depth buffer visualization is on, the depth buffer is stored in the usual row-major layout normally used for 2D arrays in C/C++: In memory, we first have all pixels for the (topmost) row 0 (left to right), then all pixels for row 1, and so forth for the whole size of the image. If you draw a diagram of how the pixels are laid out in memory, it looks like this:

8x8 pixels in raster-scan order

8×8 pixels in raster-scan order

This is also the format that graphics APIs typically expect you to pass textures in. But if you’re writing pixels blocks of 2×2 at a time, that means you always need to split your reads (and writes) into two accesses to the two affected rows – annoying. By contrast, if depth buffer visualization is off, the code uses a tiled layout that looks more like this:

8x8 pixels in a 2x2 tiled layout

8×8 pixels in a 2×2 tiled layout

This layout doesn’t break up the 2×2 groups of pixels; in effect, instead of a 2D array of pixels, we now have a 2D array of 2×2 pixel blocks. This is a so-called “tiled” layout; I’ve written about this before if you’re not familiar with the concept. Tiled layouts makes access much easier and faster provided that our 2×2 blocks are always at properly aligned positions – we would still need to access multiple locations if we wanted to read our 2×2 pixels from, say, an odd instead of an even row. The rasterizer code always keeps the 2×2 blocks aligned to even x and y coordinates to make sure depth buffer accesses can be done quickly.

The tiled layout provides better performance, so it’s the one we want to use in general. So instead of switching to linear layout when the user wants to see the depth buffer, I changed the code to always store the depth buffer tiled, and then perform the depth buffer visualization using a custom pixel shader that knows how to read the pixels in tiled format. It took me a bit of time to figure out how to do this within the app framework, but it really wasn’t hard. Once that’s done, there’s no need to keep the linear storage code around, and a bunch of special cases just disappear. Caveat: The updated code assumes that the depth buffer is always stored in tiled format; this is true for the SSE versions of the rasterizers, but not the scalar versions that the demo also showcases. It shouldn’t be hard to use a different shader when running the scalar variants, but I didn’t bother maintaining them in my branches because they’re only there for illustration anyway.

So, we always use the tiled layout (but we did that throughout the test run before too, since I don’t enable depth buffer visualization in it!) and we get rid of the alternative paths completely. Does it help?

Change: Remove support for linear depth buffer layout.

Version min 25th med 75th max mean sdev
Initial 3.367 3.420 3.432 3.445 3.512 3.433 0.021
Always tiled depth 3.357 3.416 3.428 3.443 3.486 3.429 0.021

We get a lower value for the depth tests, but that doesn’t necessarily mean much, because it’s still within a little more than a standard deviation of the previous measurements. And the difference in depth test performance is easily within a standard deviation too. So there’s no appreciable difference from this change by itself; turns out that modern x86s are pretty good at dealing with branches that always go the same way. It did simplify the code, though, which will make further optimizations easier. Progress.

Step 2: Try to do a little less work

Let me show you the whole inner loop (with some cosmetic changes so it fits in the layout, damn those overlong Intel SSE intrinsics) so you can see what I’m talking about:

for(int c = startXx; c < endXx;
        c += 2,
        idx += 4,
        alpha = _mm_add_epi32(alpha, aa0Inc),
        beta  = _mm_add_epi32(beta, aa1Inc),
        gama  = _mm_add_epi32(gama, aa2Inc))
{
    // Test Pixel inside triangle
    __m128i mask = _mm_cmplt_epi32(fxptZero, 
        _mm_or_si128(_mm_or_si128(alpha, beta), gama));
					
    // Early out if all of this quad's pixels are
    // outside the triangle.
    if(_mm_test_all_zeros(mask, mask))
        continue;
					
    // Compute barycentric-interpolated depth
    __m128 betaf = _mm_cvtepi32_ps(beta);
    __m128 gamaf = _mm_cvtepi32_ps(gama);
    __m128 depth = _mm_mul_ps(_mm_cvtepi32_ps(alpha), zz[0]);
    depth = _mm_add_ps(depth, _mm_mul_ps(betaf, zz[1]));
    depth = _mm_add_ps(depth, _mm_mul_ps(gamaf, zz[2]));

    __m128 previousDepthValue = *(__m128*)&pDepthBuffer[idx];

    __m128 depthMask = _mm_cmpge_ps(depth, previousDepthValue);
    __m128i finalMask = _mm_and_si128(mask,
        _mm_castps_si128(depthMask));

    depth = _mm_blendv_ps(previousDepthValue, depth,
        _mm_castsi128_ps(finalMask));
    _mm_store_ps(&pDepthBuffer[idx], depth);
}

As I said last time, we expect at least 50% of the pixels inside an average triangle’s bounding box to be outside the triangle. This loop neatly splits into two halves: The first half is until the early-out tests, and simply steps the edge equations and tests whether any pixels within the current 2×2 pixel block (quad) are inside the triangle. The second half then performs barycentric interpolation and the depth buffer update.

Let’s start with the top half. At first glance, there doesn’t appear to be much we can do about the amount of work we do, at least with regards to the SSE operations: we need to step the edge equations (inside the for statement). The code already does the OR trick to only do one comparison. And we use a single test (which compiles into the PTEST instruction) to check whether we can skip the quad. Not much we can do here, or is there?

Well, turns out there’s one thing: we can get rid of the compare. Remember that for two’s complement integers, compares of the type x < 0 or x >= 0 can be performed by just looking at the sign bit. Unfortunately, the test here is of the form x > 0, which isn’t as easy – couldn’t it be >= 0 instead?

Turns out: it could. Because our x is only ever 0 when all three edge functions are 0 – that is, the current pixel lies right on all three edges at the same time. And the only way that can ever happen is for the triangle to be degenerate (zero-area). But we never rasterize zero-area triangles – they get culled before we ever reach this loop! So the case x == 0 can never actually happen, which means it makes no difference whether we write x >= 0 or x > 0. And the condition x >= 0, we can implement by simply checking whether the sign bit is zero. Whew! Okay, so we get:

__m128i mask = _mm_or_si128(_mm_or_si128(alpha, beta), gama));

Now, how do we test the sign bit without using an extra instruction? Well, it turns out that the instruction we use to determine whether we should early-out is PTEST, which already performs a binary AND. And it also turns out that the check we need (“are the sign bits set for all four lanes?”) can be implemented using the very same instruction:

if(_mm_testc_si128(_mm_set1_epi32(0x80000000), mask))

Note that the semantics of mask have changed, though: before, each SIMD lane held either the value 0 (“point outside triangle”) or -1 (“point inside triangle). Now, it either holds a nonnegative value (sign bit 0, “point inside triangle”) or a negative one (sign bit 1, “point outside triangle”). The instructions that end up using this value only care about the sign bit, but still, we ended up exactly flipping which one indicates “inside” and which one means “outside”. Lucky for us, that’s easily remedied in the computation of finalMask, still only by changing ops without adding any:

__m128i finalMask = _mm_andnot_si128(mask,
    _mm_castps_si128(depthMask));

We simply use andnot instead of and. Okay, I admit that was a bit of trouble to get rid of a single instruction, but this is a tight inner loop that’s not being slowed down by memory effects or other micro-architectural issues. In short, this is one of the (nowadays rare) places where that kind of stuff actually matters. So, did it help?

Change: Get rid of compare.

Version min 25th med 75th max mean sdev
Initial 3.367 3.420 3.432 3.445 3.512 3.433 0.021
Always tiled depth 3.357 3.416 3.428 3.443 3.486 3.429 0.021
One compare less 3.250 3.296 3.307 3.324 3.434 3.313 0.025

Yes indeed: render time is down by 0.1ms – about 4 standard deviations, a significant win (and yes, this is repeatable). To be fair, as we’ve already seen in previous post: this is unlikely to be solely attributable to removing a single instruction. Even if we remove (or change) just one intrinsic in the source code, this can have ripple effects on register allocation and scheduling that together make a larger difference. And just as importantly, sometimes changing the code in any way at all will cause the compiler to accidentally generate a code placement that performs better at run time. So it would be foolish to take all the credit – but still, it sure is nice when this kind of thing happens.

Step 2b: Squeeze it some more

Next, we look at the second half of the loop, after the early-out. This half is easier to find worthwhile targets in. Currently, we perform full barycentric interpolation to get the per-pixel depth value:

z = \alpha z_0 + \beta z_1 + \gamma z_2

Now, as I mentioned at the end of “The barycentric conspiracy”, we can use the alternative form

z = z_0 + \beta (z_1 - z_0) + \gamma (z_2 - z_0)

when the barycentric coordinates are normalized, or more generally

\displaystyle z = z_0 + \beta \left(\frac{z_1 - z_0}{\alpha + \beta + \gamma}\right) + \gamma \left(\frac{z_2 - z_0}{\alpha + \beta + \gamma}\right)

when they’re not. And since the terms in parentheses are constants, we can compute them once, and get rid of a int-to-float conversion and a multiply in the inner loop – two less instructions for a bit of extra setup work once per triangle. Namely, our per-triangle setup computation goes from

__m128 oneOverArea = _mm_set1_ps(oneOverTriArea.m128_f32[lane]);
zz[0] *= oneOverArea;
zz[1] *= oneOverArea;
zz[2] *= oneOverArea;

to

__m128 oneOverArea = _mm_set1_ps(oneOverTriArea.m128_f32[lane]);
zz[1] = (zz[1] - zz[0]) * oneOverArea;
zz[2] = (zz[2] - zz[0]) * oneOverArea;

and our per-pixel interpolation goes from

__m128 depth = _mm_mul_ps(_mm_cvtepi32_ps(alpha), zz[0]);
depth = _mm_add_ps(depth, _mm_mul_ps(betaf, zz[1]));
depth = _mm_add_ps(depth, _mm_mul_ps(gamaf, zz[2]));

to

__m128 depth = zz[0];
depth = _mm_add_ps(depth, _mm_mul_ps(betaf, zz[1]));
depth = _mm_add_ps(depth, _mm_mul_ps(gamaf, zz[2]));

And what do our timings say?

Change: Alternative interpolation formula

Version min 25th med 75th max mean sdev
Initial 3.367 3.420 3.432 3.445 3.512 3.433 0.021
Always tiled depth 3.357 3.416 3.428 3.443 3.486 3.429 0.021
One compare less 3.250 3.296 3.307 3.324 3.434 3.313 0.025
Simplify interp. 3.195 3.251 3.265 3.276 3.332 3.264 0.024

Render time is down by about another 0.05ms, and the whole distribution has shifted down by roughly that amount (without increasing variance), so this seems likely to be an actual win.

Finally, there’s another place where we can make a difference by better instruction selection. Our current depth buffer update code looks as follows:

    __m128 previousDepthValue = *(__m128*)&pDepthBuffer[idx];

    __m128 depthMask = _mm_cmpge_ps(depth, previousDepthValue);
    __m128i finalMask = _mm_andnot_si128(mask,
        _mm_castps_si128(depthMask));

    depth = _mm_blendv_ps(previousDepthValue, depth,
        _mm_castsi128_ps(finalMask));

finalMask here is a mask that encodes “pixel lies inside the triangle AND has a larger depth value than the previous pixel at that location”. The blend instruction then selects the new interpolated depth value for the lanes where finalMask has the sign bit (MSB) set, and the previous depth value elsewhere. But we can do slightly better, because SSE provides MAXPS, which directly computes the maximum of two floating-point numbers. Using max, we can rewrite this expression to read:

    __m128 previousDepthValue = *(__m128*)&pDepthBuffer[idx];
    __m128 mergedDepth = _mm_max_ps(depth, previousDepthValue);
    depth = _mm_blendv_ps(mergedDepth, previousDepthValue,
        _mm_castsi128_ps(mask));

This is a slightly different way to phrase the solution – “pick whichever is largest of the previous and the interpolated depth value, and use that as new depth if this pixel is inside the triangle, or stick with the old depth otherwise” – but it’s equivalent, and we lose yet another instruction. And just as important on the notoriously register-starved 32-bit x86, it also needs one less temporary register.

Let’s check whether it helps!

Change: Alternative depth update formula

Version min 25th med 75th max mean sdev
Initial 3.367 3.420 3.432 3.445 3.512 3.433 0.021
Always tiled depth 3.357 3.416 3.428 3.443 3.486 3.429 0.021
One compare less 3.250 3.296 3.307 3.324 3.434 3.313 0.025
Simplify interp. 3.195 3.251 3.265 3.276 3.332 3.264 0.024
Revise depth update 3.152 3.182 3.196 3.208 3.316 3.198 0.025

It does appear to shave off another 0.05ms, bringing the total savings due to our instruction-shaving up to about 0.2ms – about a 6% reduction in running time so far. Considering that we started out with code that was already SIMDified and fairly optimized to start with, that’s not a bad haul at all. But we seem to have exhausted the obvious targets. Does that mean that this is as fast as it’s going to go?

Step 3: Show the outer loops some love

Of course not. This is actually a common mistake people make during optimization sessions: focusing on the innermost loops to the exclusion of everything else. Just because a loop is at the innermost nesting level doesn’t necessarily mean it’s more important than everything else. A profiler can help you figure out how often code actually runs, but in our case, I’ve already mentioned several times that we’re dealing with lots of small triangles. This means that we may well run through our innermost loop only once or twice per row of 2×2 blocks! And for a lot of triangles, we’ll only do one or two of such rows too. Which means we should definitely also pay attention to the work we do per block row and per triangle.

So let’s look at our row loop:

for(int r = startYy; r < endYy;
        r += 2,
        row  = _mm_add_epi32(row, _mm_set1_epi32(2)),
        rowIdx = rowIdx + 2 * SCREENW,
        bb0Row = _mm_add_epi32(bb0Row, bb0Inc),
        bb1Row = _mm_add_epi32(bb1Row, bb1Inc),
        bb2Row = _mm_add_epi32(bb2Row, bb2Inc))
{
    // Compute barycentric coordinates 
    int idx = rowIdx;
    __m128i alpha = _mm_add_epi32(aa0Col, bb0Row);
    __m128i beta = _mm_add_epi32(aa1Col, bb1Row);
    __m128i gama = _mm_add_epi32(aa2Col, bb2Row);

    // <Column loop here>
}

Okay, we don’t even need to get fancy here – there’s two things that immediately come to mind. First, we seem to be updating row even though nobody in this loop (or the inner loop) uses it. That’s not a performance problem – standard dataflow analysis techniques in compilers are smart enough to figure this kind of stuff out and just eliminate the computation – but it’s still unnecessary code that we can just remove, so we should. Second, we add the initial column terms of the edge equations (aa0Col, aa1Col, aa2Col) to the row terms (bb0Row etc.) every line. There’s no need to do that – the initial column terms don’t change during the row loop, so we can just do these additions once per triangle!

So before the loop, we add:

    __m128i sum0Row = _mm_add_epi32(aa0Col, bb0Row);
    __m128i sum1Row = _mm_add_epi32(aa1Col, bb1Row);
    __m128i sum2Row = _mm_add_epi32(aa2Col, bb2Row);

and then we change the row loop itself to read:

for(int r = startYy; r < endYy;
        r += 2,
        rowIdx = rowIdx + 2 * SCREENW,
        sum0Row = _mm_add_epi32(sum0Row, bb0Inc),
        sum1Row = _mm_add_epi32(sum1Row, bb1Inc),
        sum2Row = _mm_add_epi32(sum2Row, bb2Inc))
{
    // Compute barycentric coordinates 
    int idx = rowIdx;
    __m128i alpha = sum0Row;
    __m128i beta = sum1Row;
    __m128i gama = sum2Row;

    // <Column loop here>
}

That’s probably the most straightforward of all the changes we’ve seen so far. But still, it’s in an outer loop, so we wouldn’t expect to get as much out of this as if we had saved the equivalent amount of work in the inner loop. Any guesses for how much it actually helps?

Change: Straightforward tweaks to the outer loop

Version min 25th med 75th max mean sdev
Initial 3.367 3.420 3.432 3.445 3.512 3.433 0.021
Always tiled depth 3.357 3.416 3.428 3.443 3.486 3.429 0.021
One compare less 3.250 3.296 3.307 3.324 3.434 3.313 0.025
Simplify interp. 3.195 3.251 3.265 3.276 3.332 3.264 0.024
Revise depth update 3.152 3.182 3.196 3.208 3.316 3.198 0.025
Tweak row loop 3.020 3.081 3.095 3.106 3.149 3.093 0.020

I bet you didn’t expect that one. I think I’ve made my point.

UPDATE: An earlier version had what turned out to be an outlier measurement here (mean of exactly 3ms). Every 10 runs or so, I get a run that is a bit faster than usual; I haven’t found out why yet, but I’ve updated the list above to show a more typical measurement. It’s still a solid win, just not as big as initially posted.

And with the mean run time of our depth buffer rasterizer down by about 10% from the start, I think this should be enough for one post. As usual, I’ve updated the head of the blog branch on Github to include today’s changes, if you’re interested. Next time, we’ll look a bit more at the outer loops and whip out VTune again for a surprise discovery! (Well, surprising for you anyway.)

By the way, this is one of these code-heavy play-by-play posts. With my regular articles, I’m fairly confident that the format works as a vehicle for communicating ideas, but this here is more like an elaborate case study. I know that I have fun writing in this format, but I’m not so sure if it actually succeeds at delivering valuable information, or if it just turns into a parade of super-specialized tricks that don’t seem to generalize in any useful way. I’d appreciate some input before I start knocking out more posts like this :). Anyway, thanks for reading, and until next time!

Optimizing the basic rasterizer

This post is part of a series – go here for the index.

Last time, we saw how to write a simple triangle rasterizer, analyzed its behavior with regard to integer overflows, and discussed how to modify it to incorporate sub-pixel precision and fill rules. This time, we’re going to make it run fast. But before we get started, I want to get one thing out of the way:

Why this kind of algorithm?

The algorithm we’re using basically loops over a bunch of candidate pixels and checks whether they’re inside the triangle. This is not the only way to render triangles, and if you’ve written any software rendering code in the past, chances are good that you used a scanline rasterization approach instead: you scan the triangle from top to bottom and determine, for each scan line, where the triangle starts and ends along the x axis. Then we can just fill in all the pixels in between. This can be done by keeping track of so-called active edges (triangle edges that intersect the current scan line) and tracking their intersection point from line to line using what is essentially a modified line-drawing algorithm. While the high-level overview is easy enough, the details get fairly subtle, as for example the first two articles from Chris Hecker’s 1995-96 series on perspective texture mapping explain (links to the whole series here).

More importantly though, this kind of algorithm is forced to work line by line. This has a number of annoying implications for both modern software and hardware implementations: the algorithm is asymmetrical in x and y, which means that a very skinny triangle that’s mostly horizontal has a very different performance profile from one that’s mostly vertical. The outer scanline loop is serial, which is a serious problem for hardware implementations. The inner loop isn’t very SIMD-friendly – you want to be processing aligned groups of several pixels (usually at least 4) at once, which means you need special cases for the start of a scan line (to get up to alignment), the end of a scan line (to finish the last partial group of pixels), and short lines (scan line is over before we ever get to an aligned position). Which makes the whole thing even more orientation-dependent. If you’re trying to do mip mapping at the same time, you typically work on “quads”, groups of 2×2 pixels (explanation for why is here). Now you need to trace out two scan lines at the same time, which boils down to keeping track of the current scan conversion state for both even and odd edges separately. With two lines instead of one, the processing for the starts and end of a scan line gets even worse than it already is. And let’s not even talk about supporting pixel sample positions that aren’t strictly on a grid, as for example used in multisample antialiasing. It all goes downhill fast.

I think I’ve made my point: while scan-line rasterization works great when you’re working one scan line at a time anyway, it gets hairy quickly once throw additional requirements such as “aligned access”, “multiple rows at a time” or “variable sample position” into the mix. And it’s not very parallel, which hamstrings our ability to harness wide SIMD or build efficient hardware for it. In contrast, the algorithm we’ve been discussing is embarrassingly parallel – you can test as many pixels as you want at the same time, you can use arbitrary sample locations, and if you have specific alignment requirements, you can test pixels in groups that satisfy those requirements easily. There’s a lot to be said for those properties, and indeed they’ve proven convincing enough that by now, the edge function approach is the method of choice in high-performance software rasterizers – in graphics hardware, it’s been in use for a good while longer, starting in the late 80s (yes, 80s – not a typo). I’ll talk a bit more about the history later.

Right, however, now we still perform two multiplies and five subtractions per edge, per pixel. SIMD and dedicated silicon are one thing, but that’s still a lot of work for a single pixel, and it most definitely was not a practical way to perform hardware rasterization in 1988. What we need to do now is drastically simplify our inner loop. Luckily, we’ve seen everything we need to do that already.

Simplifying the rasterizer

If you go back to “The barycentric conspiracy”, you’ll notice that we already derived an alternative formulation of the edge functions by rearranging and simplifying the determinant expression:

F_{01}(p) = (v_{0y} - v_{1y}) p_x + (v_{1x} - v_{0x}) p_y + (v_{0x} v_{1y} - v_{0y} v_{1x})

Now, to reduce the amount of noise, let’s give those terms in parentheses names:

A_{01} := v_{0y} - v_{1y}
B_{01} := v_{1x} - v_{0x}
C_{01} := v_{0x} v_{1y} - v_{0y} v_{1x}

And if we split p into its x and y components, we get:

F_{01}(p_x, p_y) = A_{01} p_x + B_{01} p_y + C_{01}

Now, in every iteration of our inner loop, we move one pixel to the right, and for every scan line, we move one pixel up or down (depending on which way your y axis points – note I haven’t bothered to specify that yet!) from the start of the previous scan line. Both of these updates are really easy to perform since F01 is an affine function and we’re stepping along the coordinate axes:

F_{01}(p_x + 1, p_y) - F_{01}(p_x, p_y) = A_{01}
F_{01}(p_x, p_y + 1) - F_{01}(p_x, p_y) = B_{01}

In words, if you go one step to the right, add A01 to the edge equation. If you step down/up (whichever direction +y is in your coordinate system), add B01. That’s it. That’s all there is to it.

In our basic triangle rasterization loop, this turns into something like this: (I’ll keep using the original orient2d for the initial setup so we can see the similarity):

    // Bounding box and clipping as before
    // ...

    // Triangle setup
    int A01 = v0.y - v1.y, B01 = v1.x - v0.x;
    int A12 = v1.y - v2.y, B12 = v2.x - v1.x;
    int A20 = v2.y - v0.y, B20 = v0.x - v2.x;

    // Barycentric coordinates at minX/minY corner
    Point2D p = { minX, minY };
    int w0_row = orient2d(v1, v2, p);
    int w1_row = orient2d(v2, v0, p);
    int w2_row = orient2d(v0, v1, p);

    // Rasterize
    for (p.y = minY; p.y <= maxY; p.y++) {
        // Barycentric coordinates at start of row
        int w0 = w0_row;
        int w1 = w1_row;
        int w2 = w2_row;

        for (p.x = minX; p.x <= maxX; p.x++) {
            // If p is on or inside all edges, render pixel.
            if (w0 >= 0 && w1 >= 0 && w2 >= 0)
                renderPixel(p, w0, w1, w2);     

            // One step to the right
            w0 += A12;
            w1 += A20;
            w2 += A01;
        }

        // One row step
        w0_row += B12;
        w1_row += B20;
        w2_row += B01;
    }

And just like that, we’re down to three additions per pixel. Want proper fill rules? As we saw last time, we can do that using a single bias that we add to the edge functions, and we only have to add it once, at the start. Sub-pixel precision? Again, a bit more work during triangle setup, but the inner loop stays the same. Different pixel center? Turns out that’s just a bias applied once too. Want to sample at several locations within a pixel? That also turns into just another add and a sign test.

In fact, after triangle setup, it’s really mostly adds and sign tests no matter what we do. That’s why this is a popular algorithm for hardware implementation – you don’t even need to do the compare explicitly, you just use a bunch of adders and route the MSB (most significant bit) of the sum, which contains the sign bit, to whoever needs to know whether the pixel is in or not.

And on the subject of signs, there’s a small trick in software implementations to simplify the sign-testing part: as I just said, all we really need is the sign bit. If it’s clear, we know the value is positive or zero, and if it’s set, we know the value is negative. In fact, this is why I made the initial rasterizer test for >= 0 in the first place – you really want to use a test that only depends on the sign bit, and not something slightly more complicated like > 0. Why do we care? Because it allows us to rewrite the three sign tests like this:

    // If p is on or inside all edges, render pixel.
    if ((w0 | w1 | w2) >= 0)
        renderPixel(p, w0, w1, w2);     

To understand why this works, you only need to look at the sign bits. Remember, if the sign bit is set in a value, that means it’s negative. If, after ORing the three values together, they still register as non-negative, that means none of them had the sign bit set – which is exactly what we wanted to test for. Rewriting the expression like this turns three conditional branches into one – always a good idea to keep the flow control in inner loops simple if you want the optimizer to be happy, and it usually also turns out to be beneficial in terms of branch prediction, although I won’t bother to profile it here.

Processing multiple pixels at once

However, as fun as squeezing individual integer instructions is, the main reason I cited for using this algorithm is that it’s embarrassingly parallel, so it’s easy to process multiple pixels at the same time using either dedicated silicon (in hardware) or SIMD instructions (in software). In fact, all we really have to do is keep track of the current value of the edge equations for each pixel, and then update them all per pixel. For concreteness, let’s stick with 4-wide SIMD (e.g. SSE2). I’m going to assume that there’s a data type Vec4i for 4 signed integers in a SIMD registers that overloads the usual arithmetic operations to be element-wise, because I don’t want to use the official Intel intrinsics here (way too much clutter to see what’s going on).

For starters, let’s assume we want to process 4×1 pixels at a time – that is, in groups 4 pixels wide, but only one pixel high. But before we do anything else, let me just pull all the per-edge setup into a single function:

struct Edge {
    // Dimensions of our pixel group
    static const int stepXSize = 4;
    static const int stepYSize = 1;

    Vec4i oneStepX;
    Vec4i oneStepY;

    Vec4i init(const Point2D& v0, const Point2D& v1,
               const Point2D& origin);
};

Vec4i Edge::init(const Point2D& v0, const Point2D& v1,
                 const Point2D& origin)
{
    // Edge setup
    int A = v0.y - v1.y, B = v1.x - v0.x;
    int C = v0.x*v1.y - v0.y*v1.x;

    // Step deltas
    oneStepX = Vec4i(A * stepXSize);
    oneStepY = Vec4i(B * stepYSize);

    // x/y values for initial pixel block
    Vec4i x = Vec4i(origin.x) + Vec4i(0,1,2,3);
    Vec4i y = Vec4i(origin.y);

    // Edge function values at origin
    return Vec4i(A)*x + Vec4i(B)*y + Vec4i(C);
}

As said, this is the setup for one edge, but it already includes all the “magic” necessary to set it up for SIMD traversal. Which is really not much – we now step in units larger than one pixel, hence the oneStep values instead of using A and B directly. Also, we now return the edge function value at the specified “origin” directly; this is the value we previously computed with orient2d. Now that we’re processing 4 pixels at a time, we also have 4 different initial values. Note that I write Vec4i(value) for a single scalar broadcast into all 4 SIMD lanes, and Vec4i(a, b, c, d) for a 4-int vector that initializes the lanes to different values. I hope this is readable enough.

With this factored out, the SIMD version for the rest of the rasterizer is easy enough:

    // Bounding box and clipping again as before

    // Triangle setup
    Point2D p = { minX, minY };
    Edge e01, e12, e20;

    Vec4i w0_row = e12.init(v1, v2, p);
    Vec4i w1_row = e20.init(v2, v0, p);
    Vec4i w2_row = e01.init(v0, v1, p);

    // Rasterize
    for (p.y = minY; p.y <= maxY; p.y += Edge::stepYSize) {
        // Barycentric coordinates at start of row
        Vec4i w0 = w0_row;
        Vec4i w1 = w1_row;
        Vec4i w2 = w2_row;

        for (p.x = minX; p.x <= maxX; p.x += Edge::stepXSize) {
            // If p is on or inside all edges for any pixels,
            // render those pixels.
            Vec4i mask = w0 | w1 | w2;
            if (any(mask >= 0))
                renderPixels(p, w0, w1, w2, mask);

            // One step to the right
            w0 += e12.oneStepX;
            w1 += e20.oneStepX;
            w2 += e01.oneStepX;
        }

        // One row step
        w0_row += e12.oneStepY;
        w1_row += e20.oneStepY;
        w2_row += e01.oneStepY;
    }

There’s a bunch of surface changes – our edge function values are now Vec4is instead of ints, and we now process multiple pixels at a time – but the only thing that really changes in any way that matters is the switch from renderPixel to renderPixels: we now process multiple pixels at a time, and some of them could be in while others are out, so we can’t do a single if anymore. Instead, we pass our mask to renderPixels – which can then use the corresponding sign bit for each pixel to decide whether to update the frame buffer for that pixel. We only early-out if all of the pixels are outside the triangle.

But really, the most important thing to note is that this wasn’t hard at all! (At least I hope it wasn’t. Apologies if I’m going too fast.)

Next steps and a bit of perspective

At this point, I could spend an arbitrary amount of time tweaking our toy rasterizer, adding features, optimizing it and so forth, but I’ll leave it be; it’s served its purpose, which was to illustrate the underlying algorithm. We’re gonna switch back to the actual rasterizer from Intel’s Software Occlusion Culling demo next. But before we go there, I want to give you some more context about this kind of algorithm, where it’s coming from, and how you would modify it for practical applications.

First, as I mentioned before, the nice thing about this type of rasterizer is that it’s easy to incorporate external constraints. For example, try modifying the above code so it always does “aligned” accesses, i.e. the x-coordinate passed to renderPixels is always a multiple of 4. This enables the use of aligned loads and stores, which are faster. Similarly, try modifying the rasterizer to traverse groups of 2×2 pixels instead of 4×1 pixels; the code is set up in a way that should make this an easy change. Then combine the two things – traverse groups of aligned quads, i.e. x and y coordinates passed to renderPixels are always even. The point is that all these changes are actually easy to make, whereas they would be relatively hard to incorporate in a scanline rasterizer. It’s also easy to make use of wider instruction sets: you could do groups of 4×2 pixels, or 2×4, or even 4×4 and more if you wanted.

That said, the current outer loop we use – always checking the whole bounding box of the triangle – is hardly optimal. In fact, for any triangle that’s not so large it gets clipped to the screen edges, at least half of the bounding box is going to be empty. There are much better ways to do this traversal, but we’re not going to use any of the fancier strategies in this series (at least, I don’t plan to at this moment) since the majority of triangles we’re going to encounter in the demo are actually quite small. The better strategies are much more efficient at rasterizing large triangles, but if a triangle touches less than 10 pixels to begin with, it’s just not worth the effort to spend extra time on trying to only cover the areas of the triangle that matter. So there’s a fairly delicate balancing act involved. The code on Github does contain a branch that implements a hierarchical rasterizer, and while as of this writing it is somewhat faster, it’s not really enough of a win to justify the effort that went into it. But it might still be interesting if you want to see how a (quickly hacked!) version of that approach looks.

Which brings me to the history section: As I mentioned in the introduction, this approach is anything but new. The first full description of it in the literature that I’m aware of is Pineda’s “A Parallel Algorithm for Polygon Rasterization”. It was presented at Siggraph 1988 and already describes most of the ideas: It uses integer edge functions, has the incremental evaluation, sub-pixel precision (but no proper fill rule), and it produces blocks of 4×4 pixels at a time. It also shows several smarter traversal algorithms than the basic bounding box strategy we’re using. McCormack and McNamara describe more efficient traversal schemes based on tiles, Greene’s “Hierarchical Polygon Tiling with Coverage Masks” describes a hierarchical approach, Michael Abrash’s “Rasterization on Larrabee” describes the same approach as independently discovered while working on Larrabee (I later joined that team, which is a good part of the reason for me being able to quote this list of references by heart), and McCool et al. describe a combination of hierarchical rasterization and Hilbert curve scan order that should be sufficient to nerd snipe you for at least half an hour if you’re still clicking on those links. Olano and Greer even describe an algorithm that rasterizes straight from homogeneous coordinates without dividing the vertex coordinates through by w first that everyone interested either in rasterization or projective geometry should check out.

Did I mention that this approach isn’t exactly new? Anyway, this tangent has gone on for long enough; let’s go back to the Software Occlusion Culling demo.

A match made in Github

I’m not going to start describing any new techniques here, but I do want to use the rest of this article to link up my description of the algorithm with the code in the Software Occlusion Culling demo, so you know what goes where. I purposefully picked our notation and terminology to be similar to the rasterizer code, to minimize friction. I’ll write down differences as we encounter them. One thing I’ll point out right now is that this code has y pointing down, whereas all my diagrams so far had y=up (note that I was fairly dodgy in the last 2 posts about which way y actually points – this is why). This is a fairly superficial change, but it does mean that the triangles with positive area are now the clockwise ones. Keep that in mind. Also, apologies in advance for the messed-up spacing in the code I’m linking to – it was written for 4-column tabs and mixes tabs and spaces, so there’s the usual display problems. (This is why I prefer using spaces in my code, at least in code I intend to put on the net)

The demo uses a “binning” architecture, which means the screen is chopped up into a number of rectangles (“tiles”), each 320×90 pixels. Triangles first get “binned”, which means that for each tile, we build a list of triangles that (potentially) overlap it. This is done by the binner.

Once the triangles are binned, this data gets handed off to the actual rasterizer. Each instance of the rasterizer processes exactly one tile. The idea is that tiles are small enough so that their depth buffer (which is what we’re rasterizing, since we want it for occlusion culling) fits comfortably within the L2 cache of a core. By rendering one tile at a time, we should thus keep number of cache misses for the depth buffer to a minimum. And it works fairly well – if you look at some of the profiles in earlier articles, you’ll notice that the depth buffer rasterizer doesn’t have a high number of last-level cache misses, even though it’s one of the main workhorse functions in the program.

Anyway, the rasterizer first tries to grabs a group of 4 triangles from its active bin (a “bin” is a container for a list of triangles). These triangles will be rendered sequentially, but they’re all set up as a group using SIMD instructions. The first step is to compute the A’s, B’s and C’s and determine the bounding box, complete with clipping to the tile bounds and snapping to 2×2-aligned pixel positions. This is now written using SSE2 intrinsics, but the math should all look very familiar at this point.

It also computes the triangle area (actually, twice its area) which the barycentric coordinates later get divided by to normalize them.

Then, we enter the per-triangle loop. Mostly, variables get broadcast into SIMD registers first, followed by a bit more setup for the increments and of course the initial evaluation of the edge functions (this looks all scarier than it is, but it is fairly repetitive, which is why I introduced the Edge struct in my version of the same code). Once we enter the y-loop, things should be familiar again: we have our three edge function values at the start of the row (incremented whenever we go down one step), and the per-pixel processing should look familiar too.

After the early-out, we have the actual depth-buffer rendering code – the part I always referred to as renderPixels. The interpolated depth value is computed from the edge functions using the barycentric coordinates as weights, and then there’s a bit of logic to read the current value from the depth buffer and update it given the interpolated depth value. The ifs are there because this loop supports two different depth storage formats: a linear one that is used in “visualize depth buffer” mode and a (very simply) swizzled format that’s used when “visualize depth buffer” is disabled.

So everything does, in fact, closely follow the basic code flow I showed you earlier. There’s a few simple details that I haven’t explained yet (such as the way the depth buffer is stored), but don’t worry, we’ll get there – next time. No more delays – actual changes to the rasterizer and our first hard-won performance improvements are upcoming!

Triangle rasterization in practice

This post is part of a series – go here for the index.

Welcome back! The previous post gave us a lot of theoretical groundwork on triangles. This time, let’s turn it into a working triangle rasterizer. Again, no profiling or optimization this time, but there will be code, and it should get us set up to talk actual rasterizer optimizations in the next post. But before we start optimizing, let’s first try to write the simplest rasterizer that we possibly can, using the primitives we saw in the last part.

The basic rasterizer

As we saw last time, we can calculate edge functions (which produce barycentric coordinates) as a 2×2 determinant. And we also saw last time that we can check if a point is inside, on the edge or outside a triangle simply by looking at the signs of the three edge functions at that point. Our rasterizer is going to work in integer coordinates, so let’s assume for now that our triangle vertex positions and point coordinates are given as integers too. The orientation test that computes the 2×2 determinant looks like this in code:

struct Point2D {
    int x, y;
};

int orient2d(const Point2D& a, const Point2D& b, const Point2D& c)
{
    return (b.x-a.x)*(c.y-a.y) - (b.y-a.y)*(c.x-a.x);
}

Now, all we have to do to rasterize our triangle is to loop over candidate pixels and check whether they’re inside or not. We could do it brute-force and loop over all screen pixels, but let’s try to not be completely brain-dead about this: we do know that all pixels inside the triangle are also going to be inside an axis-aligned bounding box around the triangle. And axis-aligned bounding boxes are both easy to compute and trivial to traverse. This gives:

void drawTri(const Point2D& v0, const Point2D& v1, const Point2D& v2)
{
    // Compute triangle bounding box
    int minX = min3(v0.x, v1.x, v2.x);
    int minY = min3(v0.y, v1.y, v2.y);
    int maxX = max3(v0.x, v1.x, v2.x);
    int maxY = max3(v0.y, v1.y, v2.y);

    // Clip against screen bounds
    minX = max(minX, 0);
    minY = max(minY, 0);
    maxX = min(maxX, screenWidth - 1);
    maxY = min(maxY, screenHeight - 1);

    // Rasterize
    Point2D p;
    for (p.y = minY; p.y <= maxY; p.y++) {
        for (p.x = minX; p.x <= maxX; p.x++) {
            // Determine barycentric coordinates
            int w0 = orient2d(v1, v2, p);
            int w1 = orient2d(v2, v0, p);
            int w2 = orient2d(v0, v1, p);

            // If p is on or inside all edges, render pixel.
            if (w0 >= 0 && w1 >= 0 && w2 >= 0)
                renderPixel(p, w0, w1, w2);           
        }
    }
}

And that’s it. That’s a fully functional triangle rasterizer. In theory anyway – you need to write the min / max and renderPixel functions yourself, and I didn’t actually test this code, but you get the idea. It even does 2D clipping. Now, don’t get me wrong. I don’t recommend that you use this code as-is anywhere, for reasons I will explain below. But I wanted you to see this, because this is the actual heart of the algorithm. In any implementation of it that you’re ever going to see in practice, the wonderful underlying simplicity of it is going to be obscured by the various wrinkles introduced by various features and optimizations. That’s fine – all these additions are worth their price. But they are, in a sense, implementation details. Hell, even limiting the traversal to a bounding box is just an optimization, if a simple and important one. The point I’m trying to make here: This is not, at heart, a hard problem that requires a complex solution. It’s a fundamentally simple problem that can be solved much more efficiently if we apply some smarts – an important distinction.

Issues with this approach

All that said, let’s list some problems with this initial implementation:

  • Integer overflows. What if some of the computations overflow? This might be an actual problem or it might not, but at the very least we need to look into it.
  • Sub-pixel precision. This code doesn’t have any.
  • Fill rules. Graphics APIs specify a set of tie-breaking rules to make sure that when two non-overlapping triangles share an edge, every pixel (or sample) covered by these two triangles is lit once and only once. GPUs and software rasterizers need to strictly abide by these rules to avoid visual artifacts.
  • Speed. While the code as given above sure is nice and short, it really isn’t particularly efficient. There’s a lot we can do make it faster, and we’ll get there in a bit, but of course this will make things more complicated.

I’m going to address each of these in turn.

Integer overflows

Since all the computations happen in orient2d, that’s the only expression we actually have to look at:

(b.x-a.x)*(c.y-a.y) - (b.y-a.y)*(c.x-a.x)

Luckily, it’s pretty very symmetric, so there’s not many different sub-expressions we have to look at: Say we start with p-bit signed integer coordinates. That means the individual coordinates are in [-2p-1,2p-1-1]. By subtracting the upper bound from the lower bound (and vice versa), we can determine the bounds for the difference of the two coordinates:

-(2^p - 1) \le b_x - a_x \le 2^p - 1 \quad \Leftrightarrow \quad |b_x - a_x| \le 2^p - 1

And the same applies for the other three coordinate differences we compute. Next, we compute a product of two such values. Easy enough:

|(b_x - a_x) (c_y - a_y)| \le |b_x - a_x| |c_y - a_y| = (2^p - 1)^2

Again, the same applies to the other product. Finally, we compute the difference between the two products, which doubles our bound on the absolute value:

|\mathrm{Orient2D}(a,b,c)| \le 2 (2^p - 1)^2 = 2^{2p + 1} - 2^{p+2} + 2 \le 2^{2p + 1} - 2

since p is always nonnegative. Accounting for the sign bit, that means the result of Orient2D fits inside a (2p+2)-bit signed integer. Since we want the results to fit inside a 32-bit integer, that means we need p \le (32 - 2) / 2 = 15 to make sure there are no overflows. In other words, we’re good as long as the input coordinates are all inside [-16384,16383]. Anything poking outside that area needs to be analytically clipped beforehand to make sure there’s no overflows during rasterization.

Incidentally, this is shows how a typical implementation guard band clipping works: the rasterizer performs computations using some set bit width, which determines the range of coordinates that the rasterizer accepts. X/Y-clipping only needs to be done when a triangle doesn’t fall entirely within that region, which is very rare with common viewport sizes. Note that there is no need for rasterizer coordinates to agree with render-target coordinates, and if you want to maximize the utility of your guard band region, your best bet is to translate the rasterizer coordinate system such that the center (instead of the top-left or bottom-right corner) of your viewport is near (0,0). Otherwise large viewports might have a much bigger guard band on the left side than they do on the right side (and similar in the vertical direction), which is undesirable.

Anyway. Integer overflows: Not a big deal, at least in our current setup with all-integer coordinates. We do need to check for (and possibly clip) huge triangles, but they’re rare in practice, so we still get away with no clipping most of the time.

Sub-pixel precision

For this point and the next, I’m only going to give a high-level overview, since we’re not actually going to use it for our target application.

Snapping vertex coordinates to pixels is actually quite crappy in terms of quality. It’s okay for a static view of a static scene, but if either the camera or one of the visible objects moves very slowly, it’s quite noticeable that the triangles only move in discrete steps once one of the vertices has moved from one pixel to the next after rounding the coordinates to integer. It looks as if the triangle is “wobbly”, especially so if there’s a texture on it.

Now, for the application we’re concerned with in this series, we’re only going to render a depth buffer, and the user is never gonna see it directly. So we can live with artifacts that are merely visually distracting, and needn’t bother with sub-pixel correction. This still means that the triangles we software-rasterize aren’t going to match up exactly with what the hardware rasterizer does, but in practice, if we mistakenly occlusion-cull an object even though some of its pixel are just about visible due to sub-pixel coordinate differences, it’s not a big deal. And neither is not culling an object because of a few pixels that are actually invisible. As one of my CS professors once pointed out, there are reasonable error bounds for everything, and for occlusion culling, “a handful of pixels give or take” is a reasonable error bound, at least if they’re not clustered together!

But suppose that you want to actually render something user-visible, in which case you absolutely do need sub-pixel precision. You want at least 4 extra bits in each coordinate (i.e. coordinates are specified in 1/16ths of a pixel), and at this point the standard in DX11-compliant GPUs in 8 bits of sub-pixel precision (coordinates in 1/256ths of a pixel). Let’s assume 8 bits of sub-pixel precision for now. The trivial way to get this is to multiply everything by 256: our (still integer) coordinates are now in 1/256ths of a pixel, but we still only perform one sample each pixel. Easy enough: (just sketching the updated main loop here)

    static const int subStep = 256;
    static const int subMask = subStep - 1;

    // Round start position up to next integer multiple
    // (we sample at integer pixel positions, so if our
    // min is not an integer coordinate, that pixel won't
    // be hit)
    minX = (minX + subMask) & ~subMask;
    minY = (minY + subMask) & ~subMask;

    for (p.y = minY; p.y <= maxY; p.y += subStep) {
        for (p.x = minX; p.x <= maxX; p.x += subStep) {
            // Determine barycentric coordinates
            int w0 = orient2d(v1, v2, p);
            int w1 = orient2d(v2, v0, p);
            int w2 = orient2d(v0, v1, p);

            // If p is on or inside all edges, render pixel.
            if (w0 >= 0 && w1 >= 0 && w2 >= 0)
                renderPixel(p, w0, w1, w2);           
        }
    }

Simple enough, and it works just fine. Well, in theory it does, anyway – this code fragment is just as untested as the previous one, so be careful :). By the way, this seems like a good place to note that if you’re writing a software rasterizer, this is likely not what you want: This code samples triangle coverage at integer coordinates. This is simpler if you’re writing a rasterizer without sub-pixel correction (as we will do, which is why I set up coordinates this way), and it also happens to match with D3D9 rasterization conventions, but it disagrees with OpenGL and D3D10+ rasterization rules, which turn out to be saner in several important ways for a full-blown renderer. So consider yourselves warned.

Anyway, as said, this works, but it has a problem: doing the computation like this costs us a lot of bits. Our accepted coordinate range when working with 32-bit integers is still [-16384,16383], but now that’s in sub-pixel steps and boils down to approximately [-64,63.996] pixels. That’s tiny – even if we center the viewport perfectly, we can’t squeeze more than 128 pixels along each axis out of it this way. One way out is to decrease the level of sub-pixel precision: at 4 bits, we can just about fit a 2048×2048 pixel render target inside our coordinate space, which isn’t exactly comfortable but workable.

But there’s a better way. I’m not gonna go into details here because we’re already on a tangent and the details, though not hard, are fairly subtle. I might turn it into a separate post at some point. But the key realization is that we’re still taking steps of one pixel at a time: all the p’s we pass into orient2d are an integral number of pixel samples apart. This, together with the incremental evaluation we’re gonna see soon, means that we only have to do a full-precision calculation once per triangle. All the pixel-stepping code always advances in units of integral pixels, which means the sub-pixel size enters the computation only once, not squared. Which in turn means we can actually cover the 2048×2048 render target with 8 bits of subpixel accuracy, or 8192×8192 pixels with 4 bits of subpixel resolution. You can squeeze that some more if you traverse the triangle in 2×2 pixel blocks and not actual pixels, as our triangle rasterizer and any OpenGL/D3D-style rasterizer will do, but again, I digress.

Fill rules

The goal of fill rules, as briefly explained earlier, is to make sure that when two non-overlapping triangles share an edge and you render both of them, each pixel gets processed only once. Now, if you look at an actual description (this one is for D3D10 and up), it might seem like they’re really tricky to implement and require comparing edges to other edges, but luckily it all turns out to be fairly simple to do, although I’ll need a bit of space to explain it.

Remember that our core rasterizer only deals with triangles in one winding order – let’s say counter-clockwise, as we’ve been using last time. Now let’s look at the rules from the article I just pointed you to:

A top edge, is an edge that is exactly horizontal and is above the other edges.
A left edge, is an edge that is not exactly horizontal and is on the left side of the triangle.

A triangle.

The “exactly horizontal” part is easy enough to find out (just check if the y-coordinates are different), but the second half of these definitions looks troublesome. Luckily, it turns out to be fairly easy. Let’s do top first: What does “above the other edges” mean, really? An edge connects two points. The edge that’s “above the other edges” connects the two highest vertices; the third vertex is below them. In our example triangle, that edge is v1v2 (ignore that it’s not horizontal for now, it’s still the edge that’s above the others). Now I claim that edge must be one that is going towards the left. Suppose it was going to the right instead – then v0 would be in its right (negative) half-space, meaning the triangle is wound clockwise, contradicting our initial assertion that it’s counter-clockwise! And by the same argument, any horizontal edge that goes to the right must be a bottom edge, or again we’d have a clockwise triangle. Which gives us our first updated rule:

In a counter-clockwise triangle, a top edge is an edge that is exactly horizontal and goes towards the left, i.e. its end point is left of its start point.

That’s really easy to figure out – just a sign test on the edge vectors. And again using the same kind of argument as before (consider the edge v2v0), we can see that any “left” edge must be one that’s going down, and that any edge that is going up is in fact a right edge. Which gives us the second updated rule:

In a counter-clockwise triangle, a left edge is an edge that goes down, i.e. its end point is strictly below its start point.

Note we can drop the “not horizontal” part entirely: any edge that goes down by our definition can’t be horizontal to begin with. So this is just one sign test, even easier than testing for a top edge!

And now that we know how to identify which edge is which, what do we do with that information? Again, quoting from the D3D10 rules:

Any pixel center which falls inside a triangle is drawn; a pixel is assumed to be inside if it passes the top-left rule. The top-left rule is that a pixel center is defined to lie inside of a triangle if it lies on the top edge or the left edge of a triangle.

To paraphrase: if our sample point actually falls inside the triangle (not on an edge), we draw it no matter what. It if happens to fall on an edge, we draw it if and only if that edge happens to be a top or a left edge.

Now, our current rasterizer code:

    int w0 = orient2d(v1, v2, p);
    int w1 = orient2d(v2, v0, p);
    int w2 = orient2d(v0, v1, p);

    // If p is on or inside all edges, render pixel.
    if (w0 >= 0 && w1 >= 0 && w2 >= 0)
        renderPixel(p, w0, w1, w2);           

Draws all points that fall on edges, no matter which kind – all the tests are for greater-or-equals to zero. That’s okay for edge functions corresponding to top or left edges, but for the other edges we really want to be testing for a proper “greater than zero” instead. We could have multiple versions of the rasterizer, one for each possible combination of “edge 0/1/2 is (not) top-left”, but that’s too horrible to contemplate. Instead, we’re going to use the fact that for integers, x > 0 and x >= 1 mean the same thing. Which means we can leave the tests as they are by first computing a per-edge offset once:

  int bias0 = isTopLeft(v1, v2) ? 0 : -1;
  int bias1 = isTopLeft(v2, v0) ? 0 : -1;
  int bias2 = isTopLeft(v0, v1) ? 0 : -1;

and then changing our edge function computation slightly:

    int w0 = orient2d(v1, v2, p) + bias0;
    int w1 = orient2d(v2, v0, p) + bias1;
    int w2 = orient2d(v0, v1, p) + bias2;

    // If p is on or inside all edges, render pixel.
    if (w0 >= 0 && w1 >= 0 && w2 >= 0)
        renderPixel(p, w0, w1, w2);           

Full disclosure: this changes the barycentric coordinates we pass to renderPixel slightly (as does the subpixel-precision squeezing we did earlier!). If you’re not using sub-pixel correction, this can be quite a big error, and you want to correct for it. With sub-pixel correction, you might decide that being off-by-1 on interpolated quantities is no big deal (remember that the edge functions are in area units, so “1” is a 1-subpixel-by-1-subpixel square, which is fairly small). Either way, the bias values are computed once per triangle, and you can usually do the correction once per triangle too, so it’s no extra per-pixel overhead. Right now, we pay some per-pixel cost to apply the biases too, but it turns out that will go away once we start optimizing it. And by the way, if you go back to the “integer overflow” section, you’ll notice we had a bit of slack on the precision requirements; the “bias” terms will not cause us to need any extra bits. So it really does all work out, and we can get proper fill rule handling in our rasterizer.

Which reminds me: This is the part where I tell you that the depth buffer rasterizer we’re going to look at doesn’t bother with implementing a consistent fill rule. It has the same “fill everything inside or on the edge” behavior as our initial code does. That might be an oversight, or it might be an intentional decision to make the rasterizer slightly conservative, which would make sense given the application. I’m not sure, and I decided not to mess with it. But I figured that since I was writing a post on rasterization, it would be a sin not to describe how to do this properly, especially since a coherent explanation of how exactly it’s done is quite hard to find on the net.

All that’s fine and good, but now how do we make it fast?

Well, that’s a big question, and – much as I hate to tell you – one that I will try to answer in the next post. We’ll also end this brief detour into software rasterization generalities and get back to the Software Occlusion Culling demo that started this series.

So what’s the point of this and the previous post? Well, first off, this is still my blog, and I just felt like writing about it. :) And just as importantly, I’m going to spend at least two posts poking around in the guts of a rasterizer, and none of the changes I’m going to describe will make any sense to you without this background information. Low-hanging fruit are all nice and good, but sometimes you actually have to work for it, and this is one of those times. Besides, while optimizing code is fun, correctness isn’t optional. Fast code that doesn’t do what it’s supposed to is no good to anyone. So I’m trying to get it right before we make it fast. I can promise you it will be worth your while, though, and I’ll try to finish and upload the next post quickly. Until then, take care!