Optimizing Software Occlusion Culling – index

February 17, 2013

In January of 2013, some nice folks at Intel released a Software Occlusion Culling demo with full source code. I spent about two weekends playing around with the code, and after realizing that it made a great example for various things I’d been meaning to write about for a long time, started churning out blog posts about it for the next few weeks. This is the resulting series.

Here’s the list of posts (the series is now finished):

“Write combining is not your friend”, on typical write combining issues when writing graphics code.
“A string processing rant”, a slightly over-the-top post that starts with some bad string processing habits and ends in a rant about what a complete minefield the standard C/C++ string processing functions and classes are whenever non-ASCII character sets are involved.
“Cores don’t like to share”, on some very common pitfalls when running multiple threads that share memory.
“Fixing cache issues, the lazy way”. You could redesign your system to be more cache-friendly – but when you don’t have the time or the energy, you could also just do this.
“Frustum culling: turning the crank” – on the other hand, if you do have the time and energy, might as well do it properly.
“The barycentric conspiracy” is a lead-in to some in-depth posts on the triangle rasterizer that’s at the heart of Intel’s demo. It’s also a gripping tale of triangles, Möbius, and a plot centuries in the making.
“Triangle rasterization in practice” – how to build your own precise triangle rasterizer and not die trying.
“Optimizing the basic rasterizer”, because this is real time, not amateur hour.
“Depth buffers done quick, part 1” – at last, looking at (and optimizing) the depth buffer rasterizer in Intel’s example.
“Depth buffers done quick, part 2” – optimizing some more!
“The care and feeding of worker threads, part 1” – this project uses multi-threading; time to look into what these threads are actually doing.
“The care and feeding of worker threads, part 2” – more on scheduling.
“Reshaping dataflows” – using global knowledge to perform local code improvements.
“Speculatively speaking” – on store forwarding and speculative execution, using the triangle binner as an example.
“Mopping up” – a bunch of things that didn’t fit anywhere else.
“The Reckoning” – in which a lesson is learned, but the damage is irreversible.

All the code is available on Github; there’s various branches corresponding to various (simultaneous) tracks of development, including a lot of experiments that didn’t pan out. The articles all reference the blog branch which contains only the changes I talk about in the posts – i.e. the stuff I judged to be actually useful.

Special thanks to Doug McNabb and Charu Chandrasekaran at Intel for publishing the example with full source code and a permissive license, and for saying “yes” when I asked them whether they were okay with me writing about my findings in this way!

To the extent possible under law,

Fabian Giesen
has waived all copyright and related or neighboring rights to
Optimizing Software Occlusion Culling.

From → Coding

17 Comments

rashmatash permalink

This series is great for learning various optimization as well as software rendering techniques which gives me a way better understanding of what actually happens in the hardware. I wonder, however, why Intel guys chose to use a software rasterizer instead of doing the whole thing on the GPU. Do you know if there is a reason for this? (especially with DX11’s compute shader and UAVs it would be way faster, right?)? Thaks.

Reply
- fgiesen permalink
  
  The whole point is that doing an early conservative occlusion culling pass on the CPU *is* faster than submitting everything to the GPU – as the example shows.
  
  In practice, I wouldn’t be using a full-resolution depth buffer (like the Intel sample uses) for this though. Smaller depth buffer should work just fine and is cheaper to render. You need to be careful with small occluders and make sure your rasterizer is conservative once you do that though.
  
  Reply
sgm permalink

Thanks a lot for this article series. Learned a lot of details about writing a rasterizer. I have one question though, how do you get around race conditions when writing to the depth buffer using multiple threads? If I’m not mistaken, each thread rasterizes a ‘bin’ and there can be multiple bins per screen space tile. Is it not the case?

Reply
- fgiesen permalink
  
  No, each tile has one bin. (The tile is the area on the screen, the bin is a data structure containing triangles overlapping the tile.)
  
  Reply
  - sgm permalink
    
    I see. Somehow the comments in the code confused me. Thanks for the prompt reply.

Optimizing Software Occlusion Culling – index

Trackbacks & Pingbacks

Leave a comment Cancel reply

Recent Posts

Categories

Archives

Optimizing Software Occlusion Culling – index

Share this:

Related

Trackbacks & Pingbacks

Leave a comment Cancel reply

Recent Posts

Categories

Archives