<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>The ryg blog</title>
	<atom:link href="http://fgiesen.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://fgiesen.wordpress.com</link>
	<description>When I grow up I&#039;ll be an inventor.</description>
	<lastBuildDate>Thu, 23 May 2013 02:20:44 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='fgiesen.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>The ryg blog</title>
		<link>http://fgiesen.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://fgiesen.wordpress.com/osd.xml" title="The ryg blog" />
	<atom:link rel='hub' href='http://fgiesen.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Trig identities from complex exponentials</title>
		<link>http://fgiesen.wordpress.com/2013/05/13/trig-identities-from-complex/</link>
		<comments>http://fgiesen.wordpress.com/2013/05/13/trig-identities-from-complex/#comments</comments>
		<pubDate>Mon, 13 May 2013 07:30:39 +0000</pubDate>
		<dc:creator>fgiesen</dc:creator>
				<category><![CDATA[Maths]]></category>

		<guid isPermaLink="false">http://fgiesen.wordpress.com/?p=2030</guid>
		<description><![CDATA[There&#8217;s tons of useful trig identities. You could spend the time to learn them by heart, or just look them up on Wikipedia when necessary. But I&#8217;ve always had problems remembering where the signs and such go when trying to memorize this directly. At least for me, what worked way better is this: spend a [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=2030&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>There&#8217;s tons of useful trig identities. You could spend the time to learn them by heart, or just look them up on Wikipedia when necessary. But I&#8217;ve always had problems remembering where the signs and such go when trying to memorize this directly. At least for me, what worked way better is this: spend a few hours familiarizing yourself with complex numbers if you haven&#8217;t done so already; after that, most identities that you need in practice are easy to derive from Euler&#8217;s formula:</p>
<p><img src='http://s0.wp.com/latex.php?latex=e%5E%7Bix%7D+%3D+%5Cexp%28ix%29+%3D+%5Ccos%28x%29+%2B+i+%5Csin%28x%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='e^{ix} = &#92;exp(ix) = &#92;cos(x) + i &#92;sin(x)' title='e^{ix} = &#92;exp(ix) = &#92;cos(x) + i &#92;sin(x)' class='latex' /></p>
<p>Let&#8217;s do the basic addition formulas first. Euler&#8217;s formula gives:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ccos%28x%2By%29+%2B+i+%5Csin%28x%2By%29+%3D+%5Cexp%28i%28x%2By%29%29+%3D+%5Cexp%28ix%29+%5Cexp%28iy%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(x+y) + i &#92;sin(x+y) = &#92;exp(i(x+y)) = &#92;exp(ix) &#92;exp(iy)' title='&#92;cos(x+y) + i &#92;sin(x+y) = &#92;exp(i(x+y)) = &#92;exp(ix) &#92;exp(iy)' class='latex' /></p>
<p>and once we apply the identity again we get:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%28%5Ccos%28x%29+%2B+i+%5Csin%28x%29%29+%28%5Ccos%28y%29+%2B+i+%5Csin%28y%29%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='(&#92;cos(x) + i &#92;sin(x)) (&#92;cos(y) + i &#92;sin(y))' title='(&#92;cos(x) + i &#92;sin(x)) (&#92;cos(y) + i &#92;sin(y))' class='latex' /></p>
<p>multiplying out:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%28%5Ccos%28x%29+%5Ccos%28y%29+-+%5Csin%28x%29+%5Csin%28y%29%29+%2B+i+%28%5Csin%28x%29+%5Ccos%28y%29+%2B+%5Ccos%28x%29+%5Csin%28y%29%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='(&#92;cos(x) &#92;cos(y) - &#92;sin(x) &#92;sin(y)) + i (&#92;sin(x) &#92;cos(y) + &#92;cos(x) &#92;sin(y))' title='(&#92;cos(x) &#92;cos(y) - &#92;sin(x) &#92;sin(y)) + i (&#92;sin(x) &#92;cos(y) + &#92;cos(x) &#92;sin(y))' class='latex' /></p>
<p>The terms in parentheses are all real numbers; equating them with our original expression yields the result</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ccos%28x%2By%29+%3D+%5Ccos%28x%29+%5Ccos%28y%29+-+%5Csin%28x%29+%5Csin%28y%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(x+y) = &#92;cos(x) &#92;cos(y) - &#92;sin(x) &#92;sin(y)' title='&#92;cos(x+y) = &#92;cos(x) &#92;cos(y) - &#92;sin(x) &#92;sin(y)' class='latex' /><br />
<img src='http://s0.wp.com/latex.php?latex=%5Csin%28x%2By%29+%3D+%5Csin%28x%29+%5Ccos%28y%29+%2B+%5Ccos%28x%29+%5Csin%28y%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;sin(x+y) = &#92;sin(x) &#92;cos(y) + &#92;cos(x) &#92;sin(y)' title='&#92;sin(x+y) = &#92;sin(x) &#92;cos(y) + &#92;cos(x) &#92;sin(y)' class='latex' /></p>
<p>Both addition formulas for the price of one. (In fact, this exploits that the addition formulas for trigonometric functions and the addition formula for exponents are really the same thing). The main point being that if you know complex multiplication, you never have to remember what the grouping of factors and the signs are, something I used to have trouble remembering.</p>
<p>Plugging in x=y into the above also immediately gives the double-angle formulas:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ccos%282x%29+%3D+%5Ccos%28x%29%5E2+-+%5Csin%28x%29%5E2&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(2x) = &#92;cos(x)^2 - &#92;sin(x)^2' title='&#92;cos(2x) = &#92;cos(x)^2 - &#92;sin(x)^2' class='latex' /><br />
<img src='http://s0.wp.com/latex.php?latex=%5Csin%282x%29+%3D+2+%5Csin%28x%29+%5Ccos%28x%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;sin(2x) = 2 &#92;sin(x) &#92;cos(x)' title='&#92;sin(2x) = 2 &#92;sin(x) &#92;cos(x)' class='latex' /></p>
<p>so if you know the addition formulas there&#8217;s really no reason to learn these separately.</p>
<p>Then there&#8217;s the well-known</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ccos%28x%29%5E2+%2B+%5Csin%28x%29%5E2+%3D+1&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(x)^2 + &#92;sin(x)^2 = 1' title='&#92;cos(x)^2 + &#92;sin(x)^2 = 1' class='latex' /></p>
<p>but it&#8217;s really just the Pythagorean theorem in disguise (since cos(x) and sin(x) are the side lengths of a right-angled triangle). So not really a new formula either!</p>
<p>Moving either the cosine or sine terms to the right-hand side gives the two <em>immensely</em> useful equations:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ccos%28x%29%5E2+%3D+1+-+%5Csin%28x%29%5E2&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(x)^2 = 1 - &#92;sin(x)^2' title='&#92;cos(x)^2 = 1 - &#92;sin(x)^2' class='latex' /><br />
<img src='http://s0.wp.com/latex.php?latex=%5Csin%28x%29%5E2+%3D+1+-+%5Ccos%28x%29%5E2&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;sin(x)^2 = 1 - &#92;cos(x)^2' title='&#92;sin(x)^2 = 1 - &#92;cos(x)^2' class='latex' /></p>
<p>In particular, that second one is perfect if you need the sine squared of an angle that you only have the cosine of (usually because you&#8217;ve determined it using a dot product). Judicious application of these two tends to be a great way to simplify superfluous math in shaders (and elsewhere), one of my <a href="http://fgiesen.wordpress.com/2010/10/21/finish-your-derivations-please/">pet peeves</a>.</p>
<p>For practice, let&#8217;s apply these two identities to the cosine double-angle formula:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ccos%282x%29+%3D+%5Ccos%28x%29%5E2+-+%5Csin%28x%29%5E2+%3D+2+%5Ccos%28x%29%5E2+-+1+%5CLeftrightarrow+cos%28x%29%5E2+%3D+%28cos%282x%29+%2B+1%29+%2F+2&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(2x) = &#92;cos(x)^2 - &#92;sin(x)^2 = 2 &#92;cos(x)^2 - 1 &#92;Leftrightarrow cos(x)^2 = (cos(2x) + 1) / 2' title='&#92;cos(2x) = &#92;cos(x)^2 - &#92;sin(x)^2 = 2 &#92;cos(x)^2 - 1 &#92;Leftrightarrow cos(x)^2 = (cos(2x) + 1) / 2' class='latex' /><br />
<img src='http://s0.wp.com/latex.php?latex=%5Ccos%282x%29+%3D+%5Ccos%28x%29%5E2+-+%5Csin%28x%29%5E2+%3D+1+-+2+%5Csin%28x%29%5E2+%5CLeftrightarrow+sin%28x%29%5E2+%3D+%281+-+cos%282x%29%29+%2F+2&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(2x) = &#92;cos(x)^2 - &#92;sin(x)^2 = 1 - 2 &#92;sin(x)^2 &#92;Leftrightarrow sin(x)^2 = (1 - cos(2x)) / 2' title='&#92;cos(2x) = &#92;cos(x)^2 - &#92;sin(x)^2 = 1 - 2 &#92;sin(x)^2 &#92;Leftrightarrow sin(x)^2 = (1 - cos(2x)) / 2' class='latex' /></p>
<p>why, it&#8217;s the half-angle formulas! Fancy meeting you here!</p>
<p>Can we do something with the sine double-angle formula too? Well, it&#8217;s not too fancy, but we can get this:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Csin%282x%29+%3D+2+%5Csin%28x%29+%5Ccos%28x%29+%5CLeftrightarrow+%5Csin%28x%29+%5Ccos%28x%29+%3D+%5Csin%282x%29+%2F+2&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;sin(2x) = 2 &#92;sin(x) &#92;cos(x) &#92;Leftrightarrow &#92;sin(x) &#92;cos(x) = &#92;sin(2x) / 2' title='&#92;sin(2x) = 2 &#92;sin(x) &#92;cos(x) &#92;Leftrightarrow &#92;sin(x) &#92;cos(x) = &#92;sin(2x) / 2' class='latex' /></p>
<p>Now, let&#8217;s go back to the original addition formulas and let&#8217;s see what happens when we plug in negative values for y. Using <img src='http://s0.wp.com/latex.php?latex=%5Csin%28-x%29+%3D+-%5Csin%28x%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;sin(-x) = -&#92;sin(x)' title='&#92;sin(-x) = -&#92;sin(x)' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%5Ccos%28-x%29+%3D+%5Ccos%28x%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(-x) = &#92;cos(x)' title='&#92;cos(-x) = &#92;cos(x)' class='latex' />, we get:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ccos%28x-y%29+%3D+%5Ccos%28x%29+%5Ccos%28y%29+%2B+%5Csin%28x%29+%5Csin%28y%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(x-y) = &#92;cos(x) &#92;cos(y) + &#92;sin(x) &#92;sin(y)' title='&#92;cos(x-y) = &#92;cos(x) &#92;cos(y) + &#92;sin(x) &#92;sin(y)' class='latex' /><br />
<img src='http://s0.wp.com/latex.php?latex=%5Csin%28x-y%29+%3D+%5Csin%28x%29+%5Ccos%28y%29+-+%5Ccos%28x%29+%5Csin%28y%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;sin(x-y) = &#92;sin(x) &#92;cos(y) - &#92;cos(x) &#92;sin(y)' title='&#92;sin(x-y) = &#92;sin(x) &#92;cos(y) - &#92;cos(x) &#92;sin(y)' class='latex' /></p>
<p>Hey look, flipped signs! This means that we can now add these to (or subtract them from) the original formulas to get <em>even more</em> identities!</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ccos%28x%2By%29+%2B+%5Ccos%28x-y%29+%3D+2+%5Ccos%28x%29+%5Ccos%28y%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(x+y) + &#92;cos(x-y) = 2 &#92;cos(x) &#92;cos(y)' title='&#92;cos(x+y) + &#92;cos(x-y) = 2 &#92;cos(x) &#92;cos(y)' class='latex' /><br />
<img src='http://s0.wp.com/latex.php?latex=%5Ccos%28x-y%29+-+%5Ccos%28x%2By%29+%3D+2+%5Csin%28x%29+%5Csin%28y%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(x-y) - &#92;cos(x+y) = 2 &#92;sin(x) &#92;sin(y)' title='&#92;cos(x-y) - &#92;cos(x+y) = 2 &#92;sin(x) &#92;sin(y)' class='latex' /><br />
<img src='http://s0.wp.com/latex.php?latex=%5Csin%28x%2By%29+%2B+%5Csin%28x-y%29+%3D+2+%5Csin%28x%29+%5Ccos%28y%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;sin(x+y) + &#92;sin(x-y) = 2 &#92;sin(x) &#92;cos(y)' title='&#92;sin(x+y) + &#92;sin(x-y) = 2 &#92;sin(x) &#92;cos(y)' class='latex' /><br />
<img src='http://s0.wp.com/latex.php?latex=%5Csin%28x%2By%29+-+%5Csin%28x-y%29+%3D+2+%5Ccos%28x%29+%5Csin%28y%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;sin(x+y) - &#92;sin(x-y) = 2 &#92;cos(x) &#92;sin(y)' title='&#92;sin(x+y) - &#92;sin(x-y) = 2 &#92;cos(x) &#92;sin(y)' class='latex' /></p>
<p>It&#8217;s the product-to-sum identities this time. I got one more! We&#8217;ve deliberately flipped signs and then added/subtracted the addition formulas to get the above set. What if we do the same trick in reverse to get rid of those x+y and x-y terms? Let&#8217;s set <img src='http://s0.wp.com/latex.php?latex=x+%3D+%28a+%2B+b%29%2F2&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='x = (a + b)/2' title='x = (a + b)/2' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=y+%3D+%28b+-+a%29%2F2&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='y = (b - a)/2' title='y = (b - a)/2' class='latex' /> and plug that into the identities above and we get:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ccos%28b%29+%2B+%5Ccos%28a%29+%3D+2+%5Ccos%28%28a%2Bb%29%2F2%29+%5Ccos%28%28b-a%29%2F2%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(b) + &#92;cos(a) = 2 &#92;cos((a+b)/2) &#92;cos((b-a)/2)' title='&#92;cos(b) + &#92;cos(a) = 2 &#92;cos((a+b)/2) &#92;cos((b-a)/2)' class='latex' /><br />
<img src='http://s0.wp.com/latex.php?latex=%5Ccos%28a%29+-+%5Ccos%28b%29+%3D+2+%5Csin%28%28a+%2B+b%29%2F2%29+%5Csin%28%28b+-+a%29%2F2%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(a) - &#92;cos(b) = 2 &#92;sin((a + b)/2) &#92;sin((b - a)/2)' title='&#92;cos(a) - &#92;cos(b) = 2 &#92;sin((a + b)/2) &#92;sin((b - a)/2)' class='latex' /><br />
<img src='http://s0.wp.com/latex.php?latex=%5Csin%28b%29+%2B+%5Csin%28a%29+%3D+2+%5Csin%28%28a+%2B+b%29%2F2%29+%5Ccos%28%28b+-+a%29%2F2%29&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;sin(b) + &#92;sin(a) = 2 &#92;sin((a + b)/2) &#92;cos((b - a)/2)' title='&#92;sin(b) + &#92;sin(a) = 2 &#92;sin((a + b)/2) &#92;cos((b - a)/2)' class='latex' /></p>
<p>Ta-dah, it&#8217;s the sum-to-product identities. Now, admittedly, we&#8217;ve taken quite a few steps to get here, and looking these up when you need them is going to be faster than walking through the derivation (if you ever need them in the first place &#8211; I don&#8217;t think I&#8217;ve ever used the product/sum identities in practice). But still, working these out is a good exercise, and a lot less likely to go wrong (at least for me) than memorizing lots of similar formulas. (I never can get the signs right that way)</p>
<p>Bonus exercise: work out general expressions for <img src='http://s0.wp.com/latex.php?latex=%5Ccos%28x%29%5En&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(x)^n' title='&#92;cos(x)^n' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%5Csin%28x%29%5En&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;sin(x)^n' title='&#92;sin(x)^n' class='latex' />. Hint:</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ccos%28x%29+%3D+%28%5Cexp%28ix%29+%2B+%5Cexp%28-ix%29%29%2F2&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;cos(x) = (&#92;exp(ix) + &#92;exp(-ix))/2' title='&#92;cos(x) = (&#92;exp(ix) + &#92;exp(-ix))/2' class='latex' /><br />
<img src='http://s0.wp.com/latex.php?latex=%5Csin%28x%29+%3D+%28%5Cexp%28ix%29+-+%5Cexp%28-ix%29%29%2F2i&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='&#92;sin(x) = (&#92;exp(ix) - &#92;exp(-ix))/2i' title='&#92;sin(x) = (&#92;exp(ix) - &#92;exp(-ix))/2i' class='latex' />.</p>
<p>And I think that&#8217;s enough for now. (At some later point, I might do an extra post about one of the sneakier trig techniques: the <a href="http://en.wikipedia.org/wiki/Weierstrass_substitution">Weierstrass substitution</a>).</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/fgiesen.wordpress.com/2030/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/fgiesen.wordpress.com/2030/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=2030&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://fgiesen.wordpress.com/2013/05/13/trig-identities-from-complex/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/32870837851c0e5eb620649cb8d3d608?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">fgiesen</media:title>
		</media:content>
	</item>
		<item>
		<title>64-bit mode and 3-operand instructions</title>
		<link>http://fgiesen.wordpress.com/2013/03/13/extra-registers-and-3-operand-instructions/</link>
		<comments>http://fgiesen.wordpress.com/2013/03/13/extra-registers-and-3-operand-instructions/#comments</comments>
		<pubDate>Wed, 13 Mar 2013 08:25:50 +0000</pubDate>
		<dc:creator>fgiesen</dc:creator>
				<category><![CDATA[Coding]]></category>

		<guid isPermaLink="false">http://fgiesen.wordpress.com/?p=2012</guid>
		<description><![CDATA[One interesting thing about x86 is that it&#8217;s changed two major architectural &#8220;magic values&#8221; in the past 10 years. The first is the addition of 64-bit mode, which not only widens all general-purpose registers and gives a much larger virtual address space, it also increases the number of general-purpose and XMM registers from 8 to [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=2012&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>One interesting thing about x86 is that it&#8217;s changed two major architectural &#8220;magic values&#8221; in the past 10 years. The first is the addition of 64-bit mode, which not only widens all general-purpose registers and gives a much larger virtual address space, it also increases the number of general-purpose and XMM registers from 8 to 16. The second is AVX, which allows all SSE (and other SIMD) instructions to be encoded using non-destructive 3-operand forms instead of the original 2-operand forms.</p>
<p>Since modern x86 processors are trying really hard to run both 32- and 64-bit code well (and same for SSE vs. AVX), this gives us an opportunity to compare the relative performance of these choices in a reasonably level playing field, when running the same (C++) code. Of course, this is nowhere near a perfect comparison, especially since switching from 32 to 64 bits also changes the sizes of pointers and (at the very least) the code generator used by the compiler, but it&#8217;s still interesting to be able to do the experiment on a single machine with no fuss. So, without further ado, here&#8217;s a quick comparison using the <a href="http://fgiesen.wordpress.com/2013/03/10/optimizing-software-occlusion-culling-the-reckoning/">Software Occlusion Culling demo</a> I&#8217;ve been writing about for the past month &#8211; a fairly SIMD-heavy workload.</p>
<table>
<tr>
<th>Version</th>
<th>Occlusion cull</th>
<th>Render scene</th>
</tr>
<tr>
<td>x86 (baseline)</td>
<td>2.88ms</td>
<td>1.39ms</td>
</tr>
<tr>
<td>x86, <code>/arch:SSE2</code></td>
<td>2.88ms (+0.2%)</td>
<td>1.48ms (+5.8%)</td>
</tr>
<tr>
<td>x86, <code>/arch:AVX</code></td>
<td>2.77ms (-3.8%)</td>
<td>1.43ms (+2.7%)</td>
</tr>
<tr>
<td>x64</td>
<td>2.71ms (-5.7%)</td>
<td>1.29ms (-7.2%)</td>
</tr>
<tr>
<td>x64, <code>/arch:AVX</code></td>
<td>2.63ms (-8.7%)</td>
<td>1.28ms (-8.5%)</td>
</tr>
</table>
<p>Note that <code>/arch:AVX</code> makes VC++ use AVX forms of SSE vector instructions (i.e. 3-operand), but it&#8217;s all still 4-wide SIMD, not the new 8-wide SIMD floating point. Getting that would require changes to the code. And of course the code uses SSE2 (and, in fact, even SSE4.1) instructions whether we turn on <code>/arch:SSE2</code> on x86 or not &#8211; this only affects how &#8220;regular&#8221; floating-point code is generated. Also, the speedup percentages are computed from the full-precision values, not the truncated values I put in the table. (Which doesn&#8217;t mean much, since I truncated the values to about their level of accuracy)</p>
<p>So what does this tell us? Hard to be sure. It&#8217;s very few data points and I haven&#8217;t done any work to eliminate the effect of e.g. memory layout / code placement, which can be very much significant. And of course I&#8217;ve also changed the compiler. That said, a few observations:</p>
<ul>
<li>Not much of a win turning on <code>/arch:SSE2</code> on the regular x86 code. If anything, the rendering part of the code gets worse from the &#8220;enhanced instruction set&#8221; usage. I did not investigate further.</li>
<li>The 3-operand AVX instructions provide a solid win of a few percentage points in both 32-bit and 64-bit mode. Considering I&#8217;m not using any 8-wide instructions, this is almost exclusively the impact of having less register-register move instructions.</li>
<li>Yes, going to 64 bits does make a noticeable difference. Note in particular the dip in rendering time. Whether it&#8217;s due to the overhead of 32-bit thunks on a 64-bit system, better code generation on the app side, better code on the D3D runtime/driver side, or most likely a combination of all these factors, the D3D rendering code sure gets a lot faster. And similarly, the SIMD-heavy occlusion cull code sees a good speed-up too. I have not investigated whether this is primarily due to the extra registers, or due to code generation improvements.</li>
</ul>
<p>I don&#8217;t think there&#8217;s any particular lesson here, but it&#8217;s definitely interesting.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/fgiesen.wordpress.com/2012/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/fgiesen.wordpress.com/2012/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=2012&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://fgiesen.wordpress.com/2013/03/13/extra-registers-and-3-operand-instructions/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/32870837851c0e5eb620649cb8d3d608?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">fgiesen</media:title>
		</media:content>
	</item>
		<item>
		<title>On ePub/PDF versions of my posts and licensing</title>
		<link>http://fgiesen.wordpress.com/2013/03/10/on-epubpdf-versions-of-my-posts-and-licensing/</link>
		<comments>http://fgiesen.wordpress.com/2013/03/10/on-epubpdf-versions-of-my-posts-and-licensing/#comments</comments>
		<pubDate>Sun, 10 Mar 2013 22:54:33 +0000</pubDate>
		<dc:creator>fgiesen</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://fgiesen.wordpress.com/?p=2002</guid>
		<description><![CDATA[I&#8217;ve been asked several times about this, so I wanted to make an &#8220;official&#8221; statement: No, I will not prepare ePub/PDF (&#8220;book&#8221;) versions of posts, particularly the &#8220;A trip through the Graphics Pipeline 2011&#8221; and &#8220;Optimizing Software Occlusion Culling&#8221; series. However, should someone be willing to prepare such a thing, I&#8217;d be very happy to [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=2002&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I&#8217;ve been asked several times about this, so I wanted to make an &#8220;official&#8221; statement:</p>
<p>No, I will not prepare ePub/PDF (&#8220;book&#8221;) versions of posts, particularly the &#8220;<a href="http://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/">A trip through the Graphics Pipeline 2011</a>&#8221; and &#8220;<a href="http://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/">Optimizing Software Occlusion Culling</a>&#8221; series. However, should someone be willing to prepare such a thing, I&#8217;d be very happy to provide them with a WordPress extended RSS dump of the site contents (with your comments and all other emails / personal data removed, don&#8217;t worry) and host the results. If you&#8217;re interested in helping, please write a comment with a valid email address and I&#8217;ll get in touch with you.</p>
<p>To clarify the legal situation, I have put both these series into the public domain (using the CC-0 &#8220;license&#8221; waiver). <em>This means you may do with these posts whatever you want</em>. You may edit them, update them or add additional information; you may turn them into an eBook, PDF, or hardcopy book; you may use it as a starting point for a graphics pipeline Wiki, if you are so inclined &#8211; I don&#8217;t have the energy or web development chops to set that kind of thing up, but I&#8217;d be happy to contribute to it if it existed! You may also claim that you wrote them yourself, sell it to a publisher for a million bucks, and invest the proceeds in land mines you bury in a public park. I would rather that you not do these things, but it boils down to this: if you were to do it, would I want to make the whole affair even more unpleasant for myself than it would already be by engaging in complicated and expensive legal proceedings? And my answer to that question is a clear &#8220;no&#8221;.</p>
<p>In fact, my reasons for not preparing eBook versions and for releasing the texts in the public domain are basically the same: I enjoy writing these posts, and I enjoy seeing people read them. I do not enjoy wrestling with publication formats or blogging frameworks, and I certainly don&#8217;t enjoy dealing with legal issues. The reason I can manage to write a few thousand words of technical content a week despite having a full-time job is because I&#8217;ve structured the experience to be as enjoyable and low-friction for me as possible. Last year, I tried editing the &#8220;A trip through the Graphics Pipeline 2011&#8243; series into a book format, and progress was excruciatingly slow, because ultimately it was not a fun task for me; it felt like an unpaid part-time job, so at some point I just stopped.</p>
<p>So this is the deal: I&#8217;m a professional software developer that happens to like writing. But the writing is a pure &#8220;bonus&#8221;; I do it because I enjoy it, but only as long as it&#8217;s on my terms &#8211; I write what I feel like writing, on whatever schedule pleases me, and without any additional process beyond hitting &#8220;Publish&#8221; once I&#8217;m done. I&#8217;ll be happy to help anyone who wants to do more than that, but I&#8217;m not going to do it myself.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/fgiesen.wordpress.com/2002/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/fgiesen.wordpress.com/2002/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=2002&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://fgiesen.wordpress.com/2013/03/10/on-epubpdf-versions-of-my-posts-and-licensing/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/32870837851c0e5eb620649cb8d3d608?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">fgiesen</media:title>
		</media:content>
	</item>
		<item>
		<title>Optimizing Software Occlusion Culling: The Reckoning</title>
		<link>http://fgiesen.wordpress.com/2013/03/10/optimizing-software-occlusion-culling-the-reckoning/</link>
		<comments>http://fgiesen.wordpress.com/2013/03/10/optimizing-software-occlusion-culling-the-reckoning/#comments</comments>
		<pubDate>Sun, 10 Mar 2013 09:59:27 +0000</pubDate>
		<dc:creator>fgiesen</dc:creator>
				<category><![CDATA[Coding]]></category>

		<guid isPermaLink="false">http://fgiesen.wordpress.com/?p=1939</guid>
		<description><![CDATA[This post is part of a series &#8211; go here for the index. Welcome back! Last time, I promised to end the series with a bit of reflection on the results. So, time to find out how far we&#8217;ve come! The results Without further ado, here&#8217;s the breakdown of per-frame work at the end of [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1939&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><em>This post is part of a series &#8211; go <a href="http://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/">here</a> for the index.</em></p>
<p>Welcome back! Last time, I promised to end the series with a bit of reflection on the results. So, time to find out how far we&#8217;ve come!</p>
<h3>The results</h3>
<p>Without further ado, here&#8217;s the breakdown of per-frame work at the end of the respective posts (names abbreviated), in order:</p>
<table>
<tr>
<th>Post</th>
<th>Cull / setup</th>
<th>Render depth</th>
<th>Depth test</th>
<th>Render scene</th>
<th>Total</th>
</tr>
<tr>
<td>Initial</td>
<td>1.988</td>
<td>3.410</td>
<td>2.091</td>
<td>5.567</td>
<td>13.056</td>
</tr>
<tr>
<td>Write&nbsp;Combining</td>
<td>1.946</td>
<td>3.407</td>
<td>2.058</td>
<td>3.497</td>
<td>10.908</td>
</tr>
<tr>
<td>Sharing</td>
<td>1.420</td>
<td>3.432</td>
<td>1.829</td>
<td>3.490</td>
<td>10.171</td>
</tr>
<tr>
<td>Cache&nbsp;issues</td>
<td>1.045</td>
<td>3.485</td>
<td>1.980</td>
<td>3.420</td>
<td>9.930</td>
</tr>
<tr>
<td>Frustum&nbsp;culling</td>
<td>0.735</td>
<td>3.424</td>
<td>1.812</td>
<td>3.495</td>
<td>9.466</td>
</tr>
<tr>
<td>Depth buffers&nbsp;1</td>
<td>0.740</td>
<td>3.061</td>
<td>1.791</td>
<td>3.434</td>
<td>9.026</td>
</tr>
<tr>
<td>Depth buffers&nbsp;2</td>
<td>0.739</td>
<td>2.755</td>
<td>1.484</td>
<td>3.578</td>
<td>8.556</td>
</tr>
<tr>
<td>Workers&nbsp;1</td>
<td>0.418</td>
<td>2.134</td>
<td>1.354</td>
<td>3.553</td>
<td>7.459</td>
</tr>
<tr>
<td>Workers&nbsp;2</td>
<td>0.197</td>
<td>2.217</td>
<td>1.191</td>
<td>3.463</td>
<td>7.068</td>
</tr>
<tr>
<td>Dataflows</td>
<td>0.180</td>
<td>2.224</td>
<td>0.831</td>
<td>3.589</td>
<td>6.824</td>
</tr>
<tr>
<td>Speculation</td>
<td>0.169</td>
<td>1.972</td>
<td>0.766</td>
<td>3.655</td>
<td>6.562</td>
</tr>
<tr>
<td>Mopping up</td>
<td>0.183</td>
<td>1.940</td>
<td>0.797</td>
<td>1.389</td>
<td>4.309</td>
</tr>
<tr>
<td><b>Total&nbsp;diff.</b></td>
<td>-90.0%</td>
<td>-43.1%</td>
<td>-61.9%</td>
<td>-75.0%</td>
<td>-67.0%</td>
</tr>
<tr>
<td><b>Speedup</b></td>
<td>10.86x</td>
<td>1.76x</td>
<td>2.62x</td>
<td>4.01x</td>
<td>3.03x</td>
</tr>
</table>
<p>What, you think that doesn&#8217;t tell you much? Okay, so did I. Have a graph instead:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/03/post_breakdown1.png"><img src="http://fgiesen.files.wordpress.com/2013/03/post_breakdown1.png?w=497&#038;h=173" alt="Time breakdown over posts" width="497" height="173" class="aligncenter size-large wp-image-1963" /></a></p>
<p>The image is a link to the full-size version that you probably want to look at. Note that in both the table and the image, updating the depth test pass to use the rasterizer improvements is chalked up to &#8220;Depth buffers done quick, part 2&#8243;, not &#8220;The care and feeding of worker threads, part 1&#8243; where I mentioned it in the text.</p>
<p>From the graph, you should clearly see one very interesting fact: the two biggest individual improvements &#8211; the write combining fix at 2.1ms and &#8220;Mopping up&#8221; at 2.2ms &#8211; both affect the <em>D3D rendering code</em>, and don&#8217;t have anything to do with the software occlusion culling code. In fact, it wasn&#8217;t until &#8220;Depth buffers done quick&#8221; that we actually started working on that part of the code. Which makes you wonder&#8230;</p>
<h3>What-if machine</h3>
<p>Is the software occlusion culling actually worth it? That is, how much do we actually get for the CPU time we invest in occlusion culling? To help answer this, I ran a few more tests:</p>
<table>
<tr>
<th>Test</th>
<th>Cull / setup</th>
<th>Render depth</th>
<th>Depth test</th>
<th>Render scene</th>
<th>Total</th>
</tr>
<tr>
<td>Initial</td>
<td>1.988</td>
<td>3.410</td>
<td>2.091</td>
<td>5.567</td>
<td>13.056</td>
</tr>
<tr>
<td>Initial,&nbsp;no&nbsp;occ.</td>
<td>1.433</td>
<td>0.000</td>
<td>0.000</td>
<td>25.184</td>
<td>26.617</td>
</tr>
<tr>
<td>Cherry-pick</td>
<td>1.548</td>
<td>3.462</td>
<td>1.977</td>
<td>2.084</td>
<td>9.071</td>
</tr>
<tr>
<td>Cherry-pick, no&nbsp;occ.</td>
<td>1.360</td>
<td>0.000</td>
<td>0.000</td>
<td>10.124</td>
<td>11.243</td>
<tr>
<td>Final</td>
<td>0.183</td>
<td>1.940</td>
<td>0.797</td>
<td>1.389</td>
<td>4.309</td>
</tr>
<tr>
<td>Final,&nbsp;no&nbsp;occ.</td>
<td>0.138</td>
<td>0.000</td>
<td>0.000</td>
<td>6.866</td>
<td>7.004</td>
</tr>
</table>
<p>Yes, the occlusion culling was a solid win both before and after. But the interesting value is the &#8220;cherry-pick&#8221; one. This is the original code, with only the following changes applied: (okay, and also with the timekeeping code added, in case you feel like nitpicking)</p>
<ul>
<li><a href="https://github.com/rygorous/intel_occlusion_cull/commit/e1839f69cf0680ad3339a5aa0f0b633bf71bcb68">Don&#8217;t read back from the constant buffers we&#8217;re writing</a>. Total diff: 3 lines.</li>
<li><a href="https://github.com/rygorous/intel_occlusion_cull/commit/1e1b5cca743c5ce26d2d5e8570f1ac689b5ce7fb">Don&#8217;t update debug counters in CPUTFrustum</a>. Total diff: 2 lines.</li>
<li><a href="https://github.com/rygorous/intel_occlusion_cull/commit/2504647a050e8c56ef2c4b4e03cce2ca7608343e">Use only one dynamic constant buffer</a>. Total diff: 10 lines changed, 8 added.</li>
<li><a href="https://github.com/rygorous/intel_occlusion_cull/commit/b4e29b2dfb43a040a9eb5ed5c074092766fe4ba7">Load materials only once</a>. Total diff: 7 lines changed, 1 added.</li>
<li><a href="https://github.com/rygorous/intel_occlusion_cull/commit/464503ca5bd657d7d6c6dc9e8a9144e1f223a278">Share materials instead of cloning them</a>. Total diff: 3 lines changed.</li>
<li><a href="https://github.com/rygorous/intel_occlusion_cull/commit/aa09c99a361988c1e7dd8765c0cbb9bd3bb5d527">AABBoxRasterizer traversal fix</a> &#8211; keep list of models instead of going over whole database every time. Total diff: 15 lines added, 18 deleted.</li>
</ul>
<p>In other words, &#8220;Cherry-pick&#8221; is within a few dozen lines of the original code, all of the changes are to &#8220;framework&#8221; code not the actual sample, and none of them do anything fancy. Yet it makes the difference between occlusion culling enabled and disabled shrink to about a 1.24x speedup, down from the 2x it was before!</p>
<h3>A brief digression</h3>
<p>This kind of thing is, in a nutshell, the reason why graphics papers really need to come with source code. Anything GPU-related in particular is <em>full</em> of performance cliffs like this. In this case, I had the source code, so I could investigate what was going on, fix a few problems, and get a much more realistic assessment of the gain to expect from this kind of technique. Had it just been a paper claiming a &#8220;2x improvement&#8221;, I would certainly not have been able to reproduce that result &#8211; note that in the &#8220;final&#8221; version, the speedup goes back to about 1.63x, but that&#8217;s with a considerable amount of extra work.</p>
<p>I mention this because it&#8217;s a very common problem: whatever technique the author of a paper is proposing is well-optimized and tweaked to look good, whereas the things that it&#8217;s being compared with are often a very sloppy implementation. The end result is lots of papers that claim &#8220;substantial gains&#8221; over the prior state of the art that somehow never materialize for anyone else. At one extreme, I&#8217;ve had one of my professors state outright at one point that he just stopped giving out source code to their algorithms because the effort invested in getting other people to successfully replicate his old results &#8220;distracted&#8221; him from producing new ones. (I&#8217;m not going to name names here, but he later stated a several other things along the same lines, and he&#8217;s probably the number one reason for me deciding against pursuing a career in academia.)</p>
<p>To that kind of attitude, I have only one thing to say: If you care only about producing results and not independent verification, then you may be brilliant, but you are not a scientist, and there&#8217;s a very good chance that your life&#8217;s work is useless to anyone but yourself.</p>
<p>Conversely, exposing your code to outside eyes might not be the optimal way to stroke your ego in case somebody finds an obvious mistake :), but it sure makes your approach a lot more likely to actually become relevant in practice. Anyway, let&#8217;s get back to the subject at hand.</p>
<h3>Observations</h3>
<p>The number one lesson from all of this probably is that there&#8217;s lots of ways to shoot yourself in the foot in graphics, and that it&#8217;s really easy to do so without even noticing it. So don&#8217;t assume, <em>profile</em>. I&#8217;ve used a fancy profiler with event-based sampling (VTune), but even a simple tool like Sleepy will tell you when a small piece of code takes a disproportionate amount of time. You just have to be on the lookout for these things.</p>
<p>Which brings me to the next point: you should always have an expectation of how long things should take. A common misconception is that profilers are primarily useful to identify the hot spots in an application, so you can focus your efforts there. Let&#8217;s have another look at the very first profiler screenshot I posted in this series:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/01/wc_slow1.png"><img src="http://fgiesen.files.wordpress.com/2013/01/wc_slow1.png?w=497&#038;h=300" alt="Reading from write-combined memory" width="497" height="300" class="aligncenter size-full wp-image-1304" /></a></p>
<p>If I had gone purely by what takes the largest amount of time, I&#8217;d have started with the depth buffer rasterization pass; as you should well recall, it took me several posts to explain what&#8217;s even going on in that code, and as you can see from the chart above, while we got a good win out of improving it (about 1.1ms total), doing so took lots of individual changes. Compare with what I <em>actually</em> worked on first &#8211; namely, the Write Combining issue, which gave us a 2.1ms win for a three-line change.</p>
<p>So what&#8217;s the secret? Don&#8217;t use a profile exclusively to look for hot spots. In particular, if your profile has the hot spots you expected (like the depth buffer rasterizer in this example), they&#8217;re not worth more than a quick check to see if there&#8217;s any obvious waste going on. What you really want to look for are <em>anomalies</em>: code that seems to be running into execution issues (like <code>SetRenderStates</code> with the read-back from write-combined memory running at over 9 cycles per instruction), or things that just shouldn&#8217;t take as much time as they seem to (like the frustum culling code we looked at for the next few posts). If used correctly, a profiler is a powerful tool not just for performance tuning, but also to find deeper underlying architectural issues.</p>
<h3>While you&#8217;re at it&#8230;</h3>
<p>Anyway, once you&#8217;ve picked a suitable target, I recommend that you do not just the necessary work to knock it out of the top 10 (or some other arbitrary cut-off). After &#8220;<a href="http://fgiesen.wordpress.com/2013/02/02/frustum-culling-turning-the-crank/">Frustum culling: turning the crank</a>&#8220;, a commenter asked why I would spend the extra time optimizing a function that was, at the time, only at the #10 spot in the profile. A perfectly valid question, but one I have three separate answers to:</p>
<p>First, the answer I gave in the comments at the time: code is not just isolated from everything else; it exists in a context. A lot of the time in optimizing code (or even just reading it, for that matter) is spent building up a mental model of what&#8217;s going on and how it relates to the rest of the system. The best time to make changes to code is while that mental model is still current; if you drop the topic and work somewhere else for a bit, you&#8217;ll have to redo at least part of that work again. So if you have ideas for further improvements while you&#8217;re working on code, that&#8217;s a good time to try them out (once you&#8217;ve finished your current task, anyway). If you run out of ideas, or if you notice you&#8217;re starting to micro-optimize where you really shouldn&#8217;t, then stop. But by all means continue while the going is good; even if you don&#8217;t need that code to be faster now, you might want it later.</p>
<p>Second, never mind the relative position. As you can see in the table above, the &#8220;advanced&#8221; frustum culling changes reduced the total frame time by about 0.4ms. That&#8217;s about as much as we got out of our first set of depth buffer rendering changes, even though it was much simpler work. Particularly for games, where you usually have a set frame rate target, you don&#8217;t particularly care where exactly you get the gains from; 0.3ms less is 0.3ms less, no matter whether it&#8217;s done by speeding up one of the Top 10 functions slightly or something else substantially!</p>
<p>Third, relating to my comment about looking for anomalies above: unless there&#8217;s a really stupid mistake somewhere, it&#8217;s fairly likely that the top 10, or top 20, or top whatever hot spots are actually code that does substantial work &#8211; certainly so for code that other people have already optimized. However, most people do tend to work on the hot spots first when looking to improve performance. My favorite sport when optimizing code is starting in the middle ranks: while everyone else is off banging their head against the hard problems, I will casually snipe at functions in the 0.05%-1.0% total run time range. This has two advantages: first, you can often get rid of a lot of these functions entirely. Even if it&#8217;s only 0.2% of your total time, if you manage to get rid of it, that&#8217;s 0.2% that are gone. It&#8217;s usually a lot easier to get rid of a 0.2% function than it is to squeeze an extra 2% out of a 10%-run time function that 10 people have already looked at. And second, the top hot spots are usually in leafy code. But down in the middle ranks is &#8220;middle management&#8221; &#8211; code that&#8217;s just passing data around, maybe with some minor reformatting. That&#8217;s your entry point to re-designing data flows: this is the code where subsystems meet &#8211; the place where restructuring will make a difference. When optimizing interfaces, it&#8217;s crucial to be working on the interfaces that actually have problems, and this is how you find them.</p>
<h3>Ground we&#8217;ve covered</h3>
<p>Throughout this series, my emphasis has been on changes that are fairly high-yield but have low impact in terms of how much disruption they cause. I also made no substantial algorithmic changes. That was fully intentional, but it might be surprising; after all, as any (good) text covering optimization will tell you, it&#8217;s much more important to get your algorithms right than it is to fine-tune your code. So why this bias?</p>
<p>Again, I did this for a reason: while algorithmic changes are indeed the ticket when you need large speed-ups, they&#8217;re also very context-sensitive. For example, instead of optimizing the frustum culling code the way I did &#8211; by making the code more SIMD- and cache-friendly &#8211; I could have just switched to a bounding volume hierarchy instead. And normally, I probably would have. But there&#8217;s plenty of material on bounding volume hierarchies out there, and I trust you to be able to find it yourself; by now, there&#8217;s also a good amount of Google-able material on &#8220;Data-oriented Design&#8221; (I dislike the term; much like &#8220;Object-oriented Design&#8221;, it means everything and nothing) and designing algorithms and data structures from scratch for good SIMD and cache efficiency.</p>
<p>But I found that there&#8217;s a distinct lack of material for the actual problem most of us actually face when optimizing: how do I make existing code faster without breaking it or rewriting it from scratch? So my point with this series is that there&#8217;s a lot you can accomplish purely using fairly local and incremental changes. And while the actual changes are specific to the code, the underlying ideas are very much universal, or at least I hope so. And I couldn&#8217;t resist throwing in some low-level architectural material too, which I hope will come in handy. :)</p>
<h3>Changes I intentionally did not make</h3>
<p>So finally, here&#8217;s a list of things I did <em>not</em> discuss in this series, because they were either too invasive, too tricky or changed the algorithms substantially:</p>
<ul>
<li><em>Changing the way the binner works</em>. We don&#8217;t need that much information per triangle, and currently we gather vertices both in the binner and the rasterizer, which is a fairly expensive step. I did implement a variant that writes out signed 16-bit coordinates and the set-up Z plane equation; it saves roughly another 0.1ms in the final rasterizer, but it&#8217;s a fairly invasive change. Code is <a href="https://github.com/rygorous/intel_occlusion_cull/tree/blog_past_the_end">here</a> for those who are interested. (I may end up posting other stuff to that branch later, hence the name).</li>
<li><em>A hierarchical rasterizer for the larger triangles</em>. Another thing I <a href="https://github.com/rygorous/intel_occlusion_cull/tree/hier_rast">implemented</a> (note this branch is based off a pre-blog version of the codebase) but did not feel like writing about because it took a lot of effort to deliver, ultimately, fairly little gain.</li>
<li><em>Other rasterizer techniques or tweaks</em>. I could have implemented a scanline rasterizer, or a different traversal strategy, or a dozen other things. I chose not to; I wanted to write an introduction to edge-function rasterizers, since they&#8217;re cool, simple to understand and less well-known than they should be, and this series gave me a good excuse. I did not, however, want to spend more time on actual rasterizer optimization than the two posts I wrote; it&#8217;s easy to spend years of your life on that kind of stuff (I&#8217;ve seen it happen!), but there&#8217;s a point to be made that this series was already too long, and I did not want to stretch it even further.</li>
<li><em>Directly rasterizing quads in the depth test rasterizer</em>. The depth test rasterizer only handles boxes, which are built from 6 quads. It&#8217;s possible to build an edge function rasterizer that directly traverses quads instead of triangles. Again, I wrote the code (not on Github this time) but decided against writing about it; while the basic idea is fairly simple, it turned out to be really ugly to make it work in a &#8220;drop-in&#8221; fashion with the rest of the code. See <a href="http://fgiesen.wordpress.com/2013/02/28/reshaping-dataflows/#comment-2466">this comment</a> and my reply for a few extra details.</li>
<li><em>Ray-trace the boxes in the test pass instead of rasterizing them</em>. Another suggestion by <a href="http://fgiesen.wordpress.com/2013/02/28/reshaping-dataflows/#comment-2466">Doug</a>. It&#8217;s a cool idea and I think it has potential, but I didn&#8217;t try it.</li>
<li><em>Render a lower-res depth buffer using very low-poly, conservative models</em>. This is how I&#8217;d actually use this technique for a game; I think bothering with a full-size depth buffer is just a waste of memory bandwidth and processing time, and we do spend a fair amount of our total time just transforming vertices too. Nor is there a big advantage to using the more detailed models for culling. That said, changing this would have required dedicated art for the low-poly occluders (which I didn&#8217;t want to do); it also would&#8217;ve violated my &#8220;no-big-changes&#8221; rule for this series. Both these changes are definitely worth looking into if you want to ship this in a game.</li>
<li><em>Try other occlusion culling techniques</em>. Out of the (already considerably bloated) scope of this series.</li>
</ul>
<p>And that&#8217;s it! I hope you had as much fun reading these posts as I did writing them. But for now, it&#8217;s back to your regularly scheduled, piece-meal blog fare, at least for the time being! Should I feel the urge to write another novella-sized series of posts again in the near future, I&#8217;ll be sure to let you all know by the point I&#8217;m, oh, nine posts in or so.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/fgiesen.wordpress.com/1939/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/fgiesen.wordpress.com/1939/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1939&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://fgiesen.wordpress.com/2013/03/10/optimizing-software-occlusion-culling-the-reckoning/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/32870837851c0e5eb620649cb8d3d608?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">fgiesen</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/03/post_breakdown1.png?w=497" medium="image">
			<media:title type="html">Time breakdown over posts</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/01/wc_slow1.png" medium="image">
			<media:title type="html">Reading from write-combined memory</media:title>
		</media:content>
	</item>
		<item>
		<title>Mopping up</title>
		<link>http://fgiesen.wordpress.com/2013/03/05/mopping-up/</link>
		<comments>http://fgiesen.wordpress.com/2013/03/05/mopping-up/#comments</comments>
		<pubDate>Tue, 05 Mar 2013 10:10:59 +0000</pubDate>
		<dc:creator>fgiesen</dc:creator>
				<category><![CDATA[Coding]]></category>

		<guid isPermaLink="false">http://fgiesen.wordpress.com/?p=1908</guid>
		<description><![CDATA[This post is part of a series &#8211; go here for the index. Welcome back! This post is going to be slightly different from the others. So far, I&#8217;ve attempted to group the material thematically, so that each post has a coherent theme (to a first-order approximation, anyway). Well, this one doesn&#8217;t &#8211; this is [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1908&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><em>This post is part of a series &#8211; go <a href="http://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/">here</a> for the index.</em></p>
<p>Welcome back! This post is going to be slightly different from the others. So far, I&#8217;ve attempted to group the material thematically, so that each post has a coherent theme (to a first-order approximation, anyway). Well, this one doesn&#8217;t &#8211; this is a collection of everything that didn&#8217;t fit anywhere else. But don&#8217;t worry, there&#8217;s still some good stuff in here! That said, one warning: there&#8217;s a bunch of poking around in the framework code this time, and it didn&#8217;t come with docs, so I&#8217;m honestly not quite sure how some of the internals are supposed to work. So the code changes referenced this time are definitely on the hacky side of things.</p>
<h3>The elephant in the room</h3>
<p>Featured quite near the top of all the profiles we&#8217;ve seen so far are two functions I haven&#8217;t talked about before:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/03/hotspots_render.png"><img src="http://fgiesen.files.wordpress.com/2013/03/hotspots_render.png?w=497&#038;h=264" alt="Rendering hot spots" width="497" height="264" class="aligncenter size-full wp-image-1909" /></a></p>
<p>In case you&#8217;re wondering, the <code>VIDMM_Global::ReferenceDmaBuffer</code> is what used to be just &#8220;<code>[dxgmms1.sys]</code>&#8221; in the previous posts; I&#8217;ve set up VTune to use the symbol server to get debug symbols for this DLL. Now, I haven&#8217;t talked about this code before because it&#8217;s part of the GPU rendering, not the software rasterizer, but let&#8217;s broaden our scope one final time.</p>
<p>What you can see here is the video memory manager going over the list of resources (vertex/index buffers, constant buffers, textures, and so forth) referenced by a DMA buffer (which is what WDDM calls GPU command buffers in the native format) and <em>completely</em> blowing out the cache; each resource has some amount of associated metadata that the memory manager needs to look at (and possibly update), and it turns out there&#8217;s <em>many</em> of them. The cache is not amused.</p>
<p>So, what can we do to use less resources? There&#8217;s lots of options, but one thing I had noticed while measuring loading time is that there&#8217;s one dynamic constant buffer per model:</p>
<pre>
// Create the model constant buffer.
HRESULT hr;
D3D11_BUFFER_DESC bd = {0};
bd.ByteWidth = sizeof(CPUTModelConstantBuffer);
bd.BindFlags = D3D11_BIND_CONSTANT_BUFFER;
bd.Usage = D3D11_USAGE_DYNAMIC;
bd.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
hr = (CPUT_DX11::GetDevice())-&gt;CreateBuffer( &amp;bd, NULL,
    &amp;mpModelConstantBuffer );
ASSERT( !FAILED( hr ), _L("Error creating constant buffer.") );
</pre>
<p>Note that they&#8217;re all the same size, and it turns out that all of them get updated (using a <code>Map</code> with <code>DISCARD</code>) immediately before they get used for rendering. And because there&#8217;s about 27000 models in this example, we&#8217;re talking about a lot of constant buffers here.</p>
<p>What if we instead just created one dynamic model constant buffer, and shared it between all the models? It&#8217;s a fairly simple change to make, if you&#8217;re willing to do it in a hacky fashion (as said, not very clean code this time). For this test, I took the liberty of adding some timing around the actual D3D rendering code as well, so we can compare. It&#8217;s probably gonna make a difference, but how much can it be, really?</p>
<p><b>Change:</b> Single shared dynamic model constant buffer</p>
<table>
<tr>
<th>Render scene</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Original</td>
<td>3.392</td>
<td>3.501</td>
<td>3.551</td>
<td>3.618</td>
<td>4.155</td>
<td>3.586</td>
<td>0.137</td>
</tr>
<tr>
<td>One dynamic CB</td>
<td>2.474</td>
<td>2.562</td>
<td>2.600</td>
<td>2.644</td>
<td>3.043</td>
<td>2.609</td>
<td>0.068</td>
</tr>
</table>
<p>It turns out that reducing the number of distinct constant buffers referenced per frame by several thousand is a pretty big deal. Drivers work hard to make constant buffer <code>DISCARD</code> really, really fast, and they make sure that the underlying allocations get handled quickly. And discarding a single constant buffer a thousand times in a frame works out to be a lot faster than discarding a thousand constant buffers once each.</p>
<p>Lesson learned: for &#8220;throwaway&#8221; constant buffers, it&#8217;s a good idea to design your renderer so it only allocates one underlying D3D constant buffer per size class. More are not necessary and can (evidently) induce a substantial amount of overhead. D3D11.1 adds a few features that allow you to further reduce that count down to a single constant buffer that&#8217;s used the same way that dynamic vertex/index buffers are; as you can see, there&#8217;s a reason. Here&#8217;s the profile after this single fix:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/03/hotspots_render_dyncb.png"><img src="http://fgiesen.files.wordpress.com/2013/03/hotspots_render_dyncb.png?w=497&#038;h=266" alt="Render after dynamic CB fix" width="497" height="266" class="aligncenter size-full wp-image-1918" /></a></p>
<p>Still a lot of time spent in the driver and the video memory manager, but if you compare the raw cycle counts with the previous image, you can see that this change really made quite a dent.</p>
<h3>Loading time</h3>
<p>This was (for the most part) something I worked on just to make my life easier &#8211; as you can imagine, while writing this series, I&#8217;ve recorded lots of profiling and tests runs, and the loading time is a fixed cost I pay every time. I won&#8217;t go in depth here, but I still want to give a brief summary of the changes I made and why. If you want to follow along, the changes in the source code start at the &#8220;<a href="https://github.com/rygorous/intel_occlusion_cull/commit/5d4f83887034761c47bdd03ff4c834d7f24adc59">Track loading time</a>&#8221; commit.</p>
<h4>Initial: 9.29s</h4>
<p>First, I simply added a timer and code to print the loading time to the debug output window.</p>
<h4>Load materials once, not once per model: 4.54s</h4>
<p>One thing I noticed way back in January when I did my initial testing was that most materials seem to get loaded multiple times; there seems to be logic in the asset library code to avoid loading materials multiple times, but it didn&#8217;t appear to work for me. So I modified the code to actually load each material only once and then create copies when requested. As you can see, <a href="https://github.com/rygorous/intel_occlusion_cull/commit/b4e29b2dfb43a040a9eb5ed5c074092766fe4ba7">this change</a> by itself roughly cut loading times in half.</p>
<h4>FindAsset optimizations: 4.32s</h4>
<p><code>FindAsset</code> is the function used in the asset manager to actually look up resources by name. With two simples changes to avoid unnecessary <a href="https://github.com/rygorous/intel_occlusion_cull/commit/0b25f7de67f2631ac09456679f4857e86fdd5566">path name resolution</a> and <a href="https://github.com/rygorous/intel_occlusion_cull/commit/40bde879d627ff4e129624a7230255656087f21a">string compares</a>, the loading time loses another 200ms.</p>
<h4>Better config file loading: 2.54s</h4>
<p>I mentioned this in &#8220;<a href="http://fgiesen.wordpress.com/2013/01/30/a-string-processing-rant/">A string processing rant</a>&#8220;, but didn&#8217;t actually merge the changes into the blog branch so far. Well, here you go: with <a href="https://github.com/rygorous/intel_occlusion_cull/commit/9b7648b1a1ba5b7c8e419645a2878491f36faa4e">these</a> <a href="https://github.com/rygorous/intel_occlusion_cull/commit/b5a62433664f5480ede40ab8f1945f3bb999e919">three</a> <a href="https://github.com/rygorous/intel_occlusion_cull/commit/574e48e49ba09399420f43244576d8dbf50d4391">commits</a> that together rewrite a substantial portion of the config file reading, we lose almost another 2 seconds. Yes, that was <em>2 whole seconds</em> worth of unnecessary allocations and horribly inefficient string handling. I wrote that rant for a reason.</p>
<h4>Improve shader input layout cache: 2.03s</h4>
<p>D3D11 wants shader input layouts to be created with a pointer to the bytecode of the shader it&#8217;s going to be used with, to handle vertex format to shader binding. The &#8220;shader input layout cache&#8221; is just an internal cache to produce such input layouts for all unique combinations of vertex formats and shaders we use. The original implementation of this cache was fairly inefficient, but the code already contained a &#8220;TODO&#8221; comment with instructions of how to fix it. In <a href="https://github.com/rygorous/intel_occlusion_cull/commit/b10993347b5ff983306f644dafd636961f266e47">this commit</a>, I implemented that fix.</p>
<h4>Reduce temporary strings: 1.88s</h4>
<p>There were still a bunch of unnecessary string temporaries being created, which I found simply by looking at the call stack profiles of <code>free</code> calls during the loading phase (yet another useful application for profilers)! <a href="https://github.com/rygorous/intel_occlusion_cull/commit/bbbfb89a304c14617e58cb2cf1e0fa16bfe322a8">Two</a> <a href="https://github.com/rygorous/intel_occlusion_cull/commit/beb92aaefdfe1a06f2c0daa87627fcf550078488">commits</a> later, this problem was resolved too.</p>
<h4>Actually share materials: 1.46s</h4>
<p>Finally, <a href="https://github.com/rygorous/intel_occlusion_cull/commit/464503ca5bd657d7d6c6dc9e8a9144e1f223a278">this commit</a> goes one step further than just loading the materials once, it also actually shares the same material instance between all its users (the previous version created copies). <em>This is not necessarily a safe change to make</em>. I have no idea what invariants the asset manager tries to enforce, if any. Certainly, this would cause problems if someone were to start modifying materials after loading &#8211; you&#8217;d need to introduce copy-on-write or something similar. But in our case (i.e. the Software Occlusion Culling demo), the materials do not get modified after loading, and sharing them is completely safe.</p>
<p>Not only does this reduce loading time by another 400ms, it also makes rendering a lot faster, because suddenly there&#8217;s a lot less cache misses when setting up shaders and render states for the individual models:</p>
<p><b>Change:</b> Share materials.</p>
<table>
<tr>
<th>Render scene</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Original</td>
<td>3.392</td>
<td>3.501</td>
<td>3.551</td>
<td>3.618</td>
<td>4.155</td>
<td>3.586</td>
<td>0.137</td>
</tr>
<tr>
<td>One dynamic CB</td>
<td>2.474</td>
<td>2.562</td>
<td>2.600</td>
<td>2.644</td>
<td>3.043</td>
<td>2.609</td>
<td>0.068</td>
</tr>
<tr>
<td>Share materials</td>
<td>1.870</td>
<td>1.922</td>
<td>1.938</td>
<td>1.964</td>
<td>2.331</td>
<td>1.954</td>
<td>0.057</td>
</tr>
</table>
<p>Again, this is somewhat extreme because there&#8217;s so many different models around, but it illustrates the point: you really want to make sure there&#8217;s no unnecessary duplication of data used during rendering; you&#8217;re going to be missing the cache enough during regular rendering as it is.</p>
<p>And at that point, I decided that I could live with 1.5 seconds of loading time, so I didn&#8217;t pursue the matter any further. :)</p>
<h3>The final rendering tweak</h3>
<p>There&#8217;s one more function with a high number of cache misses in the profiles I&#8217;ve been running, even though it&#8217;s never been at the top. That function is <code>AABBoxRasterizerSSE::RenderVisible</code>, which uses the (post-occlusion-test) visibility information to render all visible models. Here&#8217;s the code:</p>
<pre>
void AABBoxRasterizerSSE::RenderVisible(CPUTAssetSet **pAssetSet,
    CPUTRenderParametersDX &amp;renderParams,
    UINT numAssetSets)
{
    int count = 0;

    for(UINT assetId = 0, modelId = 0; assetId &lt; numAssetSets; assetId++)
    {
        for(UINT nodeId = 0; nodeId &lt; GetAssetCount(); nodeId++)
        {
            CPUTRenderNode* pRenderNode = NULL;
            CPUTResult result = pAssetSet[assetId]-&gt;GetAssetByIndex(nodeId, &amp;pRenderNode);
            ASSERT((CPUT_SUCCESS == result), _L ("Failed getting asset by index")); 
            if(pRenderNode-&gt;IsModel())
            {
                if(mpVisible[modelId])
                {
                    CPUTModelDX11* model = (CPUTModelDX11*)pRenderNode;
                    model = (CPUTModelDX11*)pRenderNode;
                    model-&gt;Render(renderParams);
                    count++;
                }
                modelId++;			
            }
            pRenderNode-&gt;Release();
        }
    }
    mNumCulled =  mNumModels - count;
}
</pre>
<p>This code first enumerates all <code>RenderNodes</code> (a base class) in the active asset libraries, ask each of them &#8220;are you a model?&#8221;, and if so renders it. This is a construct that I&#8217;ve seen several times before &#8211; but from a performance standpoint, this is a <em>terrible</em> idea. We walk over the whole scene database, do a virtual function call (which means we have, at the very least, load the cache line containing the vtable pointer) to check if the current item is a model, and only then check if it is culled &#8211; in which case we just ignore it.</p>
<p>That is a stupid game and we should stop playing it.</p>
<p>Luckily, it&#8217;s easy to fix: at load time, we traverse the scene database <em>once</em>, to make a list of all the models. Note the code already does such a pass to initialize the bounding boxes etc. for the occlusion culling pass; all we have to do is set an extra array that maps <code>modelId</code>s to the corresponding models. Then the actual rendering code turns into:</p>
<pre>
void AABBoxRasterizerSSE::RenderVisible(CPUTAssetSet **pAssetSet,
    CPUTRenderParametersDX &amp;renderParams,
    UINT numAssetSets)
{
    int count = 0;

    for(modelId = 0; modelId &lt; mNumModels; modelId++)
    {
        if(mpVisible[modelId])
        {
            mpModels[modelId]-&gt;Render(renderParams);
            count++;
        }
    }

    mNumCulled =  mNumModels - count;
}
</pre>
<p>That already looks much better. But how much does it help?</p>
<p><b>Change:</b> Cull before accessing models</p>
<table>
<tr>
<th>Render scene</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Original</td>
<td>3.392</td>
<td>3.501</td>
<td>3.551</td>
<td>3.618</td>
<td>4.155</td>
<td>3.586</td>
<td>0.137</td>
</tr>
<tr>
<td>One dynamic CB</td>
<td>2.474</td>
<td>2.562</td>
<td>2.600</td>
<td>2.644</td>
<td>3.043</td>
<td>2.609</td>
<td>0.068</td>
</tr>
<tr>
<td>Share materials</td>
<td>1.870</td>
<td>1.922</td>
<td>1.938</td>
<td>1.964</td>
<td>2.331</td>
<td>1.954</td>
<td>0.057</td>
</tr>
<tr>
<td>Fix RenderVisible</td>
<td>1.321</td>
<td>1.358</td>
<td>1.371</td>
<td>1.406</td>
<td>1.731</td>
<td>1.388</td>
<td>0.047</td>
</tr>
</table>
<p>I rest my case.</p>
<p>And I figure that this nice 2.59x cumulative speedup on the rendering code is a good stopping point for the coding part of this series &#8211; quit while you&#8217;re ahead and all that. There&#8217;s a few more minor fixes (both for actual bugs and speed problems) on <a href="https://github.com/rygorous/intel_occlusion_cull/commits/blog">Github</a>, but it&#8217;s all fairly small change, so I won&#8217;t go into the details.</p>
<p>This series is not yet over, though; we&#8217;ve covered a lot of ground, and every case study should spend some time reflecting on the lessons learned. I also want to explain why I covered what I did, what I left out, and a few notes on the way I tend to approach performance problems. So all that will be in the next and final post of this series. Until then!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/fgiesen.wordpress.com/1908/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/fgiesen.wordpress.com/1908/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1908&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://fgiesen.wordpress.com/2013/03/05/mopping-up/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/32870837851c0e5eb620649cb8d3d608?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">fgiesen</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/03/hotspots_render.png" medium="image">
			<media:title type="html">Rendering hot spots</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/03/hotspots_render_dyncb.png" medium="image">
			<media:title type="html">Render after dynamic CB fix</media:title>
		</media:content>
	</item>
		<item>
		<title>Speculatively speaking</title>
		<link>http://fgiesen.wordpress.com/2013/03/04/speculatively-speaking/</link>
		<comments>http://fgiesen.wordpress.com/2013/03/04/speculatively-speaking/#comments</comments>
		<pubDate>Mon, 04 Mar 2013 11:16:58 +0000</pubDate>
		<dc:creator>fgiesen</dc:creator>
				<category><![CDATA[Coding]]></category>

		<guid isPermaLink="false">http://fgiesen.wordpress.com/?p=1853</guid>
		<description><![CDATA[This post is part of a series &#8211; go here for the index. Welcome back! Today, it&#8217;s time to take a closer look at the triangle binning code, which we&#8217;ve only seen mentioned briefly so far, and we&#8217;re going to see a few more pitfalls that all relate to speculative execution. Loads blocked by what? [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1853&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><em>This post is part of a series &#8211; go <a href="http://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/">here</a> for the index.</em></p>
<p>Welcome back! Today, it&#8217;s time to take a closer look at the triangle binning code, which we&#8217;ve only seen mentioned briefly so far, and we&#8217;re going to see a few more pitfalls that all relate to <a href="http://en.wikipedia.org/wiki/Speculative_execution">speculative execution</a>.</p>
<h3>Loads blocked by what?</h3>
<p>There&#8217;s one more micro-architectural issue this program runs into that I haven&#8217;t talked about before. Here&#8217;s the obligatory profiler screenshot:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/03/hotspots_stlf.png"><img src="http://fgiesen.files.wordpress.com/2013/03/hotspots_stlf.png?w=497&#038;h=279" alt="Store-to-load forwarding issues" width="497" height="279" class="aligncenter size-full wp-image-1856" /></a></p>
<p>The full column name reads &#8220;Loads Blocked by Store Forwarding&#8221;. So, what&#8217;s going on there? For this one, I&#8217;m gonna have to explain a bit first.</p>
<p>So let&#8217;s talk about stores in an out-of-order processor. In this series, we already saw how conditional branches and memory sharing between cores get handled on modern x86 cores: namely, with <em>speculative execution</em>. For branches, the core tries to predict which direction they will go, and automatically starts fetching and executing the corresponding instructions. Similarly, memory accesses are assumed to not conflict with what other cores are doing at the same time, and just march on ahead. But if it later turns out that the branch actually went in the other direction, that there was a memory conflict, or that some exception / hardware interrupt occurred, all the instructions that were executed in the meantime are invalid and their results must be discarded &#8211; the speculation didn&#8217;t pan out. The implicit assumption is that our speculation (branches behave as predicted, memory accesses generally don&#8217;t conflict and CPU exceptions/interrupts are rare) is right most of the time, so it generally pays off to forge ahead, and the savings are worth the occasional extra work of undoing a bunch of instructions when we turned out to be wrong.</p>
<p>But wait, how does the CPU &#8220;undo&#8221; instructions? Well, conceptually it takes a &#8220;snapshot&#8221; of the current machine state every time it&#8217;s about to start an operation that it might later have to undo. If that instructions makes it all the way through the pipeline without incident, it just gets retired normally, the snapshot gets thrown away and we know that our speculation was successful. But if there is a problem somewhere, the machine can just throw away all the work it did in the meantime, rewind back to the snapshot and retry.</p>
<p>Of course, CPUs don&#8217;t actually take full snapshots. Instead, they make use of the out-of-order machinery to do things much more efficiently: out-of-order CPUs have more registers internally than are exposed in the ISA (Instruction Set Architecture), and use a technique called &#8220;register renaming&#8221; to map the small set of architectural registers onto the larger set of physical registers. The &#8220;snapshotting&#8221; then doesn&#8217;t actually need to save register contents; it just needs to keep track of what the current register mapping at the snapshot point was, and make sure that the associated physical registers from the &#8220;before&#8221; snapshot don&#8217;t get reused until the instruction is safely retired.</p>
<p>This takes care of register modifications. We already know what happens with loads from memory &#8211; we just run them, and if it later turns out that the memory contents changed between the load instruction&#8217;s execution and its retirement, we need to re-run that block of code. Stores are the tricky part: we can&#8217;t easily do &#8220;memory renaming&#8221; since memory (unlike registers) is a shared resource, and also unlike registers rarely gets written in whole &#8220;accounting units&#8221; (cache lines) at a time.</p>
<p>The solution are <em>store buffers</em>: when a store instruction is executed, we do all the necessary groundwork &#8211; address translation, access right checking and so forth &#8211; but don&#8217;t actually write to memory just yet; rather, the target address and the associated data bits are written into a store buffer, where they just sit around for a while; the store buffers form a log of all pending writes to memory. Only after the core is sure that the store instruction will actually be executed (branch results etc. are known and no exceptions were triggered) will these values <em>actually</em> be written back to the cache.</p>
<p>Buffering stores this way has numerous advantages (beyond just making speculation easier), and is a technique not just used in out-of-order architectures; there&#8217;s just one problem though: what happens if I run code like this?</p>
<pre>
  mov  [x], eax
  mov  ebx, [x]
</pre>
<p>Assuming no other threads writing to the same memory at the same time, you would certainly hope that at the end of this instruction sequence, <code>eax</code> and <code>ebx</code> contain the same value. But remember that the first instruction (the store) just writes to a store buffer, whereas the second instruction (the load) normally just references the cache. At the very least, we have to detect that this is happening &#8211; i.e., that we are trying to load from an address that currently has a write logged in a store buffer &#8211; but there&#8217;s numerous things we could do with that information.</p>
<p>One option is to simply stall the core and wait until the store is done before the load can start. This is fairly cheap to implement in hardware, but it does slow down the software running on it. This option was chosen by the in-order cores used in the current generation of game consoles, and the result is the dreaded &#8220;Load Hit Store&#8221; stall. It&#8217;s a way to solve the problem, but let&#8217;s just say it won&#8217;t win you many friends.</p>
<p>So x86 cores normally use a technique called &#8220;store to load forwarding&#8221; or just &#8220;store forwarding&#8221;, where loads can actually read data directly from the store buffers, at least under certain conditions. This is much more expensive in hardware &#8211; it adds a <em>lot</em> of wires between the load unit and the store buffers &#8211; but it is far less finicky to use on the software side.</p>
<p>So what are the conditions? The details depend on the core in question. Generally, if you store a value to a naturally aligned location in memory, and do a load with the same size as the store, you can expect store forwarding to work. If you do trickier stuff &#8211; span multiple cache lines, or use mismatched sizes between the loads and stores, for example &#8211; it really does depend. Some of the more recent Intel cores can also forward larger stores into smaller loads (e.g. a DWord read from a location written with <code>MOVDQA</code>) under certain circumstances, for example. The dual case (large load overlapping with smaller stores) is substantially harder though, because it can involved multiple store buffers at the same time, and I currently know of no processor that implements this. And whenever you hit a case where the processor can&#8217;t perform store forwarding, you get the &#8220;Loads Blocked by Store Forwarding&#8221; stall above (effectively, x86&#8242;s version of a Load-Hit-Store).</p>
<h3>Revenge of the cycle-eaters</h3>
<p>Which brings us back to the example at hand: what&#8217;s going on in those functions, <code>BinTransformedTrianglesMT</code> in particular? Some investigation of the compiled code shows that the first sign of blocked loads is near these reads:</p>
<pre>
Gather(xformedPos, index, numLanes);
		
vFxPt4 xFormedFxPtPos[3];
for(int i = 0; i &lt; 3; i++)
{
    xFormedFxPtPos[i].X = ftoi_round(xformedPos[i].X);
    xFormedFxPtPos[i].Y = ftoi_round(xformedPos[i].Y);
    xFormedFxPtPos[i].Z = ftoi_round(xformedPos[i].Z);
    xFormedFxPtPos[i].W = ftoi_round(xformedPos[i].W);
}
</pre>
<p>and looking at the code for <code>Gather</code> shows us exactly what&#8217;s going on:</p>
<pre>
void TransformedMeshSSE::Gather(vFloat4 pOut[3], UINT triId,
    UINT numLanes)
{
    for(UINT l = 0; l &lt; numLanes; l++)
    {
        for(UINT i = 0; i &lt; 3; i++)
        {
            UINT index = mpIndices[(triId * 3) + (l * 3) + i];
            pOut[i].X.lane[l] = mpXformedPos[index].m128_f32[0];
            pOut[i].Y.lane[l] = mpXformedPos[index].m128_f32[1];
            pOut[i].Z.lane[l] = mpXformedPos[index].m128_f32[2];
            pOut[i].W.lane[l] = mpXformedPos[index].m128_f32[3];
        }
    }
}
</pre>
<p>Aha! This is the code that transforms our vertices from the AoS (array of structures) form that&#8217;s used in memory into the SoA (structure of arrays) form we use during binning (and also the two rasterizers). Note that the output vectors are written element by element; then, as soon as we try to read the whole vector into a register, we hit a forwarding stall, because the core can&#8217;t forward the results from the 4 different stores per vector to a single load. It turns out that the other two instances of forwarding stalls run into this problem for the same reason &#8211; during the gather of bounding box vertices and triangle vertices in the rasterizer, respectively.</p>
<p>So how do we fix it? Well, we&#8217;d really like those vectors to be written using full-width SIMD stores instead. Luckily, that&#8217;s not too hard: converting data from AoS to SoA is essentially a matrix transpose, and our typical use case happens to be 4 separate 4-vectors, i.e. a 4&#215;4 matrix; luckily, a 4&#215;4 matrix transpose is fairly easy to do in SSE, and Intel&#8217;s intrinsics header file even comes with a macro that implements it. So here&#8217;s the updated <code>Gather</code> that uses a SSE transpose:</p>
<pre>
void TransformedMeshSSE::Gather(vFloat4 pOut[3], UINT triId,
    UINT numLanes)
{
    const UINT *pInd0 = &amp;mpIndices[triId * 3];
    const UINT *pInd1 = pInd0 + (numLanes &gt; 1 ? 3 : 0);
    const UINT *pInd2 = pInd0 + (numLanes &gt; 2 ? 6 : 0);
    const UINT *pInd3 = pInd0 + (numLanes &gt; 3 ? 9 : 0);

    for(UINT i = 0; i &lt; 3; i++)
    {
        __m128 v0 = mpXformedPos[pInd0[i]]; // x0 y0 z0 w0
        __m128 v1 = mpXformedPos[pInd1[i]]; // x1 y1 z1 w1
        __m128 v2 = mpXformedPos[pInd2[i]]; // x2 y2 z2 w2
        __m128 v3 = mpXformedPos[pInd3[i]]; // x3 y3 z3 w3
        _MM_TRANSPOSE4_PS(v0, v1, v2, v3);
        // After transpose:
        pOut[i].X = VecF32(v0); // v0 = x0 x1 x2 x3
        pOut[i].Y = VecF32(v1); // v1 = y0 y1 y2 y3
        pOut[i].Z = VecF32(v2); // v2 = z0 z1 z2 z3
        pOut[i].W = VecF32(v3); // v3 = w0 w1 w2 w3
    }
}
</pre>
<p>Not much to talk about here. The other two instances of this get modified in the exact same way. So how much does it help?</p>
<p><b>Change:</b> Gather using SSE instructions and transpose</p>
<table>
<tr>
<th>Total cull time</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.148</td>
<td>3.208</td>
<td>3.243</td>
<td>3.305</td>
<td>4.321</td>
<td>3.271</td>
<td>0.100</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>2.934</td>
<td>3.078</td>
<td>3.110</td>
<td>3.156</td>
<td>3.992</td>
<td>3.133</td>
<td>0.103</td>
</tr>
</table>
<table>
<tr>
<th>Render depth</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>2.206</td>
<td>2.220</td>
<td>2.228</td>
<td>2.242</td>
<td>2.364</td>
<td>2.234</td>
<td>0.022</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>2.099</td>
<td>2.119</td>
<td>2.137</td>
<td>2.156</td>
<td>2.242</td>
<td>2.141</td>
<td>0.028</td>
</tr>
</table>
<table>
<tr>
<th>Depth test</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>0.813</td>
<td>0.830</td>
<td>0.839</td>
<td>0.847</td>
<td>0.886</td>
<td>0.839</td>
<td>0.013</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>0.773</td>
<td>0.793</td>
<td>0.802</td>
<td>0.809</td>
<td>0.843</td>
<td>0.801</td>
<td>0.012</td>
</tr>
</table>
<p>So we&#8217;re another 0.13ms down, about 0.04ms of which we gain in the depth testing pass and the remaining 0.09ms in the rendering pass. And a re-run with VTune confirms that the blocked loads are indeed gone:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/03/hotspots_stlf_fixed.png"><img src="http://fgiesen.files.wordpress.com/2013/03/hotspots_stlf_fixed.png?w=497&#038;h=282" alt="Store forwarding fixed" width="497" height="282" class="aligncenter size-full wp-image-1876" /></a></p>
<h3>Vertex transformation</h3>
<p><a href="http://fgiesen.wordpress.com/2013/02/28/reshaping-dataflows/">Last time</a>, we modified the vertex transform code in the depth test rasterizer to get rid of the z-clamping and simplify the clipping logic. We also changed the logic to make better use of the regular structure of our input vertices. We don&#8217;t have any special structure we can use to make vertex transforms on regular meshes faster, but we definitely can (and should) improve the projection and near-clip logic, turning this:</p>
<pre>
mpXformedPos[i] = TransformCoords(&amp;mpVertices[i].position,
    cumulativeMatrix);
float oneOverW = 1.0f/max(mpXformedPos[i].m128_f32[3], 0.0000001f);
mpXformedPos[i] = _mm_mul_ps(mpXformedPos[i],
    _mm_set1_ps(oneOverW));
mpXformedPos[i].m128_f32[3] = oneOverW;
</pre>
<p>into this:</p>
<pre>
__m128 xform = TransformCoords(&amp;mpVertices[i].position,
    cumulativeMatrix);
__m128 vertZ = _mm_shuffle_ps(xform, xform, 0xaa);
__m128 vertW = _mm_shuffle_ps(xform, xform, 0xff);
__m128 projected = _mm_div_ps(xform, vertW);

// set to all-0 if near-clipped
__m128 mNoNearClip = _mm_cmple_ps(vertZ, vertW);
mpXformedPos[i] = _mm_and_ps(projected, mNoNearClip);
</pre>
<p>Here, near-clipped vertices are set to the (invalid) x=y=z=w=0, and the binner code can just check for <code>w==0</code> to test whether a vertex is near-clipped instead of having to use the original w tests (which again had a hardcoded near plane value).</p>
<p>This change doesn&#8217;t have any significant impact on the running time, but it does get rid of the hardcoded near plane location for good, so I thought it was worth mentioning.</p>
<h3>Again with the memory ordering</h3>
<p>And if we profile again, we notice there&#8217;s at least one more surprise waiting for us in the binning code:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mc.png"><img src="http://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mc.png?w=497&#038;h=246" alt="Binning Machine Clears" width="497" height="246" class="aligncenter size-full wp-image-1883" /></a></p>
<p>Machine clears? We&#8217;ve seen them before, way back in &#8220;<a href="http://fgiesen.wordpress.com/2013/01/31/cores-dont-like-to-share/">Cores don&#8217;t like to share</a>&#8220;. And yes, they&#8217;re again for memory ordering reasons. What did we do wrong this time? It turns out that the problematic code has been in there since the beginning, and ran just fine for quite a while, but ever since the scheduling optimizations we did in &#8220;<a href="http://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/">The care and feeding of worker threads</a>&#8220;, we now have binning jobs running tightly packed enough to run into memory ordering issues. So what&#8217;s the problem? Here&#8217;s the code:</p>
<pre>
// Add triangle to the tiles or bins that the bounding box covers
int row, col;
for(row = startY; row &lt;= endY; row++)
{
    int offset1 = YOFFSET1_MT * row;
    int offset2 = YOFFSET2_MT * row;
    for(col = startX; col &lt;= endX; col++)
    {
        int idx1 = offset1 + (XOFFSET1_MT * col) + taskId;
        int idx2 = offset2 + (XOFFSET2_MT * col) +
            (taskId * MAX_TRIS_IN_BIN_MT) + pNumTrisInBin[idx1];
        pBin[idx2] = index + i;
        pBinModel[idx2] = modelId;
        pBinMesh[idx2] = meshId;
        pNumTrisInBin[idx1] += 1;
    }
}
</pre>
<p>The problem turns out to be the array <code>pNumTrisInBin</code>. Even though it&#8217;s accessed as 1D, it is effectively a 3D array like this:</p>
<p><code>uint16 pNumTrisInBin[TILE_ROWS][TILE_COLS][BINNER_TASKS]</code></p>
<p>The <code>TILE_ROWS</code> and <code>TILE_COLS</code> parts should be obvious. The <code>BINNER_TASKS</code> needs some explanation though: as you hopefully remember, we try to divide the work between binning tasks so that each of them gets roughly the same amount of triangles. Now, before we start binning triangles, we don&#8217;t know which tiles they will go into &#8211; after all, that&#8217;s what the binner is there to find out.</p>
<p>We could have just one output buffer (bin) per tile; but then, whenever two binner tasks simultaneously end up trying to add a triangle to the same tile, they will end up getting serialized because they try to increment the same counter. And even worse, it would mean that the actual order of triangles in the bins would be different between every run, depending on when exactly each thread was running; while not fatal for depth buffers (we just end up storing the max of all triangles rendered to a pixel anyway, which is ordering-invariant) it&#8217;s still a complete pain to debug.</p>
<p>Hence there is one bin per tile per binning worker. We already know that the binning workers get assigned the triangles in the order they occur in the models &#8211; with the 32 binning workers we use, the first binning task gets the first 1/32 of the triangles, and second binning task gets the second 1/32, and so forth. And each binner processes triangles in order. This means that the rasterizer tasks can still process triangles in the original order they occur in the mesh &#8211; first process all triangles inserted by binner 0, then all triangles inserted by binner 1, and so forth. Since they&#8217;re in distinct memory ranges, that&#8217;s easily done. And each bin has a separate triangle counter, so they don&#8217;t interfere, right? Nothing to see here, move along.</p>
<p>Well, except for the bit where coherency is managed on a cache line granularity. Now, as you can see from the above declaration, the triangle counts for all the binner tasks are stored in adjacent 16-bit words; 32 of them, to be precise, one per binner task. So what was the size of a cache line again? 64 bytes, you say?</p>
<p>Oops.</p>
<p>Yep, even though it&#8217;s 32 separate counters, for the purposes of the memory subsystem it&#8217;s just the same as if it was all a single counter per tile (well, it might be slightly better than that if the initial pointer isn&#8217;t 64-byte aligned, but you get the idea).</p>
<p>Luckily for us, the fix is dead easy: all we have to do is shuffle the order of the array indices around.</p>
<p><code>uint16 pNumTrisInBin[BINNER_TASKS][TILE_ROWS][TILE_COLS]</code></p>
<p>We also happen to have 32 tiles total &#8211; which means that now, each binner task gets its own cache line by itself (again, provided we align things correctly). So again, it&#8217;s a really easy fix. The question being &#8211; how much does it help?</p>
<p><b>Change:</b> Change pNumTrisInBin array indexing</p>
<table>
<tr>
<th>Total cull time</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.148</td>
<td>3.208</td>
<td>3.243</td>
<td>3.305</td>
<td>4.321</td>
<td>3.271</td>
<td>0.100</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>2.934</td>
<td>3.078</td>
<td>3.110</td>
<td>3.156</td>
<td>3.992</td>
<td>3.133</td>
<td>0.103</td>
</tr>
<tr>
<td>Change bin inds</td>
<td>2.842</td>
<td>2.933</td>
<td>2.980</td>
<td>3.042</td>
<td>3.914</td>
<td>3.007</td>
<td>0.125</td>
</tr>
</table>
<table>
<tr>
<th>Render depth</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>2.206</td>
<td>2.220</td>
<td>2.228</td>
<td>2.242</td>
<td>2.364</td>
<td>2.234</td>
<td>0.022</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>2.099</td>
<td>2.119</td>
<td>2.137</td>
<td>2.156</td>
<td>2.242</td>
<td>2.141</td>
<td>0.028</td>
</tr>
<tr>
<td>Change bin inds</td>
<td>1.980</td>
<td>2.008</td>
<td>2.026</td>
<td>2.046</td>
<td>2.172</td>
<td>2.032</td>
<td>0.035</td>
</tr>
</table>
<p>That&#8217;s right, a 0.1ms difference from <em>changing the memory layout of a 1024-entry, 2048-byte array</em>. You really need to be extremely careful with the layout of shared data when dealing with multiple cores at the same time.</p>
<h3>Once more, with branching</h3>
<p>At this point, the binner is starting to look fairly good, but there&#8217;s one more thing that springs to eye:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mispred.png"><img src="http://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mispred.png?w=497&#038;h=229" alt="Binning branch mispredicts" width="497" height="229" class="aligncenter size-full wp-image-1894" /></a></p>
<p>Branch mispredictions. Now, the two rasterizers have legitimate reason to be mispredicting branches some of the time &#8211; they&#8217;re processing triangles with fairly unpredictable sizes, and the depth test rasterizer also has an early-out that&#8217;s hard to predict. But the binner has less of an excuse &#8211; sure, the triangles have very different dimensions measured <em>in 2&#215;2 pixel blocks</em>, but the vast majority of our triangles fits inside one of our (generously sized!) 320&#215;90 pixel tiles. So where are all these branches?</p>
<pre>
for(int i = 0; i &lt; numLanes; i++)
{
    // Skip triangle if area is zero 
    if(triArea.lane[i] &lt;= 0) continue;
    if(vEndX.lane[i] &lt; vStartX.lane[i] ||
       vEndY.lane[i] &lt; vStartY.lane[i]) continue;
			
    float oneOverW[3];
    for(int j = 0; j &lt; 3; j++)
        oneOverW[j] = xformedPos[j].W.lane[i];
			
    // Reject the triangle if any of its verts are outside the
    // near clip plane
    if(oneOverW[0] == 0.0f || oneOverW[1] == 0.0f ||
        oneOverW[2] == 0.0f) continue;

    // ...
}
</pre>
<p>Oh yeah, that. In particular, the first test (which checks for degenerate and back-facing triangles) will reject roughly half of all triangles and can be fairly random (as far as the CPU is concerned). Now, <a href="http://fgiesen.wordpress.com/2013/02/16/depth-buffers-done-quick-part-2/">last time we had an issue with branch mispredicts</a>, we simply removed the offending early-out. That&#8217;s a really bad idea in this case &#8211; any triangles we don&#8217;t reject here, we&#8217;re gonna waste even more work on later. No, these tests really should all be done here.</p>
<p>However, there&#8217;s no need for them to be done like this; right now, we have a whole slew of branches that are all over the map. Can&#8217;t we consolidate the branches somehow?</p>
<p>Of course we can. The basic idea is to do all the tests on 4 triangles at a time, while we&#8217;re still in SIMD form:</p>
<pre>
// Figure out which lanes are active
VecS32 mFront = cmpgt(triArea, VecS32::zero());
VecS32 mNonemptyX = cmpgt(vEndX, vStartX);
VecS32 mNonemptyY = cmpgt(vEndY, vStartY);
VecF32 mAccept1 = bits2float(mFront &amp; mNonemptyX &amp; mNonemptyY);

// All verts must be inside the near clip volume
VecF32 mW0 = cmpgt(xformedPos[0].W, VecF32::zero());
VecF32 mW1 = cmpgt(xformedPos[1].W, VecF32::zero());
VecF32 mW2 = cmpgt(xformedPos[2].W, VecF32::zero());

VecF32 mAccept = and(and(mAccept1, mW0), and(mW1, mW2));
// laneMask == (1 &lt;&lt; numLanes) - 1; - initialized earlier
unsigned int triMask = _mm_movemask_ps(mAccept.simd) &amp; laneMask;
</pre>
<p>Note I change the &#8220;is not near-clipped test&#8221; from <code>!(w == 0.0f)</code> to <code>w &gt; 0.0f</code>, on account of me knowing that all legal w&#8217;s happen to not just be non-zero, they&#8217;re positive (okay, what really happened is that I forgot to add a &#8220;cmpne&#8221; when I wrote <code>VecF32</code> and didn&#8217;t feel like adding it here). Other than that, it&#8217;s fairly straightforward. We build a mask in vector registers, then turn it into an integer mask of active lanes using <code>MOVMSKPS</code>.</p>
<p>With this, we could turn all the original branches into a single test in the <code>i</code> loop:</p>
<pre>
if((triMask &amp; (1 &lt;&lt; i)) == 0)
    continue;
</pre>
<p>However, we can do slightly better than that: it turns out we can iterate pretty much directly over the set bits in <code>triMask</code>, which means we&#8217;re now down to one single branch in the outer loop &#8211; the loop counter itself. The modified loop looks like this:</p>
<pre>
while(triMask)
{
    int i = FindClearLSB(&amp;triMask);
    // ...
}
</pre>
<p>So what does the magic <code>FindClearLSB</code> function do? It better not contain any branches! But lucky for us, it&#8217;s quite straightforward:</p>
<pre>
// Find index of least-significant set bit in mask
// and clear it (mask must be nonzero)
static int FindClearLSB(unsigned int *mask)
{
    unsigned long idx;
    _BitScanForward(&amp;idx, *mask);
    *mask &amp;= *mask - 1;
    return idx;
}
</pre>
<p>all it takes is <code>_BitScanForward</code> (the VC++ intrinsic for the x86 <code>BSF</code> instruction) and a really old trick for clearing the least-significant set bit in a value. In other words, this compiles into about 3 integer instructions and is completely branch-free. Good enough. So does it help?</p>
<p><b>Change:</b> Less branches in binner</p>
<table>
<tr>
<th>Total cull time</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.148</td>
<td>3.208</td>
<td>3.243</td>
<td>3.305</td>
<td>4.321</td>
<td>3.271</td>
<td>0.100</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>2.934</td>
<td>3.078</td>
<td>3.110</td>
<td>3.156</td>
<td>3.992</td>
<td>3.133</td>
<td>0.103</td>
</tr>
<tr>
<td>Change bin inds</td>
<td>2.842</td>
<td>2.933</td>
<td>2.980</td>
<td>3.042</td>
<td>3.914</td>
<td>3.007</td>
<td>0.125</td>
</tr>
<tr>
<td>Less branches</td>
<td>2.786</td>
<td>2.879</td>
<td>2.915</td>
<td>2.969</td>
<td>3.706</td>
<td>2.936</td>
<td>0.092</td>
</tr>
</table>
<table>
<tr>
<th>Render depth</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>2.206</td>
<td>2.220</td>
<td>2.228</td>
<td>2.242</td>
<td>2.364</td>
<td>2.234</td>
<td>0.022</td>
</tr>
<tr>
<td>SSE Gather</td>
<td>2.099</td>
<td>2.119</td>
<td>2.137</td>
<td>2.156</td>
<td>2.242</td>
<td>2.141</td>
<td>0.028</td>
</tr>
<tr>
<td>Change bin inds</td>
<td>1.980</td>
<td>2.008</td>
<td>2.026</td>
<td>2.046</td>
<td>2.172</td>
<td>2.032</td>
<td>0.035</td>
</tr>
<tr>
<td>Less branches</td>
<td>1.905</td>
<td>1.934</td>
<td>1.946</td>
<td>1.959</td>
<td>2.012</td>
<td>1.947</td>
<td>0.019</td>
</tr>
</table>
<p>That&#8217;s another 0.07ms off the total, for about a 10% reduction in median total cull time for this post, and a 12.7% reduction in median rasterizer time. And for our customary victory lap, here&#8217;s the VTune results after this change:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/03/hotspots_binning_done.png"><img src="http://fgiesen.files.wordpress.com/2013/03/hotspots_binning_done.png?w=497&#038;h=268" alt="Binning with branching improved" width="497" height="268" class="aligncenter size-full wp-image-1903" /></a></p>
<p>The branch mispredictions aren&#8217;t gone, but we did make a notable dent. It&#8217;s more obvious if you compare the number of clock cyles with the previous image.</p>
<p>And with that, I&#8217;ll conclude this journey into both the triangle binner and the dark side of speculative execution. We&#8217;re also getting close to the end of this series &#8211; the next post will look again at the loading and rendering code we&#8217;ve been intentionally ignoring for most of this series :), and after that I&#8217;ll finish with a summary and wrap-up &#8211; including a list of things I didn&#8217;t cover, and why not.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/fgiesen.wordpress.com/1853/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/fgiesen.wordpress.com/1853/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1853&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://fgiesen.wordpress.com/2013/03/04/speculatively-speaking/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/32870837851c0e5eb620649cb8d3d608?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">fgiesen</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/03/hotspots_stlf.png" medium="image">
			<media:title type="html">Store-to-load forwarding issues</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/03/hotspots_stlf_fixed.png" medium="image">
			<media:title type="html">Store forwarding fixed</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mc.png" medium="image">
			<media:title type="html">Binning Machine Clears</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/03/hotspots_binning_mispred.png" medium="image">
			<media:title type="html">Binning branch mispredicts</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/03/hotspots_binning_done.png" medium="image">
			<media:title type="html">Binning with branching improved</media:title>
		</media:content>
	</item>
		<item>
		<title>Reshaping dataflows</title>
		<link>http://fgiesen.wordpress.com/2013/02/28/reshaping-dataflows/</link>
		<comments>http://fgiesen.wordpress.com/2013/02/28/reshaping-dataflows/#comments</comments>
		<pubDate>Thu, 28 Feb 2013 11:12:40 +0000</pubDate>
		<dc:creator>fgiesen</dc:creator>
				<category><![CDATA[Coding]]></category>

		<guid isPermaLink="false">http://fgiesen.wordpress.com/?p=1813</guid>
		<description><![CDATA[This post is part of a series &#8211; go here for the index. Welcome back! So far, we&#8217;ve spent quite some time &#8220;zoomed in&#8221; on various components of the Software Occlusion Culling demo, looking at various micro-architectural pitfalls and individual loops. In the last two posts, we &#8220;zoomed out&#8221; and focused on the big picture: [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1813&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><em>This post is part of a series &#8211; go <a href="http://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/">here</a> for the index.</em></p>
<p>Welcome back! So far, we&#8217;ve spent quite some time &#8220;zoomed in&#8221; on various components of the Software Occlusion Culling demo, looking at various micro-architectural pitfalls and individual loops. In the last two posts, we &#8220;zoomed out&#8221; and focused on the big picture: what work runs when, and how to keep all cores busy. Now, it&#8217;s time to look at what lies in between: the plumbing, if you will. We&#8217;ll be looking at the dataflows between subsystems and modules and how to improve them.</p>
<p>This is one of my favorite topics in optimization, and it&#8217;s somewhat under-appreciated. There&#8217;s plenty of material on how to make loops run fast (although a lot of it is outdated or just wrong, so beware), and at this point there&#8217;s plenty of ways of getting concurrency up and running: there&#8217;s OpenMP, Intel&#8217;s TBB, Apple&#8217;s GCD, Windows Thread Pools and ConcRT for CPU, there&#8217;s OpenCL, CUDA and DirectCompute for jobs that are GPU-suitable, and so forth; you get the idea. The point being that it&#8217;s not hard to find a shrink-wrap solution that gets you up and running, and a bit of profiling (like we just did) is usually enough to tell you what needs to be done to make it all go smoothly.</p>
<p>But back to the topic at hand: improving dataflow. The problem is that, unlike the other two aspects I mentioned, there&#8217;s really no recipe to follow; it&#8217;s very much context-dependent. It basically boils down to looking at both sides of the interface between systems and functions and figuring out if there&#8217;s a better way to handle that interaction. We&#8217;ve seen a bit of that earlier when talking about frustum culling; rather than trying to define it in words, I&#8217;ll just do it by example, so let&#8217;s dive right in!</p>
<h3>A simple example</h3>
<p>A good example is the member variable <code>TransformedAABBoxSSE::mVisible</code>, declared like this:</p>
<pre>
bool *mVisible;
</pre>
<p>A pointer to a bool. So where does that pointer come from?</p>
<pre>
inline void SetVisible(bool *visible){mVisible = visible;}
</pre>
<p>It turns out that the constructor initializes this pointer to <code>NULL</code>, and the only method that ever does anything with <code>mVisible</code> is <code>RasterizeAndDepthTestAABBox</code>, which executes <code>*mVisible = true;</code> if the bounding box is found to be visible. So how does this all get used?</p>
<pre>
mpVisible[i] = false;
mpTransformedAABBox[i].SetVisible(&amp;mpVisible[i]);
if(...)
{
    mpTransformedAABBox[i].TransformAABBox();
    mpTransformedAABBox[i].RasterizeAndDepthTestAABBox(...);
}
</pre>
<p>That&#8217;s it. That&#8217;s the only call sites. There&#8217;s really no reason for <code>mVisible</code> to be state &#8211; semantically, it&#8217;s just a return value for <code>RasterizeAndDepthTestAABBox</code>, so that&#8217;s what it should be &#8211; <em>always</em> try to get rid of superfluous state. This doesn&#8217;t even have anything to do with optimization per se; explicit dataflow is easy for programmers to see and reason about, while implicit dataflow (through pointers, members and state) is hard to follow (both for humans and compilers!) and error-prone.</p>
<p>Anyway, making this return value explicit is really basic, so I&#8217;m not gonna walk through the details; you can always look at the <a href="https://github.com/rygorous/intel_occlusion_cull/commit/36fed2dd3d098e4cace8adec67a415139a0049dd">corresponding commit</a>. I won&#8217;t bother benchmarking this change either.</p>
<h3>A more interesting case</h3>
<p>In the depth test rasterizer, right after determining the bounding box, there&#8217;s this piece of code:</p>
<pre>
for(int vv = 0; vv &lt; 3; vv++) 
{
    // If W (holding 1/w in our case) is not between 0 and 1,
    // then vertex is behind near clip plane (1.0 in our case).
    // If W &lt; 1 (for W&gt;0), and 1/W &lt; 0 (for W &lt; 0).
    VecF32 nearClipMask0 = cmple(xformedPos[vv].W, VecF32(0.0f));
    VecF32 nearClipMask1 = cmpge(xformedPos[vv].W, VecF32(1.0f));
    VecS32 nearClipMask = float2bits(or(nearClipMask0,
        nearClipMask1));

    if(!is_all_zeros(nearClipMask))
    {
        // All four vertices are behind the near plane (we&#039;re
        // processing four triangles at a time w/ SSE)
        return true;
    }
}
</pre>
<p>Okay. The transform code sets things up so that the &#8220;w&#8221; component of the screen-space positions actually contains 1/w; the first part of this code then tries to figure out whether the source vertex was in front of the near plane (i.e. outside the view frustum or not). An ugly wrinkle here is that the near plane is hard-coded to be at 1. Doing this after dividing by w adds extra complications since the code needs to be careful about the signs. And the second comment is outright wrong &#8211; it in fact early-outs when <em>any</em> of the four active triangles have vertex number <code>vv</code> outside the near-clip plane, not when all of them do. In other words, if any of the 4 active triangles get near-clipped, the test rasterizer will just punt and return <code>true</code> (&#8220;visible&#8221;).</p>
<p>So here&#8217;s the thing: there&#8217;s really no reason to do this check <em>after</em> we&#8217;re done with triangle setup. Nor do we even have to gather the 3 triangle vertices to discover that one of them is in front of the near plane. A box has 8 vertices, and we&#8217;ll know whether any of them are in front of the near plane as soon as we&#8217;re done transforming them, before we even think about triangle setup! So let&#8217;s look at the function that transforms the vertices:</p>
<pre>
void TransformedAABBoxSSE::TransformAABBox()
{
    for(UINT i = 0; i &lt; AABB_VERTICES; i++)
    {
        mpXformedPos[i] = TransformCoords(&amp;mpBBVertexList[i],
            mCumulativeMatrix);
        float oneOverW = 1.0f/max(mpXformedPos[i].m128_f32[3],
            0.0000001f);
        mpXformedPos[i] = mpXformedPos[i] * oneOverW;
        mpXformedPos[i].m128_f32[3] = oneOverW;
    }
}
</pre>
<p>As we can see, returning 1/w does in fact take a bit of extra work, so we&#8217;d like to avoid it, especially since that 1/w is really only referenced by the near-clip checking code. Also, the code seems to clamp w at some arbitrary small positive value &#8211; which means that the part of the near clip computation in the depth test rasterizer that worries about w&lt;0 is actually unnecessary. This is the kind of thing I&#8217;m talking about &#8211; each piece of code in isolation seems reasonable, but once you look at both sides it becomes clear that the pieces don&#8217;t fit together all that well.</p>
<p>It turns out that after <code>TransformCoords</code>, we&#8217;re in &#8220;homogeneous viewport space&#8221;, i.e. we&#8217;re still in a homogeneous space, but unlike the homogeneous clip space you might be used to from vertex shaders, this one also has the viewport transform baked in. But our viewport transform leaves z alone (we fixed that in the previous post!), so we still have a D3D-style clip volume for z:</p>
<p><img src='http://s0.wp.com/latex.php?latex=0+%5Cle+z+%5Cle+w&amp;bg=f9f7f5&amp;fg=444444&amp;s=0' alt='0 &#92;le z &#92;le w' title='0 &#92;le z &#92;le w' class='latex' /></p>
<p>Since we&#8217;re using a reversed clip volume, the z&le;w constraint is the near-plane one. Note that <em>this</em> test doesn&#8217;t need any special cases for negative signs and also doesn&#8217;t have a hardcoded near-plane location any more: it just automatically uses <a href="http://fgiesen.wordpress.com/2012/08/31/frustum-planes-from-the-projection-matrix/">whatever the projection matrix says</a>, which is the right thing to do!</p>
<p>Even better, if we test for near-clip anyway, there&#8217;s no need to clamp w at all. We know that anything with w&le;0 is outside the near plane, and if a vertex is outside the near plane we&#8217;re not gonna rasterize the box anyway. Now we might still end up dividing by 0, but since we&#8217;re dealing with floats, this is a well-defined operation (it might return infinities or NaNs, but that&#8217;s fine).</p>
<p>And on the subject of not rasterizing the box: as I said earlier, as soon as one vertex is outside the near-plane, we know we&#8217;re going to return <code>true</code> from the depth test rasterizer, so there&#8217;s no point even starting the operation. To facilitate this, we just make <code>TransformAABBox</code> return whether the box should be rasterized or not. Putting it all together:</p>
<pre>
bool TransformedAABBoxSSE::TransformAABBox()
{
    __m128 zAllIn = _mm_castsi128_ps(_mm_set1_epi32(~0));

    for(UINT i = 0; i &lt; AABB_VERTICES; i++)
    {
        __m128 vert = TransformCoords(&amp;mpBBVertexList[i],
            mCumulativeMatrix);

        // We have inverted z; z is inside of near plane iff z &lt;= w.
        __m128 vertZ = _mm_shuffle_ps(vert, vert, 0xaa); //vert.zzzz
        __m128 vertW = _mm_shuffle_ps(vert, vert, 0xff); //vert.wwww
        __m128 zIn = _mm_cmple_ps(vertZ, vertW);
        zAllIn = _mm_and_ps(zAllIn, zIn);

        // project
        mpXformedPos[i] = _mm_div_ps(vert, vertW);
    }

    // return true if and only if all verts inside near plane
    return _mm_movemask_ps(zAllIn) == 0xf;
}
</pre>
<p>In case you&#8217;re wondering why this code uses raw SSE intrinsics and not <code>VecF32</code>, it&#8217;s because I&#8217;m purposefully trying to keep anything depending on the SIMD width out of <code>VecF32</code>, which makes it a lot easier to go to 8-wide AVX should we want to at some point. But this code really uses 4-vectors of (x,y,z,w) and needs to do shuffles, so it doesn&#8217;t fit in that model and I want to keep it separate.  But the actual logic is just what I described.</p>
<p>And once we have this return value from <code>TransformAABBox</code>, we get to remove the near-clip test from the depth test rasterizer, <em>and</em> we get to move our early-out for near-clipped boxes all the way to the call site:</p>
<pre>
if(mpTransformedAABBox[i].TransformAABBox())
    mpVisible[i] = mpTransformedAABBox[i].RasterizeAndDepthTestAABBox(...);
else
    mpVisible[i] = true;
</pre>
<p>So, the <code>oneOverW</code> hack, the clamping hack and the hard-coded near plane are gone. That&#8217;s already a victory in terms of code quality, but did it improve the run time?</p>
<p><b>Change:</b> Transform/early-out fixes</p>
<table>
<tr>
<th>Depth test</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Start</td>
<td>1.109</td>
<td>1.152</td>
<td>1.166</td>
<td>1.182</td>
<td>1.240</td>
<td>1.167</td>
<td>0.022</td>
</tr>
<tr>
<td>Transform fixes</td>
<td>1.054</td>
<td>1.092</td>
<td>1.102</td>
<td>1.112</td>
<td>1.146</td>
<td>1.102</td>
<td>0.016</td>
</tr>
</table>
<p>Another 0.06ms off our median depth test time, which may not sound big but is over 5% of what&#8217;s left of it at this point.</p>
<h3>Getting warmer</h3>
<p>The bounding box rasterizer has one more method that&#8217;s called per-box though, and this is one that really deserves some special attention. Meet <code>IsTooSmall</code>:</p>
<pre>
bool TransformedAABBoxSSE::IsTooSmall(__m128 *pViewMatrix,
    __m128 *pProjMatrix, CPUTCamera *pCamera)
{
    float radius = mBBHalf.lengthSq(); // Use length-squared to
    // avoid sqrt().  Relative comparisons hold.

    float fov = pCamera-&gt;GetFov();
    float tanOfHalfFov = tanf(fov * 0.5f);

    MatrixMultiply(mWorldMatrix, pViewMatrix, mCumulativeMatrix);
    MatrixMultiply(mCumulativeMatrix, pProjMatrix,
        mCumulativeMatrix);
    MatrixMultiply(mCumulativeMatrix, mViewPortMatrix,
        mCumulativeMatrix);

    __m128 center = _mm_set_ps(1.0f, mBBCenter.z, mBBCenter.y,
        mBBCenter.x);
    __m128 mBBCenterOSxForm = TransformCoords(&amp;center,
        mCumulativeMatrix);
    float w = mBBCenterOSxForm.m128_f32[3];
    if( w &gt; 1.0f )
    {
        float radiusDivW = radius / w;
        float r2DivW2DivTanFov = radiusDivW / tanOfHalfFov;

        return r2DivW2DivTanFov &lt;
            (mOccludeeSizeThreshold * mOccludeeSizeThreshold);
    }

    return false;
}
</pre>
<p>Note that <code>MatrixMultiply(A, B, C)</code> performs <code>C = A * B</code>; the rest should be easy enough to figure out from the code. Now there&#8217;s really several problems with this function, so let&#8217;s go straight to a list:</p>
<ul>
<li><code>radius</code> (which is really radius squared) only depends on <code>mBBHalf</code>, which is fixed at initialization time. There&#8217;s no need to recompute it every time.</li>
<li>Similarly, <code>fov</code> and <code>tanOfHalfFov</code> only depend on the camera, and absolutely do not need to be recomputed once for every box. This is what gave us the <code>_tan_pentium4</code> cameo all the way back in <a href="http://fgiesen.wordpress.com/2013/02/02/frustum-culling-turning-the-crank/">&#8220;Frustum culling: turning the crank&#8221;</a>, by the way.</li>
<li>The view matrix, projection matrix and viewport matrix are also all camera or global constants. Again, no need to multiply these together for every box &#8211; the only matrix that is different between boxes is the very first one, the world matrix, and since matrix multiplication is associative, we can just concatenate the other three once.</li>
<li>There&#8217;s also no need for <code>mOccludeeSizeThreshold</code> to be squared every time &#8211; we can do that once.</li>
<li>Nor is there a need for it to be stored per box, since it&#8217;s a global constant owned by the depth test rasterizer.</li>
<li><code>(radius / w) / tanOfHalfFov</code> would be better computed as <code>radius / (w * tanOfHalfFov)</code>.</li>
<li>But more importantly, since all we&#8217;re doing is a compare and both <code>w</code> and <code>tanOfHalfFov</code> are positive, we can just multiply through by them and get rid of the divide altogether.</li>
</ul>
<p>All these things are common problems that I must have fixed a hundred times, but I have to admit that it&#8217;s pretty rare to see so many of them in a single page of code. Anyway, rather than fixing these one by one, let&#8217;s just cut to the chase: instead of all the redundant computations, we just move everything that only depends on the camera (or is global) into a single struct that holds our setup, which I dubbed <code>BoxTestSetup</code>. Here&#8217;s the code:</p>
<pre>
struct BoxTestSetup
{
    __m128 mViewProjViewport[4];
    float radiusThreshold;

    void Init(const __m128 viewMatrix[4],
        const __m128 projMatrix[4], CPUTCamera *pCamera,
        float occludeeSizeThreshold);
};

void BoxTestSetup::Init(const __m128 viewMatrix[4],
    const __m128 projMatrix[4], CPUTCamera *pCamera,
    float occludeeSizeThreshold)
{
    // viewportMatrix is a global float4x4; we need a __m128[4]
    __m128 viewPortMatrix[4];
    viewPortMatrix[0] = _mm_loadu_ps((float*)&amp;viewportMatrix.r0);
    viewPortMatrix[1] = _mm_loadu_ps((float*)&amp;viewportMatrix.r1);
    viewPortMatrix[2] = _mm_loadu_ps((float*)&amp;viewportMatrix.r2);
    viewPortMatrix[3] = _mm_loadu_ps((float*)&amp;viewportMatrix.r3);

    MatrixMultiply(viewMatrix, projMatrix, mViewProjViewport);
    MatrixMultiply(mViewProjViewport, viewPortMatrix,
        mViewProjViewport);

    float fov = pCamera-&gt;GetFov();
    float tanOfHalfFov = tanf(fov * 0.5f);
    radiusThreshold = occludeeSizeThreshold * occludeeSizeThreshold
        * tanOfHalfFov;
}
</pre>
<p>This is initialized once we start culling and simply kept on the stack. Then we just pass it to <code>IsTooSmall</code>, which after our <a href="https://github.com/rygorous/intel_occlusion_cull/commit/2411249a28f9918fc574648d5c79af2fe702c1f8">surgery</a> looks like this:</p>
<pre>
bool TransformedAABBoxSSE::IsTooSmall(const BoxTestSetup &amp;setup)
{
    MatrixMultiply(mWorldMatrix, setup.mViewProjViewport,
        mCumulativeMatrix);

    __m128 center = _mm_set_ps(1.0f, mBBCenter.z, mBBCenter.y,
        mBBCenter.x);
    __m128 mBBCenterOSxForm = TransformCoords(&amp;center,
        mCumulativeMatrix);
    float w = mBBCenterOSxForm.m128_f32[3];
    if( w &gt; 1.0f )
    {
        return mRadiusSq &lt; w * setup.radiusThreshold;
    }

    return false;
}
</pre>
<p>Wow, that method sure seems to have lost a few pounds. Let&#8217;s run the numbers:</p>
<p><b>Change:</b> IsTooSmall cleanup</p>
<table>
<tr>
<th>Depth test</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Start</td>
<td>1.109</td>
<td>1.152</td>
<td>1.166</td>
<td>1.182</td>
<td>1.240</td>
<td>1.167</td>
<td>0.022</td>
</tr>
<tr>
<td>Transform fixes</td>
<td>1.054</td>
<td>1.092</td>
<td>1.102</td>
<td>1.112</td>
<td>1.146</td>
<td>1.102</td>
<td>0.016</td>
</tr>
<tr>
<td>IsTooSmall cleanup</td>
<td>0.860</td>
<td>0.893</td>
<td>0.908</td>
<td>0.917</td>
<td>0.954</td>
<td>0.905</td>
<td>0.018</td>
</tr>
</table>
<p>Another 0.2ms off the median run time, bringing our total reduction for this post to about 22%. So are we done? Not yet!</p>
<h3>The state police</h3>
<p>Currently, each <code>TransformedAABBoxSSE</code> still keeps its own copy of the cumulative transform matrix and a copy of its transformed vertices. But it&#8217;s not necessary for these to be persistent &#8211; we compute them once, use them to rasterize the box, then don&#8217;t look at them again until the next frame. So, like <code>mVisible</code> earlier, there&#8217;s really no need to keep them around as state; instead, it&#8217;s better to just store them on the stack. Less pointers per <code>TransformedAABBoxSSE</code>, less cache misses, and &#8211; perhaps most important of all &#8211; it makes the bounding box objects themselves stateless. Granted, that&#8217;s the case only because our world is perfectly static and nothing is animated at runtime, but still, stateless is good! Stateless is easier to read, easier to debug, and easier to test.</p>
<p>Again, this is another change that is purely mechanical &#8211; just pass in a pointer to <code>cumulativeMatrix</code> and <code>xformedPos</code> to the functions that want them. So this time, I&#8217;m just going to refer you directly to the <a href="https://github.com/rygorous/intel_occlusion_cull/commit/0fad7d4fb406eb57a45d59ed2187fbddffe08bc7">two</a> <a href="https://github.com/rygorous/intel_occlusion_cull/commit/028a108d36b8bdb0d883d5baf82d1e922dd00fd1">commits</a> that implement this idea, and skip straight to the results:</p>
<p><b>Change:</b> Reduce amount of state</p>
<table>
<tr>
<th>Depth test</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Start</td>
<td>1.109</td>
<td>1.152</td>
<td>1.166</td>
<td>1.182</td>
<td>1.240</td>
<td>1.167</td>
<td>0.022</td>
</tr>
<tr>
<td>Transform fixes</td>
<td>1.054</td>
<td>1.092</td>
<td>1.102</td>
<td>1.112</td>
<td>1.146</td>
<td>1.102</td>
<td>0.016</td>
</tr>
<tr>
<td>IsTooSmall cleanup</td>
<td>0.860</td>
<td>0.893</td>
<td>0.908</td>
<td>0.917</td>
<td>0.954</td>
<td>0.905</td>
<td>0.018</td>
</tr>
<tr>
<td>Reduce state</td>
<td>0.834</td>
<td>0.862</td>
<td>0.873</td>
<td>0.886</td>
<td>0.938</td>
<td>0.875</td>
<td>0.017</td>
</tr>
</table>
<p>Only about 0.03ms this time, but we also save 192 bytes (plus allocator overhead) worth of memory per box, which is a nice bonus. And anyway, we&#8217;re not done yet, because I have one more!</p>
<h3>It&#8217;s more fun to compute</h3>
<p>There&#8217;s one more piece of unnecessary data we currently store per bounding box: the vertex list, initialized in <code>CreateAABBVertexIndexList</code>:</p>
<pre>
float3 min = mBBCenter - bbHalf;
float3 max = mBBCenter + bbHalf;
	
//Top 4 vertices in BB
mpBBVertexList[0] = _mm_set_ps(1.0f, max.z, max.y, max.x);
mpBBVertexList[1] = _mm_set_ps(1.0f, max.z, max.y, min.x); 
mpBBVertexList[2] = _mm_set_ps(1.0f, min.z, max.y, min.x);
mpBBVertexList[3] = _mm_set_ps(1.0f, min.z, max.y, max.x);
// Bottom 4 vertices in BB
mpBBVertexList[4] = _mm_set_ps(1.0f, min.z, min.y, max.x);
mpBBVertexList[5] = _mm_set_ps(1.0f, max.z, min.y, max.x);
mpBBVertexList[6] = _mm_set_ps(1.0f, max.z, min.y, min.x);
mpBBVertexList[7] = _mm_set_ps(1.0f, min.z, min.y, min.x);
</pre>
<p>This is, in effect, just treating the bounding box as a general mesh. But that&#8217;s extremely wasteful &#8211; we already store center and half-extent, the min/max corner positions are trivial to reconstruct from that information, and all the other vertices can be constructed by splicing min/max together componentwise using a set of masks that is the same for all bounding boxes. So these 8*16 = 128 bytes of vertex data really don&#8217;t pay their way.</p>
<p>But more importantly, note that the we only ever use two distinct values for x, y and z each. Now <code>TransformAABBox</code>, which we already saw above, uses <code>TransformCoords</code> to compute the matrix-vector product <code>v*M</code> with the cumulative transform matrix, using the expression</p>
<p><code>v.x * M.row[0] + v.y * M.row[1] + v.z * M.row[2] + M.row[3]</code> (v.w is assumed to be 1)</p>
<p>and because we know that <code>v.x</code> is either <code>min.x</code> or <code>max.x</code>, we can multiply both by <code>M.row[0]</code> once and store the result. Then the 8 individual vertices can skip the multiplies altogether. Putting it all together leads to the following new code for <code>TransformAABBox</code>:</p>
<pre>
// 0 = use min corner, 1 = use max corner
static const int sBBxInd[AABB_VERTICES] = { 1, 0, 0, 1, 1, 1, 0, 0 };
static const int sBByInd[AABB_VERTICES] = { 1, 1, 1, 1, 0, 0, 0, 0 };
static const int sBBzInd[AABB_VERTICES] = { 1, 1, 0, 0, 0, 1, 1, 0 };

bool TransformedAABBoxSSE::TransformAABBox(__m128 xformedPos[],
    const __m128 cumulativeMatrix[4])
{
    // w ends up being garbage, but it doesn't matter - we ignore
    // it anyway.
    __m128 vCenter = _mm_loadu_ps(&amp;mBBCenter.x);
    __m128 vHalf   = _mm_loadu_ps(&amp;mBBHalf.x);

    __m128 vMin    = _mm_sub_ps(vCenter, vHalf);
    __m128 vMax    = _mm_add_ps(vCenter, vHalf);

    // transforms
    __m128 xRow[2], yRow[2], zRow[2];
    xRow[0] = _mm_shuffle_ps(vMin, vMin, 0x00) * cumulativeMatrix[0];
    xRow[1] = _mm_shuffle_ps(vMax, vMax, 0x00) * cumulativeMatrix[0];
    yRow[0] = _mm_shuffle_ps(vMin, vMin, 0x55) * cumulativeMatrix[1];
    yRow[1] = _mm_shuffle_ps(vMax, vMax, 0x55) * cumulativeMatrix[1];
    zRow[0] = _mm_shuffle_ps(vMin, vMin, 0xaa) * cumulativeMatrix[2];
    zRow[1] = _mm_shuffle_ps(vMax, vMax, 0xaa) * cumulativeMatrix[2];

    __m128 zAllIn = _mm_castsi128_ps(_mm_set1_epi32(~0));

    for(UINT i = 0; i &lt; AABB_VERTICES; i++)
    {
        // Transform the vertex
        __m128 vert = cumulativeMatrix[3];
        vert += xRow[sBBxInd[i]];
        vert += yRow[sBByInd[i]];
        vert += zRow[sBBzInd[i]];

        // We have inverted z; z is inside of near plane iff z &lt;= w.
        __m128 vertZ = _mm_shuffle_ps(vert, vert, 0xaa); //vert.zzzz
        __m128 vertW = _mm_shuffle_ps(vert, vert, 0xff); //vert.wwww
        __m128 zIn = _mm_cmple_ps(vertZ, vertW);
        zAllIn = _mm_and_ps(zAllIn, zIn);

        // project
        xformedPos[i] = _mm_div_ps(vert, vertW);
    }

    // return true if and only if none of the verts are z-clipped
    return _mm_movemask_ps(zAllIn) == 0xf;
}
</pre>
<p>Admittedly, quite a bit longer than the original one, but that&#8217;s because we front-load a lot of the computation; most of the per-vertex work done in <code>TransformCoords</code> is gone. And here&#8217;s our reward:</p>
<p><b>Change:</b> Get rid of per-box vertex list</p>
<table>
<tr>
<th>Depth test</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Start</td>
<td>1.109</td>
<td>1.152</td>
<td>1.166</td>
<td>1.182</td>
<td>1.240</td>
<td>1.167</td>
<td>0.022</td>
</tr>
<tr>
<td>Transform fixes</td>
<td>1.054</td>
<td>1.092</td>
<td>1.102</td>
<td>1.112</td>
<td>1.146</td>
<td>1.102</td>
<td>0.016</td>
</tr>
<tr>
<td>IsTooSmall cleanup</td>
<td>0.860</td>
<td>0.893</td>
<td>0.908</td>
<td>0.917</td>
<td>0.954</td>
<td>0.905</td>
<td>0.018</td>
</tr>
<tr>
<td>Reduce state</td>
<td>0.834</td>
<td>0.862</td>
<td>0.873</td>
<td>0.886</td>
<td>0.938</td>
<td>0.875</td>
<td>0.017</td>
</tr>
<tr>
<td>Remove vert list</td>
<td>0.801</td>
<td>0.823</td>
<td>0.830</td>
<td>0.839</td>
<td>0.867</td>
<td>0.831</td>
<td>0.012</td>
</tr>
</table>
<p>This brings our total for this post to a nearly 25% reduction in median depth test time, plus about 320 bytes memory reduction per <code>TransformedAABBoxSSE</code> &#8211; which, since we have about 27000 of them, works out to well over 8 megabytes. Such are the rewards for widening the scope beyond optimizing functions by themselves.</p>
<p>And as usual, the code for this time (plus some changes I haven&#8217;t discussed yet) is up on <a href="https://github.com/rygorous/intel_occlusion_cull/tree/blog">Github</a>. Until next time!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/fgiesen.wordpress.com/1813/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/fgiesen.wordpress.com/1813/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1813&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://fgiesen.wordpress.com/2013/02/28/reshaping-dataflows/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/32870837851c0e5eb620649cb8d3d608?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">fgiesen</media:title>
		</media:content>
	</item>
		<item>
		<title>The care and feeding of worker threads, part 2</title>
		<link>http://fgiesen.wordpress.com/2013/02/25/the-care-and-feeding-of-worker-threads-part-2/</link>
		<comments>http://fgiesen.wordpress.com/2013/02/25/the-care-and-feeding-of-worker-threads-part-2/#comments</comments>
		<pubDate>Mon, 25 Feb 2013 10:39:45 +0000</pubDate>
		<dc:creator>fgiesen</dc:creator>
				<category><![CDATA[Coding]]></category>

		<guid isPermaLink="false">http://fgiesen.wordpress.com/?p=1778</guid>
		<description><![CDATA[This post is part of a series &#8211; go here for the index. In the previous post, we took a closer look at what our worker threads were doing and spent some time load-balancing the depth buffer rasterizer to reduce our overall latency. This time, we&#8217;ll have a closer look at the rest of the [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1778&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><em>This post is part of a series &#8211; go <a href="http://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/">here</a> for the index.</em></p>
<p>In the <a href="http://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/">previous post</a>, we took a closer look at what our worker threads were doing and spent some time load-balancing the depth buffer rasterizer to reduce our overall latency. This time, we&#8217;ll have a closer look at the rest of the system.</p>
<h3>A bug</h3>
<p>But first, it&#8217;s time to look at a bug that I inadvertently introduced last time: If you tried running the code from last time, you might have noticed that toggling the &#8220;Multi Tasking&#8221; checkbox off and back on causes a one-frame glitch. I introduced this bug in the changes corresponding to the section &#8220;Balancing act&#8221;. Since I didn&#8217;t get any comments or mails about it, it seems like I got away with it :), but I wanted to rectify it here anyway.</p>
<p>The issue turned out to be that the <code>IsTooSmall</code> computation for occluders, which we moved from the &#8220;vertex transform&#8221; to the &#8220;frustum cull&#8221; pass last time, used stale information. The relevant piece of the main loop is this:</p>
<pre>
mpCamera-&gt;SetNearPlaneDistance(1.0f);
mpCamera-&gt;SetFarPlaneDistance(gFarClipDistance);
mpCamera-&gt;Update();

// If view frustum culling is enabled then determine which occluders
// and occludees are inside the view frustum and run the software
// occlusion culling on only the those models
if(mEnableFCulling)
{
    renderParams.mpCamera = mpCamera;
    mpDBR-&gt;IsVisible(mpCamera);
    mpAABB-&gt;IsInsideViewFrustum(mpCamera);
}

// if software occlusion culling is enabled
if(mEnableCulling)
{
    mpCamera-&gt;SetNearPlaneDistance(gFarClipDistance);
    mpCamera-&gt;SetFarPlaneDistance(1.0f);
    mpCamera-&gt;Update();

    // Set the camera transforms so that the occluders can
    // be transformed 
<span style="color:#c11;">    mpDBR-&gt;SetViewProj(mpCamera-&gt;GetViewMatrix(),
        (float4x4*)mpCamera-&gt;GetProjectionMatrix());</span>

    // (clear, render depth and perform occlusion test here)

    mpCamera-&gt;SetNearPlaneDistance(1.0f);
    mpCamera-&gt;SetFarPlaneDistance(gFarClipDistance);
    mpCamera-&gt;Update();
}
</pre>
<p>Note how the call that actually updates the view-projection matrix (highlighted in red) runs <em>after</em> the frustum-culling pass. That&#8217;s the bug I was running into. Fixing this bug is almost as simple as moving that call up (to before the frustum culling pass), but another wrinkle is that the depth-buffer pass uses an inverted Z-buffer with Z=0 at the <em>far</em> plane and Z=1 at the near plane &#8211; note the calls that swap the positions of the camera &#8220;near&#8221; and &#8220;far&#8221; planes before depth buffer rendering, and the ones that swap it back after. There&#8217;s <a href="http://www.humus.name/index.php?ID=255">good reasons</a> for doing this, particularly if the depth buffer uses floats (as it does in our implementation). But to simplify matters here, I changed the code to do the swapping as part of the viewport transform instead, which means there&#8217;s no need to be modifying the camera/projection setup during the frame at all. This keeps the code simpler and also makes it easy to move the <code>SetViewProj</code> call to before the frustum culling pass, where it should be now that we&#8217;re using these matrices earlier.</p>
<h3>Some extra instrumentation</h3>
<p>In some of the previous posts, we already looked at the frustum culling logic; this time, I also added another timer that measures our total culling time, including frustum culling and everything related to rendering the depth buffer and performing the bounding box occlusion tests. The code itself is straightforward; I just wanted to add another explicit counter so we can see the explicit summary statistics as we make changes. I&#8217;ll use separate tables for the individual measurements:</p>
<table>
<tr>
<th>Total cull time</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.767</td>
<td>3.882</td>
<td>3.959</td>
<td>4.304</td>
<td>5.075</td>
<td>4.074</td>
<td>0.235</td>
</tr>
</table>
<table>
<tr>
<th>Render depth</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>2.098</td>
<td>2.119</td>
<td>2.132</td>
<td>2.146</td>
<td>2.212</td>
<td>2.136</td>
<td>0.022</td>
</tr>
</table>
<table>
<tr>
<th>Depth test</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>1.249</td>
<td>1.366</td>
<td>1.422</td>
<td>1.475</td>
<td>1.656</td>
<td>1.425</td>
<td>0.081</td>
</tr>
</table>
<h3>Load balancing depth testing</h3>
<p>Last time, we saw two fundamentally different ways to balance our multi-threaded workloads. The first was to simply split the work into N contiguous chunks. As we saw for the &#8220;transform vertices&#8221; and &#8220;bin meshes&#8221; passes, this works great provided that the individual work items generate a roughly uniform amount of work. Since vertex transform and binning work were roughly proportional to the number of vertices and triangles respectively, this kind of split worked well once we made sure to split after early-out processing.</p>
<p>In the second case, triangle rasterization, we couldn&#8217;t change the work partition after the fact: each task corresponded to one tile, and if we started touching two tiles in one task, it just wouldn&#8217;t work; there&#8217;d be race conditions. But at least we had a rough metric of how expensive each tile was going to be &#8211; the number of triangles in the respective bins &#8211; and we could use that to make sure that the &#8220;bulky&#8221; tiles would get processed first, to reduce the risk of picking up such a tile late and then having all other threads wait for its processing to finish.</p>
<p>Now, the depth tests are somewhat tricky, because neither of these strategies really apply. The cost of depth-testing a bounding box has two components: first, there is a fixed overhead of just processing a box (transforming its vertices and setting up the triangles), and second, there&#8217;s the actual rasterization with a cost that&#8217;s roughly proportional to the size of the bounding box in pixels when projected to the screen. For small boxes, the constant overhead is the bigger issue; for larger boxes, the per-pixel cost dominates. And at the point when we&#8217;re partitioning the work items across threads, we don&#8217;t know how big an area a box is going to cover on the screen, because we haven&#8217;t transformed the vertices yet! But still, our depth test pass is in desperate need of some balancing &#8211; here&#8217;s a typical example:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/02/tmviz_depth_tests.png"><img src="http://fgiesen.files.wordpress.com/2013/02/tmviz_depth_tests.png?w=497&#038;h=349" alt="Imbalanced depth tests" width="497" height="349" class="aligncenter size-full wp-image-1794" /></a></p>
<p>There&#8217;s nothing that&#8217;s stopping us from treating the depth test pass the way we treat the regular triangle pass: chop it up into separate phases with explicit hand-overs and balance them separately. But that&#8217;s a really big and disruptive change, and it turns out we don&#8217;t have to go that far to get a decent improvement.</p>
<p>The key realization is that the array of model bounding boxes we&#8217;re traversing is not in a random order. Models that are near each other in the world also tend to be near each other in the array. Thus, when we just partition the list of world models into N separate contiguous chunks, they&#8217;re not gonna have a similar amount of work for most viewpoints: some chunks are closer to the viewer than others, and those will contain bounding boxes that take up more area on the screen and hence be more expensive to process.</p>
<p>Well, that&#8217;s easy enough to fix: <em>don&#8217;t do that!</em> Suppose we had two worker threads. Our current approach would then correspond to splitting the world database in the middle, giving the first half to the first worker, and the second half to the second worker. This is bad whenever there&#8217;s much more work in one of the halves, say because the camera happens to be in it and the models are just bigger on screen and take longer to depth-test. But there&#8217;s no need to split the world database like that! We can just as well split it non-contiguously, say into one half with even indices and another half with odd indices. We can still get a lopsided distribution, but only if we happen to be a lot closer to all the even-numbered models than we are to the odd-numbered ones, and that&#8217;s a lot less likely to happen by accident. Unless the meshes happen to form a grid or other regular structure that is, in which case you might still get screwed. :)</p>
<p>Anyway, the same idea generalizes to N threads: instead of partitioning the models into odd and even halves, group all models which have the same index mod N. And in practice we don&#8217;t want to interleave at the level of individual models, since them being close together also has an advantage: they tend to hit similar regions of the depth buffer, which have a good chance of being in the cache. So instead of interleaving at the level of individual models, we interleave groups of 64 (arbitrary choice!) models at a time; an idea similar to the disk striping used for RAIDs. It turns out to be a really easy change to make: just replace the original loop</p>
<pre>
for(UINT i = start; i &lt; end; i++)
{
    // process model i
}
</pre>
<p>with the only marginally more complicated</p>
<pre>
static const UINT kChunkSize = 64;
for(UINT base = taskId*kChunkSize; base &lt; mNumModels;
        base += mNumDepthTestTasks * kChunkSize)
{
    UINT end = min(base + kChunkSize, mNumModels);
    for(UINT i = base; i &lt; end; i++)
    {
        // process model i
    }
}
</pre>
<p>and we&#8217;re done. Let&#8217;s see the change:</p>
<p><b>Change:</b> &#8220;Striping&#8221; to load-balance depth test threads.</p>
<table>
<tr>
<th>Depth test</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>1.249</td>
<td>1.366</td>
<td>1.422</td>
<td>1.475</td>
<td>1.656</td>
<td>1.425</td>
<td>0.081</td>
</tr>
<tr>
<td>Striped</td>
<td>1.109</td>
<td>1.152</td>
<td>1.166</td>
<td>1.182</td>
<td>1.240</td>
<td>1.167</td>
<td>0.022</td>
</tr>
</table>
<table>
<tr>
<th>Total cull time</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.767</td>
<td>3.882</td>
<td>3.959</td>
<td>4.304</td>
<td>5.075</td>
<td>4.074</td>
<td>0.235</td>
</tr>
<tr>
<td>Striped depth test</td>
<td>3.646</td>
<td>3.769</td>
<td>3.847</td>
<td>3.926</td>
<td>4.818</td>
<td>3.877</td>
<td>0.160</td>
</tr>
</table>
<p>That&#8217;s pretty good for just changing a few lines. Here&#8217;s the corresponding Telemetry screenshot:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/02/tmviz_depth_tests_striped.png"><img src="http://fgiesen.files.wordpress.com/2013/02/tmviz_depth_tests_striped.png?w=497" alt="Depth tests after striping"   class="aligncenter size-full wp-image-1799" /></a></p>
<p>Not as neatly balanced as some of the other ones we&#8217;ve seen, but we successfully managed to break up some of the huge packets, so it&#8217;s good enough for now.</p>
<h3>One bottleneck remaining</h3>
<p>At this point, we&#8217;re in pretty good shape as far as worker thread utilization is concerned, but there&#8217;s one big serial chunk still remaining, right between frustum culling and vertex transformation:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/02/tmviz_clear_depth.png"><img src="http://fgiesen.files.wordpress.com/2013/02/tmviz_clear_depth.png?w=497" alt="Depth buffer clears"   class="aligncenter size-full wp-image-1802" /></a></p>
<p>Clearing the depth buffer. This is about 0.4ms, about a third of the time we spend depth testing, all tracing back to a single line in the code:</p>
<pre>
    // Clear the depth buffer
    mpCPURenderTargetPixels = (UINT*)mpCPUDepthBuf;
    <span style="color:#c11;">memset(mpCPURenderTargetPixels, 0, SCREENW * SCREENH * 4);</span>
</pre>
<p>Luckily, this one&#8217;s really easy to fix. We could try and turn this into another separate group of tasks, but there&#8217;s no need: we already have a pass that chops up the screen into several smaller pieces, namely the actual rasterization which works one tile at a time. And neither the vertex transform nor the binner that run before it actually care about the contents of the depth buffer. So we just clear one tile at a time, from the rasterizer code. As a bonus, this means that the active tile gets &#8220;pre-loaded&#8221; into the current core&#8217;s L2 cache before we start rendering. I&#8217;m not going to bother walking through the code here &#8211; it&#8217;s simple enough &#8211; but as usual, I&#8217;ll give you the results:</p>
<p><b>Change:</b> Clear depth buffer in rasterizer workers</p>
<table>
<tr>
<th>Total cull time</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>3.767</td>
<td>3.882</td>
<td>3.959</td>
<td>4.304</td>
<td>5.075</td>
<td>4.074</td>
<td>0.235</td>
</tr>
<tr>
<td>Striped depth test</td>
<td>3.646</td>
<td>3.769</td>
<td>3.847</td>
<td>3.926</td>
<td>4.818</td>
<td>3.877</td>
<td>0.160</td>
</tr>
<tr>
<td>Clear in rasterizer</td>
<td>3.428</td>
<td>3.579</td>
<td>3.626</td>
<td>3.677</td>
<td>4.734</td>
<td>3.658</td>
<td>0.155</td>
</tr>
</table>
<table>
<tr>
<th>Render depth</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial</td>
<td>2.098</td>
<td>2.119</td>
<td>2.132</td>
<td>2.146</td>
<td>2.212</td>
<td>2.136</td>
<td>0.022</td>
</tr>
<tr>
<td>Clear in rasterizer</td>
<td>2.191</td>
<td>2.224</td>
<td>2.248</td>
<td>2.281</td>
<td>2.439</td>
<td>2.258</td>
<td>0.043</td>
</tr>
</table>
<p>So even though we take a bit of a hit in rasterization latency, we still get a very solid 0.2ms win in the total cull time. Again, a very good pay-off considering the amount of work involved.</p>
<h3>Summary</h3>
<p>A lot of the posts in this series so far either needed conceptual/algorithmic leaps or at least some detailed micro-architectural profiling. But this post and the previous one did not. In fact, finding these problems took nothing but a timeline profiler, and none of the fixes were particularly complicated either. I used Telemetry because that&#8217;s what I&#8217;m familiar with, but I didn&#8217;t use any but its most basic features, and I&#8217;m sure you would&#8217;ve found the same problems with any other program of this type; I&#8217;m told Intel&#8217;s GPA can do the same thing, but I haven&#8217;t used it so far.</p>
<p>Just to drive this one home &#8211; this is what we started with:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/02/tmviz_initial_cropped.png"><img src="http://fgiesen.files.wordpress.com/2013/02/tmviz_initial_cropped.png?w=497&#038;h=375" alt="Initial work distribution" width="497" height="375" class="aligncenter size-full wp-image-1806" /></a></p>
<p>(total cull time 7.36ms, for what it&#8217;s worth) and this is where we are now:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/02/tmviz_alldone.png"><img src="http://fgiesen.files.wordpress.com/2013/02/tmviz_alldone.png?w=497&#038;h=406" alt="Finished worker balance" width="497" height="406" class="aligncenter size-full wp-image-1807" /></a></p>
<p>Note that the bottom one is <em>zoomed in by 2x</em> so you can read the labels! Compare the zone lengths where printed. Now, this is not a representative sample; I just grabbed an arbitrary frame from both sessions, so don&#8217;t draw any conclusions from these two images alone, but it&#8217;s still fairly impressive. I&#8217;m still not sure why TBB only seems to use some subset of its worker threads &#8211; maybe there&#8217;s some threshold before they wake up and our parallel code just doesn&#8217;t run for long enough? &#8211; but it should be fairly obvious that the overall packing is a lot better now.</p>
<p>Remember, people. This is <em>the same code</em>. I didn&#8217;t change any of the algorithms nor their implementations in any substantial way. All I did was spend some time on their callers, improving the work granularity and scheduling. If you&#8217;re using worker threads, this is absolutely something you need to have on your radar.</p>
<p>As usual, the code for this part is up on <a href="https://github.com/rygorous/intel_occlusion_cull/tree/blog">Github</a>, this time with a few bonus commits I&#8217;m going to discuss next time (spoiler alert!), when I take a closer look at the depth testing code and the binner. See you then!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/fgiesen.wordpress.com/1778/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/fgiesen.wordpress.com/1778/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1778&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://fgiesen.wordpress.com/2013/02/25/the-care-and-feeding-of-worker-threads-part-2/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/32870837851c0e5eb620649cb8d3d608?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">fgiesen</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/02/tmviz_depth_tests.png" medium="image">
			<media:title type="html">Imbalanced depth tests</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/02/tmviz_depth_tests_striped.png" medium="image">
			<media:title type="html">Depth tests after striping</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/02/tmviz_clear_depth.png" medium="image">
			<media:title type="html">Depth buffer clears</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/02/tmviz_initial_cropped.png" medium="image">
			<media:title type="html">Initial work distribution</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/02/tmviz_alldone.png" medium="image">
			<media:title type="html">Finished worker balance</media:title>
		</media:content>
	</item>
		<item>
		<title>The care and feeding of worker threads, part 1</title>
		<link>http://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/</link>
		<comments>http://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/#comments</comments>
		<pubDate>Mon, 18 Feb 2013 07:31:20 +0000</pubDate>
		<dc:creator>fgiesen</dc:creator>
				<category><![CDATA[Coding]]></category>

		<guid isPermaLink="false">http://fgiesen.wordpress.com/?p=1717</guid>
		<description><![CDATA[This post is part of a series &#8211; go here for the index. It&#8217;s time for another post! After all the time I&#8217;ve spent on squeezing about 20% out of the depth rasterizer, I figured it was time to change gears and look at something different again. But before we get started on that new [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1717&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><em>This post is part of a series &#8211; go <a href="http://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/">here</a> for the index.</em></p>
<p>It&#8217;s time for another post! After all the time I&#8217;ve spent on squeezing about 20% out of the depth rasterizer, I figured it was time to change gears and look at something different again. But before we get started on that new topic, there&#8217;s one more set of changes that I want to talk about.</p>
<h3>The occlusion test rasterizer</h3>
<p>So far, we&#8217;ve mostly been looking at one rasterizer only &#8211; the one that actually renders the depth buffer we cull against, and even more precisely, only multi-threaded SSE version of it. But the occlusion culling demo has two sets of rasterizers: the other set is used for the occlusion tests. It renders bounding boxes for the various models to be tested and checks whether they are fully occluded. Check out the <a href="https://github.com/rygorous/intel_occlusion_cull/blob/4c64fd75/SoftwareOcclusionCulling/TransformedAABBoxSSE.cpp#L165">code</a> if you&#8217;re interested in the details.</p>
<p>This is basically the same rasterizer that we already talked about. In the previous two posts, I talked about optimizing the depth buffer rasterizer, but most of the same changes apply to the test rasterizer too. It didn&#8217;t make sense to talk through the same thing again, so I took the liberty of just making the same changes (with some minor tweaks) to the test rasterizer &#8220;off-screen&#8221;. So, just a heads-up: the test rasterizer has changed while you weren&#8217;t looking &#8211; unless you closely watch the Github repository, that is.</p>
<p>And now that we&#8217;ve established that there&#8217;s another inner loop we ought to be aware of, let&#8217;s zoom out a bit and look at the bigger picture.</p>
<h3>Some open questions</h3>
<p>There&#8217;s two questions you might have if you&#8217;ve been following this series closely so far. The first concerns a very visible difference between the depth and test rasterizers that you might have noticed if you ran the code. It&#8217;s also visible in the data in <a href="http://fgiesen.wordpress.com/2013/02/11/depth-buffers-done-quick-part/">&#8220;Depth buffers done quick, part 1&#8243;</a>, though I didn&#8217;t talk about it at the time. I&#8217;m talking, of course, about the large standard deviation we get for the execution time of the occlusion tests. Here&#8217;s a set of measurements for the code right after bringing the test rasterizer up to date:</p>
<table>
<tr>
<th>Pass</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Render depth</td>
<td>2.666</td>
<td>2.716</td>
<td>2.732</td>
<td>2.745</td>
<td>2.811</td>
<td>2.731</td>
<td>0.022</td>
</tr>
<tr>
<td>Occlusion test</td>
<td>1.335</td>
<td>1.545</td>
<td>1.587</td>
<td>1.631</td>
<td>1.761</td>
<td>1.585</td>
<td>0.066</td>
</tr>
</table>
<p>Now, the standard deviation actually got a fair bit lower with the rasterizer changes (originally, we were well above 0.1ms), but it&#8217;s still surprisingly large, especially considering that the occlusion tests run roughly half as long (in terms of wall-clock time) as the depth rendering. And there&#8217;s also a second elephant in the room that&#8217;s been staring us in the face for quite a while. Let me recycle one of the VTune screenshots from last time:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png"><img src="http://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png?w=497&#038;h=205" alt="Rasterizer hotspots without early-out" width="497" height="205" class="aligncenter size-full wp-image-1689" /></a></p>
<p>Right there at #4 is some code from <a href="http://threadingbuildingblocks.org/">TBB</a>, namely, what turns out to be the &#8220;thread is idle&#8221; spin loop.</p>
<p>Well, so far, we&#8217;ve been profiling, measuring and optimizing this as if it was a single-threaded application, but it&#8217;s not. The code uses TBB to dispatch tasks to worker threads, and clearly, a lot of these worker threads seem to be idle a lot of the time. But why? To answer that question, we need a bit different information than what either a normal VTune analysis run or our summary timers give us. We want a detailed breakdown of what happens during a frame. Now, VTune has <em>some</em> support for that (as part of their threading/concurrency profiling), but the UI doesn&#8217;t work well for me, and neither does the the visualization; it seems to be geared towards HPC/throughput computing more than latency-sensitive applications like real-time graphics, and it&#8217;s also still based on sampling profiling, which means it&#8217;s low-overhead but fairly limited in the kind of data it can collect.</p>
<p>Instead, I&#8217;m going to go for the shameless plug and use <a href="http://www.radgametools.com/telemetry.htm">Telemetry</a> instead (full disclosure: I work at RAD). It works like this: I manually instrument the source code to tell Telemetry when certain events are happening, and Telemetry collects that data, sends the whole log to a server and can later visualize it. Most games I&#8217;ve worked on have some kind of &#8220;bar graph profiler&#8221; that can visualize within-frame events, but because Telemetry keeps the whole data stream, it can also be used to answer the favorite question (not!) of engine programmers everywhere: &#8220;Wait, what the hell just happened there?&#8221;. Instead of trying to explain it in words, I&#8217;m just gonna show you the screenshot of my initial profiling run after I hooked up Telemetry and added some basic markup: (Click on the image to get the full-sized version)</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/02/tmviz_initial.png"><img src="http://fgiesen.files.wordpress.com/2013/02/tmviz_initial.png?w=497&#038;h=269" alt="Initial Telemetry run" width="497" height="269" class="aligncenter size-large wp-image-1725" /></a></p>
<p>The time axis goes from left to right, and all of the blocks correspond to regions of code that I&#8217;ve marked up. Regions can nest, and when they do, the blocks stack. I&#8217;m only using really basic markup right now, because that turns out to be all we need for the time being. The different tracks correspond to different threads.</p>
<p>As you can see, despite the code using TBB and worker threads, it&#8217;s fairly rare for more than 2 threads to be actually running anything interesting at a time. Also, if you look at the &#8220;Rasterize&#8221; and &#8220;DepthTest&#8221; tasks, you&#8217;ll notice that we&#8217;re spending a fair amount of time just waiting for the last 2 threads to finish their respective jobs, while the other worker threads are idle. That&#8217;s where our variance in latency ultimately comes from &#8211; it all depends on how lucky (or unlucky) we get with scheduling, and the exact scheduling of tasks changes every frame. And now that we&#8217;ve seen how much time the worker threads spend being idle, it also shouldn&#8217;t surprise us that TBB&#8217;s idle spin loop ranked as high as it did in the profile.</p>
<p>What do we do about it, though?</p>
<h3>Let&#8217;s start with something simple</h3>
<p>As usual, we go for the low-hanging fruit first, and if you look at the left side of the screenshot I&#8217;ll posted, you&#8217;ll see <em>a lot</em> of blocks (&#8220;zones&#8221;) on the left side of the screen. In fact, the count is much higher than you probably think &#8211; these are LOD zones, which means that Telemetry has grouped a bunch of very short zones into larger groups for the purposes of visualization. As you can see from the mouse-over text, the single block I&#8217;m pointing at with the mouse cursor corresponds to 583 zones &#8211; and each of those zones corresponds to an individual TBB task! That&#8217;s because this culling code uses one TBB task per model to be culled. <em>Ouch.</em> Let&#8217;s zoom in a bit:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_zoomed.png"><img src="http://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_zoomed.png?w=497&#038;h=269" alt="Telemetry: occluder visibility, zoomed" width="497" height="269" class="aligncenter size-large wp-image-1729" /></a></p>
<p>Note that even at this zoom level (the whole screen covers about 1.3ms), most zones are <em>still</em> LOD&#8217;d out. I&#8217;ve mouse-over&#8217;ed on a single task that happens to hit one or two L3 cache miss and so is long enough (at about 1500 cycles) to show up individually, but most of these tasks are closer to 600 cycles. In total, frustum culling the approximately 1600 occluder models takes up just above 1ms, as the captions helpfully say. For reference, the much smaller block that says &#8220;OccludeesVisible&#8221; and takes about 0.1ms? That one actually processes about 27000 models (it&#8217;s the code we optimized in <a href="http://fgiesen.wordpress.com/2013/02/02/frustum-culling-turning-the-crank/">&#8220;Frustum culling: turning the crank&#8221;</a>). Again, <em>ouch</em>.</p>
<p>Fortunately, there&#8217;s a simple solution: don&#8217;t use one task per model. Instead, use a smaller number of tasks (I just used 32) that each cover multiple models. The code is fairly obvious, so I won&#8217;t bother repeating it here, but I am going to show you the results:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_fixed.png"><img src="http://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_fixed.png?w=497&#038;h=269" alt="Telemetry: Occluder culling fixed" width="497" height="269" class="aligncenter size-large wp-image-1734" /></a></p>
<p>Down from 1ms to 0.08ms in two minutes of work. Now we could apply the same level of optimization as we did to the occludee culling, but I&#8217;m not going to bother, because at least not for the time being it&#8217;s fast enough. And with that out of the way, let&#8217;s look at the rasterization and depth testing part.</p>
<h3>A closer look</h3>
<p>Let&#8217;s look a bit more closely at what&#8217;s going on during rasterization:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/02/tmviz_raster_closeup.png"><img src="http://fgiesen.files.wordpress.com/2013/02/tmviz_raster_closeup.png?w=497&#038;h=283" alt="Rasterization close-up" width="497" height="283" class="aligncenter size-full wp-image-1737" /></a></p>
<p>There are at least two noteworthy things clearly visible in this screenshot:</p>
<ol>
<li>There&#8217;s three separate passes &#8211; transform, bin, then rasterize.</li>
<li>For some reason, we seem to have an odd mixture of really long tasks and very short ones.</li>
</ol>
<p>The former shouldn&#8217;t come as a surprise, since it&#8217;s explicit in the code:</p>
<pre>
gTaskMgr.CreateTaskSet(&amp;DepthBufferRasterizerSSEMT::TransformMeshes, this,
    NUM_XFORMVERTS_TASKS, NULL, 0, "Xform Vertices", &amp;mXformMesh);
gTaskMgr.CreateTaskSet(&amp;DepthBufferRasterizerSSEMT::BinTransformedMeshes, this,
    NUM_XFORMVERTS_TASKS, &amp;mXformMesh, 1, "Bin Meshes", &amp;mBinMesh);
gTaskMgr.CreateTaskSet(&amp;DepthBufferRasterizerSSEMT::RasterizeBinnedTrianglesToDepthBuffer, this,
    NUM_TILES, &amp;mBinMesh, 1, "Raster Tris to DB", &amp;mRasterize);	

// Wait for the task set
gTaskMgr.WaitForSet(mRasterize);
</pre>
<p>What the screenshot does show us, however, is the cost of those synchronization points. There sure is a lot of &#8220;air&#8221; in that diagram, and we could get some significant gains from squeezing it out. The second point is more of a surprise though, because the code does in fact try pretty hard to make sure the tasks are evenly sized. There&#8217;s a problem, though:</p>
<pre>
void TransformedModelSSE::TransformMeshes(...)
{
    if(mVisible)
    {
        // compute mTooSmall

        if(!mTooSmall)
        {
            // transform verts
        }
    }
}

void TransformedModelSSE::BinTransformedTrianglesMT(...)
{
    if(mVisible &amp;&amp; !mTooSmall)
    {
        // bin triangles
    }
}
</pre>
<p>Just because we make sure each task handles an equal number of vertices (as happens for the &#8220;TransformMeshes&#8221; tasks) or an equal number of triangles (&#8220;BinTransformedTriangles&#8221;) doesn&#8217;t mean they are similarly-sized, because the work subdivision ignores culling. Evidently, the tasks end up <em>not</em> being uniformly sized &#8211; not even close. Looks like we need to do some load balancing.</p>
<h3>Balancing act</h3>
<p>To simplify things, I moved the computation of <code>mTooSmall</code> from <code>TransformMeshes</code> into <code>IsVisible</code> &#8211; right after the frustum culling itself. That required some shuffling arguments around, but it&#8217;s exactly the kind of thing we already saw in <a href="http://fgiesen.wordpress.com/2013/02/02/frustum-culling-turning-the-crank/">&#8220;Frustum culling: turning the crank&#8221;</a>, so there&#8217;s little point in going over it in detail again.</p>
<p>Once <code>TransformMeshes</code> and <code>BinTransformedTrianglesMT</code> use the exact same condition &#8211; <code>mVisible &amp;&amp; !mTooSmall</code> &#8211; we can determine the list of models that are visible and not too small once, compute how many triangles and vertices these models have in total, and then use these corrected numbers which account for the culling when we&#8217;re setting up the individual transform and binning tasks.</p>
<p>This is easy to do: <code>DepthBufferRasterizerSSE</code> gets a few more member variables</p>
<pre>
UINT *mpModelIndexA; // 'active' models = visible and not too small
UINT mNumModelsA;
UINT mNumVerticesA;
UINT mNumTrianglesA;
</pre>
<p>and two new member functions</p>
<pre>
inline void ResetActive()
{
    mNumModelsA = mNumVerticesA = mNumTrianglesA = 0;
}

inline void Activate(UINT modelId)
{
    UINT activeId = mNumModelsA++;
    assert(activeId &lt; mNumModels1);

    mpModelIndexA[activeId] = modelId;
    mNumVerticesA += mpStartV1[modelId + 1] - mpStartV1[modelId];
    mNumTrianglesA += mpStartT1[modelId + 1] - mpStartT1[modelId];
}
</pre>
<p>that handle the accounting. The depth buffer rasterizer already kept cumulative vertex and triangle counts for all models; I added one more element at the end so I could use the simplified vertex/triangle-counting logic.</p>
<p>Then, at the end of the <code>IsVisible</code> pass (after the worker threads are done), I run</p>
<pre>
// Determine which models are active
ResetActive();
for (UINT i=0; i &lt; mNumModels1; i++)
    if(mpTransformedModels1[i].IsRasterized2DB())
        Activate(i);
</pre>
<p>where <code>IsRasterized2DB()</code> is just a predicate that returns <code>mIsVisible &amp;&amp; !mTooSmall</code> (it was already there, so I used it).</p>
<p>After that, all that remains is distributing work over the active models only, using <code>mNumVerticesA</code> and <code>mNumTrianglesA</code>. This is as simple as turning the original loop in <code>TransformMeshes</code></p>
<pre>
for(UINT ss = 0; ss &lt; mNumModels1; ss++)
</pre>
<p>into</p>
<pre>
for(UINT active = 0; active &lt; mNumModelsA; active++)
{
    UINT ss = mpModelIndexA[active];
    // ...
}
</pre>
<p>and the same for <code>BinTransformedMeshes</code>. All in all, this took me about 10 minutes to write, debug and test. And with that, we should have proper load balancing for the first two passes of rendering: transform and binning. The question, as always, is: does it help?</p>
<p><b>Change</b>: Better rendering &#8220;front end&#8221; load balancing</p>
<table>
<tr>
<th>Version</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial depth render</td>
<td>2.666</td>
<td>2.716</td>
<td>2.732</td>
<td>2.745</td>
<td>2.811</td>
<td>2.731</td>
<td>0.022</td>
</tr>
<tr>
<td>Balance front end</td>
<td>2.282</td>
<td>2.323</td>
<td>2.339</td>
<td>2.362</td>
<td>2.476</td>
<td>2.347</td>
<td>0.034</td>
</tr>
</table>
<p>Oh boy, does it ever. That&#8217;s a 14.4% reduction <em>on top of what we already got last time</em>. And Telemetry tells us we&#8217;re now doing a much better job at submitting uniform-sized tasks:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/02/tmvis_rasterbal1.png"><img src="http://fgiesen.files.wordpress.com/2013/02/tmvis_rasterbal1.png?w=497&#038;h=331" alt="Balanced rasterization front end" width="497" height="331" class="aligncenter size-full wp-image-1751" /></a></p>
<p>In this frame, there&#8217;s still one transform batch that takes longer than the others; this happens sometimes, because of context switches for example. But note that the other threads nicely pick up the slack, and we&#8217;re still fine: a ~2x variation on the occasional item isn&#8217;t a big deal, provided most items are still roughly the same size. Also note that, even though there&#8217;s 8 worker threads, we never seem to be running more than 4 tasks at a time, and the hand-offs between threads (look at what happens in the BinMeshes phase) seem too perfectly synchronized to just happen accidentally. I&#8217;m assuming that TBB intentionally never uses more than 4 threads because the machine I&#8217;m running this on has a quad-core CPU (albeit with HyperThreading), but I haven&#8217;t checked whether this is just a configuration option or not; it probably is.</p>
<h3>Balancing the rasterizer back end</h3>
<p>Now we can&#8217;t do the same trick for the actual triangle rasterization, because it works in tiles, and they just end up with uneven amounts of work depending on what&#8217;s on the screen &#8211; there&#8217;s nothing we can do about that. That said, we&#8217;re definitely hurt by the uneven task sizes here too &#8211; for example, on my original Telemetry screenshot, you can clearly see how the non-uniform job sizes hurt us:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/02/tmviz_initial_badbal.png"><img src="http://fgiesen.files.wordpress.com/2013/02/tmviz_initial_badbal.png?w=497&#038;h=360" alt="Initial bad rasterizer balance" width="497" height="360" class="aligncenter size-full wp-image-1758" /></a></p>
<p>The green thread picks up a tile with lots of triangles to render pretty late, and as a result everyone else ends up waiting for him to finish. This is not good.</p>
<p>However, lucky for us, there&#8217;s a solution: the TBB task manager will parcel out tasks roughly in the order they were submitted. So all we have to do is to make sure the &#8220;big&#8221; tiles come first. Well, after binning is done, we know exactly how many triangles end up in each tile. So what we do is insert a single task between<br />
binning and rasterization that determines the right order to process the tiles in, then make the actual rasterization depend on it:</p>
<pre>
gTaskMgr.CreateTaskSet(&amp;DepthBufferRasterizerSSEMT::BinSort, this,
    1, &amp;mBinMesh, 1, "BinSort", &amp;sortBins);
gTaskMgr.CreateTaskSet(&amp;DepthBufferRasterizerSSEMT::RasterizeBinnedTrianglesToDepthBuffer,
    this, NUM_TILES, &amp;sortBins, 1, "Raster Tris to DB", &amp;mRasterize);	
</pre>
<p>So how does that function look? Well, all we have to do is count how many triangles ended up in each triangle, and then sort the tiles by that. The function is so short I&#8217;m just gonna show you the whole thing:</p>
<pre>
void DepthBufferRasterizerSSEMT::BinSort(VOID* taskData,
    INT context, UINT taskId, UINT taskCount)
{
    DepthBufferRasterizerSSEMT* me =
        (DepthBufferRasterizerSSEMT*)taskData;

    // Initialize sequence in identity order and compute total
    // number of triangles in the bins for each tile
    UINT tileTotalTris[NUM_TILES];
    for(UINT tile = 0; tile &lt; NUM_TILES; tile++)
    {
        me-&gt;mTileSequence[tile] = tile;

        UINT base = tile * NUM_XFORMVERTS_TASKS;
        UINT numTris = 0;
        for (UINT bin = 0; bin &lt; NUM_XFORMVERTS_TASKS; bin++)
            numTris += me-&gt;mpNumTrisInBin[base + bin];

        tileTotalTris[tile] = numTris;
    }

    // Sort tiles by number of triangles, decreasing.
    std::sort(me-&gt;mTileSequence, me-&gt;mTileSequence + NUM_TILES,
        [&amp;](const UINT a, const UINT b)
        {
            return tileTotalTris[a] &gt; tileTotalTris[b]; 
        });
}
</pre>
<p>where <code>mTileSequence</code> is just an array of <code>UINT</code>s with <code>NUM_TILES</code> elements. Then we just rename the <code>taskId</code> parameter of <code>RasterizeBinnedTrianglesToDepthBuffer</code> to <code>rawTaskId</code> and start the function like this:</p>
<pre>
    UINT taskId = mTileSequence[rawTaskId];
</pre>
<p>and presto, we have bin sorting. Here&#8217;s the results:</p>
<p><b>Change</b>: Sort back-end tiles by amount of work</p>
<table>
<tr>
<th>Version</th>
<th>min</th>
<th>25th</th>
<th>med</th>
<th>75th</th>
<th>max</th>
<th>mean</th>
<th>sdev</th>
</tr>
<tr>
<td>Initial depth render</td>
<td>2.666</td>
<td>2.716</td>
<td>2.732</td>
<td>2.745</td>
<td>2.811</td>
<td>2.731</td>
<td>0.022</td>
</tr>
<tr>
<td>Balance front end</td>
<td>2.282</td>
<td>2.323</td>
<td>2.339</td>
<td>2.362</td>
<td>2.476</td>
<td>2.347</td>
<td>0.034</td>
</tr>
<tr>
<td>Balance back end</td>
<td>2.128</td>
<td>2.162</td>
<td>2.178</td>
<td>2.201</td>
<td>2.284</td>
<td>2.183</td>
<td>0.029</td>
</tr>
</table>
<p>Once again, we&#8217;re 20% down from where we started! Now let&#8217;s check in Telemetry to make sure it worked correctly and we weren&#8217;t just lucky:</p>
<p><a href="http://fgiesen.files.wordpress.com/2013/02/tmviz_rasterbal2.png"><img src="http://fgiesen.files.wordpress.com/2013/02/tmviz_rasterbal2.png?w=497&#038;h=387" alt="Rasterizer fully balanced" width="497" height="387" class="aligncenter size-full wp-image-1767" /></a></p>
<p>Now that&#8217;s just <em>beautiful</em>. See how the whole thing is now densely packed into the live threads, with almost no wasted space? This is how you want your profiles to look. Aside from the fact that our rasterization only seems to be running on 3 threads, that is &#8211; there&#8217;s always more digging to do. One fun thing I noticed is that TBB actually doesn&#8217;t process the tasks fully in-order; the two top threads indeed start from the biggest tiles and work their way forwards, but the  bottom-most thread actually starts from the end of the queue, working its way towards the beginning. The tiny LOD zone I&#8217;m hovering over covers both the bin sorting task and the seven smallest tiles; the packets get bigger from there.</p>
<p>And with that, I think we have enough changes (and images!) for one post. We&#8217;ll continue ironing out scheduling kinks next time, but I think the lesson is already clear: you can&#8217;t just toss tasks to worker threads and expect things to go smoothly. If you want to get good thread utilization, better profile to make sure your threads actually do what you think they&#8217;re doing! And as usual, you can find the code for this post on <a href="https://github.com/rygorous/intel_occlusion_cull/tree/blog">Github</a>, albeit without the Telemetry instrumentation for now &#8211; Telemetry is a commercial product, and I don&#8217;t want to introduce any dependencies that make it harder for people to compile the code. Take care, and until next time.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/fgiesen.wordpress.com/1717/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/fgiesen.wordpress.com/1717/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1717&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/32870837851c0e5eb620649cb8d3d608?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">fgiesen</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/02/hotspots_rast2.png" medium="image">
			<media:title type="html">Rasterizer hotspots without early-out</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/02/tmviz_initial.png?w=497" medium="image">
			<media:title type="html">Initial Telemetry run</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_zoomed.png?w=497" medium="image">
			<media:title type="html">Telemetry: occluder visibility, zoomed</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/02/tmviz_occluders_fixed.png?w=497" medium="image">
			<media:title type="html">Telemetry: Occluder culling fixed</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/02/tmviz_raster_closeup.png" medium="image">
			<media:title type="html">Rasterization close-up</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/02/tmvis_rasterbal1.png" medium="image">
			<media:title type="html">Balanced rasterization front end</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/02/tmviz_initial_badbal.png" medium="image">
			<media:title type="html">Initial bad rasterizer balance</media:title>
		</media:content>

		<media:content url="http://fgiesen.files.wordpress.com/2013/02/tmviz_rasterbal2.png" medium="image">
			<media:title type="html">Rasterizer fully balanced</media:title>
		</media:content>
	</item>
		<item>
		<title>Optimizing Software Occlusion Culling &#8211; index</title>
		<link>http://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/</link>
		<comments>http://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/#comments</comments>
		<pubDate>Sun, 17 Feb 2013 23:33:57 +0000</pubDate>
		<dc:creator>fgiesen</dc:creator>
				<category><![CDATA[Coding]]></category>

		<guid isPermaLink="false">http://fgiesen.wordpress.com/?p=1703</guid>
		<description><![CDATA[In January of 2013, some nice folks at Intel released a Software Occlusion Culling demo with full source code. I spent about two weekends playing around with the code, and after realizing that it made a great example for various things I&#8217;d been meaning to write about for a long time, started churning out blog [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1703&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In January of 2013, some nice folks at Intel released a <a href="http://software.intel.com/en-us/vcsource/samples/software-occlusion-culling">Software Occlusion Culling demo</a> with full source code. I spent about two weekends playing around with the code, and after realizing that it made a great example for various things I&#8217;d been meaning to write about for a long time, started churning out blog posts about it for the next few weeks. This is the resulting series.</p>
<p>Here&#8217;s the list of posts (the series is now finished):</p>
<ol>
<li><a href="http://fgiesen.wordpress.com/2013/01/29/write-combining-is-not-your-friend/">&#8220;Write combining is not your friend&#8221;</a>, on typical write combining issues when writing graphics code.</li>
<li><a href="http://fgiesen.wordpress.com/2013/01/30/a-string-processing-rant/">&#8220;A string processing rant&#8221;</a>, a slightly over-the-top post that starts with some bad string processing habits and ends in a rant about what a complete minefield the standard C/C++ string processing functions and classes are whenever non-ASCII character sets are involved.</li>
<li><a href="http://fgiesen.wordpress.com/2013/01/31/cores-dont-like-to-share/">&#8220;Cores don&#8217;t like to share&#8221;</a>, on some very common pitfalls when running multiple threads that share memory.</li>
<li><a href="http://fgiesen.wordpress.com/2013/02/01/fixing-cache-issues-the-lazy-way/">&#8220;Fixing cache issues, the lazy way&#8221;</a>. You could redesign your system to be more cache-friendly &#8211; but when you don&#8217;t have the time or the energy, you could also just do this.</li>
<li><a href="http://fgiesen.wordpress.com/2013/02/02/frustum-culling-turning-the-crank/">&#8220;Frustum culling: turning the crank&#8221;</a> &#8211; on the other hand, if you do have the time and energy, might as well do it properly.</li>
<li><a href="http://fgiesen.wordpress.com/2013/02/06/the-barycentric-conspirac/">&#8220;The barycentric conspiracy&#8221;</a> is a lead-in to some in-depth posts on the triangle rasterizer that&#8217;s at the heart of Intel&#8217;s demo. It&#8217;s also a gripping tale of triangles, Möbius, and a plot centuries in the making.</li>
<li><a href="http://fgiesen.wordpress.com/2013/02/08/triangle-rasterization-in-practice/">&#8220;Triangle rasterization in practice&#8221;</a> &#8211; how to build your own precise triangle rasterizer and <em>not</em> die trying.</li>
<li><a href="http://fgiesen.wordpress.com/2013/02/10/optimizing-the-basic-rasterizer/">&#8220;Optimizing the basic rasterizer&#8221;</a>, because this is real time, not amateur hour.</li>
<li><a href="http://fgiesen.wordpress.com/2013/02/11/depth-buffers-done-quick-part/">&#8220;Depth buffers done quick, part 1&#8243;</a> &#8211; at last, looking at (and optimizing) the depth buffer rasterizer in Intel&#8217;s example.</li>
<li><a href="http://fgiesen.wordpress.com/2013/02/16/depth-buffers-done-quick-part-2/">&#8220;Depth buffers done quick, part 2&#8243;</a> &#8211; optimizing some more!</li>
<li><a href="http://fgiesen.wordpress.com/2013/02/17/care-and-feeding-of-worker-threads-part-1/">&#8220;The care and feeding of worker threads, part 1&#8243;</a> &#8211; this project uses multi-threading; time to look into what these threads are actually doing.</li>
<li><a href="http://fgiesen.wordpress.com/2013/02/25/the-care-and-feeding-of-worker-threads-part-2/">&#8220;The care and feeding of worker threads, part 2&#8243;</a> &#8211; more on scheduling.</li>
<li><a href="http://fgiesen.wordpress.com/2013/02/28/reshaping-dataflows/">&#8220;Reshaping dataflows&#8221;</a> &#8211; using global knowledge to perform local code improvements.</li>
<li><a href="http://fgiesen.wordpress.com/2013/03/04/speculatively-speaking/">&#8220;Speculatively speaking&#8221;</a> &#8211; on store forwarding and speculative execution, using the triangle binner as an example.</li>
<li><a href="http://fgiesen.wordpress.com/2013/03/05/mopping-up/">&#8220;Mopping up&#8221;</a> &#8211; a bunch of things that didn&#8217;t fit anywhere else.</li>
<li><a href="http://fgiesen.wordpress.com/2013/03/10/optimizing-software-occlusion-culling-the-reckoning/">&#8220;The Reckoning&#8221;</a> &#8211; in which a lesson is learned, but <a href="http://www.alessonislearned.com/">the damage is irreversible</a>.</li>
</ol>
<p>All the code is available on <a href="https://github.com/rygorous/intel_occlusion_cull/">Github</a>; there&#8217;s various branches corresponding to various (simultaneous) tracks of development, including a lot of experiments that didn&#8217;t pan out. The articles all reference the <a href="https://github.com/rygorous/intel_occlusion_cull/tree/blog">blog branch</a> which contains only the changes I talk about in the posts &#8211; i.e. the stuff I judged to be actually useful.</p>
<p>Special thanks to Doug McNabb and Charu Chandrasekaran at Intel for publishing the example with full source code and a permissive license, and for saying &#8220;yes&#8221; when I asked them whether they were okay with me writing about my findings in this way!</p>
<p>
  <a rel="license" href="http://creativecommons.org/publicdomain/zero/1.0/"><br />
    <img src="http://i.creativecommons.org/p/zero/1.0/88x31.png" style="border-style:none;" alt="CC0" /><br />
  </a><br />
  <br />
  To the extent possible under law,<br />
  <a rel="dct:publisher" href="http://fgiesen.wordpress.com"><br />
    <span>Fabian Giesen</span></a><br />
  has waived all copyright and related or neighboring rights to<br />
  <span>Optimizing Software Occlusion Culling</span>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/fgiesen.wordpress.com/1703/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/fgiesen.wordpress.com/1703/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=fgiesen.wordpress.com&#038;blog=9777542&#038;post=1703&#038;subd=fgiesen&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/32870837851c0e5eb620649cb8d3d608?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">fgiesen</media:title>
		</media:content>

		<media:content url="http://i.creativecommons.org/p/zero/1.0/88x31.png" medium="image">
			<media:title type="html">CC0</media:title>
		</media:content>
	</item>
	</channel>
</rss>
