This post is part of a series – go here for the index.
And welcome back to my impromptu optimization series. Today, we won’t see a single line of code, nor a profiler screenshot. That’s because our next subject is the triangle rasterizer, and we better brush up on our triangle facts before we dive into that. A lot of it is going to involve barycentric coordinates, hence the name. Be warned, this post is a lot drier than the previous ones, and it doesn’t even have a big pay-off at the end. It’s a pure, uncut info-dump. The purpose is to collect all this material in one place so I can refer back to it later as necessary.
Meet the triangle
Let’s just start by fast-forwarding through a whole bunch of things I assume you already know. If you don’t, Wikipedia can help you, as can virtually any maths textbook that covers planar geometry.
A triangle is a polygon with 3 vertices v0, v1, v2 and 3 edges v0v1, v1v2, v2v0. You can see a fine specimen on the left. A degenerate triangle is one where the three vertices are collinear, i.e. they all fall on the same line (or they might even be all the same point). Any triangle lies on a plane, and for all non-degenerate triangles, that plane is uniquely determined. This holds in any dimension, but is somewhat vacuous in 0D or 1D where any 3 vertices are going to be collinear.
Restricting ourselves to 2D for the moment, a triangle, like any non-self-intersecting planar polygon, divides the plane into two regions: the interior, which is finite, and the exterior, which is not. The two are separated by the boundary of the triangles, which consists of the three edges. To rasterize a triangle, we essentially just need to query a bunch of points, usually directly corresponding to the pixel grid or arranged in some other regular pattern, and figure out whether they are inside or not. Since this is a binary result and our query provides a ternary answer (outside, on the boundary, inside), we need to define how points on the boundary are to be handled. There’s multiple ways to do this; I’ll cover them in the article about actual triangle rasterization. Since a triangle is always planar, it’s easy to extend these definitions to higher dimensions by always working in the plane of the triangle (or in a plane of the triangle, should it be degenerate).
Finally, triangles are always convex: for any two points inside the triangle, the line connecting them is also fully inside the triangle. Convexity turns out to be important: in the plane, the inside of a convex polygon with n edges can always be written as the intersection of n half-spaces. That fact in itself is enough to write a triangle rasterizer, but if you haven’t done this before, you might be wondering what the hell I’m talking about at this point. So let’s take a step back and talk about the geometry of the situation for a second.
Oriented edges
Consider, for a moment, the edge v0v1 in the image above. The edge itself is a line segment. The corresponding line divides the plane into two halves (the half-spaces I mentioned): a “left” side and a “right” side. This is shown in the image on the left, with the “left” side being shaded. Now, speaking of it in those terms gets problematic if the edge we picked happens to be horizontal (which of the two halves is the “left” one if they’re stacked vertically?). So instead, we’re going to phrase everything relative to the edge rather than the picture we’re drawing: imagine you’re walking down the edge from v0 towards v1. Then, we’re gonna refer to everything on your left side (if you’re looking towards v1) as the “positive” half-space, and everything to the right as the “negative” half-space. Finally, points that are either right in front of you or right behind you (that is, points that fall on the line) belong to neither half-space.
Now, if we apply the same construction to the other two edges and overlay it all on top of each other, we get this image:
Walking through all its technicolored glory (apologies to the color-blind):
- The green region is in the positive half-space (I’m just gonna write “inside”) of v1v2 and v2v0, but outside of v0v1.
- The cyan region is inside v2v0, but outside the two other edges.
- Blue is inside v0v1 and v2v0.
- Magenta is inside v0v1 – all that’s left of the region we saw above.
- Red is inside v0v1 and v1v2.
- Yellow is inside v1v2.
- Finally, the gray region is inside all three edges – and that’s exactly our triangle.
And so we have the picture to go with my “intersection of 3 half-spaces” comment earlier. This means that all we need to figure out whether a point is in a triangle is to figure out whether it’s in all three positive half-spaces, which turns out to be fairly easy. That is, assuming that such a point even exists – what would’ve happened if v2 had been to the right of v0v1 instead of to the left, or if the same thing happened with any of the other edges? We’ll get there in a minute, but let’s first go into how we figure out whether a point is in the positive half-space or not.
Area, length, orientation
If we have the coordinates of all involved points, the answer turns out to be: determinants. And not just any old determinant will do; given the three points a, b and c, we want to compute the determinant
Clearly, if this expression is positive, c lies to the left of the directed edge ab (i.e. the triangle abc is wound counter-clockwise), and with that out of the way, we can start rasterizing triangles…
Wait, what?
Sorry, pet peeve. A lot of texts like to just spring these expressions on you without much explanation. That’s fine for papers, where you can expect your audience to know this already, but even a lot of introductory texts don’t bother with an actual explanation, which annoys me, because while this isn’t hard, it’s by no means obvious either.
So let’s look at this beast a bit more closely. First, notice how the first expression simply puts the three vertices into the columns with an appended 1 – why yes, those are in fact homogeneous coordinates, thank you for noticing. We’re not gonna make use of that here, but it’s worth knowing. Second, because we just use the vertex coordinates as the columns, this should make it immediately obvious that this expression is the same for all possible orderings of a, b, c, up to sign (this is just a determinant identity). In particular, if we plug in in our vertex coordinates for a, b, c, we always get the same value (this time including sign) for all three cyclical permutations (v0v1v2, v1v2v0, and v2v0v1) of the vertices. Which in turn means that the “sidedness” we compute is going to be same for all three edges, answering one of our questions above.
Next, note that we can transform the first form (the 3×3 determinant) into the second form by subtracting the first column from the other two and then developing the determinant with respect to the third row, which should hopefully make it a bit less mysterious. There’s also a very nice way to understand this geometrically, but I’m not going to explain that here – maybe another time. Anyway, now that we know how to derive the 2×2 form, let’s look at it in turn. With arbitrary 2D vectors p and q, the determinant
gives the (signed) area of the parallelogram spanned by the edge vectors p and q (I’m assuming you know this one – it’s a standard linear algebra fact, and proving it is outside the scope of this article). Similarly, a 3×3 determinant of vectors p, q, r gives the signed volume of the parallelepiped spanned by those three vectors, and in higher dimensions, a n×n determinant of n vectors gives the signed n-volume of the corresponding n-parallelotope, but I digress.
So, with that in mind, let’s first look at our triangle and try to compute Orient2D(v0, v1, v2). That should help us find out whether it’s wound counter-clockwise (i.e. whether v2 is to the left of the oriented edge v0v1) or not. The expression above tells us to compute the determinant
which should give us the signed area of the parallelogram with edges v0v1 and v0v2. Let’s draw that on top of our triangle so we can see what’s going on:
Now, there’s two things about this worth mentioning: First, if we were to swap v1 and v2, we would get the same edge vectors, just in the opposite order – we swap two columns of the determinant, which flips the sign but leaves the absolute value untouched. Now, our original triangle is wound counterclockwise: the third vertex v2 is to the left of the first edge v0v1. If we swap v1 and v2, we get the same triangle, only this time the third vertex (now v1) is to the right of the first edge (now v0v2). More precisely, the sign of the determinant turns out to be positive if our first turn is counter-clockwise, and negative if our first turn is clockwise. If it’s zero, all three vertices are collinear, so the triangle is degenerate – also useful to know.
The second thing is that the parallelogram we’re looking at clearly has twice the area of the triangle we started with. This is no accident – constructing the fourth vertex of the parallelogram produces another triangle that is congruent to the first one, so the two triangles have the same area, hence the parallelogram has twice the area of the triangle we started out with. This gives us the standard determinant formula for the area of the triangle:
The other standard formula for triangle area is , where b is the length of the base of the triangle (=length of one of its edges) and h is the corresponding height (=length of the perpendicular of b through the vertex opposite b). In fact, the proof for this formula uses the same parallelogram we just saw. Compare the two expressions and we note that our signed area computation can be written
where h(v2, v0v1) denotes the signed height of v2 over v0v1 – this isn’t standard notation, but bear with me for a minute. The point here is that the value of this signed area computation is proportional to the signed distance of v2 from the edge. That this works on triangles should not be surprising – the same is true for rectangles, for example – but it’s worth spelling out explicitly here because we’ll be doing a lot of signed area computations to determine what is in effect signed distances. So it’s important to know that they’re equivalent.
Edge functions
Now, let’s get back to our original use for these determinant expressions: figuring out on which side of an edge a point lies. So let’s pick an arbitrary point p and see how it relates to the edge v0v1. Throwing it into our determinant expression:
and if we rearrange terms a bit, regroup and simplify we get
This is what I’ll call the edge function for edge v0v1. As you can see, if we hold the vertex positions constant, this is just an affine function on p. Doing the same with the other two edges gives us two more edge functions:
If all three of these are positive, p is inside the triangle, assuming the triangle is wound counter-clockwise, which I will for the rest of this article. If it’s clockwise, just swap two of the vertices before you start hit-testing. Now, these are normal linear functions, but from their derivation and the determinant properties we saw earlier, we know that they in fact also measure the signed area of the corresponding parallelogram – which in turn is twice the signed area of the corresponding triangle. Let’s pick a point inside the triangle and draw the corresponding diagram:
Our original triangle is partitioned into three smaller triangles that together exactly cover the area of the original triangle. And since p is inside, these triangles are all wound counter-clockwise themselves: they must be, because these triangles have signed areas corresponding to the edge functions, and we know all three of them are positive with p inside. So that’s pretty neat all by itself.
But wait, there’s more! Since the three triangles add up to the area of the original triangle, the three corresponding edge functions should add up to twice the signed area of the full triangle v0v1v2 (twice because triangle area has the 1/2 factor whereas our edge functions don’t). Or, as a formula:
If you look at the terms in the edge functions containing px and py that shouldn’t be surprising: Summing the three terms for px gives (v0y – v1y + v1y – v2y + v2y – v0y) = 0, and similar for py. So yes, the sum of these three is constant alright. Now, looking at this in linear algebra terms, this shouldn’t come as a surprise: we have 3 affine functions on only 2 variables – they’re not going to be independent. But it still helps to see the underlying geometry.
Why signed areas are a good idea
Note that the statement about the edge functions summing up to the area of the triangle hold for any point, not just points inside the triangle. It’s not clear how that’s going to work when p is outside the triangle, so let’s have a look:
This time, the triangles actually overlap each other: The two triangles v0v1p and v1v2p are wound counter-clockwise and have positive area, same as before – also, they extend outside the area of the original triangle. But the third (red) triangle, v2v0p, is wound clockwise and has negative area, and happens to exactly cancel out the parts of the two other triangles that extend outside the original triangle v0v1v2. So it still all works out. If you haven’t seen this before, this kind of cancelling is an important trick, and can be used to simplify a lot of things that would otherwise be pretty hairy. For example, it can be used to calculate the area of any polygon, no matter how complicated, by just summing the areas of a bunch of triangles, one triangle for each edge. Doing the same using only positive-area triangles requires triangulating the polygon first, which is a much hairier problem, but again, I digress.
So where’s the barycentric coordinates already?
Now, this blog post is called “the barycentric conspiracy”, but strangely, this far in, we don’t seem to have seen a single barycentric coordinate yet. What’s up with that? Well, let’s first look at what barycentric coordinates are: in the context of a triangle, the barycentric coordinates of a point are a triple (w0, w1, w2) of numbers that act as “weights” for the corresponding vertices. So the three coordinate triples (1,0,0), (0,1,0) and (0,0,1) correspond to v0, v1 and v2, respectively. More generally, we allow the weights to be anything (except all zeros) and just divide through by their sum in the end. Then the barycentric coordinates for p are a triple (w0, w1, w2) such that:
Since we divide through by their sum, they’re only unique up to scale – much like the homogeneous coordinates you’re hopefully familiar with as a graphics programmer. This is the second time we’ve accidentally bumped into them in this post. That is not an accident. Barycentric coordinates are a type of homogeneous coordinates, and in fact both were introduced in the same paper by Möbius in 1827. I’m trying to stick with plain planar geometry in this post since it’s easier to draw (and also easier to follow if you’re not used to thinking in projective geometry). That means the whole homogeneous coordinate angle is fairly subdued in this post, but trust me when I say that everything we’ve been doing in here works just as well in projective spaces. And you’ve already seen the geometric derivations for everything, so we can even do it completely coordinate-free if we wanted to (always good to know how to avoid the algebra if you’re not feeling like it).
But back to barycentric coordinates: We already know that our edge functions measure (signed) areas, and that they’re zero on their respective edges. Well, both v0 and v1 are on the edge v0v1 (obviously), and hence
.
And we also already know that if we plug the third vertex into the edge function, we get twice the signed area of the whole triangle:
.
The same trick works with the other two edge functions: whenever all three vertices are involved, we get twice the signed area of the whole triangle, otherwise the result is zero. And we already know they’re affine functions. At this point, things should already look fairly suspicious, so I’m just gonna cut to the chase: Let’s set
That’s right, the three edge functions, evaluated at p, give us p’s barycentric coordinates, normalized so their sum is twice the area of the triangle. Note that the barycentric weight is always for the vertex opposite the edge we’re talking about. Now that you’ve seen the area diagram, it should be clear why: what the edge function F12(p) gives us is the scaled area of the triangle v1v2p, and the further p is from edge v1v2, the larger that triangle is. At the extreme, when p is at v0, it covers the entirety of the original triangle we started out with. So that all makes sense. While we’re at it, let’s also define a normalized version of the barycentric coordinates with their sum always being 1:
So the secret is out – the determinants we’ve been looking at, the signed areas and distances, even the edge functions – it was barycentric coordinates all along. It’s all connected, and everybody’s in on it! Cue scare chord.
Barycentric interpolation
And with that, we have all the math we need, but there’s one more application that I want to bring up: As I’ve said before, the barycentric coordinates are effectively weights for the various vertices. The definition uses this for the positions, but we can use those same weights to interpolate other stuff that’s supposed to vary linearly across a triangle, such as vertex attributes.
Now, for the depth buffer rasterizer that we’re going to look at, we only need to interpolate one thing, and that’s depth. If we have z values z0, z1, z2 at the vertices, we can determine the interpolated depth by computing
and if we have the edge function values for p already, that’s fairly straightforward and works just fine, at the cost of three multiplies and two adds. But remember that we have the whole thing normalized so the lambdas sum to 1. This means we can express any lambda in terms of the two others:
Plugging this into the above expression and simplifying, we get:
The differences between the zi‘s are constant across the triangle, so we can compute them once. This gives us an alternative barycentric interpolation expression that uses two multiplies and two adds, in a form that allows them to be executed as two fused multiply-adds. Now if there’s one thing we’ve seen in the previous posts in this series, it’s that counting operations is often the wrong way to approach performance problems, but this one simplification we will end up using in an inner loop that’s actually bottlenecked by the number of instructions executed. And, just as importantly, this is also the expression that GPUs normally use for vertex attribute interpolation. I might talk more about that at some point, but there’s already more than enough material for one sitting in this post. So see you next time, when we learn how to turn all this into a rasterizer.
This post is part of a series – go here for the index.
In the past few posts, we’ve been looking at Intel’s Software Occlusion Culling sample. This post is going to be a bit shorter than the others so far. This has two reasons: first, next up is the rasterizer. It turns out there’s another common performance problem we’re gonna see in this series, but right now, fixing it is not going to make much of a difference: as it is, the rasterizer in the sample is fairly well balanced, and it’s not making any serious mistakes. In other words, this time round, we don’t get any easy pickings. Unlike the somewhat peripheral framework code we’ve been looking at so far, this is the actual heart of this sample, and it was written by people who know what they’re doing. Speeding it up is going to take some actual algorithmic improvements, which means a lot of prep work for me, since I’ll need to teach you the necessary groundwork first. :) This is gonna take several posts, but I promise that we’ll get a properly satisfying finale.
Well, that’s the first reason. The second reason is that I’ve actually had people complain about the rate at which I’m posting these, because they’re falling behind on reading! Sheesh. You slackers – try writing the damn things! :) Anyway, I’m going to give you this one for the road, and then I’m gonna stop for the weekend so you can catch up and I can start mentally preparing for the figures on rasterization I’m gonna have to produce. A picture may be worth a thousand words, but making a decent illustration takes me a lot longer than writing a thousand words does, so it’s not a happy trade-off.
Anyway, enough introduction. One final batch of frustum cull optimization coming right up. Let’s get cracking.
What we need here is some work ethic
So the whole point of my last post was that you can often make a big difference with fairly small changes, provided you know what you’re doing – minimally invasive surgery, so to speak. This has lots of advantages: it’s easy to write and review, very testable, and less likely to cause major disruptions or cause friction with other existing code. When we last saw our frustum culling code, we managed to speed it up by a factor of over 4.5x by a handful of fixes, none of which touched more than 3 lines of code. That’s both satisfying and impressive to family and friends (if you have the kind of family which tends to be impressed by these things), and it got us from a #4 location in our hot spot list all the way down to #10:
But as you can see, while that blue bar denoting Level-3 cache misses has gotten considerably shorter, it’s still there, and a lot of the things I mentioned in the previous part are still true: in particular, one instance of our TransformedAABBoxSSE is still well above the size of a cache line, and while we’ve managed to beat it all into a shape where the prefetcher works for us, we’re still only accessing 25 bytes per cache line, best case. That’s over 60% of our memory bandwidth wasted. Surely, we can do better?
Well, of course we can, but this time we’re gonna have to roll up our sleeves and do some more invasive changes to our code. Let’s first recap the struct layout:
class TransformedAABBoxSSE
{
// Methods elided
CPUTModelDX11 *mpCPUTModel;
__m128 *mWorldMatrix;
__m128 *mpBBVertexList;
__m128 *mpXformedPos;
__m128 *mCumulativeMatrix;
bool *mVisible;
float mOccludeeSizeThreshold;
__m128 *mViewPortMatrix;
float3 mBBCenter;
float3 mBBHalf;
bool mInsideViewFrustum;
bool mTooSmall;
float3 mBBCenterWS;
float3 mBBHalfWS;
};
The part we care about right now is at the bottom: The two bools and the world-space bounding box. Now, it turns out that while one of the bools (mTooSmall) gets written by the function TransformedAABBoxSSE::IsTooSmall, nobody ever reads it – everyone just uses the return value of IsTooSmall. So we can just make it a local variable in that function and stop spending per-instance memory on it. That one’s fairly easy.
Ta data rhei
For mInsideViewFrustum though, we’re going to have to work a bit more. In particular, we’re gonna have to understand the actual dataflow patterns to figure out where the right place to put it is.
We already know that it gets set in IsInsideViewFrustum, because we’ve spent some time looking at that function already, although it’s gotten shorter since we last saw it:
void TransformedAABBoxSSE::IsInsideViewFrustum(CPUTCamera *pCamera)
{
mInsideViewFrustum = pCamera->mFrustum.IsVisible(mBBCenterWS,
mBBHalfWS);
}
Unfortunately, unlike the previous case, IsInsideViewFrustum doesn’t have a return value, so our boolean flag is actual state, and there’s two more methods that access it, one of which is also called IsInsideViewFrustum. I’m really not a fan of overloading when the two methods do completely different things – it’s confusing and error-prone – but I digress. Both of the other methods are inline:
inline void SetInsideViewFrustum(bool insideVF)
{
mInsideViewFrustum = insideVF;
}
inline bool IsInsideViewFrustum()
{
return mInsideViewFrustum;
}
And both of these get called from the outside, so we can’t simply nuke them. However, lucky for us, these dependencies don’t go very far upstream in the call graph at all. So let’s have a look where our three frustum cull-related functions get called. First, the function that updates our visibility state. Turns out there’s only two callers. Let’s look at the first one:
void AABBoxRasterizerSSEST::IsInsideViewFrustum(CPUTCamera *pCamera)
{
mpCamera = pCamera;
for(UINT i = 0; i < mNumModels; i++)
{
mpTransformedAABBox[i].IsInsideViewFrustum(mpCamera);
}
}
Straightforward enough. The second one is in the class AABBoxRasterizerSSEMT, which does the exact same thing with some additional setup to figure out which part of the model list each task needs to process (it’s multi-threaded, as the name suggests). Both classes derive from the base class AABBoxRasterizer, which holds a bunch of things common to both the single- and multi-threaded implementations, including the array of TransformedAABBoxSSEs.
Because there’s first a global frustum culling pass on multiple threads, which is only then followed by a second pass that looks at the results, we can’t simply get rid of the per-model bookkeeping: it’s actual state. Let’s look at the callers of the no-parameters version of IsInsideViewFrustum to figure out where that state is read:
void AABBoxRasterizerSSEST::TransformAABBoxAndDepthTest()
{
mDepthTestTimer.StartTimer();
for(UINT i = 0; i < mNumModels; i++)
{
mpVisible[i] = false;
mpTransformedAABBox[i].SetVisible(&mpVisible[i]);
if(mpTransformedAABBox[i].IsInsideViewFrustum() &&
!mpTransformedAABBox[i].IsTooSmall(
mViewMatrix, mProjMatrix, mpCamera))
{
mpTransformedAABBox[i].TransformAABBox();
mpTransformedAABBox[i].RasterizeAndDepthTestAABBox(
mpRenderTargetPixels);
}
}
mDepthTestTime[mTimeCounter++] = mDepthTestTimer.StopTimer();
mTimeCounter = mTimeCounter >= AVG_COUNTER ? 0 : mTimeCounter;
}
And again, there’s a multi-threaded version that does pretty much the same, and no other callers.
Finally, searching for callers to SetInsideViewFrustum turns up exactly one hit, an inline function in AABBoxRasterizerSSE:
inline void ResetInsideFrustum()
{
for(UINT i = 0; i < mNumModels; i++)
{
mpTransformedAABBox[i].SetInsideViewFrustum(true);
}
}
As far as dataflow expeditions go, this one was pretty much as tame as it gets: it’s all concentrated in a few source files, among functions that are directly related and are at similar levels of the call tree. Refactoring this won’t be hard at all. Mind you, this is pretty much the best-case result – we got off lightly. In a lot of codebases, doing this kind of thing will quickly lead you to realize that the banana you’re interested in and the gorilla holding it are very tightly coupled. But now that we’ve determined that’s not the case, how do we rearrange things for better cache efficiency?
Shuffling data around
As we just saw, AABBoxRasterizerSSE and its subclasses are clearly in charge of running the whole frustum culling operation. Not only do they trigger the frustum culling computation, they also hold the array of bounding boxes, and they’re the only ones who actually look at the frustum culling results. That suggests that AABBoxRasterizerSSE is the natural place to put our frustum calling state. So let’s add an array of bools for the visibility state of the boxes, and make it parallel to the array we already have:
class AABBoxRasterizerSSE : public AABBoxRasterizer
{
// ...
TransformedAABBoxSSE *mpTransformedAABBox;
bool *mpBBoxVisible; // <--- this is new
// ...
};
This needs to be allocated and freed, but all of that is perfectly routine, so I won’t go into it. And once we’ve added it, we have a fairly simple plan of attack:
- Replace all calls to
mpTransformedAABBox[i].IsInsideViewFrustum()(the version without arguments) bympBBoxVisible[i]. - Similarly, replace calls to
SetInsideViewFrustumby the corresponding assignment. - Instead of writing the culling state to a member variable, have
IsInsideViewFrustum(camera)(the update version) return the frustum culling state, and write it to the corresponding slot inmpBBoxVisibleat the call site. - Get rid of
TransformedAABBoxSSE::mInsideViewFrustumnow that it’s unreferenced.
Each of these items results in a handful of changes; the complete diff is here, for the curious.
And presto, we have a densely packed visibility state array (well, not that densely packed, since we still use a whole byte to store what’s effectively a 1-bit flag, but you get the idea). By itself, that won’t buy us much in the frustum culling pass, although it’s likely to make the later pass that checks for visible boxes faster, since we now never need to fetch the whole TransformedAABBoxSSE from memory if it was frustum culled.
But we can now turn the crank one more time and do the same with the world-space bounding boxes, creating yet another array held by AABBoxRasterizerSSE. We also move the actual visibility test to AABBoxRasterizerSSE (since the test function is a one-liner, that’s a simple change to make), wrap it inside a loop (since we’re always going to be culling a group of models), and call it from the two original frustum-culling loops in the single-threaded and multi-threaded rasterizer variants with the correct loop bounds. All of this is in this commit – as you can see, again it turns out to be mostly small changes.
Finally, for bonus points, we do some cleanup and remove the now-unnecessary fields and methods from TransformedAABBoxSSE. That’s in this commit.
And just like that, we have our bounding boxes densely packed in a nice linear array, and the output visibility flags densely packed in another array. No more reading a whole cache line to only use 25 bytes – this time, we look at everything in the cache lines we access, and we access it all sequentially. That should result in better cache hit rates, lower memory bandwidth usage, and generally better performance. But how much does it actually buy us? Let’s find out!
Whoa – almost down to a third of what we had before we started (for the record, the last few times, I’ve tried to keep run lengths roughly consistent so we can actually compare the cycles directly). Our CPI rate is done below 0.5 – meaning we run at over two instructions executed per clock cycle, sustained, through the whole loop. Those pesky L3 cache misses? Gone completely. And we seem to be surrounded by a lot of functions we haven’t seen before in this series, because by now we’re at rank 20 in the hot spots list – down by another 10 positions! (But wait, is that tan() right below us? What the hell is that doing there… ah well, never mind).
When people tell you that you should optimize for cache usage patterns above all else, this is what they mean.
Well, even before we started, the frustum culling performance was good enough that there was no pressing need to deal with it immediately. At this point, it’s fast enough that we should really focus our attention elsewhere; there are bigger fish to fry. But then again… we seem to be on a winning streak, so why stop now? Let’s aim for some extra credits and see if we can push it a bit further.
Up To Eleven
Now, since I’m cropping the screenshots heavily to make them fit in the blog layout, you can’t see what I see. For all the screen shots we’ve seen so far, I’ve always made the columns narrow and sorted them so that whatever I want to show you happens to be next to the labels. But what you actually get out of the “General Exploration” analysis I’ve had VTune run is more than 20 columns worth of various counters. So for most of the functions on the screen, there’s a bunch of other blue bars and counters that I haven’t shown you, representing various kinds of bottlenecks.
So you can’t see what I see, namely: absolutely nothing next to CalcInsideFrustum. In short, there’s nothing significant left to be gained by modifying data layout or implementation details. This code runs as smoothly as code can be expected to run. If we want to make things go faster still, we actually have to do less work.
Luckily, there’s still one source of inefficiency in the current algorithm: we pass in one box at a time, and test it against all 6 frustum planes. Now, this code uses SSE to test against 4 planes simultaneously, so it’s a fairly decent implementation. But the second half of the test only gives us 2 more planes; the other 2 SIMD lanes are wasted.
This can be fixed by turning the packing around: instead of testing one box against groups of four planes at a time, we test groups of four boxes against one plane at a time. Because we have a lot more boxes than we have planes, that means we have a lot less wasted work overall, at least potentially: the old test always checks one box against 8 planes, of which we actually care about 6. That means 6/8=75% of the computations done are useful. If we instead test groups of four boxes at a time, we run at perfect utilization except for the very last group, which might have less than 4 boxes in it if our total number of boxes is not divisible by four.
Of course, to do this, we need to reorder our box structures so we can grab those four boxes efficiently. Given that the original goal of this post was to be shorter than the other ones and I’m already above 2300 words, I’m not going to delve into the details here, but again, you can just look at the code. So, does it help?
You bet. In fact, if you compare the numbers, we come pretty close to the 1.33x speedup you would expect when increasing utilization from 75% to near 100%. However, as you can see, our clocks per instruction went up again, and our L3 misses. That’s because we’re now starting to outrun the cache prefetching again.
Now, I have a processor with AVX support, and if we were compute limited, we could try use 8-wide SIMD instead of 4-wide SIMD. But considering that we already seem to be processing data faster than we can fetch it, there’s not much point to it. I tried it anyway to be certain, and sure enough, it’s really mostly a way of turning code with slightly too little computation per data item into code with far too little computation per data item. Now given what I saw in that code, I believe that things might look slightly differently in x64 mode, where we get 8 more YMM registers that this code could really make great use of, but I didn’t look into it; this post has gone on for long enough already.
Conclusions
I still stand by what I said in my previous post, namely that you don’t need to go full-on Data-Oriented Design to get good performance on modern CPUs. But all that said, if you’re willing to put in the effort, it definitely does pay off: we got a 3.33x speedup on code that was already using SSE to begin with. Stop counting ALU cycles, people. As this series should have shown you by now, it’s really not so much about what happens when your code runs – it’s about getting rid of the things that make it grind to a halt. As you just saw, data density makes a huge difference in cache efficiency (and hence execution times), and the widest ALUs in the world won’t do you any good if you can’t keep them fed.
And on that note, I’m gonna let this particular pipeline drain over the weekend so you have some time to let it all settle :). See you next time!
This post is part of a series – go here for the index.
Last time, we ended on a bit of a cliffhanger. We’ll continue right where we left off, but first, I want to get a few things out of the way.
First, a lot of people have been asking me what profiler I’ve been using for the various screenshots. So, for the record, the answer is that all these measurements were done using the current version of Intel VTune. VTune has a bit of a learning curve, and if I just want to quickly figure out why something is slow I prefer other tools. But if you’re trying to figure out what’s really going on while your code is running, you want something that supports the CPU’s internal event-based sampling counters and can do the proper analysis, and VTune does. If on the other hand you just want to take a quick peek to figure out why something is slow (which is most of the time while you’re not fine-tuning), I suggest you start with Very Sleepy – it’s free, small and easy to use.
Next, some background on why I’m writing this: I do not intend to badmouth the sample project I’ve been using for this series, nor do I want to suggest it’s doing anything particularly stupid. As you might have noticed, both the write-combining issues and the involuntary sharing we saw last post boiled down to two-line fixes in the source. These kinds of things just happen as code gets written and modified, particularly if there’s deadlines involved. In fact, I’m using this example precisely because the problems I’ve found in it are so very typical: I have run into all of these problems before on other projects, and I assume so have most engine programmers who’ve shipped a game. That’s exactly why I think this is worth writing down: so that people who don’t have much optimization experience and are running into performance problems know what to look for.
Third, lest you get a false impression: I’m in a comfortable position here – I spent two weekends (and the equivalent of maybe 3 days worth of full-time work) looking at the code, profiling and tweaking it. And of course, I’m only writing up the changes that worked. You don’t get to see the false starts and all the ideas that didn’t pan out. Nor am I presenting my changes in chronological order: as you can see in the Github repository, in fact I did the SSE version of CPUTFrustum::IsVisible a whole day before I found the sharing issues that were the actual bottleneck. With 20-20 hindsight, I get to present changes in order of impact, but that’s not how it plays out in practice. The whole process is a lot messier (and less deterministic) than it may seem in these posts.
And with all that out of the way, let’s look at cache effects.
Previously…
…we looked at the frustum culling code in Intel’s Software Occlusion Culling sample. The last blog post ended with me showing this profile:
and explaining that the actual issue is triggered by this (inlined) function:
void TransformedAABBoxSSE::IsInsideViewFrustum(CPUTCamera *pCamera)
{
float3 mBBCenterWS;
float3 mBBHalfWS;
mpCPUTModel->GetBoundsWorldSpace(&mBBCenterWS, &mBBHalfWS);
mInsideViewFrustum = pCamera->mFrustum.IsVisible(mBBCenterWS,
mBBHalfWS);
}
which spends a considerable amount of time missing the cache while trying to read the world-space bounding box from mpCPUTModel. Well, I didn’t actually back that up with any data yet. As you can see in the above profile, we spend about 13.8 billion cycles total in AABBoxRasterizerSSEMT::IsInsideViewFrustum in the profile. Now, if you look at the actual assembly code, you’ll notice that a majority of them are actually spent in a handful of instructions:
As you can see, about 11.7 of our 13.8 billion cycles are get counted on a mere two instructions. The column right next to the cycle counts is the number of “last-level cache” (L3 cache) misses. It doesn’t take a genius to figure out that we might be running into cache issues here.
The code you’re seeing is inlined from CPUTModel::GetBoundsWorldSpace, and simply copies the 6 floats describing the bounding box center and half-extents from the model into the two provided locations. That’s all this fragment of code does. Well, the member variables of CPUTModel are one indirection through mpCPUModelT away, and clearly, following that pointer seems to result in a lot of cache misses. In turns out that this is the only time anyone ever looks at data from CPUTModel during the frustum-culling pass. Now, what we really want is the 24 bytes worth of bounding box that we’re going to read. But CPU cores fetch data in units of cache lines, which is 64 bytes on current x86 processors. So best case, we’re going to get 24 bytes worth of data that we care about, and 40 bytes of data that we don’t. If we’re unlucky and the box crosses a cache line boundary, we might even end up fetching 128 bytes. And because it’s behind some arbitrary pointer, the processor can’t easily do tricks like automated memory prefetching that reduce the cost of memory accesses: prefetching requires predictable access patterns, and following pointers isn’t very predictable – not to the CPU core, anyway.
At this point, you might decide to rewrite the whole thing to have more coherent access patterns. Now the frustum culling loop actually isn’t that complicated, and rewriting it (and changing the data structures to be more cache-friendly) wouldn’t take very long, but for new let’s suppose we don’t know that. Is there any way incremental, less error-prone way to give us a quick speed boost, and maybe get us in a better position should we choose to change the frustum culling code later?
Making those prefetchers work
Of course there is, or I wouldn’t be asking. The key realization is that the outer loop in AABBoxRasterizerSSEMT::IsInsideViewFrustum actually traverses an array of bounding boxes (type TransformedAABBoxSSE) in order:
for(UINT i = start; i < end; i++)
{
mpTransformedAABBox[i].IsInsideViewFrustum(mpCamera);
}
One linear traversal is all we need. We know that the hardware prefetcher is going to load that ahead for us – and by now, they’re smart enough to do that properly even if our accesses are strided, that is, we don’t read all the data between the start and the end of the array, but only some of them with a regular spacing. This means that if we can get those world-space bounding boxes into TransformedAABBoxSSE, they’ll automatically get prefetched for us. And it turns out that in this example, all models are at a fixed position – we can determine the world-space bounding boxes once, at load time. Let’s look at our function again:
void TransformedAABBoxSSE::IsInsideViewFrustum(CPUTCamera *pCamera)
{
float3 mBBCenterWS;
float3 mBBHalfWS;
mpCPUTModel->GetBoundsWorldSpace(&mBBCenterWS, &mBBHalfWS);
mInsideViewFrustum = pCamera->mFrustum.IsVisible(mBBCenterWS,
mBBHalfWS);
}
Here’s the punch line: all we really have to do is promote these two variables from locals to member variables, and move the GetBoundsWorldSpace call to init time. Sure, it’s a bit crude, and it leads to data duplication, but on the plus side, this is a really easy thing to try – just move a few lines of code around. If it pans out, we can always do it cleaner later. Which leaves the question – does it pan out?
Of course it does – I get to cheat and only write about the changes that work, remember? As you see, now the clock cycles are back in CPUTFrustum::IsVisible. This is not because it’s gotten mysteriously slower, it’s because IsInsideViewFrustum doesn’t copy any data anymore, so IsVisible is the first function to look at the bounding box cache lines now. Which means that it gets billed for those cache misses now.
It’s still not great (I’ve included the Clocks Per Instruction Rate again so we can see where we stand), but we’re clearly making progress: compared to the first profile at the top of this post, which has a similar total cycle count, we’re very roughly twice as fast – and that’s for IsVisible, which includes not just the cache misses but also the actual frustum culling work. Meanwhile, AABBoxRasterizerSSEMT::IsInsideViewFrustum, now really just a loop, has dropped well out of the top 20 hot spots, as it should. Pretty good for just moving a couple of lines of code around.
Order in the cache!
Okay, our quick fix got the HW prefetchers to work for us, and clearly that gave us a considerable improvement. But we still only need 24 bytes out of every TransfomedAABBoxSSE. How big are they? Let’s have a look at the data members (methods elided):
class TransformedAABBoxSSE
{
// Methods elided
CPUTModelDX11 *mpCPUTModel;
__m128 *mWorldMatrix;
__m128 *mpBBVertexList;
__m128 *mpXformedPos;
__m128 *mCumulativeMatrix;
UINT mBBIndexList[AABB_INDICES]; /* 36 */
bool *mVisible;
bool mInsideViewFrustum;
float mOccludeeSizeThreshold;
bool mTooSmall;
__m128 *mViewPortMatrix;
float3 mBBCenter;
float3 mBBHalf;
float3 mBBCenterWS;
float3 mBBHalfWS;
};
In a 32-bit environment, that gives us 226 bytes of payload per BBox (the actual size is a bit more, due to alignment padding). Of these 226 bytes, for the frustum culling, we actually read 24 bytes (mBBCenterWS and mBBHalfWS) and write one (mInsideViewFrustum). That’s a pretty bad ratio, and there’s a lot of memory wasting going on, but for the purposes of caching, we only pay for what we actually read, and that’s not much. That said, even though we don’t access it here, the biggest chunk of data in the whole thing is mBBIndexList at 144 bytes, which is just a list of triangle indices for this BBox. That’s completely unnecessary, since that list is going to be the same for every single BBox in the system. So let’s fix that one and reorder some of the other fields so that the members we’re going to access during frustum culling are close by each other (and hence more likely to hit the same cache line):
class TransformedAABBoxSSE
{
// Methods elided
CPUTModelDX11 *mpCPUTModel;
__m128 *mWorldMatrix;
__m128 *mpBBVertexList;
__m128 *mpXformedPos;
__m128 *mCumulativeMatrix;
bool *mVisible;
float mOccludeeSizeThreshold;
__m128 *mViewPortMatrix;
float3 mBBCenter;
float3 mBBHalf;
bool mInsideViewFrustum;
bool mTooSmall;
float3 mBBCenterWS;
float3 mBBHalfWS;
};
Note that we’re writing mInsideViewFrustum right after we read the bounding boxes, so it makes sense to make them adjacent. I put the fields between the object-space and the world-space bounding box simply because the object-space bounding box is reasonably large (24 bytes, about a third of a cache line) and having it between the flags and the box greatly increases our chance of having to fetch two cache lines not one per box.
So, did it help?
Sure did. IsVisible is down to the number 10 spot, and the CPI Rate is down to an acceptable 1.042 clocks/instruction. Now that’s by no means the end of the line, but I want to make this clear: all I did here was factor out one common array to be a shared static const variable, and reorder some class members. That’s it. If you don’t count the initializers for the 36-element index list (which I’ve copied with comments and generous spacing, so it’s a few lines long), we’re talking less than 10 lines of code changed for all the improvements in this post. Total.
In the last few years, there’s been a push by several prominent game developers to “Data-Oriented Design”, which emphasizes structuring code around desired data-flow patterns, rather than the other way round. That’s a sound design strategy particularly for subsystems like the one we’re looking at. It’s also a good guideline for what you want to work towards when refactoring existing code. But the point I want to make here is that even when trying to optimize existing code within its existing environment, you can achieve substantial gains by a sequence of simple, localized improvements. That will only get you so far, but there’s a lot to be said for incremental techniques, especially if you’re just trying to hit a given performance goal in a limited time budget.
And that’s it for today. I might do another post on the frustum culling (I want it gone from the top 10 completely!), or I might turn to the actual rasterizer code next for a change of pace – haven’t decided yet. Until next time!
This post is part of a series – go here for the index.
Two posts ago, I explained write combining and used a real-world example to show how badly it can go wrong if you’re not careful. The last part was an out-of-turn rant about some string and memory management insanity that was severely hurting the loading times of that same program. That program was Intel’s Software Occlusion Culling sample, which I’ve been playing around with for the last two weekends.
Well, it turns out that there’s even more common performance problems where those two came from. Now, please don’t walk away with the impression that I’m doing this to pick on either Intel or the authors of that sample. What I’m really trying to do here is talk about common performance problems you might find in a typical game code base. Now, I’ve worked on (and in) several such projects before, and every one of them had its share of skeletons in the closet. But this time, the problems happen to be in an open-source example with a permissive license, written by a third party. Which means I can post the code and my modifications freely, a big plus if I’m going to blog about it – real code is a lot more interesting than straw man examples. And to be honest, I’m a lot more comfortable with publicly talking about performance problems in “abstract code I found on the internet” than I would be doing the same with code that I know a friend hammered out quickly two days before trying to get a milestone out.
What I’m trying to say here is, don’t let this discourage you from looking at the actual occlusion culling code that is, after all, the point of the whole example. And no hard feelings to the guys at Intel who went through the trouble of writing and releasing it in the first place!
Our problem of the day
That said, we’re still not going to see any actual occlusion culling performance problems or optimizations today. Because before we get there, it turns out we have some more low-hanging fruit to pick. As usual, here’s a profile – of the rendering this time.
All the functions with SSE in their names relate to the actual depth buffer rasterizer that’s at the core of the demo (as said, we’re gonna see it eventually). XdxInitXopAdapterServices is actually the user-mode graphics driver, the tbb_graphics_samples thing is the TBB scheduler waiting for worker threads to finish (this sample uses Intel’s TBB to dispatch jobs to multiple worker threads), dxgmms1.sys is the video memory manager / GPU scheduler, and atikmdag.sys is the kernel-mode graphics driver. In short, the top 10 list is full of the kinds of things you would expect in an example that renders lots of small models with software occlusion culling.
Except for that spot up at #2, that is. This function – CPUTFrustum::IsVisible – simply checks whether an axis-aligned bounding box intersects the view frustum, and is used for coarse frustum culling before occlusion is even considered. And it’s a major time sink.
Now, instead of the hierarchical callstack profiling I used last time to look at the loading, this profile was made using hardware event counters, same as in the write combining article. I’ve taken the liberty of spoiling the initial investigation and going straight to the counters that matter: see that blue bar in the “Machine Clears” column? That bar is telling us that the IsVisible function apparently spends 23.6% of its total running time performing machine clears! Yikes, but what does that mean?
Understanding machine clears
What Intel calls a “machine clear” on its current architectures is basically a panic mode: the CPU core takes all currently pending operations (i.e. anything that hasn’t completed yet), cancels them, and then starts over. It needs to do this whenever some implicit assumption that those pending instructions were making turns out to be wrong.
On the Sandy Bridge i7 I’m running this example on, there’s event counters for three kinds of machine clears. Two of them we can safely ignore in this article – one of them deals with self-modifying code (which we don’t have) and one can occur during execution of AVX masked load operations (which we don’t use). The third, however, bears closer scrutiny: its official name is MACHINE_CLEAR.MEMORY_ORDERING, and it’s the event that ends up consuming 23.6% of all CPU cycles during IsVisible.
A memory ordering machine clear gets triggered whenever the CPU core detects a “memory ordering conflict”. Basically, this means that some of the currently pending instructions tried to access memory that we just found out some other CPU core wrote to in the meantime. Since these instructions are still flagged as pending while the “this memory just got written” event means some other core successfully finished a write, the pending instructions – and everything that depends on their results – are, retroactively, incorrect: when we started executing these instructions, we were using a version of the memory contents that is now out of date. So we need to throw all that work out and do it over. That’s the machine clear.
Now, I’m not going to go into the details of how exactly a core knows when other cores are writing to memory, or how the cores make sure that whenever multiple cores try to write to a memory locations, there’s always one (and only one) winner. Nor will I explain how the cores make sure that they learn these things in time to cancel all operations that might depend on them. All these are deep and fascinating questions, but the details are unbelievably gnarly (once you get down to the bottom of how it all works within a core), and they’re well outside the scope of this post. What I will say here is that cores track memory “ownership” on cache line granularity. So when a memory ordering conflict happens, that means something in a cache line that we just accessed changed in the mean time. Might be some data we actually looked at, might be something else – the core doesn’t know. Ownership is tracked at the cache line level, not the byte level.
So the core issues a machine clear whenever something in a cache line we just looked at changed. It might be due to actual shared data, or it might be two unrelated data items that just happen to land in the same cache line in memory – this latter case is normally referred to as “false sharing”. And to clear up something that a lot of people get wrong, let me point out that “false sharing” is purely a software concept. CPUs really only track ownership on a cache line level, and a cache line is either shared or it’s not, it’s never “falsely shared”. So “false sharing” is purely a property of your data’s layout in memory; it’s not something the CPU knows (or can do anything) about.
Anyway, I digress. Evidently, we’re sharing something, intentionally or not, and that something is causing a lot of instructions to get cancelled and re-executed. The question is: what is it?
Finding the culprit
And this is where it gets icky. With a lot of things like cache misses or slow instructions, a profiler can tell us exactly which instruction is causing the problem. Memory ordering problems are much harder to trace, for two reasons: First, they necessarily involve multiple cores (which tends to make it much harder to find the corresponding causal chain of events), and second, because of the cache line granularity, even when they show up as events in one thread, they do so on an arbitrary instruction that happens to access memory near the actual shared data. Might be the data that is actually being modified elsewhere, or it might be something else. There’s no easy way to find out. Looking at these events in a source-level profile is almost completely useless – in optimized code, a completely unrelated instruction that logically belongs to another source line might cause a spike. In an assembly-level profile, you at least get to see the actual instruction that triggers the event, but for the reasons stated above that’s not necessarily very helpful either.
So it boils down to this: a profiler will tell you where to look, and it will usually point you to some code near the code that’s actually causing the problem, and some data near the data that is being shared. That’s a good starting point, but from there on it’s manual detective work – staring at the code, staring at the data structures, and trying to figure out what case is causing the problem. It’s annoying work, but you get better at it over time, and there’s some common mistakes – one of which we’re going to see in a minute.
But first, some context. IsVisible is called in parallel on multiple threads (via TBB) in a global, initial frustum-cull pass. This is where we’re seeing the slowdown. Evidently, those threads are writing to shared data somewhere: it must be writes – as long as the memory doesn’t change, you can’t get any memory ordering conflicts.
Here’s the declaration of the CPUTFrustum class (several methods omitted for brevity):
class CPUTFrustum
{
public:
float3 mpPosition[8];
float3 mpNormal[6];
UINT mNumFrustumVisibleModels;
UINT mNumFrustumCulledModels;
void InitializeFrustum( CPUTCamera *pCamera );
bool IsVisible(
const float3 ¢er,
const float3 &half
);
};
And here’s the full code for IsVisible, with some minor formatting changes to make it fit inside the layout (excerpting it would spoil the reveal):
bool CPUTFrustum::IsVisible(
const float3 ¢er,
const float3 &half
){
// TODO: There are MUCH more efficient ways to do this.
float3 pBBoxPosition[8];
pBBoxPosition[0] = center + float3( half.x, half.y, half.z );
pBBoxPosition[1] = center + float3( half.x, half.y, -half.z );
pBBoxPosition[2] = center + float3( half.x, -half.y, half.z );
pBBoxPosition[3] = center + float3( half.x, -half.y, -half.z );
pBBoxPosition[4] = center + float3( -half.x, half.y, half.z );
pBBoxPosition[5] = center + float3( -half.x, half.y, -half.z );
pBBoxPosition[6] = center + float3( -half.x, -half.y, half.z );
pBBoxPosition[7] = center + float3( -half.x, -half.y, -half.z );
// Test each bounding box point against each of the six frustum
// planes.
// Note: we need a point on the plane to compute the distance
// to the plane. We only need two of our frustum's points to do
// this. A corner vertex is on three of the six planes. We
// need two of these corners to have a point on all six planes.
UINT pPointIndex[6] = {0,0,0,6,6,6};
UINT ii;
for( ii=0; ii<6; ii++ )
{
bool allEightPointsOutsidePlane = true;
float3 *pNormal = &mpNormal[ii];
float3 *pPlanePoint = &mpPosition[pPointIndex[ii]];
float3 planeToPoint;
float distanceToPlane;
UINT jj;
for( jj=0; jj<8; jj++ )
{
planeToPoint = pBBoxPosition[jj] - *pPlanePoint;
distanceToPlane = dot3( *pNormal, planeToPoint );
if( distanceToPlane < 0.0f )
{
allEightPointsOutsidePlane = false;
break; // from for. No point testing any
// more points against this plane.
}
}
if( allEightPointsOutsidePlane )
{
mNumFrustumCulledModels++;
return false;
}
}
// Tested all eight points against all six planes and
// none of the planes had all eight points outside.
mNumFrustumVisibleModels++;
return true;
}
Can you see what’s going wrong? Try to figure it out yourself. It’s a far more powerful lesson if you discover it yourself. Scroll down if you think you have the answer (or if you give up).
The reveal
As I mentioned, what it takes for memory ordering conflicts to occur is writes. The function arguments are const, and mpPosition and mpNormal aren’t modified either. Local variables are either in registers or on the stack; either way, they’re far enough away between different threads not to conflict. Which only leaves two variables: mNumFrustumCulledModels and mNumFrustumVisibleModels. And indeed, both of these global (debugging) counters get stored per instance. All threads happen to use the same instance of CPUTFrustum, so the write locations are shared, and we have our culprit. Now, in a multithreaded scenario, these counters aren’t going to produce the right values anyway, because the normal C++ increments aren’t an atomic operation. As I mentioned before, these counters are only there for debugging (or at least nothing else in the code looks at them), so we might as well just remove the two increments altogether.
So how much does it help to get rid of two meager increments?
Again, the two runs have somewhat different lengths (because I manually start/stop them after loading is over), so we can’t compare the cycle counts directly, but we can compare the ratios. CPUTFrustum::IsVisible used to take about 60% as much time as our #1 function, and was in the #2 spot. Now it’s at position 5 in the top ten and takes about 32% as much time as our main workhorse function. In other words, removing these two increments just about doubled our performance – and that’s in a function that does a fair amount of other work. It can be even more drastic in shorter functions.
Just like we saw with write combining, this kind of mistake is easy to make, hard to track down and can cause serious performance and scalability issues. Everyone I know that has seriously used threads has fallen into this trap at least once – take it as a rite of passage.
Anyway, the function is now running smoothly, not hitting any major stalls and in fact completely bound by backend execution time – that is, the expensive part of that function is now the actual computational work. As the TODO comment mentions, there’s better ways to solve this problem. I’m not gonna go into it here, because as it turns out, I already wrote a post about efficient ways to solve this problem using SIMD instructions a bit more than two years ago – using Cell SPE intrinsics, not SSE intrinsics, but the idea remains the same.
I won’t bother walking through the code here – it’s all on GitHub if you want to check it out. But suffice to say that, with the sharing bottleneck gone, IsVisible can be made much faster indeed. In the final profile I took (using the SSE), it shows up at spot number 19 in the top twenty.
Two steps forward, one step back
All is not well however, because the method AABBoxRasterizerSSEMT::IsInsideViewFrustum, which you can (barely) see in some of the earlier profiles, suddenly got a lot slower in relation:
Again, I’m not going to dig into it here now deeply, but it turns out that the this is the function that calls IsVisible. No, it’s not what you might be thinking – IsVisible didn’t get inlined or anything like that. In fact, its code looks exactly like it did before. And more to the point, the problem actually isn’t in AABBoxRasterizerSSEMT::IsInsideViewFrustum, it’s inside the function TransformedAABBoxSSE::IsInsideViewFrustum, which it calls, and which does get inlined into AABBoxRasterizerSSEMT::IsInsideViewFrustum:
void TransformedAABBoxSSE::IsInsideViewFrustum(CPUTCamera *pCamera)
{
float3 mBBCenterWS;
float3 mBBHalfWS;
mpCPUTModel->GetBoundsWorldSpace(&mBBCenterWS, &mBBHalfWS);
mInsideViewFrustum = pCamera->mFrustum.IsVisible(mBBCenterWS,
mBBHalfWS);
}
No smoking guns here either – a getter call to retrieve the bounding box center and half-extents, followed by the call to IsVisible. And no, none of the involved code changed substantially, and there’s nothing weird going on in GetBoundsWorldSpace. It’s not a virtual call, and it gets properly inlined. All it does is copy the 6 floats from mpCPUTModel to the stack.
What we do have in this method, however, is lots of L3 cache misses (or Last-Level Cache misses / LLC misses, as Intel likes to call them) during this copying. Now, the code doesn’t have any more cache misses now than it did before I added some SSE code to IsVisible. But it generates them a lot faster than it used to. Before, some of the long-taking memory fetches overlapped with the slower execution of the visibility test for an earlier box. Now, we’re going through instructions fast enough for the code to starve waiting for the bounding boxes to arrive.
That’s how it is dealing with Out-of-Order cores: They’re really quite good at making the best of a bad situation. Which also means that often, fixing a performance problem just immediately moves the bottleneck somewhere else, without any substantial speed-up. It often takes several attempts to tackle the various bottlenecks one by one until, finally, you get to cut the Gordian Knot. And to get this one faster, we’ll have to improve our cache usage. Which is a topic for another post. Until next time!
This post is part of a series – go here for the index.
Rants are not usually the style of this blog, but this one I just don’t want to keep in. So if you’re curious, the actual information content of this post will be as follows: C string handling functions kinda suck. So does C++’s std::string. Dealing with wide character strings using only standard C/C++ functionality is absolutely horrible. And VC++’s implementation of said functionality is a damn minefield. That’s it. You will not actually learn anything more from reading this post. Continue at your own risk.
UPDATE: As “cmf” points out in the comments, there are actually C++ functions that seem to do what I wanted in the first place with a minimum amount of fuss. Goes to show, the one time I post a rant on this blog and of course I’m wrong! :) That said, I do stand by my criticism of the numerous API flaws I point out in this post, and as I point out in myreply, the discoverability of this solution is astonishingly low; when I ran into this problem originally, not being familiar with std::wstring I googled a bit and checked out Stack Overflow and several other coding sites and mailing list archives, and what appears to be the right solution showed up on none of the pages I ran into. So at the very least I’m not alone. Ah well.
The backstory
I spent most of last weekend playing around with my fork of Intel’s Software Occlusion Culling sample. I was trying to optimize some of the hot spots, a process that involves a lot of small (or sometimes not-so-small) modifications to the code followed by a profiling run to see if they help. Now unfortunately this program, at least in its original version, had loading times of around 24 seconds on my machine, and having to wait for those 24 seconds every time before you can even start the profiling run (which takes another 10-20 seconds if you want useful results) gets old fast, so I decided to check whether there was a simple way to shorten the load times.
Since I already had the profiler set up, I decided to take a peek into the loading phase. The first big time-waste I found was a texture loader that was unnecessarily decompressing a larger DXT skybox texture, and then recompressing it again. I won’t go into details here; suffice it to say that once I had identified the problem, it was straightforward enough to fix, and it cut down loading time to about 12 seconds.
My next profiling run showed me this:
I’ve put a red dot next to functions that are called either directly or indirectly by the configuration-file class CPUTConfigFile. Makes you wonder, doesn’t it? Lest you think I’m exaggerating, here’s some of the call stacks for our #2 function, malloc:
Here’s the callers to #5, free:
And here’s memcpy, further down:
I have to say, I’ve done optimization work on lots of projects over the years, and it’s rare that you’ll see a single piece of functionality leave a path of destruction this massive in its wake. The usual patterns you’ll see are either “localized performance hog” (a few functions completely dominating the profile, like I saw in the first round with the texture loading) or the “death by a thousand paper cuts”, where the profile is dominated by lots of “middle-man” functions that let someone else do the actual work but add a little overhead each time. As you can see, that’s not what’s going on here. What we have here is the rare “death in all directions” variant. Why settle for paper cuts, just go straight for the damn cluster bomb!
At this point it was clear that the whole config file thing needed some serious work. But first, I was curious. Config file loading and config block handling, sure. But what was that RemoveWhitespace function doing there? So I took a look.
How not to remove whitespace
Let’s cut straight to the chase: Here’s the code.
void RemoveWhitespace(cString &szString)
{
// Remove leading whitespace
size_t nFirstIndex = szString.find_first_not_of(_L(' '));
if(nFirstIndex != cString::npos)
{
szString = szString.substr(nFirstIndex);
}
// Remove trailing newlines
size_t nLastIndex = szString.find_last_not_of(_L('\n'));
while(nLastIndex != szString.length()-1)
{
szString.erase(nLastIndex+1,1);
nLastIndex = szString.find_last_not_of(_L('\n'));
};
// Tabs
nLastIndex = szString.find_last_not_of(_L('\t'));
while(nLastIndex != szString.length()-1)
{
szString.erase(nLastIndex+1,1);
nLastIndex = szString.find_last_not_of(_L('\t'));
};
// Spaces
nLastIndex = szString.find_last_not_of(_L(' '));
while(nLastIndex != szString.length()-1)
{
szString.erase(nLastIndex+1,1);
nLastIndex = szString.find_last_not_of(_L(' '));
};
}
As my current and former co-workers will confirm, I’m generally a fairly calm, relaxed person. However, in moments of extreme frustration, I will (on occasion) perform a “*headdesk*”, and do so properly.
This code did not drive me quite that far, but it was a close call.
Among the many things this function does wrong are:
- While it’s supposed to strip all leading and trailing white space (not obvious from the function itself, but clear in context), it will only trim leading spaces. So for example leading tabs won’t get stripped, nor will any spaces that follow after those tabs.
- The function will remove trailing spaces, tabs, and newlines – provided they occur in exactly that order: first all spaces, then all tabs, then all newlines. But the string “test\t \n” will get trimmed to “test\t” with the tab still intact, because the tab-stripping loop will only tabs that occur at the end of the string after the newlines have been removed.
- It removes white space characters it finds front to back rather than back to front. Because of the way C/C++ strings work, this is an O(N2) operation. For example, take a string consisting only of tabs.
- The substring operation creates an extra temporary string; while not horrible by the standards of what else happens in this function, it’s now becoming clear why
RemoveWhitespacemanages to feature prominently in the call stacks formalloc,freeandmemcpyat the same time. - And let’s not even talk about how many times the string is scanned from front to back.
That by itself would be bad enough. But it turns out that in context, not only is this function badly implemented, most of the work it does is completely unnecessary. Here’s one of its main callers, ReadLine:
CPUTResult ReadLine(cString &szString, FILE *pFile)
{
// TODO: 128 chars is a narrow line. Why the limit?
// Is this not really reading a line, but instead just reading the next 128 chars to parse?
TCHAR szCurrLine[128] = {0};
TCHAR *ret = fgetws(szCurrLine, 128, pFile);
if(ret != szCurrLine)
{
if(!feof(pFile))
{
return CPUT_ERROR_FILE_ERROR;
}
}
szString = szCurrLine;
RemoveWhitespace(szString);
// TODO: why are we checking feof twice in this loop?
// And, why are we using an error code to signify done?
// eof check should be performed outside ReadLine()
if(feof(pFile))
{
return CPUT_ERROR_FILE_ERROR;
}
return CPUT_SUCCESS;
}
I’ll let the awesome comments speak for themselves – and for the record, no, this thing really is supposed to read a line, and the ad-hoc parser that comes after this will get out of sync if it’s ever fed a line with more than 128 characters in it.
But the main thing of note here is that szString is assigned from a C-style (wide) string. So the sequence of operations here is that we’ll first allocate a cString (which is a typedef for a std::wstring, by the way), copy the line we read into it, then call RemoveWhitespace which might create another temporary string in the substr call, to follow it up with several full-string scans and possibly memory moves.
Except all of this is completely unnecessary. Even if we need the output to be a cString, we can just start out with a subset of the C string to begin with, rather than taking the whole thing. All RemoveWhitespace really needs to do is tell us where the non-whitespace part of the string begins and ends. You can either do this using C-style string handling or, if you want it to “feel more C++”, you can express it by iterator manipulation:
static bool iswhite(int ch)
{
return ch == _L(' ') || ch == _L('\t') || ch == _L('\n');
}
template
static void RemoveWhitespace(Iter& start, Iter& end)
{
while (start < end && iswhite(*start))
++start;
while (end > start && iswhite(*(end - 1)))
--end;
}
Note that this is not only much shorter, it also correctly deals with all types of white space both at the beginning and the end of the line. Instead of the original string assignment we then do:
// TCHAR* obeys the iterator interface, so...
TCHAR* start = szCurrLine;
TCHAR* end = szCurrLine + tcslen(szCurrLine);
RemoveWhitespace(start, end);
szString.assign(start, end);
Note how I use the iterator range form of assign to set up the string with a single copy. No more substring operations, no more temporaries or O(N2) loops, and after reading we scan over the entire string no more than two times, one of those being in tcslen. (tcslen is a MS extension that is the equivalent of strlen for TCHAR – which might be either plain char or wchar_t, depending on whether UNICODE is defined – this code happens to be using “Unicode”, that is, UTF-16).
There’s only two other calls to RemoveWhitespace, and both of these are along the same vein as the call we just saw, so they’re just as easy to fix up.
Problem solved?
Not quite. Even with the RemoveWhitespace insanity under control, we’re still reading several megabytes worth of text files with short lines, and there’s still between 1 and 3 temporary string allocations per line in the code, plus whatever allocations are needed to actually store the data in its final location in the CPUTConfigBlock.
Long story short, this code still badly needed to be rewritten to do less string handling, so I did. My new code just reads the file into a memory buffer in one go (the app in question takes 1.5GB of memory in its original form, we can afford to allocate 650K for a text file in one block) and then implements a more reasonable scanner that processes the data in place and doesn’t do any string operations until we need to store values in their final location. Now, because the new scanner assumes that ASCII characters end up as ASCII, this will actually not work correctly with some character encodings such as Shift-JIS, where ASCII-looking characters can appear in the middle of encodings for multibyte characters (the config file format mirrors INI files, so ‘[‘, ‘]’ and ‘=’ are special characters, and the square brackets can appear as second characters in a Shift-JIS sequence). It does however still work with US-ASCII text, the ISO Latin family and UTF-8, which I decided was acceptable for a config file reader. I did still want to support Unicode characters as identifiers though, which meant I was faced with a problem: once I’ve identified all the tokens and their extents in the file, surely it shouldn’t be hard to turn the corresponding byte sequences into the std::wstring objects the rest of the code wants using standard C++ facilities? Really, all I need is a function with this signature:
void AssignStr(cString& str, const char* begin, const char* end);
Converting strings, how hard can it be?
Turns out: quite hard. I could try using assign on my cString again. That “works”, if the input happens to be ASCII only. But it just turns each byte value into the corresponding Unicode code point, which is blatantly wrong if our input text file actually has any non-ASCII characters in it.
Okay, so we could turn our character sequence into a std::string, and then convert that into a std::wstring, never mind the temporaries for now, we can figure that out later… wait, WHAT? There’s actually no official way to turn a string containing multi-byte characters into a wstring? How moronic is that?
Okay, whatever. Screw C++. Just stick with C. Now there actually is a standard function to convert multi-byte encodings to wchar_t strings, and it’s called, in the usual “omit needless vowels” C style, mbstowcs. Only that function can’t be used on an input string that’s delimited by two pointers! Because while it accepts a size for the output buffer, it assumes the input is a 0-terminated C string. Which may be a reasonable protocol for most C string-handling functions, but is definitely problematic for something that’s typically used for input parsing, where you generally aren’t guaranteed to have NUL characters in the right places.
But let’s assume for a second that we’re willing to modify the input data (const be damned) and temporarily overwrite whatever is at end with a NUL character so we can use mbstowcs – and let me just remark at this point that awesomely, the Microsoft-extended safe version of mbstowcs, mbstowcs_s, accepts two arguments for the size of the output buffer, but still doesn’t have a way to control how many input characters to read – if you decide to extend a standard API anyway, why can’t you fix it at the same time? Anyway, if we just patch around in the source string to make mbstowcs happy, does that help us?
Well, it depends on how loose you’re willing to play with the C++ standard. The goal of the whole operation was to reduce the number of temporary allocations. Well, mbstowcs wants a wchar_t output buffer, and writes it like it’s a C string, including terminating NUL. std::wstring also has memory allocated, and normal implementations will store a terminating 0 wchar_t, but as far as I can tell, this is not actually guaranteed. In any case, there’s a problem, because we need to reserve the right number of wchar’s in the output string, but it’s not guaranteed to be safe to do this:
void AssignStr(cString& str, const char* begin, const char* end)
{
// patch a terminating NUL into *end
char* endPatch = (char*) end;
char oldEnd = *end;
*endPatch = 0;
// mbstowcs with NULL arg counts how many wchar_t's would be
// generated
size_t numOut = mbstowcs(NULL, begin, 0);
// make sure str has the right size
str.resize(numOut, ' ');
// convert characters including terminating NUL and hope it's
// going to be OK?
mbstowcs(&str[0], begin, numOut + 1);
// restore the original end
*endPatch = oldEnd;
}
This might work, or it might not. As far as I know, it would be legal for a std::wstring implementation to only append a trailing NUL character lazily whenever c_str() is first called on a particular string. Either way, it’s fairly gross. I suppose I could resize to numOut + 1 elements, and then later do another resize after the mbstowcs is done; that way should definitely be safe.
Either way is completely beside the point though. This is an actual, nontrivial operation on strings that is a totally reasonable thing to do, and that the C IO system will in fact do for me implicitly if I use fgetws. Why are all the functions dealing with this so horribly broken for this use case that’s not at all fancy? Did anyone ever look at this and decide that it was reasonable to expect people to write code like this? WHAT THE HELL?
It gets better
That’s not it quite yet, though. Because when I actually wrote the code (as opposed to summarizing it for this blog post), I didn’t think to patch in the NUL byte on the source string. So I went for the alternative API that works character by character: the C function mbtowc. Now, awesomely, because it works character by character, and is not guaranteed to see all characters in a multi-byte sequence in the same call, it has to keep state around of which partial multi-byte sequences it has seen to be able to decode characters. So it’s not thread-safe, and POSIX defines an extended version mbrtowc that makes you pass in a pointer to that state which does make it thread-safe. At this point though, I don’t care about thread-safety (this code is single-threaded anyway), and besides, in our case I actually know that the characters between begin and end are supposed to parse correctly. So I just don’t worry about it. Also, instead of actually counting the right number of wchar_t‘s ahead of time in a second pass, I just assume that the string is generally likely to have less wide characters than the source multi-byte string has bytes. Even if that turns out wrong (which won’t happen for conventional encodings), the std::wstring we write to can dynamically resize, so there’s not much that can go wrong. So I ended up with this implementation:
void AssignStr(cString& dest, const char* begin, const char* end)
{
dest.clear();
if (end <= begin)
return;
size_t len = end - begin;
size_t initial = len + 1; // assume most characters are 1-byte
dest.reserve(initial);
const char* p = start;
while (p < end)
{
wchar_t wc;
int len = mbtowc(&wc, p, end - p);
if (len < 1) // NUL byte or error
break;
p += len;
dest.push_back(wc);
}
}
Looks fairly reasonable, right?
Well, one profiling session later, I noticed that performance had improved, but it turned out that I was apparently wrong to assume that, like its std::vector counterpart, std::wstring::push_back would basically compile into the moral equivalent of dest.data[dest.len++] = wc. Instead, what I saw in VTune (with a kind of morbid fascination) was about two dozen instructions worth of inlined insanity surrounding a call to std::wstring::insert. For every character. In a release build.
It’s probably the VC++ STL doing something stupid. At this point, I don’t feel like investigating why this is happening. Whatever, I’m just gonna add some more to this layer cake of insanity. Just stop thinking and start coding. So I figure that hey, if adding stuff to strings is apparently an expensive operation, well, let’s amortize it, eh? So I go for this:
void AssignStr(cString& dest, const char* begin, const char* end)
{
dest.clear();
if (end <= begin)
return;
static const int NBUF = 64;
wchar_t buf[NBUF];
int nb = 0;
size_t len = end - begin;
size_t initial = len + 1; // assume most characters are 1-byte
dest.reserve(initial);
const char* p = start;
while (p < end)
{
int len = mbtowc(&buf[nb++], p, end - p);
if (len < 1) // NUL byte or error
break;
p += len;
if (p >= end || nb >= NBUF)
{
dest.append(buf, buf + nb);
nb = 0;
}
}
}
And it’s still slow, and I still get a metric ton of bullshit inlined for that call. Turns out this happens because I call the general “input iterator” variant of append which, go figure, adds character by character. Silly me! What I really should’ve called is dest.append(buf, nb). Of course! Once I figure that one out, I profile again, and sure enough, this time there’s no magic std::string functions cluttering up the profile anymore. Finally. Mission accomplished, right?
Not so fast, bucko.
Ohhh no. No, there’s one final “surprise” waiting for me. I put surprise in quotes because we already saw it in my first profile screenshot.
Yeah right. Those C functions we’ve been calling? In the VC++ C runtime library, all of them end up calling a constructor for a C++ object for some reason.
No, I’m not gonna comment on that one. I stopped caring a few paragraphs ago. Go ahead, put C++ code in your C runtime library. Whatever makes you happy.
So it turns out that VC++ has two versions of all the multibyte conversion functions: one that uses the current locale (which you can query using _get_current_locale()) and one that takes an explicit locale_t parameter. And if you don’t pass in a locale yourself, mbtowc and so forth will call _get_current_locale() themselves, and that ends up calling a C++ constructor for some reason. (I don’t care, I’m in my happy place right now. La la la).
And I finally decide to screw portability – hey, it’s a VC++-only project anyway – and call _get_current_locale() once, pass it to all my calls, and the magic constructor disappears, and with it the last sign of dubious things happening in the string handling.
Hooray.
Conclusions
So, what do we have here: we have a C++ string class that evidently makes it easy to write horrendously broken code without noticing it, and simultaneously doesn’t provide some core functionality that apps which use both std::wstring and interface with non-UTF16 character sets (which is almost nobody, I’m sure!) will need. We have C functions that go out of their way to make it hard to use them correctly. We have the Microsoft camp that decides that the right way to fix these functions is to fix buffer overflows, and we have the POSIX camp that decides that the right way to fix them is to fix the race condition inherent in their global state. Both of these claim that their modifications are more important than the other’s, and then there’s the faction that holds the original C standard library to be the only true way, ignoring the fact that this API is clearly horribly broken no matter how you slice it. Meanwhile, std::wstring gets another attention fix by making it unnecessarily hard to actually get data from C APIs into it without extra copying (and may I remind you that I’m only using C APIs here because there doesn’t seem to be an official C++ API!), while the VC++ standard library proves its attention deficit by somehow making a push_back to a properly pre-allocated string an expensive operation. And for the final act of our little performance, watch as a constructor gets called from C code, a veritable Deus Ex Machina that I honestly didn’t see coming.
As my friend Casey Muratori would put it: Everyone is fired.
And now excuse me while I apply some bandages and clean the blood off my desk.
This post is part of a series – go here for the index.
Most memory you deal with on a daily basis is cached; on CPUs, it’s usually write-back cached. While dealing with processor caches can be counter-intuitive, caching works well most of the time, and it’s mostly transparent to the programmer (and certainly the user). However, if we are to use the cache to service memory reads, we need to make sure to invalidate our cache entries if someone else writes to the corresponding memory locations. This is implemented using one of several mechanisms referred to as “coherency protocols”, which CPU cores use to synchronize their caches with each other.
That is not the subject of this post, though. Because while such mechanisms are in place for CPUs talking to each other, there is nothing equivalent for the CPU talking to other non-CPU devices, such as GPUs, storage or network devices. Generally, communication with such devices still happens via system memory (or by memory-mapping registers or device memory so they appear to be system memory, which doesn’t make much difference from the CPU core’s point of view), but the CPU is not going to be notified of changes in a timely fashion, so normal caching is out.
Originally, device memory used to be accessed completely without caching. That’s safe (or at least as safe as it’s going to get) but also slow, because each memory access gets turned into an individual bus transaction, which has considerable overhead. Now anything related to graphics tends to move a lot of data around. Before widespread hardware acceleration, it was mostly the CPU writing pixels to the frame buffer, but now there’s other graphics-related writes too. So finally we get write combining, where the CPU treats reads as uncached but will buffer writes for a while in the hope of being able to combine multiple adjacent writes into a larger bus transaction. This is much faster. Common implementations have much weaker memory ordering guarantees than most memory accesses, but that’s fine too; this kind of thing tends to be used mainly for bulk transfers, where you really don’t care in which order the bytes trickle into memory. All you really want is some mechanism to make sure that all the writes are done before you pull the trigger and launch a command buffer, display a frame, trigger a texture upload, whatever.
All this is fairly straightforward and reasonable. However, the devil’s in the details, and in practice write combining is finicky. It’s really easy to make a small mistake that results in a big performance penalty. I’ve seen this twice in the last two days, on two different projects, so I’ve decided to write up some guidelines.
Where is write combining used?
I’m only going to talk about graphics here. For all I know, write-combining might be used for lots of other things, but I would assume that even if that is true, graphics is the only mainstream application where WC memory is exposed to user-mode applications.
So the main way to get a pointer to write-combining memory is by asking a 3D or GPGPU API to map a buffer or texture into memory: that is, using GL glMapBuffer, D3D9 Lock, CL clEnqueueMap* or D3D1x Map. Not all such buffers are write-combined, but those used for rapid uploads usually are – doubly so if you’re requesting a “write-only” mapping, which all mentioned APIs support.
What happens if you read from write-combined memory?
Sadly, the answer is not “reading from write-combined memory isn’t allowed”. This would be much simpler and less error-prone, but at least on x86, the processor doesn’t even have the notion of memory that can be written but not read.
Instead, what actually happens is that the read is treated as uncached. This means all pending write-combining buffers get flushed, and then the read is performed without any cache. Flushing write-combining buffers costs time and results in stores of partial cache lines, which is also inefficient. And of course uncached reads are really slow too.
Don’t read from write-combining memory, unless you have a very good reason to (you probably don’t). In particular, never read values back from constant buffers, vertex buffers or index buffers you’re currently writing to. Ever.
How bad can it possibly be? Let me show you an example. Here’s an excerpt of a VTune profile for an application I recently looked at:
As you can see, a lot of time is being spent in CPUTModelDX11::SetRenderStates. Worse, as VTune helpfully highlights for us, this function runs at an absolutely appalling 9.721 clock cycles per instruction (CPI Rate)! Now it turns out that a large fraction is due to these innocent-looking lines that write to a constant buffer:
pCb = (CPUTModelConstantBuffer*)mapInfo.pData;
pCb->World = world;
pCb->ViewProjection = view * projection;
pCb->WorldViewProjection = world * pCb->ViewProjection;
Note how pCb->ViewProjection is used as an argument for a matrix multiply in the last line. Now, here’s the simple fix:
XMMATRIX viewProj = view * projection;
pCb = (CPUTModelConstantBuffer*)mapInfo.pData;
pCb->World = world;
pCb->ViewProjection = viewProj;
pCb->WorldViewProjection = world * viewProj;
And here’s the corresponding VTune profile:
Now, this profile was somewhat longer so the actual cycle counts are different, but the point stands: This simple change made the function drop from the #5 to the #12 spot, and based on the CPI rate, it now runs more than twice as fast per invocation – mind you, 4.4 cycles/instruction is still pretty bad, but it’s certainly an improvement over the 9.7 we saw earlier.
Other things to be careful about
Okay, so not reading is an important point. What else? Well, it depends on the processor. Early x86s had fairly restrictive rules about write combining: writes had to be of certain sizes, they needed to be properly aligned, and accesses needed to be purely sequential. The first two can be dealt with, but the latter is tricky when dealing with C/C++ compilers that try to move schedule writes for optimum efficiency. For several years, it used to be that you basically had to mark all pointers to vertex buffers etc. as volatile to make sure the compiler didn’t try to reorder writes and inadvertently break write-combining in the process. While not as bad as reads, this still results in a very noticeable drop in performance.
Luckily, x86 processors from about 2002 on are far more tolerant about writes arriving out of order and will generally be able to combine writes even if they’re not perfectly sequential. However, other processors (such as those found in some game consoles) aren’t as tolerant; better safe than sorry. And even if you don’t strictly need to enforce sequential accesses, it’s still a good idea to write the code that way, because of the next rule:
Avoid holes. If you’re writing to a memory range, write the whole range. If you’re writing a dynamic vertex buffer, write every field, even if your shader ignores some of them. If you map a buffer, write the whole thing – even if you (think you) know some of the contents don’t need to change. Any hole will break the sequence and turn what would otherwise be one large write into at least two smaller ones. On some processors, it has other adverse effects too. That’s why you want to write struct fields sequentially, at least in your source code – that way, it’s easier to check against the struct definition to make sure you left nothing out.
Conclusion
Write combining is a powerful technique to accelerate writes to graphics memory, but it’s very easy to misuse in a way that causes severe performance degradation. Worse, because things only get slow but don’t crash, such problems can creep in and not be noticed for a long time. Short of profiling your code periodically, there’s little you can do to find them. Here’s the summary:
- If it’s a dynamic constant buffer, dynamic vertex buffer or dynamic texture and mapped “write-only”, it’s probably write-combined.
- Never read from write-combined memory.
- Try to keep writes sequential. This is good style even when it’s not strictly necessary. On processors with picky write-combining logic, you might also need to use
volatileor some other way to cause the compiler not to reorder instructions. - Don’t leave holes. Always write large, contiguous ranges.
- Check the rules for your target architecture. There might be additional alignment and access width limitations.
If you live by these rules, write-combining can be a powerful ally in writing high-performance graphics code. But never a friend – it will stab you in the back on the first opportunity. So be careful.
Sometimes it’s useful to convert numbers to a different representation in a way that’s order-preserving. This is useful because some applications have a clear preference for one type of number over another – for example, Radix sort and the closely related Radix trees are most naturally expressed in terms of unsigned integers, and it’s generally easier to transform the keys so they sort correctly as unsigned ints than it is to adapt the algorithm to deal with signed numbers (or floats).
Another case of interest is SIMD instructions on x86. For example, there’s signed integer greater-than comparison instructions PCMPGTB, PCMPGTW and PCMPGTD starting with the Pentium MMX (or SSE2 for 128-bit variants), but no unsigned equivalents. There’s also signed word integer min/max PMINSW and PMAXSW starting from SSE (SSE2 for 128-bit), but for some reason the corresponding unsigned equivalents PMINUW and PMAXUW weren’t added until SSE4.1.
So, without further ado, here’s my transforms of choice (assuming two’s complement arithmetic):
Signed int32 <-> unsigned int32
x ^= 0x80000000;- or:
x += 0x80000000; - or:
x -= 0x80000000;
All three do the same thing and are involutions (i.e. they are their own inverse).
Mnemonic: To convert signed integers to unsigned integers, add 0x80000000 so the smallest signed integer (-0x80000000) turns into the smallest unsigned integer (0). The same works in reverse of course.
The MSB doesn’t carry into anything, so addition and exclusive or do the same thing. Note that 0x80000000 isn’t a representable integer in 32-bit two’s complement arithmetic, but -0x80000000 is, so technically the correct expressions are x -= -0x80000000 instead of x += 0x80000000 (and similar for subtraction), but you get the idea. This generalizes to integers with other sizes in the obvious way.
IEEE float <-> signed int32
Float to int32:
int32 temp = float2bits(f32_val); int32 i32val = temp ^ ((temp >> 31) & 0x7fffffff);
int32 to float:
int32 temp = i32val ^ ((i32val >> 31) & 0x7fffffff); float f32val = bits2float(temp);
Aside from the “bit casting” (turn an IEEE754 float number into its integer bits and back) this transform is an involution too. It uses bit shifts on signed two’s complement numbers which is technically undefined in C, but in practice all compilers I’ve ever used just turn it into the arithmetic right shifts you’d expect.
Positive floats have the MSB (integer sign bit) clear and larger floating-point values have larger integer representations. This includes the IEEE infinity value, which is larger than all finite floats. Negative floats have the MSB (integer sign bit) set, but they are represented as sign + magnitude, so 0x800000 actually represents the largest “negative” float (actually, it’s -0), whereas smaller (more negative) floats have larger representations when compared as integers. This expression leaves positive floats alone but flips all but the sign bit for negative floats, so they order correctly.
Caveat 1: Under this transform, -0 < 0, whereas "true" IEEE comparisons treat them as the same.
Caveat 2: There’s NaNs (not a number) both with and without the sign bit set. Under this transform, the “smallest” and “largest” values can both represent NaN bit patterns, and NaNs are ordered relative to each other. This is well-defined and reasonable, but doesn’t match the behavior or regular floating-point compares.
IEEE float <-> unsigned int32
Float to uint32:
uint32 temp = float2bits(f32_val); uint32 i32val = temp ^ (((int32)temp >> 31) | 0x80000000);
uint32 to float:
int32 temp1 = u32val ^ 0x80000000; int32 temp2 = temp1 ^ (temp1 >> 31); float f32val = bits2float(temp2);
This is really just a combination of the previous two: instead of making sure we don’t change the sign bit, we make sure to always flip it in the forward transform. Because we flip the sign, the reverse transform needs to flip it back before it does the arithmetic shift, so this is not an involution unlike the other two. The same caveats apply as for the int32 version.
The two obvious identities:
min(a,b) = -max(-a, -b)
max(a,b) = -min(-a, -b)
can be used to rewrite algorithms using mixed min/max expressions in terms of just min (or just max). This can sometimes be useful when working with data that is intended to be processed with SIMD instructions, because it can be used to make the dataflow more regular. Let me give you a simple example to show what I mean: computing the axis-aligned bounding box (or AABB for short) of the union of several 2D AABBs.
AABB of the union of N 2D AABBs
A common representation for a 2D AABB just stores the extrema in both X and Y:
union AlignedBox2 {
struct {
float min_x, min_y;
float max_x, max_y;
};
Vec4 simd;
};
The AABB for the union of N such AABBs can then be computed by computing the min/max over all bounds in the array, as follows:
AlignedBox2 union_bounds(const AlignedBox2 *boxes, int N) // N >= 1
{
AlignedBox2 r = boxes[0];
for (int i=1; i < N; i++) {
r.min_x = min(r.min_x, boxes[i].min_x);
r.min_y = min(r.min_y, boxes[i].min_y);
r.max_x = max(r.max_x, boxes[i].max_x);
r.max_y = max(r.max_y, boxes[i].max_y);
}
return r;
}
A typical 4-wide SIMD implementation can apply the operations to multiple fields at the same time, but ends up wasting half the SIMD lanes on fields it doesn’t care about, and does some extra work at the end to merge the results back together:
AlignedBox2 union_bounds_simd(const AlignedBox2 *boxes, int N)
{
Vec4 mins = boxes[0].simd;
Vec4 maxs = boxes[0].simd;
for (int i=1; i < N; i++) {
mins = min(mins, boxes[i].simd);
maxs = max(maxs, boxes[i].simd);
}
AlignedBox2 r;
r.minx = mins[0]; // or equivalent shuffle...
r.miny = mins[1];
r.maxx = maxs[2];
r.maxy = maxs[3];
return r;
}
But the identities above suggest that it might help to use a different (and admittedly somewhat weird) representation for 2D boxes instead, where we store the negative of max_x and max_y:
union AlignedBox2b {
struct {
float min_x, min_y;
float neg_max_x, neg_max_y;
};
Vec4 simd;
};
If we write the computation of the union bounding box of two AABBs A and B in this form, we get (the interesting part only):
r.min_x = min(a.min_x, b.min_x); r.min_y = min(a.min_y, b.min_y); r.neg_max_x = min(a.neg_max_x, b.neg_max_x); r.neg_max_y = min(a.neg_max_y, b.neg_max_y);
where the last two lines are just the result of applying the identity above to the original computation of max_x / max_y (with all the sign flips thrown in). Which means the SIMD version in turn becomes much easier (and doesn’t waste any work anymore):
AlignedBox2b union_bounds_simd(const AlignedBox2b *boxes, int N)
{
AlignedBox2b r = boxes[0];
for (int i=1; i < N; i++)
r.simd = min(r.simd, boxes[i]);
return r;
}
And the same approach works for intersection too – in fact, all you need to do to get a box intersection function is to turn the min into a max.
Now, this is just a toy example, but it shows the point nicely – sometimes a little sign flip can go a long way. In particular, this trick can come in handy when dealing with 3D AABBs and the like, because groups of 3 don’t fit nicely in typical SIMD vector sizes, and you don’t always have another float-sized value to sandwich in between; even if you don’t store the negative of the max, it’s usually much easier to sign-flip individual lanes than it is to rearrange them.
There’s multiple ways to measure distances between unit quaternions (a popular rotation representation in 3D). What’s interesting is that the popular choices are essentially all equivalent.
Polar form
A standard way to build quaternions is using the polar (axis-angle) form
, where n is the (unit length) axis of rotation, θ is the angle, and i, j and k are the imaginary basis vectors.
For a rotation in this form, we know how “far” it goes: it’s just the angle θ. Since the real component of q is just cos(θ/2), we can read off the angle as
where the dot denotes the quaternion dot product.
This measures, in a sense, how far away the quaternion is from the identity element 1. To get a distance between two unit quaternions q and r, we rotate both of them such that one of them becomes the identity element. To do this for our pair q, r, we simply multiply both by r’s inverse from the left, and since r is a unit quaternion its inverse and conjugate are the same:
Note that cosine is a monotonic function over the interval we care about, so in any numerical work, there’s basically never the need to actually calculate that arc cosine: instead of checking, say, whether the angle is less than some maximum error threshold T, we can simple check that the dot product is larger than cos(T/2). If you’re actually taking the arc cosine for anything other than display purposes, you’re likely doing something wrong.
Dot product
Another way is to use the dot product directly as a distance measure between two quaternions. How does this relate to the angle from the polar form? It’s the same, as we quickly find out when we use the fact that the dot product is invariant under rotations:
and hence also
So again, whether we minimize the angle between q and r (as measured in the polar form) or maximize the dot product between q and r boils down to the same thing. But there’s one final choice left.
L2 distance
The third convenient metric is just using the norm of the difference between the two quaternions: . The question is, can we relate this somehow to the other two? We can, and as is often the case, it’s easier to work with the square of the norm:
.
In other words, the distance between two unit quaternions again just boils down to the dot product between them – albeit with a scale and bias this time.
Conclusion
The popular choices of distance metrics between quaternions all boil down to the same thing. The relationships between them are simple enough that it’s easy to convert, say, an exact error bound on the norm between two quaternions into an exact error bound on the angle of the corresponding rotation. Each of these three representations is the most convenient to use in some context; feel free to convert back and forth between them for different solvers; they’re all compatible in the sense that their minima will always agree.
UPDATE: As Sam points out in the comments, you need to be careful about the distinction between quaternions and rotations here (I cleared up the language in the article slightly). Each rotation in 3-dimensional real Euclidean space has two representations as a quaternion: the quaternion group double-covers the rotation group. If you want to measure the distances between rotations not quaternions, you need to use slightly modified metrics (see his comment for details).
Another quick one. Now this is another old trick, but it’s easy to derive and still not as well-known as it deserves to be, so here goes.
All modern graphics APIs ultimately expect vertex coordinates to end up in one common coordinate system where clipping is done – clip space. That’s the space that vertices passed to the rasterizer are expected in – and hence, the space that Vertex Shaders (or Geometry Shaders, or Domain/Tessellation Evaluation Shaders) transform to. These shaders can do what they want, but the usual setup matches the original fixed-function pipeline and splits vertex transformations into at least two steps: The projection transform and the model-view transform, both of which can be represented as homogeneous 4×4 matrices.
The projection transform is the part that transforms vertices from camera view space to clip space. A view-space input vertex position v is transformed with the projection matrix P and gives us the position of the vertex in clip space:
Here, I’ve split P up into its four row vectors p1T, …, p4T. Now, in clip space, the view frustum has a really simple form, but there’s two slightly different formulations in use. GL uses the symmetric form:
whereas D3D replaces the last row with . Either way, we get 6 distinct inequalities, each of which corresponds to exactly one clip plane:
is the left clip plane,
is the right clip plane, and so forth. Now from the equation above we know that
and
and hence
Or in words, v lies in the non-negative half-space defined by the plane p4T+p1T – we have a view-space plane equation for the left frustum plane! For the right plane, we similarly get
and in general, for the GL-style frustum we find that the six frustum planes in view space are exactly the six planes p4T±piT for i=1, 2, 3 – all you have to do to get the plane equations is to add (or subtract) the right rows of the projection matrix! For a D3D-style frustum, the near plane is different, but it takes the even simpler form
, so it’s simply defined by the third row of the projection matrix.
Deriving frustum planes from your projection matrix in this way has the advantage that it’s nice and general – it works with any projection matrix, and is guaranteed to agree with the clipping / culling done by the GPU, as long as the planes are in fact derived from the projection matrix used for rendering.
And if you need the frustum planes in some other space, say in model space: not too worry! We didn’t use any special properties of P – the derivation works for any 4×4 matrix. The planes obtained this way are in whatever space the input matrix transforms to clip space – in the case of P, view space, but it can be anything. To give an example, if you have a model-view matrix M, then PM is the combined matrix that takes us from model-space to clip-space, and extracting the planes from PM instead of P will result in model-space clip planes.

















