The ryg blog

August 23, 2010

Carry-save adders and averaging bit-packed values

Long time no update. Got a few things to write about, but I thought I’d start with a quick post on a simple trick that I’ve never seen documented in full: An interesting identity for addition of integer values is that a + b = (a^b) + ((a&b)<<1) = (a^b) + 2*(a&b). You can find this on a lot of pages that collect bit-twiddling tricks. This is basically the reduction formula for a 2-input Carry-Save Adder (Google it if you don’t know what that is). The more interesting 3-input carry save adder has the reduction formula

a + b + c = S + (C << 1) where
S = a ^ b ^ c
C = (a & b) | (a & c) | (b & c)

In Hardware, they are very useful as building blocks for integer multipliers: note that there’s no + in the computation of S and C, and that the three input numbers are “compressed” into two outputs. This makes them very useful to compute all the intermediate sums in a binary multiplier without heaving to deal with carries – using CSAs, you can reduce any multiple-term addition into a lot of very simple constant-delay logic elements with only one fast adder (the “completion adder”) placed at the very end. In Software, there is nothing to be gained by this – basically all current architectures have all basic integer ALU operations running at the same (fast) speed. But the reduction formula is still interesting, for another reason: Say you want to compute (a+b)/2 without overflow. Then you can apply the above identity to get (a+b)/2 = (a^b)/2 + (a&b) which is guaranteed to be overflow-free. This trick is also well-known; it does tend to come in handy every once in a while for SIMD code, since a lot of SIMD instruction sets have a limited set of operand sizes and no output carry bit functionality. Unpacking a 4×32 bit SIMD register to 2 2×64 bit regs just to do a sum is a pain and a big waste of time.

But often you want not (a+b)/2, but (a+b+1)/2, i.e. with a rounding bias added into the mix. This is one of the reasons for this blog-post, since I haven’t run across a description of this yet. You can use the three-input CSA reduction for this case. It simplifies down to: (a+b+1)/2 = (a^b)/2 + ((a&b) | ((a|b)&1)) Not as neat as the variant without rounding bias, but still better than temporary widening if you don’t have a SIMD add-with-carry instruction.

I’ve got one more: Let’s go back to the first variant again. A nice property of this is that it works well with bit-packed values, like A8R8G8B8 pixels in a 32-bit word. All we need to do is to make sure that our division (shift) doesn’t erroneously spill into adjacent channels. But that’s easily remedied with another bit masking operation:

avg_pixel_a8r8g8b8(a,b) = (((a^b) >> 1) & 0x7f7f7f7f) + (a&b)

In fact, why limit ourselves to 32-bit values? It works just as well for two 32-bit pixels inside a 64-bit register, just use 0x7f7f7f7f7f7f7f7f as mask. Nor does it say anywhere that the fields all have to have to same size. The trick works just as well for uneven partitions like the 11:11:10 bit format often used to store vertex normals or other unit vectors, or R5G6B5 pixels. All that changes is the bit mask.

Finally, the bitpacked stuff and the rounding bias fix are orthogonal – you can do both at the same time.

Sure, this isn’t terribly useful in practice, but it’s come in handy for me a couple times over the past few years, and I think it’s just fundamentally too cool not to have it properly documented.

UPDATE: As Charles pointed out in the comments, a better way to compute (a+b+1)/2 is to use (a | b) - ((a ^ b) >> 1) – just as cheap as the version without rounding bias, and you can do the same masking trick to get rid of carries where you don’t want them.

March 29, 2010

How to generate cellular textures 2

After writing the previous article, I experimented with the problem a bit more and found out two things. First, there’s a bug in the distance lower bound computation I used for the “tiles” variant that caused it to perform significantly worse than it should. I actually copied it over from the Werkkzeug3 implementation, which has the same bug – shouldn’t have assumed it’s correct just because it’s worked for a couple of years. (The bug causes the lower bound to be much larger than necessary in some cases; i.e. it never causes any visual changes, it just makes the algorithm run slower). The fixed version of distanceBound is a lot simpler than the original implementation. The second thing I found out is that there’s another big speedup to be had without using point location structures, higher-order Voronoi diagrams or an “all nearest neighbor” algorithm. All it takes is a bit of thinking about the problem.

How to generate cellular textures

Cellular textures are a really useful primitive to have in a texture generator. Jim Scott alias blackpawn wrote a great introduction to the subject several years ago – it’s available here. If you don’t know what cellular textures are, start there; I’m gonna get right into the juicy bits. Jims article does a great job of putting you onto the right track, but I have two complaints about it. The first one is minor: there’s another method to combine the two distances that’s quite useful to have, namely (dist2 – dist1) / (dist2 + dist1) or equivalently 1 – (2 dist1) / (dist1 + dist2), where dist1 and dist2 are the distances to the closest and second-closest points, respectively. This is like “dist2 – dist1”, but the distances are “normalized”, i.e. there are no huge brightness variations between the cells. Very handy (the result is shown below). The second complaint is more serious: I strongly disagree with his recommendation of using trees to accelerate the distance calculations, because there’s a much simpler method that improves performance far more. But let’s start at the beginning.

Cellular texture generated with the (dist2 - dist1) / (dist2 + dist1) formula

DXT5 alpha block index determination

And another quick coding post; this one came up in an email conversation with Charles Bloom. The question was how to optimally select per-pixel color indices for DXT5 alpha blocks. DXT5 stores two 8-bit reference values per block. Depending on whether the first one is larger than the second or not, one out of two possible “color maps” is selected: There’s either the two extremal points and 6 interpolated colors between them (spaced uniformly), or 0, 255, the two extremal points, and 4 interpolated colors. The problem is how to assign the 3-bit color map indices for each of the 16 pixels in a DXT block in a way that minimizes error.

Decoding Morton Codes

There’s lots of material on the web about computing Morton codes (also called Morton keys or Morton numbers) efficiently – bitwise interleaving of two or more numbers. This may sound esoteric, but it’s surprisingly useful in some applications. If you haven’t heard of Morton codes yet, step by Wikipedia or look into a book like Real-Time Collision Detection to learn more about them. Anyway, the subject of this post is not to introduce you to Morton codes, but rather to fill a rather curious gap: As I discovered a few months ago, there’s lots of material on the web and in books about how to generate morton codes from 2 or 3 numbers, but I didn’t find a single site explaining how to de-interleave the bits again to get the original numbers back from a morton code! I figured it’s time to change that.
Read more…

October 4, 2009

Launch

There goes, yet another blog. After realizing that I missed technical writing after finishing my diploma thesis (and with lots of interesting tidbits but nothing to make a paper out of), it seemed like a good idea. No particular agenda, I’ll just keep writing about whatever’s on my mind at the time, and revisit some of my older stuff if there’s nothing in particular.