Two discoveries from recent months. The first one was on x86-64 and occurs in this code fragment:
U32 offset = ...; U8 *ptr = base + (offset & 0xffff)*4;
I expected to get one “and”, with the rest folded into an address computation (after all, x86 has a base+index*4 addressing mode). What I actually got was this:
; eax = offset, rcx=base movzx eax, ax ; equivalent to "and eax, 0ffffh" (but smaller) shl eax, 2 add rax, rcx
The operand size mismatch between the instructions is the clue – the compiler computes the “*4” in 32 bits; the “*4” in an address computation would be computed for 64-bit temporaries. This is a result of the C/C++ typing rules: All of (offset & 0xffff) * 4 is computed as “unsigned int”, since C does most computations at “int” size, which is 32 bits even on most 64-bit platforms. The compiler can’t use the built-in “*4” addressing mode since (U32) x*4 isn’t equivalent to (U64) x*4 in the presence of overflow (of course, in the example given, we can’t ever get any overflows, but that’s something that the optimizer apparently wasn’t looking for).
Easy solution: use U8 *ptr = base + U64(offset & 0xffff)*4 (alternatively, multiply by “4ull” or “(size_t) 4”) and everything comes out as expected. But it illustrates an interesting dilemma: In the simpler expression U8 *ptr = base + offset*4, would you have expect the “offset*4” part to be computed in 32 bits? Probably not. “int” was left at 32-bits since a lot of code relies on it, and most ints we use really don’t need more than 32 bits. But the type conversion rules in C/C++ implicitly assume that an “int” is the native register type, leading to this kind of problem when that’s not the case. Right now, virtually nobody uses single arrays larger than 4GB, but it’s only a matter of time until that changes…
Anyway, on to tidbit number two, which concerns GCC on PS3 (PPU code). The PPU is a 64-bit PowerPC processor, but the PS3 has only 256MB of RAM (plus 256MB of graphics memory), so there’s no reason to use 64-bit pointers – it’s just a waste of memory. And again, overflow makes everything complicated. GCC assumes that (almost) all address computations could overflow at any time. It thus tends to perform address calculations manually and clear the top 32 bits afterwards, instead of using the built-in addressing modes of the CPU. This produces code that’s significantly larger (and slower) than necessary. To give an example, take this piece of code:
U32 *p = ...; *++p = x; *++p = y;
A good translation of this code into PPC assembly looks like this: (with explanation in case you don’t know PPC assembly)
; r8=p, r9=x, r10=y stw r9, 4(r8) ; *(p + 1) = x; stwu r10, 8(r8) ; *(p + 2) = y; p += 2;
but what GCC actually produces for this code sequence looks more like this:
addi r11, r8, 4 ; t = p + 1; addi r8, r8, 8 ; p += 2; clrldi r11, r11, 32 ; clear top 32 bits of t clrldi r8, r8, 32 ; same for p stw r9, 0(r11) ; *t = x; stw r10, 0(r8) ; *p = y;
Note that it’s not only 3x the number of instructions, but also needs extra registers (thus increasing register pressure). It’s the same underlying problem as with the x86 example – a lot of extra work to enforce artificial wraparound of 32-bit quantities. A pretty dubious goal, since it’s very unlikely that code actually wants that 32-bit wraparound (and if it does, it should probably be fixed), but I digress… Anyway, it turns out that there’s one case where GCC assumes that pointers won’t wrap around: Structure accesses. So in this case:
struct Blah {
U32 x, y;
};
Blah *p = ...;
p->x = x;
p->y = y;
You’ll actually get reasonable code without an addi/clrldi combo for every access. So if you’re writing out a stream of words (something that’s fairly common when building a rendering command buffer), you can save a lot of instructions by declaring a simple struct:
struct CommandData {
U32 w0, w1, w2, w3, w4, w5, w6, w7;
};
and then replacing this kind of code:
U32 *cmd = ...; cmd[0] = foo; cmd[1] = bar; cmd[2] = blah; cmd[3] = blub; cmd += 4;
with the (admittedly somewhat obscure-looking)
CommandData *cmd_out = (CommandData *) cmd; cmd_out->w0 = foo; cmd_out->w1 = bar; cmd_out->w2 = blah; cmd_out->w3 = blub; cmd = &cmd_out->w4;
This is only useful to know if you’re programming on PS3 and using GCC, but if you do, that kind of stuff can make a big difference. I’ve recently been working a lot on PS3 rendering code and got speedups in excess of 10% from this and other low-level optimizations along the same vein. Actually reading the assembly generated by your compiler is important, especially when targeting in-order CPUs. It’s nearly impossible to see this kind of problem when you’re thinking at the level of C/C++ source code.
What the title says :). Just thought it was worth mentioning since I talked a bit about singly-linked lists in the “Data structures and invariants” post, and cycles are the primary way that a singly-linked list can “go wrong” on you without containing invalid pointers: notably, you can get this if you try to insert some item x to a list when that list already contains x. (If you insert the “new” x before or at the position of the x that’s already in the list, you will get a cycle. If you insert it later, you will inadvertently remove all items between the old and new position).
When debugging this, it’s useful to have some cycle-detection code. You can add some fields to the list for debugging (not quite as trivial as it sounds, since clearing a “visited” flag involves traversing the list that might have a cycle – you need to make sure all fields are in a known “visited” state before you start looking for cycles if you do it this way), but a nicer solution that doesn’t require extra memory is to use a cycle detection algorithm. There are two main choices – Floyd’s “tortoise and hare” algorithm and Brent’s algorithm – and both are worth knowing about. They’re also explained well on Wikipedia, so read up if you’ve never encountered them before.
Following Google’s announcement of WebP, there’s been a lot of (justified) criticism of it based on its merits as an image coder. As a compression geek, I’m interested, but as both a creator and consumer of digital media formats, my main criticism is something else entirely: The last thing we need is yet another image format. This was already true when JPEG 2000 and JPEG XR were released. Never mind that their purported big compression advantages were greatly exaggerated (and based on dubious metrics); even if the standards had delivered clear-cut improvements on that front, they would’ve been irrelevant. Improved compression rate is not a feature. Not by itself, anyway. As someone who deals with content, I care about compression rate only as long as size is a limiting factor. As someone taking photos, I care whether all the pictures I shoot in one day fit onto my camera. Back when digital cameras were crappy, had low resolution and a tiny amount of storage, the JPEG compression made a big difference. Nowadays, we have 8 Megapixel cameras and multiple Gigabytes of storage in our cellphones. For normal usage, I can shoot RAW photos all day long and still not exhaust the space available to me. It’s done, problem solved. I might care about the size reduction once I start archiving stuff, but again, the 10:1 I get with plain JPEG at decent quality is plenty – a guaranteed 30% improvement on top of that (which JPEG2k/JPEG-XR/WebP don’t deliver!) is firmly into “don’t care” territory. It doesn’t solve any real problem.
The story is the same way for audio. MP3 offered an reduction of about 10:1 for CD quality audio at the right time, and made digitally archiving music practical. And we still gladly take the 10:1 on everything, since fitting 10x more on our MP3 players is convenient. But again, that’s only a factor of storage limitations that are rapidly disappearing. MP3 players started getting popular because they were smaller than mobile CD players and could (gosh!) store multiple albums worth of music (of course, back then “multiple albums” was a number in the single digits, and only if you compressed them greatly). Nowadays, most people can easily fit their complete music collection onto an iPod (or, again, their cellphone). Give it another 5 years and you’ll have enough space to fit it in as uncompressed WAV files if you choose to (not going to happen since most people now get music directly in MP3/AAC format, but it would be possible). Again, problem solved. Better audio compression remains an interesting problem with lots of fascinating ties to perception and psychology, but there’s just no real practical need for better audio compression these days. Audio just isn’t that much data.
Video is about the only mainstream type of content where the compression ratio still matters: 1080i60 video (as used in sports broadcasts for example) is about 90 megabytes per second in the subsamples YUV 4:2:0 color space it typically comes in, and about 350MB/s in the de-interlaced RGB color space we use for display. That’s unwieldy enough to need some serious compression (and partial or complete hardware support for decompression). And even the compressed representations are large enough to be unwieldy (we stick them onto Blu-ray discs and worry about the video streaming costs). So there’s still some future for innovation there, but even that window is rapidly closing. Blu-ray is probably the last disc format that was motivated partly by the need to store the amount of content needed for high-fidelity video. HDTV resolutions are gonna stay with us for a good while (HD is close enough to the internal resolutions used in movie production to make any further improvements subject to rapidly diminishing returns), and Blu-rays are already large enough to store HD content with good quality using current codec technology. BDXL is on the way; again, that’s just large enough for what we want to do with it. The next generation of video codecs after H.264 is probably still going to matter, since video is still a shitload of data right now. But we’ve been getting immensely better at large-scale image and signal processing (mainly with sheer brute force) and Moore’s law works in our favor. Five years ago, digital video was something done mainly by professionals and enthusiasts with the willingness to invest into expensive hardware. Nowadays, you can get cheap HD camcorders and do video postproduction on a normal laptop, if you’re willing to stomach the somewhat awkward workflow and long processing times. Ten years from now, a single 720p video will be something you can deal with as a matter of course on any device, just as you do with a single MP3 nowadays (…remember when encoding them took 2x as long as their runtime? Yeah, that was just 10 years ago).
And in the meantime, the last thing we need is yet more mutually incompatible formats with different feature sets and representations to make the lives of everyone dealing with this crap a living hell. If you don’t have any actual, useful features to offer (a standard lossy format with alpha channel would be nice, as would HDR support in a mainstream format) just shut up already.
Enough bit-twiddling for now, time to talk about something a bit more substantial: Invariants. If you don’t know what that is, an invariant is just a set of conditions that will hold before and after every “step” of your program/algorithm. This is usually taught in formal CS education, but textbooks love to focus on invariants in algorithm design. That’s fine for the types of algorithms typically found (and analyzed) en masse in CS textbooks: sorting, searching, and basic graph algorithms. But it has little relevance to the type of problems you usually encounter in applications, where you don’t encounter nice abstract problems with well-defined input and output specifications, but a tangled mess of several (often partially conflicting) requirements that you have to make sense of. So most programmers forget about invariants pretty soon and don’t think about them again.
That’s a mistake. Invariants are a very useful tool in practice, if you have a clear specification of what’s supposed to happen, and how. That may not be the case for the whole of the program, but it’s usually the case for the part of your code that manipulates data structures. Let me give you some examples: Read more…
Long time no update. Got a few things to write about, but I thought I’d start with a quick post on a simple trick that I’ve never seen documented in full: An interesting identity for addition of integer values is that a + b = (a^b) + ((a&b)<<1) = (a^b) + 2*(a&b). You can find this on a lot of pages that collect bit-twiddling tricks. This is basically the reduction formula for a 2-input Carry-Save Adder (Google it if you don’t know what that is). The more interesting 3-input carry save adder has the reduction formula
a + b + c = S + (C << 1) where S = a ^ b ^ c C = (a & b) | (a & c) | (b & c)
In Hardware, they are very useful as building blocks for integer multipliers: note that there’s no + in the computation of S and C, and that the three input numbers are “compressed” into two outputs. This makes them very useful to compute all the intermediate sums in a binary multiplier without heaving to deal with carries – using CSAs, you can reduce any multiple-term addition into a lot of very simple constant-delay logic elements with only one fast adder (the “completion adder”) placed at the very end. In Software, there is nothing to be gained by this – basically all current architectures have all basic integer ALU operations running at the same (fast) speed. But the reduction formula is still interesting, for another reason: Say you want to compute (a+b)/2 without overflow. Then you can apply the above identity to get (a+b)/2 = (a^b)/2 + (a&b) which is guaranteed to be overflow-free. This trick is also well-known; it does tend to come in handy every once in a while for SIMD code, since a lot of SIMD instruction sets have a limited set of operand sizes and no output carry bit functionality. Unpacking a 4×32 bit SIMD register to 2 2×64 bit regs just to do a sum is a pain and a big waste of time.
But often you want not (a+b)/2, but (a+b+1)/2, i.e. with a rounding bias added into the mix. This is one of the reasons for this blog-post, since I haven’t run across a description of this yet. You can use the three-input CSA reduction for this case. It simplifies down to: (a+b+1)/2 = (a^b)/2 + ((a&b) | ((a|b)&1)) Not as neat as the variant without rounding bias, but still better than temporary widening if you don’t have a SIMD add-with-carry instruction.
I’ve got one more: Let’s go back to the first variant again. A nice property of this is that it works well with bit-packed values, like A8R8G8B8 pixels in a 32-bit word. All we need to do is to make sure that our division (shift) doesn’t erroneously spill into adjacent channels. But that’s easily remedied with another bit masking operation:
avg_pixel_a8r8g8b8(a,b) = (((a^b) >> 1) & 0x7f7f7f7f) + (a&b)
In fact, why limit ourselves to 32-bit values? It works just as well for two 32-bit pixels inside a 64-bit register, just use 0x7f7f7f7f7f7f7f7f as mask. Nor does it say anywhere that the fields all have to have to same size. The trick works just as well for uneven partitions like the 11:11:10 bit format often used to store vertex normals or other unit vectors, or R5G6B5 pixels. All that changes is the bit mask.
Finally, the bitpacked stuff and the rounding bias fix are orthogonal – you can do both at the same time.
Sure, this isn’t terribly useful in practice, but it’s come in handy for me a couple times over the past few years, and I think it’s just fundamentally too cool not to have it properly documented.
UPDATE: As Charles pointed out in the comments, a better way to compute (a+b+1)/2 is to use (a | b) - ((a ^ b) >> 1) – just as cheap as the version without rounding bias, and you can do the same masking trick to get rid of carries where you don’t want them.
After writing the previous article, I experimented with the problem a bit more and found out two things. First, there’s a bug in the distance lower bound computation I used for the “tiles” variant that caused it to perform significantly worse than it should. I actually copied it over from the Werkkzeug3 implementation, which has the same bug – shouldn’t have assumed it’s correct just because it’s worked for a couple of years. (The bug causes the lower bound to be much larger than necessary in some cases; i.e. it never causes any visual changes, it just makes the algorithm run slower). The fixed version of distanceBound is a lot simpler than the original implementation. The second thing I found out is that there’s another big speedup to be had without using point location structures, higher-order Voronoi diagrams or an “all nearest neighbor” algorithm. All it takes is a bit of thinking about the problem.
Cellular textures are a really useful primitive to have in a texture generator. Jim Scott alias blackpawn wrote a great introduction to the subject several years ago – it’s available here. If you don’t know what cellular textures are, start there; I’m gonna get right into the juicy bits. Jims article does a great job of putting you onto the right track, but I have two complaints about it. The first one is minor: there’s another method to combine the two distances that’s quite useful to have, namely (dist2 – dist1) / (dist2 + dist1) or equivalently 1 – (2 dist1) / (dist1 + dist2), where dist1 and dist2 are the distances to the closest and second-closest points, respectively. This is like “dist2 – dist1”, but the distances are “normalized”, i.e. there are no huge brightness variations between the cells. Very handy (the result is shown below). The second complaint is more serious: I strongly disagree with his recommendation of using trees to accelerate the distance calculations, because there’s a much simpler method that improves performance far more. But let’s start at the beginning.

And another quick coding post; this one came up in an email conversation with Charles Bloom. The question was how to optimally select per-pixel color indices for DXT5 alpha blocks. DXT5 stores two 8-bit reference values per block. Depending on whether the first one is larger than the second or not, one out of two possible “color maps” is selected: There’s either the two extremal points and 6 interpolated colors between them (spaced uniformly), or 0, 255, the two extremal points, and 4 interpolated colors. The problem is how to assign the 3-bit color map indices for each of the 16 pixels in a DXT block in a way that minimizes error.
There’s lots of material on the web about computing Morton codes (also called Morton keys or Morton numbers) efficiently – bitwise interleaving of two or more numbers. This may sound esoteric, but it’s surprisingly useful in some applications. If you haven’t heard of Morton codes yet, step by Wikipedia or look into a book like Real-Time Collision Detection to learn more about them. Anyway, the subject of this post is not to introduce you to Morton codes, but rather to fill a rather curious gap: As I discovered a few months ago, there’s lots of material on the web and in books about how to generate morton codes from 2 or 3 numbers, but I didn’t find a single site explaining how to de-interleave the bits again to get the original numbers back from a morton code! I figured it’s time to change that.
Read more…
There goes, yet another blog. After realizing that I missed technical writing after finishing my diploma thesis (and with lots of interesting tidbits but nothing to make a paper out of), it seemed like a good idea. No particular agenda, I’ll just keep writing about whatever’s on my mind at the time, and revisit some of my older stuff if there’s nothing in particular.