Skip to content

A small note on the Elias gamma code

This is a small one, for all compression geeks out there.

The Elias gamma code (closely related to Exponential-Golomb codes, or Exp-Golomb codes for short) is very useful for practical compression applications; it certainly crops up in a lot of modern video coding standards, for example. The mapping goes like this:

  1 -> 1
  2 -> 010
  3 -> 011
  4 -> 00100
  5 -> 00101
  6 -> 00110
  7 -> 00111
  8 -> 0001000

and so on (Exp-Golomb with k=0 is the same, except the first index encoded is 0 not 1, i.e. there’s an implicit added 1 during encoding and a subtracted 1 during decoding).

Anyway, if you want to decode this quickly, and if your bit buffer reads bits from the MSB downwards, shifting after every read so the first unconsumed bit is indeed the MSB (there’s very good reasons for doing it this way; I might blog about this at some point), then there’s a very nifty way to decode this quickly:

  // at least one of the top 12 bits set?
  // (this means less than 12 heading zeros)
  if (bitbuffer & (~0u << (32 - 12))) {
    // this is the FAST path    
    U32 num_zeros = CountLeadingZeros(bitbuffer);
    return GetBits(num_zeros * 2 + 1);
  } else {
    // coded value is longer than 25 bits, need to
    // use slow path (read however you would read
    // gamma codes regularly)
  }

Note that this is one CountLeadingZeros and one GetBits operation. The GetBits reads both the zeros and the actual value in one go; it just so happens that the binary value of that is exactly the value we need to decode. Not earth-shaking, but nice to know. In practice, it means you can read such codes very quickly indeed (it’s just a handful of opcodes on both x86 with bsr and PowerPC with cntlzw or cntlzd if you’re using a 64-bit buffer).

This is assuming a 32-bit shift register that contains your unread bits, refilled by reading another byte whenever there’s 24 or less unconsumed bits available (a fairly typical choice). If you have 64-bit registers, it’s usually better to keep a 64-bit buffer and refill whenever the available number of bits is 32 or less. In that case you can read gamma codes of up to 33 bits length (i.e. 16 heading 0 bits) directly.

Texture tiling and swizzling

If you’re working with images in your program, you’re most likely using a regular 2D array layout to store image data. Which is to say, it’s basically a 1D array of width * height pixels, and the index of every pixel in the array is normally computed as y * width + x (or, more generally, y * stride + x when you don’t require the lines to be densely packed). Simple and easy, but it has a problem: if you have a cache, this is very efficient when processing pixels in one direction (left to right) and very inefficient if you move at a 90 degree angle to that direction (top to bottom): the first pixel of every row is in a different cache line (unless your image is tiny), so you get tons of cache misses.

That’s one of the reasons why GPUs tend to prefer tiled or swizzled texture formats, which depart from the purely linear memory layout to something a tad more involved. The specifics heavily depend on the internals of the memory subsystem, but while the respective magic values may be different between different pieces of hardware, the general principles are always the same.

Read more…

Ring buffers and queues

The data structure is extremely simple: a bounded FIFO. One step up from plain arrays, but still, it’s very basic stuff. And if you’re doing system programming, particularly anything involving IO or directly talking to hardware (boils down to the same thing really), it’s absolutely everywhere. It’s also very useful to communicate between different threads. I have some notes on the topic than aren’t immediately obvious, so it’s time to write them up. I’m only going to talk about the single producer, single consumer (SPSC) case since that’s what you usually have when you’re dealing with hardware.

The producer produces commands/objects/whatever in some way and appends them to the queue. The consumer pops an item from the start and does its work. If the queue is full, the producer has to wait; if it’s empty, the consumer has to wait. As programmer feeding hardware (or being fed from hardware), you’re generally trying to reach a steady state that does neither. The actual data structure always looks something like this:

struct FIFO {
  ElemType Elem[SIZE];
  uint ReadPos;
  uint WritePos;
};

In hardware, the elements are stored in some block of memory somewhere, and ReadPos/WritePos usually come in the form of memory-mapped registers. In software, you normally use a slightly different layout (put one pointer before the array and the other after it and make sure it’s all in different cache lines to avoid false sharing). You can find details on this elsewhere; I’m gonna be focusing on a different, more conceptual issue.

What Elem means is not really up to interpretation; it’s an array, just a block of memory where you drop your data/commands/whatever at the right position. ReadPos/WritePos have a bit more room for interpretation; there are two common models with slightly different tradeoffs.

Model 1: Just array indices (or pointers)

This is what you normally have when talking to hardware. In this model, the two positions are just array indices. When adding an element, you first write the new element to memory via Elem[WritePos] = x; and then compute the next write position as WritePos = (WritePos + 1) % SIZE;; reading is analogous. If ReadPos == WritePos, the queue is empty. Otherwise, the queue currently has WritePos - ReadPos elements in it if WritePos > ReadPos, and WritePos + SIZE - ReadPos elements if WritePos < ReadPos.

There’s an ambiguous case though: if we fill up the queue completely, we end up with ReadPos == WritePos, which is then interpreted as an empty queue. (Storing WritePos - 1 doesn’t solve this; now the “queue empty” case becomes tricky). There’s a simple solution though: Don’t do that. Seriously. When adding elements to the queue, block when it contains SIZE - 1 elements. What you definitely shouldn’t do is get fancy and use special encodings for an empty (or full) queue and riddle the code with ifs. I’ve seen this a couple times, and it’s bad. It makes “lock-free” implementations hard, and when dealing with hardware, you usually have no locks. If you use this method, just live with the very slight memory waste.

Model 2: Virtual stream

The intuition here is that you’re not giving the actual position in the ring buffer, but the “distance travelled” from the start. So if you’ve wrapped around the ring buffer twice, your current WritePos would be 2*SIZE, not 0.

This is just a slight change, but with important consequences: writing elements is Elem[WritePos % SIZE] = x; and updating the index is WritePos++; (and analogous for ReadPos). In other words, you delay the reduction modulo SIZE. For this to be efficient, you normally want to pick a power of 2 for SIZE; this makes the wrapping computation cheap and will automatically do the right thing if one of the positions ever overflows. This leads to very straightforward, efficient code. The number of items in the queue is WritePos - ReadPos; no case distinction, unsigned arithmetic does the right thing. No trouble with the last element either (if the queue is full, then WritePos == ReadPos + SIZE – no problem!).

With non-pow2 SIZE, you still need to do some amount of modulo reduction on increment – always modulo N*SIZE, where N is some constant >1 (if you use 1, you end up with Method 1). This is more work than for method 1, so it seems like a waste. But it’s not quite that simple.

Virtual streams are a useful model!

One advantage of virtual streams is it’s usually easier to state (and check) invariants using this model; for example, if you’re streaming data from a file (and I mean streaming in the original sense of the word, i.e. reading some amount of data sequentially and piece by piece without skipping around), it’s very convenient to use file offsets for the two pointers. This leads to very readable, straightforward logic: the two invariants for your streaming buffer are WritePos >= ReadPos and WritePos - ReadPos <= SIZE, and one of them (WritePos – you’d pick a different name in this case) is just the current file pointer which you need to dispatch the next async read. No redundant variables, no risk of them getting out of sync. As a bonus, if you align your destination buffer address to whatever alignment requirement async reads have, it also means you can DMA data directly from the drive to your streaming buffer without any copying (the lowest bits of the file pointer and the target address need to match for this to work, and you get that almost for free out of this scheme).

This scheme is particularly useful for sound playback, where the “consumer” (the audio HW) keeps reading data whether you’re ready or not. Of course you try to produce data fast enough, but sometimes you may be too late and the audio HW forges ahead. In that case, you want to know how far ahead it got (at least if you’re trying to keep audio and video in sync). With a “virtual stream” type API, you have a counter for the total number of samples (or blocks, or whatever) played and can immediately answer this question. Annoyingly, almost all sound APIs only give you the current read position mod the ring buffer size, so you don’t have this information. This usually leads to a little song and dance routine in low-level sound code where you query a timer every time you ask for the current read position. Next time, you look at the time difference; if it’s longer than the total length of the ring buffer in ms minus some fudge factor, you use the secondary timer to estimate how many ms you skipped, otherwise you can use the read pointer to determine how big the skip was.

It’s not a big deal, but it is annoying, especially since it’s purely an API issue – the sound driver actually knows how many samples were played, even though the HW usually uses method 1, since the driver gets an interrupt whenever the audio HW is done playing a block. This is enough to disambiguate the ring buffer position. But for some reason most audio APIs don’t give you this information, so you have to guess – argh!

This is a general pattern: If you have some type of separate feedback channel, the regular ring buffer semantics are fine. But when the FIFO is really the only means of communication, the virtual stream model is more expressive and hence preferable. Particularly with pow2 sizes, where everything just works out automagically without any extra work. Finally, a nice bonus on PowerPC-based platforms is that address generation for the array access can be done with a single rlwinm instruction if SIZE and sizeof(ElemType) are both powers of 2. This is even less work than the regular mod-after-increment variant!

Negative space in programming

There’s a lot of material out there about code; how to design it, write it, document it, build it and test it. It is, after all, the primary artifact we programmers produce (okay, that and long rants about why we’re right and almost everyone else is wrong). Considerably less attention is paid to what’s not there – the design alternatives that were rejected, the features that were omitted, the interfaces that got changed because they were too complicated/subtle to explain properly in the docs, the lack of excessive dependencies that helps keep build times down, the tests that aren’t necessary because certain types of errors are made impossible by the design of a system.

I think that’s a mistake. In my experience, the main way we actually give our programs shape is not by putting things in, but by leaving them out. An elegant program is not one that checks off all the bullet points from some arbitrary feature list; it’s one that solves the problem it’s meant to solve and does so concisely. Its quality is not that it does what it’s supposed to; it’s that it does almost nothing else. And make no mistake, picking the right problem to solve in the first place is hard, and more of an art than a science. If you’ve ever worked on a big program, you’ve probably experienced this first-hand: You want to “just quickly add that one feature”, only to discover hours (or days, or even weeks) later that it turns out to be the straw that breaks the camel’s back. At that point, there’s usually little to do except take one last glance at the wreckage, sigh, and revert your changes.

So here’s my first point: Next time you get into that situation, write down what you tried and why it didn’t work. No need to turn it into an essay; a few short paragraphs are usually plenty. If it’s something directly related to design choices in the code, put it into comments in that piece of code; if the issues stem from the architecture of your system, put it into your Docs/Wiki or at least write a mail. But make sure it’s documented somewhere; when working on any piece of code, knowing what doesn’t work is at least as important as knowing what does. The latter is usually well-documented (or at least known), but the former often isn’t – no one even knows the brick wall is there until the first person runs into it and gets a bloody nose.

Second point: If you’re thinking about rewriting a piece of code, be very aware that the problem is not one of deleting X lines of code and writing Y lines to replace it. Nor is it one of understanding the original data structures and implementation strategies and improving of them; even if the approach is fine and you’re just re-implementing the same idea with better code, it’s never that simple. What I’ve consistently found to be the biggest problem in replacing code is all the unspoken assumptions surrounding it: the API calls it doesn’t use because they don’t quite do the right thing, the unspecified behavior or side-effects that other pieces of code rely on, the problems it doesn’t need to deal with because they’re avoided by design.

When reading code, looking at what a program does (and how it does it) is instructive. But figuring out what it doesn’t do (and why) can be positively enlightening!

More PPC compiler babysitting

Another recent discovery from looking at generated code. On inorder PPC processors, you can’t move data directly between the integer and floating point units – it has to go through memory first. This usually involves storing some value to memory and reading it immediately afterwards, a guaranteed LHS (Load-Hit-Store) stall. A full integer to floating point conversion on PPC involves multiple steps:

  1. Sign-extend the integer to 64 bits (extsw)
  2. Store 64-bit value into memory (std)
  3. Load 64-bit into into floating-point register (lfd)
  4. Convert to double (fcfid)
  5. Round to single precision (frsp)

The sign-extend and round to single steps may be omitted depending on context, but the rest is pretty much fixed, and the dreaded LHS is triggered by step 3. There’s ways to work around this problem – if you have a small set of integers, it can make sense to use a small table for int->float conversion. You can also use SIMD instructions to do the conversion, provided you do the rest of your computation in SIMD registers too (again, no direct movement between the integer, vector and floating point units, you have to go through memory).

That’s not what this post is about, though. Let’s just accept that LHS as a fact of life for now. Does that mean we have to eat it on every int to float (or float to int) conversion? Not really. Have a look at this code:

void some_function(float a, float b, float c, float d);

void problem(int a, int b, int c, int d, float scale)
{
  some_function(a*scale, b*scale, c*scale, d*scale);
}

We need to perform four int-to-float conversion for this function call. They’re completely independent, so the compiler could just do steps 1 and 2 for all four values, then steps 3-5. Unless we’re unlucky, we expect all four temporaries to be in the same cache line on the stack, so we expect to get only one LHS stall on the first load. So much for the theory, anyway – I recently noticed that one of the PPC compilers didn’t do this, so I whipped up the small example above and checked the other compilers we use, and it turns out that all three of them happily produced code with 4 LHS stalls.

When the swelling from the subsequent Mother Of All Facepalms(tm) abated, I went on to check if there was some way to coax the compilers into generating better code. And yes, on all 3 compilers there’s a way to get the desired behavior, though the details differ a bit:

 // Names changed to protect the guilty
#ifndef COMPILER_C
typedef volatile S64 S32itof;
#else 
typedef S64 S32itof;
#endif

static inline F32 fast_itof(S32itof x)
{
#ifdef COMPILER_A
  return x;
#else
  return (F32) __fcfid(x);
#endif
}

void better(int a, int b, int c, int d, float scale)
{
  S32itof fa = a, fb = b, fc = c, fd = d;
  some_function(fast_itof(fa)*scale, fast_itof(fb)*scale,
    fast_itof(fc)*scale, fast_itof(d)*scale);
}

My original implementation uses a macro for fast_itof since it needs to work in plain C89 code, and the temporary values of type S32itof aren’t optional in that case. With the inline function, you might be able to get rid of them, but I haven’t checked this.

Planar rotations and the DCT

I’ve been looking a bit at DCT algorithms recently. Most fast DCTs are designed under two assumptions that are worth re-examining: 1. Multiplies are a lot more expensive than additions and should be avoided and 2. The cost of a DCT is mainly a function of the number of multiplies and adds performed. (1) isn’t really true anymore, particularly in current floating point/SIMD processors where multiplies and adds often take the same time and fused multiply-adds are available, and (2) also is overly naive for current architectures – again, particularly with SIMD implementations, the cost of loading, storing and shuffling values around can easily be as high as the actual computations, if not higher. Tricks that save a few arithmetic ops will backfire if they make the dataflow more complicated. Anyway, I don’t have any conclusions on that yet (it’s still a work in progress), but I do have some notes on one important ingredient: planar rotations. And now excuse my hypocrisy as I’ll be (shamelessly) only talking about the number of arithmetic ops for the rest of this post. :)

To briefly set the stage, basically all DCT algorithms decompose into a sequence of two types of operations:

  • “Butterflies”. The basic transform is (a,b) \mapsto (a+b,a-b) (this is also a 2-point FFT, a 2-point DCT, and a 2-point Hadamard transform – they’re all identical in this case, up to normalization). The name “butterfly” comes from the way these operations are typically drawn in a dataflow graphs.
  • Planar rotations. A bog-standard 2D rotation of the form (a,b) \mapsto \begin{pmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{pmatrix} \begin{pmatrix} a \\ b \end{pmatrix}.

There’s not much to do about the butterflies, but the planar rotations are interesting, and different DCT algorithms use different ways of performing them. Let’s look at some:

Method 1: The obvious way

You can just do the matrix-vector multiply directly. Cost: 4 multiplies and 2 adds, or 2 multiplies and 2 FMAs.

Method 2: Saving a multiply

Rearranging the equation a bit allows us to extract a common factor:

\begin{matrix} t & = & (a+b) \cos(\theta) \\ a' & = & t - b (\cos(\theta) + \sin(\theta)) \\ b' & = & t + a (\sin(\theta) - \cos(\theta)) \end{matrix}

For constant \theta, the sums/differences of cos/sin terms can be precomputed and we end up with 3 multiplies and 3 adds, or 1 multiply, 1 add and 2 FMAs. If only the computation of t was a multiply-add and not an add followed by a multiply… anyway, this reduces the number of arithmetic ops in the “classic” model, but not in a FMA-based model, and it has a chain of 3 dependent operations whereas Method 1 only has 2. This technique is used in most DCT factorizations, including the LL&M DCT used by the IJG JPEG lib for the “slow” integer transform (the default).

Note that there’s no special reason to base t off the cosine term, we could do the same with sin at the same cost:

\begin{matrix} t & = & (a-b) \sin(\theta) \\ a' & = & t + a (\cos(\theta) - \sin(\theta)) \\ b' & = & t + b (\sin(\theta) + \cos(\theta)) \end{matrix}

Method 3: A scaled rotation

A different trick is used in the AA&N DCT (used by the IJG JPEG lib for the fast integer transform and the floating-point version):

\begin{matrix} a' & = & (a - \tan(\theta) b) \cos(\theta) \\ b' & = & (b + \tan(\theta) a) \cos(\theta) \end{matrix}

Note that this is just Method 1 multiplied by \frac{1}{\cos(\theta)} \cos(\theta). Cost: Same as Method 1 – 4 adds 2 multiplies or 2 FMA and 2 multiplies. But we gained something: We can try to propagate the constant scale factor along all the rest of the computation (note we need to be careful when combining differently scaled values) and simply return scaled results. This is what the original AA&N DCT does. We can also delay the multiply by just one step: If the rotation is used as input to a butterfly step, we can absorb the scale factor into the butterfly and turn two adds/subs into two FMAs, which means we get a planar rotation for the cost of 2 independent FMAs. Nice!

(Again, there’s a dual version which multiplies by the cotangent of \theta and returns results scaled by \sin(\theta)).

Method 4: Decompose into shears

There’s more: We can also write a rotation as three shears:

\begin{matrix} \tilde{a} & = & a + \tan(\theta/2) b \\ b' & = & b - \sin(\theta) \tilde{a} \\ a' & = & \tilde{a} + \tan(\theta/2) b \end{matrix}

Note that this can be done without using any extra registers. Cost: 3 multiplies and 3 adds or 3 FMAs, but every operation is dependent on the previous one, so we get a chain of 3 dependent ops, same as for Method 2.

One interesting side effect of doing the computation this way is that the integer version of this is perfectly reversible, even if there are truncation errors in the multiplies – you just do the steps in reverse. Integer lifting formulations of wavelet transforms use the same primitive. One interesting consequence of this rotation formula is that it can be used to instantly obtain “integer lifting” versions of any orthogonal transform: You can compute the QR decomposition of any matrix using Givens rotations, which are just ordinary planar rotations and can be performed using this technique. If the matrix under consideration is orthogonal, then by uniqueness of the QR decomposition up to sign, we know that R is just a diagonal matrix with all entries being 1 or -1, so multiplication by R is perfectly reversible in integer values as well.

This nice property aside, the long dependency chains are somewhat bothersome. But it’s an interesting formula, and one that’s been used to design exactly reversible integer-to-integer DCTs (search for “BinDCT”).

Honorary mention: Complex multiply

There’s also a related formula to perform complex multiplication with just 3 multiplies instead of the usual 4 in the obvious formula

(a+ib)(c+id) = (ac-bd) + (ad+bc)i

Compute it as

(a+ib)(c+id) = (ac-bd) + ((a+b)(c+d) - ac - bd)i

instead, using 3 multiplies and 5 adds. For a known fixed z=c+id that we’re multiplying with (as in the DCT case), one of the adds can be precomputed, but it’s still 3 muls 4 adds and has a 4-operation dependency chain, so it’s not really useful in this application. But it’s nice to know anyway.

So which one do you use?

Well, I haven’t decided yet. For floating-point implementations on platforms with FMA support, method 3 is definitely the most attractive one, since absorbing scale factors into later butterflies is simple and doesn’t cause any trouble. Interestingly, both the very early Chen DCT factorization and the LL&M DCT factorization turn into algorithms with 22 FMAs and 4 adds when you apply this, and other FMA-optimized DCT algorithms use 26 ops as well. Interesting. Anyway, fixed-point is more bothersome: Every multiplication causes round-off error so you don’t want to have too many of them, or paths with a lot of multiplies in them. Also, multiplying by values with an absolute value larger than 1 (or 0.5, depending on the instruction set) is bothersome for fixed-point SIMD implementations. I haven’t decided yet what I’ll use.

Closing note: Don’t forget that arithmetic is only half of it – in this case, almost literally: I have a (untested, so beware) prototype implementation of a floating-point, FMA-based algorithm in a JPEG-like setting on Xbox 360. Half of the time in an 8×8 IDCT is actually spent doing the IDCT, the rest is loading, unpacking, tranposing, packing and storing. Still, the overall cost seems to be about 4.5 cycles per pixel (as said, not final code, so this might either increase or decrease by the time I’m done), so it’s not too bad. It also illustrates another point about image/video compression: The goddamn entropy coder ruins everything. 4.5 cycles per pixel for the IDCT is one thing, but you still have to decode the coefficients first. Chances are you need at least one variable-amount shift per coefficient during entropy decoding, and on the Cell/Xbox 360, these don’t come cheap – they’re microcoded, slow, and they block the second hardware thread while they run. And to add insult to injury, bitstream decoding is an inherently serial process while the rest is usable amenable to SIMD and parallelization. And yet none of our existing multimedia formats are designed to be efficiently decodeable using multiple cores; the current H.265 proposals started taking this into account, but virtually everything else is still completely serial. This particular bottleneck is going to stay with us for a long time.

sinc and Polynomial interpolation

Another short one. This one bugged me for quite a while until I realized what the answer was a few years ago. The Sampling Theorem states that (under the right conditions)

\displaystyle x(t) = \sum_{n=-\infty}^{\infty} x(nT) \;\textrm{sinc}\left(\frac{t - nT}{T}\right)

where

\displaystyle \textrm{sinc}(x) = \frac{\sin(\pi x)}{\pi x}

(the normalized sinc function). The problem is this: Where does the sinc function come from? (In a philosophical sense. It plops out of the proof sure enough, but that’s not what I mean). Fourier theory is full of (trigonometric) polynomials (i.e. sine/cosine waves when you’re dealing with real-valued signals), so where does the factor of x in the denominator suddenly come from?

The answer is a nice identity discovered by Euler:

\displaystyle \textrm{sinc}(x) = \prod_{n=1}^{\infty} \left(1 - \frac{x^2}{n^2}\right)

With some straightforward algebraic manipulations (ignoring convergence issues for now) you get:
\displaystyle \textrm{sinc}(x) = \prod_{n=1}^{\infty} \left(1 - \frac{x}{n}\right) \left(1 + \frac{x}{n}\right)

\displaystyle = \prod_{\substack{n \in \mathbb{Z} \\ n \ne 0}} \left(1 - \frac{x}{n}\right) = \prod_{\substack{n \in \mathbb{Z} \\ n \ne 0 }} \frac{n - x}{n}

Compare this with the formula for Lagrange basis polynomials:

\displaystyle L_{i;n}(x) = \prod_{\substack{0 \le j \le n \\ j \ne i}} \frac{x_j - x}{x_j - x_i}

in other words, the sinc function is the limiting case of Lagrange polynomials for an infinite number of equidistant control points. Which is pretty neat :)

Finish your derivations, please

Every time you ship a product with half-assed math in it, God kills a fluffy kitten by feeding it to an ill-tempered panda bear (don’t ask me why – I think it’s bizarre and oddly specific, but I have no say in the matter). There’s tons of ways to make your math way more complicated (and expensive) than it needs to be, but most of the time it’s the same few common mistakes repeated ad nauseam. Here’s a small checklist of things to look out for that will make your life easier and your code better:

  • Symmetries. If your problem has some obvious symmetry, you can usually exploit it in the solution. If it has radial symmetry around some point, move that point to the origin. If there’s some coordinate system where the constraints (or the problem statement) get a lot simpler, try solving the problem in that coordinate system. This isn’t guaranteed to win you anything, but if you haven’t checked, you should – if symmetry leads to a solution, the solutions are usually very nice, clean and efficient.
  • Geometry. If your problem is geometrical, draw a picture first, even if you know how to solve it algebraically. Approaches that use the geometry of the problem can often make use of symmetries that aren’t obvious when you write it in equations. More importantly, in geometric derivations, most of the quantities you compute actually have geometrical meaning (points, distances, ratios of lengths, etc). Very useful when debugging to get a quick sanity check. In contrast, intermediate values in the middle of algebraic manipulations rarely have any meaning within the context of the problem – you have to treat the solver essentially as a black box.
  • Angles. Avoid them. They’re rarely what you actually want, they tend to introduce a lot of trigonometric functions into the code, you suddenly need to worry about parametrization artifacts (e.g. dealing with wraparound), and code using angles is generally harder to read/understand/debug than equivalent code using vectors (and slower, too).
  • Absolute Angles. Particularly, never never ever use angles relative to some arbitrary absolute coordinate system. They will wrap around to negative at some point, and suddenly something breaks somewhere and nobody knows why. And if you’re about to introduce some arbitrary coordinate system just to determine some angles, stop and think very hard if that’s really a good idea. (If, upon reflection, you’re still undecided, this website has your answer).
  • Did I mention angles? There’s one particular case of angle-mania that really pisses me off: Using inverse trigonometric functions immediately followed by sin / cos. atan2 / sin / cos: World’s most expensive 2D vector normalize. Using acos on the result of a dot product just to get the corresponding sin/tan? Time to brush up your trigonometric identities. A particularly bad offender can be found here – the relevant section from the simplified shader is this:
    float alpha = max( acos( dot( v, n ) ), acos( dot( l, n ) ) );
    float beta  = min( acos( dot( v, n ) ), acos( dot( l, n ) ) );
    C = sin(alpha) * tan(beta);

    Ouch! If you use some trig identities and the fact that acos is monotonically decreasing over its domain, this reduces to:

    float vdotn = dot(v, n);
    float ldotn = dot(l, n);
    C = sqrt((1.0 - vdotn*vdotn) * (1.0 - ldotn*ldotn))
      / max(vdotn, ldotn);

    ..and suddenly there’s no need to use a lookup texture anymore (and by the way, this has way higher accuracy too). Come on, people! You don’t need to derive it by hand (although that’s not hard either), you don’t need to buy some formula collection, it’s all on Wikipedia – spend the two minutes and look it up!

  • Elementary linear algebra. If you build some matrix by concatenating several transforms and it’s a performance bottleneck, don’t get all SIMD on it, do the obvious thing first: do the matrix multiply symbolically and generate the result directly instead of doing the matrix multiplies every time. (But state clearly in comments which transforms you multiplied together in what order to get your result or suffer the righteous wrath of the next person to touch that code). Don’t invert matrices that you know are orthogonal, just use the transpose! Don’t use 4×4 matrices everywhere when all your transforms (except for the projection matrix) are affine. It’s not rocket science.
  • Unnecessary numerical differentiation. Numerical differentiation is numerically unstable, and notoriously difficult to get robust. It’s also often completely unnecessary. If you’re dealing with analytically defined functions, compute the derivative directly – no robustness issues, and it’s usually faster too (…but remember the chain rule if you warp your parameter on the way in).

Short version: Don’t just stick with the first implementation that works (usually barely). Once you have a solution, at least spend 5 minutes looking over it and check if you missed any obvious simplifications. If you won’t do it by yourself, think of the kittens!

Some more frustum culling notes

More link-chasing! Charles has some more notes on the general “get center along axis” / “get radius along axis” primitives that are used in these tests. This is also at the heart of separating axis tests and the GJK algorithm for collision detection, so it’s definitely worth getting comfortable with. He also notes that:

Frustum culling is actually a bit subtle, in that how much work you should do on it depends on how expensive the object you are culling is to render. If the object will take 0.0001 ms to just render, you shouldn’t spend much time culling it. In that case you should use a simpler approximate test – for example a Cone vs. Sphere test. Or maybe no test at all – combine it into a group of objects and test the whole group of culling. If an object is very expensive to render (like maybe it’s a skinned human with very complex shaders), it takes 1 ms to render, then you want to cull it very accurately indeed – in fact you may want to test against an OBB or a convex hull of the object instead of just an AABB.

I agree with the second part, but I’m not sure about the cone vs. sphere test. Three reasons: First, the cone around the view frustum is really a pretty coarse approximation. For a 4:3 viewport, the area of the cone cross-section along the near plane is about 64% larger than the area of the bounding rectangle. For the 16:9 viewports we now commonly have it’s about 84% larger than the rect. That’s a very conservative test, even if you have tight-fitting bounding spheres. Second, even when an object is cheap to render, it still goes through a lot of pipeline stages before it ever gets submitted to the GPU. You don’t draw things immediately; you normally build a job list first that is then sorted. You may also do some sort of occlusion culling. For all survivors, you then need to set the state for that batch (shaders, constants, textures, various rendering states), do some state filtering while you’re at it (to avoid wasting GPU cycles with redundant state changes), and then finally issue the actual draw call. Even with optimized code, that’s usually a few thousand clock cycles start to finish. For posterity, let’s assume that the box-frustum test costs 50 cycles more than a sphere-cone test (which is generous), and each non-culled batch costs an average of 1000 cycles start to finish. If your coarse test has a false positive rate of only 5%, it’s no win; at 10%, it’s a 50 cycles/batch net cost. Third, even if it is a win on the CPU side, you end up generating GPU work for some false positives. Wasted GPU cycles are more expensive than wasted CPU cycles, because there’s less of them (lower clock rate, and usually just one GPU vs. more than one CPU core nowadays).

Grouping is a good point though. If there’s a natural grouping for your dynamic objects, exploit it. For your scenery, you should build a static hierarchy. Large cache lines, high mispredicted branch penalties and SIMD-optimized processors mean that any kind of binary tree is a bad choice (doubly so when you don’t have fast random access to memory as on SPUs). Use a larger fan-out, aim for a fairly flat hierarchy (again, doubly so for SPUs). Don’t build just any tree; use a simple cost function and choose your subdivisions to minimize it (no need for exhaustive search, it’s enough to test 5-8 different options at each level and pick the best one). If in doubt, build a binary hierarchy first (it’s easier to code) then merge nodes later to get the larger fan-out.

Culling tests inside a hierarchy have different trade-offs than the per-batch culling tests you do at the leaf nodes, so consider using different code (or a different type of bounding volume) for it.

If your tests work with clipspace geometry, it’s fairly easy to build a 2D screen-space bounding box around your BV while you’re at it. The cheap version just returns the whole screen when the BV crosses the near plane; Blinn describes a more accurate (but slower) variant in “Calculating Screen Coverage” (Chapter 6 of “Jim Blinn’s Corner: Notation, Notation, Notation”, another book you should consider buying). Either way, this bbox is useful: you can use it to estimate the screen size of objects (useful for LOD selection and to reject objects entirely once they’re smaller than a few pixels) and as input to a coarse occlusion test.

The bbox screen size test is more generally useful than a far-plane test; if you use it, consider skipping the far-plane test entirely during your per-batch culling. That brings your plane count down to 5; depending on your geometry, there might be another plane that’s not really helpful for culling (usually the top or bottom plane). Throwing another plane out gets you down to 4, which may or may not make your culling code nicer (4-way SIMD and all). Debatable whether it helps nowadays, but it used to be nifty on PS2 :)

View frustum culling

Recently stumbled across this series by Zeux. It’s well written, but I have one tiny objection about the content. The money quote is in part 5, where he says that “there is p/n-vertex approach which I did not implement (but I’m certain that it won’t be a win over my current methods on SPUs)”. Hoo boy. Being certain about something you don’t know is never a good strategy in programming (or life in general for that matter). View-frustum culling for AABBs is something I’ve come across multiple times in the 12 years that I’ve been seriously doing 3D graphics. There’s tons of nice tricks of the trade, and even though all of them are published somewhere, most people don’t bother looking it up (after all, the obvious algorithm is simple enough and not very slow) – incidentally, there’s two books that are most likely already in your office somewhere (if you’re a graphics or game programmer) that you should consult on the topic: “Real-Time Rendering” (16.10, “Plane/Box Intersection Detection”) and “Real-Time Collision Detection” (5.2.3, “Testing Box Against Plane”). Both describe a very slick implementation of the “p/n-vertex” approach. But I’m gonna take a trip down memory lane and explain the various approaches I’ve been using over the years (don’t expect optimized implementations of all of them, I’m just going to explain the basic idea of each test and then move on, for the most part).
Read more…