We’ve been spending some time at RAD looking at Jarek Duda’s ANS/ABS coders (paper). This is essentially a new way of doing arithmetic coding with some different trade-offs from the standard methods. In particular, they have a few sweet spots for situations that were (previously) hard to handle efficiently with regular arithmetic coding.

The paper covers numerous algorithms. Of those given, I think what Jarek calls “rANS” and “tANS” are the most interesting ones; there’s plenty of binary arithmetic coders already and “uABS”, “rABS” or “tABS” do not, at first sight, offer any compelling reasons to switch (that said, there’s some non-obvious advantages that I’ll get to in a later post).

Charles has already posted a good introduction; I recommend you start there. Charles has also spent a lot of time dealing with the table-based coders, and the table-based construction allows some extra degrees of freedom that make it hard to see what’s actually going on. In this post, I’m going to mostly talk about rANS, the range coder equivalent. Charles already describes it up to the point where you try to make it “streaming” (i.e. finite precision); I’m going to continue from there.

### Streaming rANS

In Charles’ notation (which I’m going to use because the paper uses both indexed upper-case I’s and indexed lower-case l’s, which would make reading this with the default font a disaster), we have two functions now: The coding function C and the decoding function D defined as

where .

D determines the encoded symbol s by looking at (x mod M) and looking up the corresponding value in our cumulative frequency table (s is the unique s such that ).

Okay. If you're working with infinite-precision integers, that's all you need, but that's not exactly suitable for fast (de)compression. We want a way to perform this using finite-precision integers, which means we want to limit the size of x somehow. So what we do is define a "normalized" interval

The [L:bL) thing on the right is the notation I'll be using for a half-open interval of integers). b is the radix of our encoder; b=2 means we're emitting bits at a time, b=256 means we're emitting bytes, and so forth. Okay. We now want to keep x in this range. Too large x don't fit in finite-precision integers, but why not allow small x? The (hand-wavey) intuition is that too small x don't contain enough information; as Charles mentioned, x essentially contains about bits of information. If x < 4, it can only be one of four distinct values, and as you saw above the value of x (mod M) directly determines which symbol we're coding. So we want x just right: not too large, not too small. Okay. Now let's look at how a corresponding decoder would look:

while (!done) { // Loop invariant: x is normalized. assert(L <= x && x < b*L); // Decode a symbol s, x = D(x); // Renormalization: While x is too small, // keep reading more bits (nibbles, bytes, ...) while (x < L) x = x*b + readFromStream(); }

Turns out that’s actually our decoding algorithm, period. What we need to do now is figure out how the corresponding encoder looks. As long as we’re only using C and D with big integers, it’s simple; the two are just inverses of each other. Likewise, we want our encoder to be the inverse of our decoder – exactly. That means we have to do the inverse of the operations the decoder does, in the opposite order. Which means our encoder has to look something like this:

while (!done) { // Inverse renormalization: emit bits/bytes etc. while (???) { writeToStream(x % b); x /= b; } // Encode a symbol x = C(s, x); // Loop invariant: x is normalized assert(L <= x && x < b*L); }

So far, this is purely mechanical. The only question is what happens in the “???” – when exactly *do* we emit bits? Well, for the encoder and decoder to be inverses of each other, the answer has to be “exactly when the decoder would read them”. Well, the decoder reads bits whenever the normalization variant is violated after decoding a symbol, to make sure it holds for the next iteration of the loop. The encoder, again, needs to do the opposite – we need to proactively emit bits *before* coding s to make sure that, after we’ve applied C, x will be normalized.

In fact, that’s all we need for a first sketch of renormalization:

while (!done) { // Keep trying until we succeed for (;;) { x_try = C(s, x); if (L <= x_try && x_try < b*L) { // ok? x = x_try; break; } else { // Shrink x a bit and try again writeToStream(x % b); x /= b; } } x = x_try; }

Does this work? Well, it might. It depends. We certainly can’t guarantee it from where we’re standing, though. And even if it does, it’s kind of ugly. Can’t we do better? What about the normalization – I’ve just written down the normalization loops, but just because both decoder and encoder maintain the same invariants doesn’t necessarily mean they are in sync. What if at some point of the loop, there are more than two possible normalized configurations – can this happen? Plus there’s some hidden assumptions in here: the encoder, by only ever shrinking x before C, assumes that C always causes x to increase (or at least never causes x to decrease); similarly, the decoder assumes that applying D won’t increase x.

And I’m afraid this is where the proofs begin.

### Streaming rANS: the proofs (part 1)

Let’s start with the last question first: does C always increase x? It certainly looks like it might, but there’s floors involved – what if there’s some corner case lurking? Time to check:

since , , and are all non-negative (the latter because – each symbol's frequency is less than the sum of all symbol frequencies). So C indeed always increases x, or more precisely, never decreases it.

Next up: normalization. Let's tackle the decoder first. First off, is normalization in the decoder unique? That is, if , is that the only normalized x it could be at that stage in the decoding process? Yes, it is: , so , so

where arbitrary

That is, no matter what bit / byte / whatever the decoder would read next (that's 'd'), running through the normalization loop once more would cause x to become too large; there's only one way for x to be normalized. But do we always end up with a normalized x? Again, yes. Suppose that , then (because we're working in integers) , and hence

(again arbitrary)

The same kind of argument works for the encoder, which floor-divides by b instead of multiplying b it. The key here is that our normalization interval I has a property the ANS paper calls “b-uniqueness”: it’s of the form for some positive integer k (its upper bound is b times its lower bound). Any process that grows (or shrinks) x in multiples of b can’t “skip over” I (as in, transition from being larger than the largest value in I to smaller than the smallest value in I or vice versa) in a single step. Simultaneously, there’s also never two valid states the encoder/decoder could be in (which could make them go out of sync: both encoder and decoder think they’re normalized, but they’re at different state values).

To elaborate, suppose we have b=2 and some interval where the ratio between lower and upper bound is a tiny bit less than 2: say There’s just one value missing. Now suppose the decoder is in state and reads a bit, which turns out to be 1. Then the new state is – we overshot! We were below the lower bound of I’, and yet with a single bit read, we’re now past its upper bound. I’ is “too small”.

Now let’s try the other direction; again b=2, and this time we make the ratio between upper and lower bound a bit too high: set There’s no risk of not reaching a state in that interval now, but now there is ambiguity. Where’s the problem? Suppose the encoder is in state . Coding any symbol will require renormalization to “shift out” one bit. The encoder writes that bit (a zero), goes to state , and moves on. But there’s a problem: the decoder, after decoding that symbol, will be in state too. And it doesn’t know that the encoder got there from state by shifting a bit; all the decoder knows is that it’s in state , which is normalized and doesn’t require reading any more bits. So the encoder has written a bit that the decoder doesn’t read, and now the two go out of sync.

Long story short: if the state intervals involved aren’t b-unique, bad things happen. And on the subject of bad things happening, our encoder tries to find an x such that C(s,x) is inside I by shrinking x – but what if that process doesn't find a suitable x? We need to know a bit more about which values of x lead to C(s,x) being inside I, which leads us to the sets

i.e. the set of all x such that encoding 's' puts us into I again. If all these sets turn out to be b-unique intervals, we're good. If not, we're in trouble. Time for a brief example.

### Intermission: why b-uniqueness is necessary

Let’s pick L=5, b=2 and M=3. We have just two symbols: ‘a’ has probability , and ‘b’ has probability , which we turn into , , , . Our normalization interval is . By computing C(s,x) for the relevant values of s and x, we find out that

Uh-oh. Neither of these two intervals is b-unique. is too small, and is too big. So what goes wrong?

Well, suppose that we’re in state x=7 and want to encode an ‘a’. 7 is not in (too large). So the encoder emits the LSB of x, divides by 2 and… now x=3. Well, that’s not in either (too small), and shrinking it even further won’t help that. So at this point, the encoder is stuck; there’s no x it can reach that works.

### Proofs (part 2): a sufficient condition for b-uniqueness

So we just saw that in certain scenarios, rANS can just get stuck. Is there anything we can do to avoid it? Yes: the paper points out that the embarrassing situation we just ran into can’t happen when M (the sum of all symbol frequencies, the denominator in our probability distribution) divides L, our normalization interval lower bound. That is, for some positive integer k. It doesn’t give details, though; so, knowing this, can we prove anything about the that would help us? Well, let’s just look at the elements of and see what we can do:

let’s work on that condition:

at this point, we can use that and divide by M:

Now, for arbitrary real numbers r and natural numbers n we have that

Using this, we get:

note the term in the outer floor bracket is the sum of an integer and a real value inside , since , so we can simplify drastically

where we applied the floor identities above again and then just multiplied by . Note that the result is an interval of integers with its (exclusive) upper bound being equal to b times its (inclusive) lower bound, just like we need – in other words, assuming that , all the are b-unique and we're golden (this is mentioned in the paper in section 3.3, but not proven, at least not in the Jan 6 2014 version).

Note that this also gives us a much nicer expression to check for our encoder. In fact, we only need the upper bound (due to b-uniqueness, we know there's no risk of us falling through the lower bound), and we end up with the encoding function

while (!done) { // Loop invariant: x is normalized assert(L <= x && x < b*L); // Renormalize x_max = (b * (L / M)) * freq[s]; // all but freq[s] constant while (x >= x_max) { writeToStream(x % b); x /= b; } // Encode a symbol // x = C(s, x); x = freq[s] * (x / M) + (x % M) + base[s]; }

No “???”s left – we have a “streaming” (finite-precision) version of rANS, which is almost like the arithmetic coders you know and love (and in fact quite closely related) except for the bit where you need to encode your data in reverse (and reverse the resulting byte stream).

I put an actual implementation on Github for the curious.

### Some conclusions

This is an arithmetic coder, just a weird one. The reverse encoding seems like a total pain at first, and it kind of is, but it comes with a bunch of really non-obvious but amazing advantages that I’ll cover in a later post (or just read the comments in the code). The fact that M (the sum of all frequencies) has to be a multiple of L is a serious limitation, but I don’t (yet?) see any way to work around that while preserving b-uniqueness. So the compromise is to pick M and L to be both powers of 2. This makes the decoder’s division/mod with M fast. The power-of-2 limitation makes rANS really bad for adaptive coding (where you’re constantly updating your stats, and resampling to a power-of-2-sized distribution is expensive), but hey, so is Huffman. As a Huffman replacement, it’s really quite good.

In particular, it supports a divide-free decoder (and actually no per-symbol division in the encoder either, if you have a static table; see my code on Github, `RansEncPutSymbol`

in particular). This is something you can’t (easily) do with existing multi-symbol arithmetic coders, and is a really cool property to have, because it really puts it a lot closer to being a viable Huffman replacement in a lot of places that do care about the speed.

If you look at the decoder, you’ll notice that its entire behavior for a decoding step only depends on the value of `x`

at the beginning: figure out the symbol from the low-order bits of x, go to a new state, read some bits until we’re normalized again. This is where the table-based versions (tANS etc.) come into play: you can just tabulate their behavior! To make this work, you want to keep b and L relatively small. Then you just make a table of what happens in every possible state.

Interestingly, because these tables really do tabulate the behavior of a “proper” arithmetic coder, they’re *compatible*: if you have two table-baked distributions with use that same values of b and L (i.e. the same interval I), you can switch between them freely; the states mean the same in both of them. It’s not at all obvious that it’s even possible for a table-based encoder to have this property, so it’s even cooler that it comes with no onerous requirements on the distribution!

That said, as interesting as the table-based schemes are, I think the non-table-based variant (rANS) is actually more widely useful. Having small tables severely limits your probability resolution (and range precision), and big tables are somewhat dubious: adds, integer multiplies and bit operations are fine. We can do these quickly. More compute power is a safe thing to bet on right now (memory access is not). (I do have some data points on what you can do on current HW, but I’ll get there in a later post.)

As said, rANS has a bunch of really cool, unusual properties, some of which I’d never have expected to see in any practical entropy coder, with cool consequences. I’ll put that in a separate post, though – this one is long (and technical) enough already. Until then!

I got a few mails about the previous post, including some pretty cool suggestions that I figured were worth posting.

Won Chun (who wrote a book chapter for OpenGL insights on the topic, “WebGL Models: End-to-End”) writes: (this is pasted together from multiple mails; my apologies)

A vertex-cache optimized list is still way more compressible than random: clearly there is lots of coherence to exploit. In fact, that’s pretty much with the edge-sharing-quad business is (OK, some tools might do this because they are working naturally in quads, but you get lots of these cases with any vertex cache optimizer).

So you can also get a pretty big win by exploiting pre-transform vertex cache optimization. I call it “high water mark encoding.” Basically: for such a properly optimized index list, the next index you see is either (a) one you’ve seen before or (b) one higher than the current highest seen index. So, instead of encoding actual indices, you can instead encode them relative to this high water mark (the largest index yet to be seen, initialized to 0). You see “n” and that corresponds to an index of (high water mark – n). When you see 0, you also increment high watermark.

The benefit here is that the encoded indices are very small, and then you can do some kind of varint coding, then your encoded indices are a bit more than a byte on average. If you plan on zipping later, then make sure the varints are byte-aligned and LSB-first.

There are lots of variants on this, like:

- keep a small LRU (~32 elements, or whatever transform cache size you’ve optimized for) of indices, and preferentially reference indices based on recent use (e.g. values 1-32), rather than actual value (deltas are offset by 32),
- make the LRU reference edges rather than verts, so a LRU “hit” gives you two indices instead of one,
- do two-level high water mark encoding, which makes it efficient to store in multi-indexed form (like .OBJ, which has separate indices per attribute) and decode into normal single-indexed form
And also, these approaches let you be smart about attribute compression as well, since it gives you useful hints; e.g. any edge match lets you do parallelogram prediction, multi-indexing can implicitly tell you how to predict normals from positions.

### “High watermark encoding”

I really like this idea. Previously I’d been using simple delta encoding on the resulting index lists; that works, but the problem with delta coding is that a single outlier will produce two large steps – one to go from the current region to the outlier, then another one to get back. The high watermark scheme is almost as straightforward as straight delta coding and avoids this case completely.

Now, if you have an index list straight out of vertex cache optimization and vertex renumbering, the idea works as described. However, with the hybrid tri/paired-tri encoding I described last time, we have to be a bit more careful. While the original index list will indeed have each index be at most 1 larger than the highest index we’ve seen so far, our use of “A ≥ B” to encode whether the next set of indices describes a single triangle or a pair means that we might end up having to start from the second or third vertex of a triangle, and consequently see a larger jump than just 1. Luckily, the fix for this is simple – rather than keeping the high watermark always 1 higher than the largest vertex index we’ve seen so far, we keep it N higher where N is the largest possible “step” we can have in the index list. With that, the transform is really easy, so I’m just going to post my code in full:

static void watermark_transform(std::vector<int>& out_inds, const std::vector<int>& in_inds, int max_step) { int hi = max_step - 1; // high watermark out_inds.clear(); out_inds.reserve(in_inds.size()); for (int v : in_inds) { assert(v <= hi); out_inds.push_back(hi - v); hi = std::max(hi, v + max_step); } }

and the inverse is exactly the same, with the `push_back`

in the middle replaced by the two lines

v = hi - v; out_inds.push_back(v);

So what’s the value of N (aka `max_step`

in the code), the largest step that a new index can be from the highest index we’ve seen so far? Well, for the encoding described last time, it turns out to be 3:

- When encoding a single triangle, the worst case is a triangle with all-new verts. Suppose the highest index we’ve seen so far is k, and the next triangle has indices (k+1,k+2,k+3). Our encoding for single triangles requires that the first index be larger than the second one, so we would send this triangle as (k+3,k+1,k+2). That’s a step of 3.
- For a pair of triangles, we get 4 new indices. So it might seem like we might get a worst-case step of 4. However, we know that the two triangles share an edge; and for that to be the case, the shared edge must have been present in the first triangle. Furthermore, we require that the smaller of the two indices be sent first (that’s what flags this as a paired tri). So the worst cases we can have for the first two indices are (k+2,k+3) and (k+1,k+3), both of which have a largest step size of 2. After the first two indices, we only have another two indices to send; worst-case, they are both new, and the third index is larger than the fourth. This corresponds to a step size of 2. All other configurations have step sizes ≤1.

And again, that’s it. Still very simple. So does it help?

### Results

Let’s dig out the table again (new entries bold, all percentages relative to the “Vertex cache optimized” row):

Stage | Size | .zip size | .7z size |
---|---|---|---|

Original mesh | 6082k | 3312k | 2682k |

Vertex cache optimized | 6082k | 2084k | 1504k |

Vcache opt, watermark |
6082k | 1808k (-13.2%) | 1388k (-7.7%) |

Postprocessed | 4939k (-18.8%) | 1830k (-12.2%) | 1340k (-10.9%) |

Postproc, watermark |
4939k (-18.8%) | 1563k (-25.0%) | 1198k (-20.3%) |

In short: the two techniques work together perfectly and very nicely complement each other. This is without any varint encoding by the way, still sending raw 32-bit indices. Variable-sized integer encoding would probably help, but I haven’t checked how much. The code on Github has been updated in case you want to play around with it.

### Summary

I think this is at a fairly sweet spot in terms of compression ratio vs. decoder complexity. Won originally used this for WebGL, using UTF-8 as his variable-integer encoding: turn the relative indices into Unicode code points, encode the result as UTF-8. This is a bit hacky and limits you to indices that use slightly more than 20 bits (still more than a million, so probably fine for most meshes you’d use in WebGL), but it does mean that the Browser can handle the variable-sized decoding for you (instead of having to deal with byte packing in JS code).

Overall, this approach (postprocess + high watermark) gives a decoder that is maybe 30 lines or so of C++ code, and which does one linear pass over the data (the example program does two passes to keep things clearer, but they can be combined without problems) with no complicated logic whatsoever. It’s simple to get right, easy to drop in, and the results are quite good.

It is not, however, state of the art for mesh compression by any stretch of the imagination. This is a small domain-specific transform that can be applied to fully baked and optimized game assets to make them a bit smaller on disk. I also did not cover vertex data; this is not because vertex data is unimportant, but simply because, so far at least, I’ve not seen *any* mesh compressors that do something besides the obvious (that is, quantization and parallelogram prediction) for vertex data.

Finally, if what you actually want is state-of-the art triangle mesh compression, you should look elsewhere. This is outside the scope of this post, but good papers to start with are “Triangle Mesh Compression” by Touma and Gotsman (Proceedings of Graphics Interface, Vancouver, June 1998) and “TFAN: A low complexity 3D mesh compression algorithm” by Khaled, Zaharia, and Prêteux (Computer Animation and Virtual Worlds 20.2‐3 (2009)). I’ve played around with the former (the character mesh in Candytron was encoded using an algorithm descended from it) but not the latter; however, Touma-Gotsman and descendants are limited to 2-manifold meshes, whereas TFAN promises better compression and handles arbitrarily topologies, so it looks good on paper at least.

Anyway; that’s it for today. Thanks to Won for his mail! And if you should use this somewhere or figure out a way to get more mileage out of it without making it much more complicated, I’d absolute love to know!

(Almost) everyone uses indexed primitives. At this point, the primary choice is between indexed triangle lists, which are flexible but always take 3 indices per triangle, and indexed triangle strips; since your meshes are unlikely to be one big tri strip, that’s gonna involve primitive restarts of some kind. So for a strip with N triangles, you’re generally gonna spend 3+N indices – 2 indices to “prime the pump”, after which every new index will emit a new triangle, and finally a single primitive restart index at the end (supposing your HW target does have primitive restarts, that is).

Indexed triangle strips are nice because they’re smaller, but finding triangle strips is a bit of a pain, and more importantly, long triangle strips actually aren’t ideal because they tend to “wander away” from the origin triangle, which means they’re not getting very good mileage out of the vertex cache (see, for example, Tom Forsyth’s old article on vertex cache optimization).

So here’s the thing: we’d like our index buffers for indexed tri lists to be smaller, but we’d also like to do our other processing (like vertex cache optimization) and then not mess with the results – not too much anyway. Can we do that?

### The plan

Yes we can (I had this idea a while back, but never bothered to work out the details). The key insight is that, in a normal mesh, almost all triangles share an edge with another triangle. And if you’ve vertex-cache optimized your index buffer, two triangles that are adjacent in the index buffer are also quite likely to be adjacent in the geometric sense, i.e. share an edge.

So, here’s the idea: loop over the index buffer. For each triangle, check if it shares an edge with its immediate successor. If so, the two triangles can (in theory anyway) be described with four indices, rather than the usual six (since two vertices appear in both triangles).

But how do we encode this? We could try to steal a bit somewhere, but in fact there’s no need – we can do better than that.

Suppose you have a triangle with vertex indices (A, B, C). The choice of which vertex is first is somewhat arbitrary: the other two even cycles (B, C, A) and (C, A, B) describe the same triangle with the same winding order (and the odd cycles describe the same triangle with opposite winding order, but let’s leave that alone). We can use this choice to encode a bit of information: say if A≥B, we are indeed coding a triangle – otherwise (A<B), what follows will be two triangles sharing a common edge (namely, the edge AB). We can always pick an even permutation of triangle indices such that A≥B, since for *any* integer A, B, C we have

Because the sum is 0, not all three terms can be negative, which in turn means that at least one of A≥B, B≥C or C≥A must be true. Furthermore, if A, B, and C are all distinct (i.e. the triangle is non-degenerate), all three terms are nonzero, and hence we must have both negative and positive terms for the sum to come out as 0.

### Paired triangles

Okay, so if the triangle *wasn’t* paired up, we can always cyclically permute the vertices such that A≥B. What do we do when we have two triangles sharing an edge, say AB?

For this configuration, we need to send the 4 indices A, B, C, D, which encode the two triangles (A, B, C) and (A, D, B).

If A<B, we can just send the 4 indices directly, leading to this very simple decoding algorithm that unpacks our mixed triangle/double-triangle indexed buffer back to a regular triangle list:

- Read 3 indices A, B, C.
- Output triangle (A, B, C).
- If A<B, read another index D and output triangle (A, D, B).

Okay, so this works out really nicely if A<B. But what if it’s not? Well, there’s just two cases left. If A=B, the shared edge is a degenerate edge and both triangles are degenerate triangles; not exactly common, so the pragmatic solution is to say “if either triangle is degenerate, you have to send them un-paired”. That leaves the case A>B; but that means B<A, and BA is also a shared edge! In fact, we can simply rotate the diagram by 180 degrees; this swaps the position of (B,A) and (C,D) but corresponds to the same triangles. With the algorithm above, (B, A, D, C) will decode as the two triangles (B, A, D), (B, C, A) – same two triangles as before, just in a different order. So we’re good.

### Why this is cool

What this means is that, under fairly mild assumptions (but see “limitations” section below), we can have a representation of index buffers that mixes triangles and pairs of adjacent triangles, with no need for any special flag bits (as recommended in Christer’s article) or other hackery to distinguish the two.

In most closed meshes, every triangle has at least one adjacent neighbor (usually several); isolated triangles are very rare. We can store such meshes using 4 indices for every pair of triangles, instead of 6, for about a 33% reduction. Furthermore, most meshes in fact contain a significant number of quadriliterals (quads), and this representation supports quads directly (stored with 4 indices). 33% reduction for index buffers isn’t a huge deal if you have “fat” vertex formats, but for relatively small vertices (as you have in collision detection, among other things), indices can actually end up being a significant part of your overall mesh data.

Finally, this is simple enough to decode that it would probably be viable in GPU hardware. I wouldn’t hold my breath for that one, just thought I might point it out. :)

### Implementation

I wrote up a quick test for this and put it on Github, as usual. This code loads a mesh, vertex-cache optimizes the index buffer (using Tom’s algorithm), then checks for each triangle whether it shares an edge with its immediate successor and if so, sends them as a pair – otherwise, send the triangle alone. That’s it. No attempt is made to be more thorough than that; I just wanted to be as faithful to the original index buffer as possible.

On the “Armadillo” mesh from the Stanford 3D scanning repository, the program outputs this: (**UPDATE**: I added some more features to the sample program and updated the results here accordingly)

172974 verts, 1037832 inds. before: ACMR: 2.617 (16-entry FIFO) 62558 paired tris, 283386 single IB inds: list=1037832, fancy=975274 (-6.03%) after: ACMR: 0.814 (16-entry FIFO) 292458 paired tris, 53486 single IB inds: list=1037832, fancy=745374 (-28.18%) 745374 inds packed 1037832 inds unpacked index buffers match. ACMR: 0.815 (16-entry FIFO)

“Before” is the average cache miss rate (vertex cache misses/triangle) assuming a 16-entry FIFO cache for the original Armadillo mesh (not optimized). As you can see, it’s pretty bad.

I then run the simple pairing algorithm (“fancy”) on that, which (surprisingly enough) manages to reduce the index list size by about 6%.

“After” is after vertex cache optimization. Note that Tom’s algorithm is cache size agnostic; it does not assume any particular vertex cache size, and the only reason I’m dumping stats for a 16-entry FIFO is because I had to pick a number and wanted to pick a relatively conservative estimate. As expected, ACMR is much better; and the index buffer packing algorithm reduces the IB size by about 28%. Considering that the best possible case is a reduction of 33%, this is quite good. Finally, I verify that packing and unpacking the index buffer gives back the expected results (it does), and then re-compute the ACMR on the unpacked index buffer (which has vertices and triangles in a slightly different order, after all).

Long story short: it works, even the basic “only look 1 triangle ahead” algorithm gives good results on vertex cache optimized meshes, and the slight reordering performed by the algorithm does not seem to harm vertex cache hit rate much (on this test mesh anyway). Apologies for only testing on one 3D-scanned mesh, but I don’t actually have any realistic art assets lying around at home, and even if I did, loading them would’ve probably taken me more time than writing the entire rest of this program did.

### UPDATE: Some more results

The original program was missing one more step that is normally done after vertex cache optimization: reordering the vertex buffer so that vertices appear in the order they’re referenced from the index buffer. This usually improves the efficiency of the *pre*-transform cache (as opposed to the *post*-transform cache that the vertex cache optimization algorithm takes care of) because it gives better locality of reference, and has the added side effect of also making the index data more compressible for general purpose lossless compression algorithms like Deflate or LZMA.

Anyway, here’s the results for taking the entire Armadillo mesh – vertices, which just store X/Y/Z position as floats, and 32-bit indices both – and processing it with some standard general-purpose compressors at various stages of the optimization process: (all sizes in binary kilobytes, KiB)

Stage | Size | .zip size | .7z size |
---|---|---|---|

Original mesh | 6082k | 3312k | 2682k |

Vertex cache optimized | 6082k | 2084k | 1504k |

Postprocessed | 4939k (-18.8%) | 1830k (-12.2%) | 1340k (-10.9%) |

So the post-process yields a good 10% reduction in the compressed size for what would be the final packaged assets here. This value is to be taken with a grain of salt: “real” art assets have other per-vertex data besides just 12 bytes for the 3D position, and nothing I described here does anything about vertex data. In other words, this comparison is on data that favors the algorithm in the sense that roughly half of the mesh file is indices, so keep that in mind. Still, 10% reduction post-LZMA is quite good for such a simple algorithm, especially compared to the effort it takes to get the same level of reduction on, say, x86 code.

Also note that the vertex cache optimization by itself *massively* helps the compressors here; the index list for this mesh comes from a 3D reconstruction of range-scanned data and is pathologically bad (the vertex order is really quite random), but the data you get out of a regular 3D mesh export is quite crappy too. So if you’re not doing any optimization on your mesh data yet, you should really consider doing so – it will reduce both your frame timings and your asset sizes.

### Limitations

This is where the (*) from the title comes in. While I think this is fairly nice, there’s two cases where you can’t use this scheme, at least not always:

- When the order of vertices within a triangle matters. An example would be meshes using flat attribute interpolation, where the value chosen for a primitive depends on the “provoking vertex”. And I remember some fairly old graphics hardware where the Z interpolation depended on vertex specification order, so you could get Z-fighting between the passes in multi-pass rendering if they used different subsets of triangles.
- When the order of triangles within a mesh matters (remember that we in the two-tris case, we might end up swapping them to make the encoding work). Having the triangles in a particular order in the index buffer can be very useful with alpha blending, for example. That said, the typical case for this application is that the index buffer partitions into several chunks that should be drawn in order, but with no particular ordering requirements within that chunk, which is easy to do – just prohibit merging tris across chunk boundaries.

That said, it seems to me that it really should be useful in every case where you’d use a vertex cache optimizer (which messes with the order anyway). So it’s probably fine.

Anyway, that’s it. No idea whether it’s useful to anyone, but I think it’s fairly cute, and it definitely seemed worth writing up.

As a general rule, when trying to understand or prove something, symmetry is a big help. If something has obvious symmetries, or can be brought into a symmetric form, doing so is usually worth trying. One example of this is my old post on index determination for DXT5/BC3 alpha blocks. But a while ago I ran into something that makes for a really nice example: consider the expression

If these were regular parentheses, this would simply evaluate to , but we’re taking absolute values here, so it’s a bit more complicated than that. However, the expression does look pretty symmetric, so let’s see if we can do something with that. Name and conquer, so let’s define:

This expression looks fairly symmetric in a and b, so what happens if we switch the two?

So it’s indeed symmetric in the two arguments. Another candidate to check is sign flips – we’re taking absolute values here, so we might find something interesting there as well:

And because we already know that f is symmetric in its arguments, this gives us for free – bonus! Putting all these together, we now know that

which isn’t earth-shattering but not obvious either. You could prove this directly from the original expression, but I like doing it this way (defining a function f and poking at it) because it’s much easier to see what’s going on.

But we’re not quite done yet: One final thing you can do with absolute values is figure out what needs to happen for the expression to be non-negative and see if you can simplify it further. Now, both and are always non-negative, so in fact we have

.

Now suppose that . In that case, we know the sign of the expression on the right and can simplify further:

and similarly, when , we get . Putting the two together, we arrive at the conclusion

which I think is fairly cute. (This expression comes up in SATD computations and can be used to save some work in the final step.)

Last time, I talked about DCTs in general and how the core problem in designing integer DCTs are the rotations. However, I promised an 8×8 IDCT that is suitable for SIMD implementation using 16-bit integer arithmetic without multiplies, and we’re not there yet. I’m going to be focusing on the “B2″ variant of the transform in this post, for concreteness. That means the starting point will be this implementation, which is just a scaled version of Plonka’s DCT factorization with the rotation approximations discussed last time.

### Normalization

The DCT described last time (and implemented with the Matlab/Octave code above) is a scaled DCT, and I didn’t mention the scale factors last time. However, they’re easy enough to solve for. We have a forward transform matrix, . Since the rows are orthogonal (by construction), the inverse of (and hence our IDCT) is its transpose, up to scaling. We can write this as matrix equation where S ought to be a diagonal matrix. Solving for S yields . This gives the total scaling when multiplying by both and ; what we really want is a normalizing diagonal matrix such that is orthogonal, which means

since N is diagonal, so , and again because the two are diagonal this just means that the diagonal entries of N are the square roots of the diagonal entries of S. Then we multiply once by N just before quantization on the encoder side, and another time by N just after dequantization on the decoder side. In practice, this scaling can be folded directly into the quantization/dequantization process, but conceptually they’re distinct steps.

This normalization is the right one to use to compare against the “proper” orthogonal DCT-II. That said, with a non-lifting all-integer transform, normalizing to unit gain is a bad idea; we actually *want* some gain to reduce round-off error. A good compromise is to normalize everything to the gain of the DC coefficient, which is for the 1D transform. Using this normalization ensures that DC, which usually contains a significant amount of the overall energy, is not subject to scaling-induced round-off error and thus reproduced exactly.

In the 2D transform, each coefficient takes part in first a column DCT and then a row DCT (or vice versa). That means the gain of a separable 2D 8×8 DCT is the square of the gain for the 1D DCT, which in our case means 8. And since the IDCT is just the transpose of the DCT, it has the same gain, so following up the 2D DCT with the corresponding 2D IDCT introduces another 8x gain increase, bringing the total gain for the transform → normalize → inverse transform chain up to 64.

Well, the input to the DCT in a video codec is the difference between the actual pixel values and the prediction (usually motion compensation or some geometric/gradient predictor), and with 8-bit input signals that means the input is the difference between two values in [0, 255]. The worst cases are when the predicted value is 0 and the actual value is 255 or vice versa, leading to a worst-case input value range of [-255, 255] for our integer DCT. Multiply this by a gain factor of 64 and you get [-16320, 16320]. Considering that the plan is to have a 16-bit integer implementation, this should make it obvious what the problem is: it all fits, but only just, and we have to be *really* careful about overflow and value ranges.

### Dynamic range

This all falls under the umbrella term of the dynamic range of the transform. The smallest representable non-zero sample values we support are ±1. With 16-bit integer, the largest values we can support are ±32767; -32768 is out since there’s no corresponding positive value. The ratio between the two is the total dynamic range we have to work with. For transforms, it’s customary to define the dynamic range relative to the input value range, not the smallest representable non-zero signal; in our case, the maximum total scaling we can tolerate along any signal path is 32767/255, or about 128.498; anything larger and we might get overflows.

Now, this problem might seem like it’s self-made: as described above, both the forward and inverse transforms have a total gain of 8x, which seems wasteful. Why don’t we, say, leave the 8x gain on the forward transform but try to normalize the inverse transform such that it doesn’t have any extra gain?

Well, yes and no. We can, but it’s addressing the wrong part of the problem. The figures I’ve been quoting so far are for the gain as measured in the 2-norm (Euclidean norm). But for dynamic range analysis, what we need to know is the infinity norm (aka “uniform norm” or “maximum norm”): what is the largest absolute value we’re going to encounter in our vectors? And it turns out that with respect to the infinity norm, the picture looks a bit different.

To see why, let’s start with a toy example. Say we just have two samples, and our transform is a single butterfly:

As covered last time, B is a scaled reflection, its own inverse up to said scaling, and its matrix 2-norm is . However, B’s infinity-norm is 2. For example, the input vector (maximum value of 1) maps to (maximum value of 2) under B. In other words, if we add (or subtract) two numbers, we might get a number back that is up to twice as big as either of the inputs – this shouldn’t come as a surprise. Now let’s apply the transpose of B, which happens to be B itself:

This is just a scaled identity transform, and its 2-norm and infinity norm are 2 exactly. In other words, the 2-norm grew in the second application of B, but with respect to the infinity norm, things were already as bad as they were going to get after the forward transform. And it’s this infinity norm that we need to worry about if we want to avoid overflows.

### Scaling along the signal path

So what’s our actual matrix norms along the signal path, with the “B2″ DCT variant from last time? This is why I published the Octave code on Github. Let’s work through the 1D DCT first:

2> M=bink_dct_B2(eye(8)); 3> S=diag(8 ./ diag(M*M')); 4> norm(M,2) ans = 3.4324 5> norm(M,'inf') ans = 8.7500 6> norm(S*M,2) ans = 3.2962 7> norm(S*M,'inf') ans = 8.4881 8> norm(M'*S*M,2) ans = 8.0000 9> norm(M'*S*M,'inf') ans = 8.0000

This gives both the 2-norms and infinity norms for first the forward DCT only (`M`

), then forward DCT followed by scaling (`S*M`

), and finally with the inverse transform as well (`M'*S*M`

), although at that point it’s really just a scaled identity matrix so the result isn’t particularly interesting. We can do the same thing with 2D transforms using `kron`

which computes the Kronecker product of two matrices (corresponding to the tensor product of linear maps, which is the mathematical background underlying separable transforms):

10> norm(kron(M,M),'inf') ans = 76.563 11> norm(kron(S,S) * kron(M,M),2) ans = 10.865 12> norm(kron(S,S) * kron(M,M),'inf') ans = 72.047 13> norm(kron(M',M') * kron(S,S) * kron(M,M), 2) ans = 64.000 14> norm(kron(M',M') * kron(S,S) * kron(M,M), 'inf') ans = 64.000 15> norm(kron(M',M') * kron(S,S) * kron(M,M) - 64*eye(64), 'inf') ans = 8.5487e-014 16> 64*eps ans = 1.4211e-014

As you can see, the situation is very similar to what we saw in the butterfly example: while the 2-norm keeps growing throughout the entire process, the infinity norm actually maxes out right after the forward transform and stays roughly at the same level afterwards. The last two lines are just there to show that the entire 2D transform chain really does come out to be a scaled identity transform, to within a bit more than 6 ulps.

More importantly however, we do see a gain factor of around 76.6 right after the forward transform, and the inverse transform starts out with a total gain around 72; remember that this value has to be less than 128 for us to be able to do the transform using 16-bit integer arithmetic. As said before, it can fit, but it’s a close shave; we have less than 1 bit of headroom, so we need to be really careful about overflow – there’s no margin for sloppiness. And furthermore, while this means that the *entire transform* fits inside 16 bits, what we really need to make sure is that the way we’re computing it is free of overflows; we need to drill down a bit further.

For this, it helps to have a direct implementation of the DCT and IDCT; a direct implementation (still in Octave) for the DCT is here, and I’ve also checked in a direct IDCT taking an extra parameter that determines how many stages of the IDCT to compute. A “stage” is a set of mutually independent butterflies or rotations that may depend on previous stages (corresponding to “clean breaks” in a butterfly diagram or the various matrix factors in a matrix decomposition).

I’m not going to quote the entire code (it’s not very long, but there’s no point), but here’s a small snippet to show what’s going on:

% odd stage 4 c4 = d4; c5 = d5 + d6; c7 = d5 - d6; c6 = d7; if stages > 1 % odd stage 3 b4 = c4 + c5; b5 = c4 - c5; b6 = c6 + c7; b7 = c6 - c7; % even stage 3 b0 = c0 + c1; b1 = c0 - c1; b2 = c2 + c2/4 + c3/2; % 5/4*c2 + 1/2*c3 b3 = c2/2 - c3 - c3/4; % 1/2*c3 - 5/4*c3 % ... end

The trick is that within each stage, there are only integer additions and right-shifts (here written as divisions because this is Octave code, but in integer code it should really be shifts and not divisions) of values computed in a previous stage. The latter can never overflow. The former may have intermediate overflows in 16-bit integer, but provided that computations are done in two’s complement arithmetic, the results will be equal to the results of the corresponding exact integer computation modulo 2^{16}. So as long as said exact integer computation has a result that fits into 16 bits, the computed results are going to be correct, even if there are intermediate overflows. Note that this is true when computed using SIMD intrinsics that have explicit two’s complement semantics, but is *not* guaranteed to be true when implemented using 16-bit integers in C/C++ code, since signed overflow in C triggers undefined behavior!

Also note that we can’t compute values such as `5/4*c2`

as `(5*c2)/4`

, since the intermediate value `5*c2`

can overflow – and knowing as we do already that we have less than 1 bit of headroom left, likely will for some inputs. One option is to do the shift first: `5*(c2/4)`

. This works but throws away some of the least-significant bits. The form `c2 + (c2/4)`

chosen here is a compromise; it has larger round-off error than the preferred `(5*c2)/4`

but will only overflow if the final result does. The findorth tool described last time automatically determines expressions in this form.

### Verification

Okay, that’s all the ingredients we need. We have the transform decomposed into separate stages; we know the value range at the input to each stage, and we know that if both the inputs and the outputs fit inside 16 bits, the results computed in each stage will be correct. So now all we need to do is verify that last bit – *are* the outputs of the intermediate stages all 16 bits or less?

2> I=eye(8); 3> F=bink_dct_B2(I); 4> S=diag(8 ./ diag(F*F')); 5> Inv1=bink_idct_B2_partial(I,1); 6> Inv2=bink_idct_B2_partial(I,2); 7> Inv3=bink_idct_B2_partial(I,3); 8> Inv4=bink_idct_B2_partial(I,4); 9> Scaled=kron(S,S) * kron(F,F); 10> norm(kron(I,I) * Scaled, 'inf') ans = 72.047 11> norm(kron(I,Inv1) * Scaled, 'inf') ans = 72.047 12> norm(kron(I,Inv2) * Scaled, 'inf') ans = 77.811 13> norm(kron(I,Inv3) * Scaled, 'inf') ans = 77.811 14> norm(kron(I,Inv4) * Scaled, 'inf') ans = 67.905 15> norm(kron(Inv1,Inv4) * Scaled, 'inf') ans = 67.905 16> norm(kron(Inv2,Inv4) * Scaled, 'inf') ans = 73.337 17> norm(kron(Inv3,Inv4) * Scaled, 'inf') ans = 73.337 18> norm(kron(Inv4,Inv4) * Scaled, 'inf') ans = 64.000

Lines 2-4 compute the forward DCT and scaling matrices. 5 through 9 compute various partial IDCTs (`Inv4`

is the full IDCT) and the matrix representing the results after the 2D forward DCT and scaling. After that, we successively calculate the norms of more and more of the 2D IDCT: no IDCT, first stage of row IDCT, first and second stage of row IDCT, …, full row IDCT, full row IDCT and first stage of column IDCT, …, full 2D IDCT.

Long story short, the worst matrix infinity norm we have along the whole transform chain is 77.811, which is well below 128; hence the Bink B2 IDCT is overflow-free for 8-bit signals when implemented using 16-bit integers.

A similar process can be used for the forward DCT, but in our case, we just implement it in 32-bit arithmetic (and actually floating-point arithmetic at that, purely for convenience). The forward DCT is only used in the encoder, which spends *a lot* more time per block than the decoder ever does, and almost none of it in the DCT. Finally, there’s the scaling step itself; there’s ways to do this using only 16 bits on the decoder side as well (see for example “Low-complexity transform and quantization in H. 264/AVC”), but since the scaling only needs to be applied to non-zero coefficients and the post-quantization DCT coefficients in a video codec are *very* sparse, it’s not a big cost to do the scaling during entropy decoding using regular, scalar 32-bit fixed point arithmetic. And if the scaling is designed such that no intermediate values ever take more than 24 bits (which is much easier to satisfy than 16), this integer computation can be performed exactly using floating-point SIMD too, on platforms where 32-bit integer multiplies are slow.

And that’s it! Congratulations, you now know everything worth knowing about Bink 2.2’s integer DCTs. :)

This subject is fairly technical, and I’m going to proceed at a somewhat higher speed than usual; part of the reason for this post is just as future reference material for myself.

### DCTs

There’s several different types of discrete cosine transforms, or DCTs for short. All of them are linear transforms on N-dimensional vectors (i.e. they can be written as a N×N matrix), and they all relate to larger real DFTs (discrete Fourier transforms) in some way; the different types correspond to different boundary conditions and different types of symmetries. In signal processing and data compression, the most important variants are what’s called DCT-II through DCT-IV. What’s commonly called “the” DCT is the DCT-II; this is the transform used in, among other things, JPEG, MPEG-1 and MPEG-2. The DCT-III is the inverse of the DCT-II (up to scaling, depending on how the two are normalized; there’s different conventions in use). Thus it’s often called “the” IDCT (inverse DCT). The DCT-IV is its own inverse, and forms the basis for the MDCT (modified DCT), a lapped transform that’s at the heart of most popular perceptual audio codecs (MP3, AAC, Vorbis, and Opus, among others). DCTs I and V-VIII also exist but off the top of my head I can’t name any signal processing or data compression applications that use them (which doesn’t mean there aren’t any).

Various theoretical justifications exist for the DCT’s use in data compression; they are quite close to the optimal choice of orthogonal basis for certain classes of signals, the relationship to the DFT means there are fast (FFT-related) algorithms available to compute them, and they are a useful building block for other types of transforms. Empirically, after 25 years of DCT-based codecs, there’s little argument that they work fairly well in practice too.

Image (and video) codecs operate on 2D data; like the DFT, there are 2D versions of all DCTs that are separable: a 2D DCT of a M×N block decomposes into N 1D M-element DCTs on the columns followed by M N-element DCTs on the rows (or vice versa). JPEG and the MPEGs up to MPEG-4 ASP use 2D DCTs on blocks of 8×8 pixels. H.264 (aka MPEG-4 AVC) initially switched to 4×4 blocks, then added the 8×8 blocks back in as part of the “High” profile later. H.265 / HEVC has both sizes and also added support for 16×16 and 32×32 transform blocks. Bink 2 sticks with 8×8 DCTs on the block transform level.

### Algorithms

There are lots of different DCT algorithms, and I’m only going to mention a few of them. Like the DFT, DCTs have a recursive structure that allows them to be implemented in O(N log N) operations, instead of the O(N^{2}) operations a full matrix-vector multiply would take. Unlike the FFT, which decomposes into nothing but smaller FFTs except for some multiplications by scalars (“twiddle factors”) plus a permutation, the recursive factorizations of one DCT type will usually contain other trigonometric transforms – both smaller DCTs and smaller DST (discrete sine transforms) of varying types.

The first important dedicated DCT algorithm is presented in [Chen77] (see references below), which provides a DCT factorization for any power-of-2 N which is substantially faster than computing the DCT using a larger FFT. Over a decade later, [LLM89] describes a family of minimum-complexity (in terms of number of arithmetic operations) solutions for N=8 derived using graph transforms, a separate algorithm that has no more than one multiply along any path (this is the version used in the IJG JPEG library), and a then-new fast algorithm for N=16, all in a 4-page paper. [AAN88] introduces a scaled algorithm for N=8 which greatly reduces the number of multiplications. A “scaled” DCT is one where the output coefficients have non-uniform scaling factors; compression applications normally follow up the DCT with a quantization stage and precede the IDCT with a dequantization stage. The scale factors can be folded into the quantization / dequantization steps, which makes them effectively “free”. From today’s perspective, the quest for a minimal number of multiplies in DCT algorithms seems quaint; fast, pipelined multipliers are available almost everywhere these days, and today’s *mobile phone CPUs* achieve floating point throughputs higher than those of 1989’s fastest supercomputers (2012 Nexus 4: 2.8 GFLOPS measured with RgbenchMM; Cray-2: 1.9 GFLOPS peak – let that sink in for a minute). That said, the “scaled DCT” idea is important for other reasons.

There’s also direct 2D methods that can reduce number of arithmetic operations further relative to a separable 1D implementation; however, the overall reduction isn’t very large, the 2D algorithms require considerably more code (instead of 2 loops processing N elements each, there is a single “unrolled” computation processing N^{2} elements), and they have data flow patterns that are unfriendly to SIMD or hardware implementations (separable filters, on the other hand, are quite simple to implement in a SIMD fashion).

For 1D DCTs and N=8, the situation hasn’t substantially changed since the 1980s. Larger DCTs (16 and up) have seen some improvement on their arithmetic operation costs in recent years [Plonka02] [Johnson07], with algorithms derived symbolically from split-radix FFTs (though again, how much of a win the better algorithms are heavily depends on the environment).

### Building blocks

Independent of the choice of DCT algorithm, they all break down into the following 3 basic components:

**Butterflies**. A butterfly is the transform (they’re called that way because of how they’re drawn in diagram form). A butterfly is also its own inverse, up to a scale factor of two, since and likewise .**Planar rotations**. Take a pair of values, interpret them as coordinates in the plane, and rotate them about the origin. I’ve written about them (even in the context of DCTs) before. The inverse of a rotation by θ radians is a rotation by -θ radians. There’s also planar reflections which are closely related to rotations and work out pretty much the same on the implementation side. Reflections are self-inverse, and in fact a butterfly is just a reflection scaled by .**Scalar multiplies**. Map for some nonzero constant c. The inverse is scaling by 1/c.

There are also a few properties of the DCT-II that simplify the situation further. Namely, the DCT-II (when properly normalized) is an *orthogonal* transform, which means it can be decomposed purely into planar rotations and potentially a few sign flips at the end (which can be represented as scalar multiplications). Butterflies, being as they are a scaled reflection, are used because they are cheaper than a “proper” rotation/reflection, but they introduce a scale factor of . So scalar multiplications are used to normalize the scaling across paths that go through a different number of butterflies; but the DCT algorithms we’re considering here only have scaling steps at the end, which can conveniently be dropped if a non-normalized (scaled) DCT is acceptable.

What this all boils down to is that a DCT implementation basically boils down to which planar rotations are performed when, and which of the various ways to perform rotations are employed. There are tons of tricky trade-offs here, and which variant is optimal really depends on the target machine. Older standards (JPEG, the earlier MPEGs) didn’t specify exactly which IDCT was to be used by decoders, leaving this up to the implementation. Every codec had its own DCT/IDCT implementations, which caused problems when a video was encoded using codec A and decoded using codec B – basically, over time, the decoded image could drift from what the encoder intended. Some codecs even had multiple DCT implementations with different algorithms (say one for MMX, and another for SSE2-capable CPUs) so this problem occurred even between different machines using the same codec. And of course there’s the general problem that the “real” DCT involves irrational numbers as coefficients, so there’s really no efficient way to compute it exactly for arbitrary inputs.

### Integer transforms

The solution to all these problems is to pick a specific IDCT approximation and specify it exactly – preferably using (narrow) integer operations only, since floating-point is expensive in hardware implementations and getting 100% bit-identical results for floating-point calculations across multiple platforms and architectures is surprisingly hard in software.

So what denotes a good integer DCT? It depends on the application. As the title says, this post is about the 8×8 integer IDCT (and matching DCT) design used for Bink 2.2. We had the following requirements:

- Bit-exact implementation strictly required (Bink 2 will often go hundreds or even thousands of frames without a key frame). Whatever algorithm we use, it must be
*exactly*the same on all targets. - Must be separable, i.e. the 2D DCT factors into 1D DCT steps.
- Must be a close approximation to the DCT. The goal was to be within 2% in terms of the matrix 2-norm, and same for the coding gain compared to a “proper” DCT. Bink 2 was initially designed with a full-precision floating-point (or high-precision fixed point) DCT in mind and we wanted a drop-in solution that wouldn’t affect the rest of the codec too much.
- Basis vectors must be fully orthogonal (or very close) to allow for trellis quantization.
- Must have an efficient SIMD implementation on all of the following:
- x86 with SSE2 or higher in both 32-bit and 64-bit modes.
- ARM with NEON support.
- PowerPC with VMX/AltiVec (e.g. PS3 Cell PPU).
- Cell SPUs (PS3 again).
- PowerPC with VMX128 (Xbox 360 CPU).

Yep, lots of game consoles – RAD’s a game middleware company.

- 16-bit integer preferred, to enable 8-wide operation on 128-bit SIMD instruction sets. To be more precise, for input values in the range [-255,255], we want intermediate values to fit inside 16 bits (signed) through all stages of the transform.

This turns out to limit our design space greatly. In particular, VMX128 removes *all* integer multiplication operations from AltiVec – integer multiplication by scalars, where desired, has to be synthesized from adds, subtractions and shifts. This constraint ended up driving the whole design.

There’s other integer DCTs designed along similar lines, including several used in other video codecs. However, while there’s several that only use 16-bit integer adds and shifts, they are generally much coarser approximations to the “true” DCT than what we were targeting. Note that this doesn’t mean they’re necessarily worse for compression purposes, or that the DCT described here is “better”; it just means that we wanted a drop-in replacement for the full-precision DCT, and the transform we ended up with is a closer to that goal than published variants with similar complexity.

### Building an integer DCT

Most of the integer DCT designs I’m aware of start with a DCT factorization; the Bink 2 DCT is no exception, and starts from the DCT-II factorizations given in [Plonka02]. For N=8, this is equivalent to the factorization described in [LLM89]: by reordering the rows of the butterfly matrices in Plonka’s factorization and flipping signs to turn rotation-reflections into pure rotations, the DCT-II factorization from example 2.8 can be expressed as the butterfly diagram (click to zoom)

which corresponds to the DCT variation on the left of the 2nd row in figure 4 of [LLM89], with the standard even part. However, [LLM89] covers only N=8, N=16 and (implicitly) N=4, whereas [Plonka02] provides factorizations for arbitrary power-of-2 N, making it easy to experiment with larger transform sizes with a single basic algorithm. **UPDATE**: The previous version of the diagram had a sign flip in the rightmost butterfly. Thanks to “terop” for pointing this out in the comments!

Now, we’re allowing a scaled DCT, so the final scale factors of can be omitted. As explained above, this leaves butterflies (which have simple and obvious integer implementations) and exactly three rotations, one in the even part (the top half, called “even” since it produces the even-numbered DCT coefficients) and two in the odd part:

As customary for DCT factorizations, and . Note that due to the corresponding trig identities, we have , , and , which means that the three rotations we see here (and indeed all rotations referenced in [LLM89]) can be expressed in terms of just , , , , and .

Now, all we really need to do to get an integer DCT is to pick integer approximations for these rotations (preferably using only adds and shifts, see the constraints above!). One way to do so is the BinDCT [Liang01], which uses the decomposition of rotation into shears I explained previously and then approximates the shears using simple dyadic fractions. This is a cool technique, but yields transforms with basis vectors that aren’t fully orthogonal; how big the non-orthogonality is depends on the quality of approximation, with cheaper approximations being less orthogonal. While not a show-stopper, imperfect orthogonality violates the assumptions inherent in trellis quantization and thus reduces the quality of the rate-distortion optimization performed on the encoder side.

### Approximating rotations

So how do we approximate irrational rotations as integers (preferably small integers) with low error? The critical insight is that *any* matrix of the form

is a planar rotation about some angle θ such that . And more generally, any matrix of the form

is a *scaled* planar rotation. Thus, if we’re okay with a scaled DCT (and we are), we can just pick two arbitrary integers c, s that aren’t both zero and we get a scaled integer rotation (about some arbitrary angle). To approximate a *particular* rotation angle θ, we want to ensure that . Since we’d prefer small integer values s and c, we can find suitable approximations simply by enumerating all possibilities with a suitable ratio, checking how closely they approximate the target angle, and using the best one. This is easily done with a small program, and by itself sufficient to find a good solution for the rotation in the even part of the transform.

The odd part is a bit trickier, because it contains two rotations and butterflies that combine the results from different rotations. Therefore, if the two rotations were scaled differently, such a butterfly will result not in a scaled DCT, but a different transformation altogether. So simply approximating the rotations individually won’t work; however, we can approximate the two rotations simultaneously, and add the additional constraint that (somewhat reminiscent of Pythagorean triples), thus guaranteeing that the norm of the two rotations is the same. Initially I was a bit worried that this might over-constrain the problem, but it turns out that even with this extra complication, there are plenty of solutions to choose from. As before, this problem is amenable to an exhaustive search over all small integers that have the right ratios between each other.

Finally, we need implementations of the resulting rotations using only integer additions/subtractions and shifts. Since we’re already fully integer, all this means is that we need to express the integer multiplications using only adds/subtracts and shifts. This is an interesting and very tricky problem by itself, and I won’t cover it here; see books like Hacker’s Delight for a discussion of the problem. I implemented a simple algorithm that just identifies runs of 0- and 1-bits in the binary representation of values. This is optimal for small integers (below 45) but not necessarily for larger ones, but it’s usually fairly good. And since the program already tries out a bunch of variants, I made it compute costs for the different ways to factor planar rotations as well.

The result is findorth.cpp, a small C++ program that finds candidate small integer approximations to the DCT rotations and determines their approximate cost (in number of adds/subtractions and shifts) as well. Armed with this and the rest of the factorization above, it’s easy to come up with several families of integer DCT-like transforms at various quality and cost levels and compare them to existing published ones:

Variant | (c_{2},s_{2}) |
(c_{1},s_{1}) |
(c_{3},s_{3}) |
Cost | L2 err | Gain (dB) |
---|---|---|---|---|---|---|

A1 | (17,-7)/16 | (8,-1)/8 | (7,4)/8 | 32A 10S | 0.072 | 8.7971 |

B1 | (5,-2)/4 | (8,-1)/8 | (7,4)/8 | 30A 10S | 0.072 | 8.7968 |

A2 | (17,-7)/16 | (19,-4)/16 | (16,11)/16 | 38A 12S | 0.013 | 8.8253 |

B2 | (5,-2)/4 | (19,-4)/16 | (16,11)/16 | 36A 12S | 0.013 | 8.8250 |

A3 | (17,-7)/16 | (65,-13)/64 | (55,37)/64 | 42A 15S | 0.003 | 8.8258 |

B3 | (5,-2)/4 | (65,-13)/64 | (55,37)/64 | 40A 15S | 0.012 | 8.8255 |

Real DCT | - | - | - | - | 0.000 | 8.8259 |

BinDCT-L1 | - | - | - | 40A 22S | 0.013 | 8.8257 |

H.264 | - | - | - | 32A 10S | 0.078 | 8.7833 |

VC-1 | - | - | - | ? | 0.078 | 8.7978 |

The cost is measured in adds/subtracts (“A”) and shifts (“S”); “L2 err” is the matrix 2-norm of the difference between the approximated DCT matrix and the true DCT, and “gain” denotes the coding gain assuming a first-order Gauss-Markov process with autocorrelation coefficient ρ=0.95 (a standard measure for the quality of transforms in image coding).

These are by far not all solutions that `findorth`

finds, but they’re what I consider the most interesting ones, i.e. they’re at good points on the cost/approximation quality curve. The resulting transforms compare favorably to published transforms at similar cost (and approximation quality) levels; Bink 2.2 uses variant “B2″, which is the cheapest one that met our quality targets. The Octave code for everything in this post is available on Github.

The H.264 transform is patented, and I believe the VC-1 transform is as well. The transforms described in this post are not (to the best of my knowledge), and RAD has no intention of ever patenting them or keeping them secret. Part of the reason for this post is that we wanted to make an integer DCT that is of similar (or higher) quality and comparable cost than those in existing video standards freely available to everyone. And should one of the particular transforms derived in this post turn out to be patent-encumbered, `findorth`

provides everything necessary to derive a (possibly slightly worse, or slightly more expensive) alternative. (Big thanks to my boss at RAD, Jeff Roberts, for allowing and even encouraging me to write all this up in such detail!)

This post describes the basic structure of the transform; I’ll later post a follow-up with the details of our 16-bit implementation, and the derivation for why 16 bits are sufficient provided input data is in the range of [-255,255].

### References

- [Chen77]
- Chen, Wen-Hsiung, C. H. Smith, and Sam Fralick. “A fast computational algorithm for the discrete cosine transform.” Communications, IEEE Transactions on 25.9 (1977): 1004-1009. PDF
- [AAN88]
- Yukihiro, Arai, Agui Takeshi, and Masayuki Nakajima. “A fast DCT-SQ scheme for images.” IEICE TRANSACTIONS (1976-1990) 71.11 (1988): 1095-1097.
- [LLM89]
- Loeffler, Christoph, Adriaan Ligtenberg, and George S. Moschytz. “Practical fast 1-D DCT algorithms with 11 multiplications.” Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference on. IEEE, 1989. PDF
- [Liang01]
- Liang, Jie, and Trac D. Tran. “Fast multiplierless approximations of the DCT with the lifting scheme.” Signal Processing, IEEE Transactions on 49.12 (2001): 3032-3044. PDF
- [Plonka02]
- Plonka, Gerhard, and Manfred Tasche. “Split-radix algorithms for discrete trigonometric transforms.” (2002). PDF
- [Johnson07]
- Johnson, Steven G., and Matteo Frigo. “A modified split-radix FFT with fewer arithmetic operations.” Signal Processing, IEEE Transactions on 55.1 (2007): 111-119. PDF

A lot of bit manipulation operations exist basically everywhere: AND, OR, XOR/EOR, shifts, and so forth. Some exist only on some architectures but have obvious implementations everywhere else – NAND, NOR, equivalence/XNOR and so forth.

Some exist as built-in instructions on some architectures and require fairly lengthy multi-instruction sequences when they’re not available. Some examples would be population count (number of bits set; available on recent x86s and POWER but not PowerPC), or bit reversal (available on ARMv6T2 and above).

And then there’s bit scanning operations. All of these can be done fairly efficiently on most architectures, but it’s not always obvious how.

### x86 bit scanning operations

On x86, there’s `BSF`

(bit scan forward), `BSR`

(bit scan reverse), `TZCNT`

(trailing zero count) and `LZCNT`

(leading zero count). The first two are “old” (having existed since 386), the latter are very new. `BSF`

and `TZCNT`

do basically the same thing, with one difference: `BSF`

has “undefined” results on a zero input (what is the index of the first set bit in the integer 0?), whereas the “trailing zeros” definition for `TZCNT`

provides an obvious answer: the width of the register. `BSR`

returns the index of the “last” set bit (again undefined on 0 input), `LZCNT`

returns the number of leading zero bits, so these two actually produce different results.

`x86_bsf(x) = x86_tzcnt(x)`

unless x=0 (in which case`bsf`

is undefined).`x86_bsr(x) = (reg_width - 1) - x86_lzcnt(x)`

unless x=0 (in which case`bsr`

is undefined).

In the following, I’ll only use `LZCNT`

and `TZCNT`

(use above table to convert where necessary) simply to avoid that nasty edge case at 0.

### PowerPC bit scanning operations

PPC has `cntlzw`

(Count Leading Zeros Word) and `cntlzd`

(Count Leading Zeros Double Word) but no equivalent for trailing zero bits. We can get fairly close though: there’s the old trick `x & -x`

which clears all but the lowest set bit in `x`

. As long as x is nonzero, this value has exactly one bit set, and we can determine its position using a leading zero count.

`x86_tzcnt(x) = (reg_width - 1) - ppc_cntlz(x & -x)`

unless x=0 (in which case the PPC expression returns -1).`x86_lzcnt(x) = ppc_cntlz(x)`

.

### ARM bit scanning operations

ARM also has `CLZ`

(Count Leading Zeros) but nothing for trailing zero bits. But it also offers the aforementioned `RBIT`

which reverses the bits in a word, which makes a trailing zero count easy to accomplish:

`x86_tzcnt(x) = arm_clz(arm_rbit(x))`

.`x86_lzcnt(x) = arm_clz(x)`

.

### Bonus

Finally, ARMs NEON also offers `VCLS`

(Vector Count Leading Sign Bits), which (quoting from the documentation) “counts the number of consecutive bits following the topmost bit, that are the same as the topmost bit”. Well, we can do that on all architectures I mentioned as well, using only ingredients we already have: `arm_cls(x) = x86_lzcnt(x ^ (x >> 1)) - 1`

(the shift here is an arithmetic shift). The expression `y = x ^ (x >> 1)`

gives a value that has bit `n`

set if and only if bits `n`

and `n + 1`

of x are the same. By induction, the number of leading zeros in `y`

is thus exactly the number of leading bits in `x`

that match the sign bit. This count includes the topmost (sign) bit, so it’s always at least 1, and the instruction definition I just quoted requires us to return the number of bits *following* the topmost bit that match it. So we subtract 1 to get the right result. Since we can do a fast leading zero count on all quoted platforms, we’re good.

Conclusion: bit scans / zero bit counts in either direction can be done fairly efficiently on all architectures covered, but you need to be careful when zeros are involved.