I got a few mails about the previous post, including some pretty cool suggestions that I figured were worth posting.

Won Chun (who wrote a book chapter for OpenGL insights on the topic, “WebGL Models: End-to-End”) writes: (this is pasted together from multiple mails; my apologies)

A vertex-cache optimized list is still way more compressible than random: clearly there is lots of coherence to exploit. In fact, that’s pretty much with the edge-sharing-quad business is (OK, some tools might do this because they are working naturally in quads, but you get lots of these cases with any vertex cache optimizer).

So you can also get a pretty big win by exploiting pre-transform vertex cache optimization. I call it “high water mark encoding.” Basically: for such a properly optimized index list, the next index you see is either (a) one you’ve seen before or (b) one higher than the current highest seen index. So, instead of encoding actual indices, you can instead encode them relative to this high water mark (the largest index yet to be seen, initialized to 0). You see “n” and that corresponds to an index of (high water mark – n). When you see 0, you also increment high watermark.

The benefit here is that the encoded indices are very small, and then you can do some kind of varint coding, then your encoded indices are a bit more than a byte on average. If you plan on zipping later, then make sure the varints are byte-aligned and LSB-first.

There are lots of variants on this, like:

• keep a small LRU (~32 elements, or whatever transform cache size you’ve optimized for) of indices, and preferentially reference indices based on recent use (e.g. values 1-32), rather than actual value (deltas are offset by 32),
• make the LRU reference edges rather than verts, so a LRU “hit” gives you two indices instead of one,
• do two-level high water mark encoding, which makes it efficient to store in multi-indexed form (like .OBJ, which has separate indices per attribute) and decode into normal single-indexed form

And also, these approaches let you be smart about attribute compression as well, since it gives you useful hints; e.g. any edge match lets you do parallelogram prediction, multi-indexing can implicitly tell you how to predict normals from positions.

### “High watermark encoding”

I really like this idea. Previously I’d been using simple delta encoding on the resulting index lists; that works, but the problem with delta coding is that a single outlier will produce two large steps – one to go from the current region to the outlier, then another one to get back. The high watermark scheme is almost as straightforward as straight delta coding and avoids this case completely.

Now, if you have an index list straight out of vertex cache optimization and vertex renumbering, the idea works as described. However, with the hybrid tri/paired-tri encoding I described last time, we have to be a bit more careful. While the original index list will indeed have each index be at most 1 larger than the highest index we’ve seen so far, our use of “A ≥ B” to encode whether the next set of indices describes a single triangle or a pair means that we might end up having to start from the second or third vertex of a triangle, and consequently see a larger jump than just 1. Luckily, the fix for this is simple – rather than keeping the high watermark always 1 higher than the largest vertex index we’ve seen so far, we keep it N higher where N is the largest possible “step” we can have in the index list. With that, the transform is really easy, so I’m just going to post my code in full:

static void watermark_transform(std::vector<int>& out_inds,
const std::vector<int>& in_inds, int max_step)
{
int hi = max_step - 1; // high watermark
out_inds.clear();
out_inds.reserve(in_inds.size());
for (int v : in_inds)
{
assert(v <= hi);
out_inds.push_back(hi - v);
hi = std::max(hi, v + max_step);
}
}


and the inverse is exactly the same, with the push_back in the middle replaced by the two lines

v = hi - v;
out_inds.push_back(v);


So what’s the value of N (aka max_step in the code), the largest step that a new index can be from the highest index we’ve seen so far? Well, for the encoding described last time, it turns out to be 3:

• When encoding a single triangle, the worst case is a triangle with all-new verts. Suppose the highest index we’ve seen so far is k, and the next triangle has indices (k+1,k+2,k+3). Our encoding for single triangles requires that the first index be larger than the second one, so we would send this triangle as (k+3,k+1,k+2). That’s a step of 3.
• For a pair of triangles, we get 4 new indices. So it might seem like we might get a worst-case step of 4. However, we know that the two triangles share an edge; and for that to be the case, the shared edge must have been present in the first triangle. Furthermore, we require that the smaller of the two indices be sent first (that’s what flags this as a paired tri). So the worst cases we can have for the first two indices are (k+2,k+3) and (k+1,k+3), both of which have a largest step size of 2. After the first two indices, we only have another two indices to send; worst-case, they are both new, and the third index is larger than the fourth. This corresponds to a step size of 2. All other configurations have step sizes ≤1.

And again, that’s it. Still very simple. So does it help?

### Results

Let’s dig out the table again (new entries bold, all percentages relative to the “Vertex cache optimized” row):

Stage Size .zip size .7z size
Original mesh 6082k 3312k 2682k
Vertex cache optimized 6082k 2084k 1504k
Vcache opt, watermark 6082k 1808k (-13.2%) 1388k (-7.7%)
Postprocessed 4939k (-18.8%) 1830k (-12.2%) 1340k (-10.9%)
Postproc, watermark 4939k (-18.8%) 1563k (-25.0%) 1198k (-20.3%)

In short: the two techniques work together perfectly and very nicely complement each other. This is without any varint encoding by the way, still sending raw 32-bit indices. Variable-sized integer encoding would probably help, but I haven’t checked how much. The code on Github has been updated in case you want to play around with it.

### Summary

I think this is at a fairly sweet spot in terms of compression ratio vs. decoder complexity. Won originally used this for WebGL, using UTF-8 as his variable-integer encoding: turn the relative indices into Unicode code points, encode the result as UTF-8. This is a bit hacky and limits you to indices that use slightly more than 20 bits (still more than a million, so probably fine for most meshes you’d use in WebGL), but it does mean that the Browser can handle the variable-sized decoding for you (instead of having to deal with byte packing in JS code).

Overall, this approach (postprocess + high watermark) gives a decoder that is maybe 30 lines or so of C++ code, and which does one linear pass over the data (the example program does two passes to keep things clearer, but they can be combined without problems) with no complicated logic whatsoever. It’s simple to get right, easy to drop in, and the results are quite good.

It is not, however, state of the art for mesh compression by any stretch of the imagination. This is a small domain-specific transform that can be applied to fully baked and optimized game assets to make them a bit smaller on disk. I also did not cover vertex data; this is not because vertex data is unimportant, but simply because, so far at least, I’ve not seen any mesh compressors that do something besides the obvious (that is, quantization and parallelogram prediction) for vertex data.

Finally, if what you actually want is state-of-the art triangle mesh compression, you should look elsewhere. This is outside the scope of this post, but good papers to start with are “Triangle Mesh Compression” by Touma and Gotsman (Proceedings of Graphics Interface, Vancouver, June 1998) and “TFAN: A low complexity 3D mesh compression algorithm” by Khaled, Zaharia, and Prêteux (Computer Animation and Virtual Worlds 20.2‐3 (2009)). I’ve played around with the former (the character mesh in Candytron was encoded using an algorithm descended from it) but not the latter; however, Touma-Gotsman and descendants are limited to 2-manifold meshes, whereas TFAN promises better compression and handles arbitrarily topologies, so it looks good on paper at least.

Anyway; that’s it for today. Thanks to Won for his mail! And if you should use this somewhere or figure out a way to get more mileage out of it without making it much more complicated, I’d absolute love to know!

(Almost) everyone uses indexed primitives. At this point, the primary choice is between indexed triangle lists, which are flexible but always take 3 indices per triangle, and indexed triangle strips; since your meshes are unlikely to be one big tri strip, that’s gonna involve primitive restarts of some kind. So for a strip with N triangles, you’re generally gonna spend 3+N indices – 2 indices to “prime the pump”, after which every new index will emit a new triangle, and finally a single primitive restart index at the end (supposing your HW target does have primitive restarts, that is).

Indexed triangle strips are nice because they’re smaller, but finding triangle strips is a bit of a pain, and more importantly, long triangle strips actually aren’t ideal because they tend to “wander away” from the origin triangle, which means they’re not getting very good mileage out of the vertex cache (see, for example, Tom Forsyth’s old article on vertex cache optimization).

So here’s the thing: we’d like our index buffers for indexed tri lists to be smaller, but we’d also like to do our other processing (like vertex cache optimization) and then not mess with the results – not too much anyway. Can we do that?

### The plan

Yes we can (I had this idea a while back, but never bothered to work out the details). The key insight is that, in a normal mesh, almost all triangles share an edge with another triangle. And if you’ve vertex-cache optimized your index buffer, two triangles that are adjacent in the index buffer are also quite likely to be adjacent in the geometric sense, i.e. share an edge.

So, here’s the idea: loop over the index buffer. For each triangle, check if it shares an edge with its immediate successor. If so, the two triangles can (in theory anyway) be described with four indices, rather than the usual six (since two vertices appear in both triangles).

But how do we encode this? We could try to steal a bit somewhere, but in fact there’s no need – we can do better than that.

Suppose you have a triangle with vertex indices (A, B, C). The choice of which vertex is first is somewhat arbitrary: the other two even cycles (B, C, A) and (C, A, B) describe the same triangle with the same winding order (and the odd cycles describe the same triangle with opposite winding order, but let’s leave that alone). We can use this choice to encode a bit of information: say if A≥B, we are indeed coding a triangle – otherwise (A<B), what follows will be two triangles sharing a common edge (namely, the edge AB). We can always pick an even permutation of triangle indices such that A≥B, since for any integer A, B, C we have

$0 = (A - A) + (B - B) + (C - C) = (A - B) + (B - C) + (C - A)$

Because the sum is 0, not all three terms can be negative, which in turn means that at least one of A≥B, B≥C or C≥A must be true. Furthermore, if A, B, and C are all distinct (i.e. the triangle is non-degenerate), all three terms are nonzero, and hence we must have both negative and positive terms for the sum to come out as 0.

### Paired triangles

Okay, so if the triangle wasn’t paired up, we can always cyclically permute the vertices such that A≥B. What do we do when we have two triangles sharing an edge, say AB?

Two triangles sharing edge AB.

For this configuration, we need to send the 4 indices A, B, C, D, which encode the two triangles (A, B, C) and (A, D, B).

If A<B, we can just send the 4 indices directly, leading to this very simple decoding algorithm that unpacks our mixed triangle/double-triangle indexed buffer back to a regular triangle list:

1. Read 3 indices A, B, C.
2. Output triangle (A, B, C).
3. If A<B, read another index D and output triangle (A, D, B).

Okay, so this works out really nicely if A<B. But what if it’s not? Well, there’s just two cases left. If A=B, the shared edge is a degenerate edge and both triangles are degenerate triangles; not exactly common, so the pragmatic solution is to say “if either triangle is degenerate, you have to send them un-paired”. That leaves the case A>B; but that means B<A, and BA is also a shared edge! In fact, we can simply rotate the diagram by 180 degrees; this swaps the position of (B,A) and (C,D) but corresponds to the same triangles. With the algorithm above, (B, A, D, C) will decode as the two triangles (B, A, D), (B, C, A) – same two triangles as before, just in a different order. So we’re good.

### Why this is cool

What this means is that, under fairly mild assumptions (but see “limitations” section below), we can have a representation of index buffers that mixes triangles and pairs of adjacent triangles, with no need for any special flag bits (as recommended in Christer’s article) or other hackery to distinguish the two.

In most closed meshes, every triangle has at least one adjacent neighbor (usually several); isolated triangles are very rare. We can store such meshes using 4 indices for every pair of triangles, instead of 6, for about a 33% reduction. Furthermore, most meshes in fact contain a significant number of quadriliterals (quads), and this representation supports quads directly (stored with 4 indices). 33% reduction for index buffers isn’t a huge deal if you have “fat” vertex formats, but for relatively small vertices (as you have in collision detection, among other things), indices can actually end up being a significant part of your overall mesh data.

Finally, this is simple enough to decode that it would probably be viable in GPU hardware. I wouldn’t hold my breath for that one, just thought I might point it out. :)

### Implementation

I wrote up a quick test for this and put it on Github, as usual. This code loads a mesh, vertex-cache optimizes the index buffer (using Tom’s algorithm), then checks for each triangle whether it shares an edge with its immediate successor and if so, sends them as a pair – otherwise, send the triangle alone. That’s it. No attempt is made to be more thorough than that; I just wanted to be as faithful to the original index buffer as possible.

On the “Armadillo” mesh from the Stanford 3D scanning repository, the program outputs this: (UPDATE: I added some more features to the sample program and updated the results here accordingly)

172974 verts, 1037832 inds.
before:
ACMR: 2.617 (16-entry FIFO)
62558 paired tris, 283386 single
IB inds: list=1037832, fancy=975274 (-6.03%)
after:
ACMR: 0.814 (16-entry FIFO)
292458 paired tris, 53486 single
IB inds: list=1037832, fancy=745374 (-28.18%)
745374 inds packed
1037832 inds unpacked
index buffers match.
ACMR: 0.815 (16-entry FIFO)


“Before” is the average cache miss rate (vertex cache misses/triangle) assuming a 16-entry FIFO cache for the original Armadillo mesh (not optimized). As you can see, it’s pretty bad.

I then run the simple pairing algorithm (“fancy”) on that, which (surprisingly enough) manages to reduce the index list size by about 6%.

“After” is after vertex cache optimization. Note that Tom’s algorithm is cache size agnostic; it does not assume any particular vertex cache size, and the only reason I’m dumping stats for a 16-entry FIFO is because I had to pick a number and wanted to pick a relatively conservative estimate. As expected, ACMR is much better; and the index buffer packing algorithm reduces the IB size by about 28%. Considering that the best possible case is a reduction of 33%, this is quite good. Finally, I verify that packing and unpacking the index buffer gives back the expected results (it does), and then re-compute the ACMR on the unpacked index buffer (which has vertices and triangles in a slightly different order, after all).

Long story short: it works, even the basic “only look 1 triangle ahead” algorithm gives good results on vertex cache optimized meshes, and the slight reordering performed by the algorithm does not seem to harm vertex cache hit rate much (on this test mesh anyway). Apologies for only testing on one 3D-scanned mesh, but I don’t actually have any realistic art assets lying around at home, and even if I did, loading them would’ve probably taken me more time than writing the entire rest of this program did.

### UPDATE: Some more results

The original program was missing one more step that is normally done after vertex cache optimization: reordering the vertex buffer so that vertices appear in the order they’re referenced from the index buffer. This usually improves the efficiency of the pre-transform cache (as opposed to the post-transform cache that the vertex cache optimization algorithm takes care of) because it gives better locality of reference, and has the added side effect of also making the index data more compressible for general purpose lossless compression algorithms like Deflate or LZMA.

Anyway, here’s the results for taking the entire Armadillo mesh – vertices, which just store X/Y/Z position as floats, and 32-bit indices both – and processing it with some standard general-purpose compressors at various stages of the optimization process: (all sizes in binary kilobytes, KiB)

Stage Size .zip size .7z size
Original mesh 6082k 3312k 2682k
Vertex cache optimized 6082k 2084k 1504k
Postprocessed 4939k (-18.8%) 1830k (-12.2%) 1340k (-10.9%)

So the post-process yields a good 10% reduction in the compressed size for what would be the final packaged assets here. This value is to be taken with a grain of salt: “real” art assets have other per-vertex data besides just 12 bytes for the 3D position, and nothing I described here does anything about vertex data. In other words, this comparison is on data that favors the algorithm in the sense that roughly half of the mesh file is indices, so keep that in mind. Still, 10% reduction post-LZMA is quite good for such a simple algorithm, especially compared to the effort it takes to get the same level of reduction on, say, x86 code.

Also note that the vertex cache optimization by itself massively helps the compressors here; the index list for this mesh comes from a 3D reconstruction of range-scanned data and is pathologically bad (the vertex order is really quite random), but the data you get out of a regular 3D mesh export is quite crappy too. So if you’re not doing any optimization on your mesh data yet, you should really consider doing so – it will reduce both your frame timings and your asset sizes.

### Limitations

This is where the (*) from the title comes in. While I think this is fairly nice, there’s two cases where you can’t use this scheme, at least not always:

1. When the order of vertices within a triangle matters. An example would be meshes using flat attribute interpolation, where the value chosen for a primitive depends on the “provoking vertex”. And I remember some fairly old graphics hardware where the Z interpolation depended on vertex specification order, so you could get Z-fighting between the passes in multi-pass rendering if they used different subsets of triangles.
2. When the order of triangles within a mesh matters (remember that we in the two-tris case, we might end up swapping them to make the encoding work). Having the triangles in a particular order in the index buffer can be very useful with alpha blending, for example. That said, the typical case for this application is that the index buffer partitions into several chunks that should be drawn in order, but with no particular ordering requirements within that chunk, which is easy to do – just prohibit merging tris across chunk boundaries.

That said, it seems to me that it really should be useful in every case where you’d use a vertex cache optimizer (which messes with the order anyway). So it’s probably fine.

Anyway, that’s it. No idea whether it’s useful to anyone, but I think it’s fairly cute, and it definitely seemed worth writing up.

As a general rule, when trying to understand or prove something, symmetry is a big help. If something has obvious symmetries, or can be brought into a symmetric form, doing so is usually worth trying. One example of this is my old post on index determination for DXT5/BC3 alpha blocks. But a while ago I ran into something that makes for a really nice example: consider the expression

$|a + b| + |a - b|$

If these were regular parentheses, this would simply evaluate to $2a$, but we’re taking absolute values here, so it’s a bit more complicated than that. However, the expression does look pretty symmetric, so let’s see if we can do something with that. Name and conquer, so let’s define:

$f(a,b) := |a + b| + |a - b|$

This expression looks fairly symmetric in a and b, so what happens if we switch the two?

$f(b,a) = |b + a| + |b - a| = |a + b| + |-(a - b)| = \\ |a + b| + |a - b| = f(a,b)$

So it’s indeed symmetric in the two arguments. Another candidate to check is sign flips – we’re taking absolute values here, so we might find something interesting there as well:

$f(a,-b) = |a + (-b)| + |a - (-b)| = |a - b| + |a + b| = f(a,b)$

And because we already know that f is symmetric in its arguments, this gives us $f(-a,b) = f(a,b)$ for free – bonus! Putting all these together, we now know that

$f(a,b) = f(b,a) = f(|a|,|b|) = | |a| + |b| | + | |a| - |b| |$

which isn’t earth-shattering but not obvious either. You could prove this directly from the original expression, but I like doing it this way (defining a function f and poking at it) because it’s much easier to see what’s going on.

But we’re not quite done yet: One final thing you can do with absolute values is figure out what needs to happen for the expression to be non-negative and see if you can simplify it further. Now, both $|a|$ and $|b|$ are always non-negative, so in fact we have

$f(a,b) = |a| + |b| + | |a| - |b| |$.

Now suppose that $|a| \ge |b|$. In that case, we know the sign of the expression on the right and can simplify further:

$f(a,b) = |a| + |b| + (|a| - |b|) = 2 |a|$

and similarly, when $|b| \ge |a|$, we get $f(a,b) = 2 |b|$. Putting the two together, we arrive at the conclusion

$|a + b| + |a - b| = f(a,b) = 2 \max(|a|,|b|)$

which I think is fairly cute. (This expression comes up in SATD computations and can be used to save some work in the final step.)

Last time, I talked about DCTs in general and how the core problem in designing integer DCTs are the rotations. However, I promised an 8×8 IDCT that is suitable for SIMD implementation using 16-bit integer arithmetic without multiplies, and we’re not there yet. I’m going to be focusing on the “B2″ variant of the transform in this post, for concreteness. That means the starting point will be this implementation, which is just a scaled version of Plonka’s DCT factorization with the rotation approximations discussed last time.

### Normalization

The DCT described last time (and implemented with the Matlab/Octave code above) is a scaled DCT, and I didn’t mention the scale factors last time. However, they’re easy enough to solve for. We have a forward transform matrix, $M$. Since the rows are orthogonal (by construction), the inverse of $M$ (and hence our IDCT) is its transpose, up to scaling. We can write this as matrix equation $I = M^T S M$ where S ought to be a diagonal matrix. Solving for S yields $S = (M M^T)^{-1}$. This gives the total scaling when multiplying by both $M$ and $M^T$; what we really want is a normalizing diagonal matrix $N$ such that $\tilde{M} = NM$ is orthogonal, which means

$I = \tilde{M}^T \tilde{M} = M^T N^T N M = M^T (N^2) M$

since N is diagonal, so $N^2 = S$, and again because the two are diagonal this just means that the diagonal entries of N are the square roots of the diagonal entries of S. Then we multiply once by N just before quantization on the encoder side, and another time by N just after dequantization on the decoder side. In practice, this scaling can be folded directly into the quantization/dequantization process, but conceptually they’re distinct steps.

This normalization is the right one to use to compare against the “proper” orthogonal DCT-II. That said, with a non-lifting all-integer transform, normalizing to unit gain is a bad idea; we actually want some gain to reduce round-off error. A good compromise is to normalize everything to the gain of the DC coefficient, which is $\sqrt{8}$ for the 1D transform. Using this normalization ensures that DC, which usually contains a significant amount of the overall energy, is not subject to scaling-induced round-off error and thus reproduced exactly.

In the 2D transform, each coefficient takes part in first a column DCT and then a row DCT (or vice versa). That means the gain of a separable 2D 8×8 DCT is the square of the gain for the 1D DCT, which in our case means 8. And since the IDCT is just the transpose of the DCT, it has the same gain, so following up the 2D DCT with the corresponding 2D IDCT introduces another 8x gain increase, bringing the total gain for the transform → normalize → inverse transform chain up to 64.

Well, the input to the DCT in a video codec is the difference between the actual pixel values and the prediction (usually motion compensation or some geometric/gradient predictor), and with 8-bit input signals that means the input is the difference between two values in [0, 255]. The worst cases are when the predicted value is 0 and the actual value is 255 or vice versa, leading to a worst-case input value range of [-255, 255] for our integer DCT. Multiply this by a gain factor of 64 and you get [-16320, 16320]. Considering that the plan is to have a 16-bit integer implementation, this should make it obvious what the problem is: it all fits, but only just, and we have to be really careful about overflow and value ranges.

### Dynamic range

This all falls under the umbrella term of the dynamic range of the transform. The smallest representable non-zero sample values we support are ±1. With 16-bit integer, the largest values we can support are ±32767; -32768 is out since there’s no corresponding positive value. The ratio between the two is the total dynamic range we have to work with. For transforms, it’s customary to define the dynamic range relative to the input value range, not the smallest representable non-zero signal; in our case, the maximum total scaling we can tolerate along any signal path is 32767/255, or about 128.498; anything larger and we might get overflows.

Now, this problem might seem like it’s self-made: as described above, both the forward and inverse transforms have a total gain of 8x, which seems wasteful. Why don’t we, say, leave the 8x gain on the forward transform but try to normalize the inverse transform such that it doesn’t have any extra gain?

Well, yes and no. We can, but it’s addressing the wrong part of the problem. The figures I’ve been quoting so far are for the gain as measured in the 2-norm (Euclidean norm). But for dynamic range analysis, what we need to know is the infinity norm (aka “uniform norm” or “maximum norm”): what is the largest absolute value we’re going to encounter in our vectors? And it turns out that with respect to the infinity norm, the picture looks a bit different.

To see why, let’s start with a toy example. Say we just have two samples, and our transform is a single butterfly:

$B = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}$

As covered last time, B is a scaled reflection, its own inverse up to said scaling, and its matrix 2-norm is $\sqrt{2} \approx 1.4142$. However, B’s infinity-norm is 2. For example, the input vector $(1, 1)^T$ (maximum value of 1) maps to $(2, 0)^T$ (maximum value of 2) under B. In other words, if we add (or subtract) two numbers, we might get a number back that is up to twice as big as either of the inputs – this shouldn’t come as a surprise. Now let’s apply the transpose of B, which happens to be B itself:

$B^T B = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}^2 = \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix}$

This is just a scaled identity transform, and its 2-norm and infinity norm are 2 exactly. In other words, the 2-norm grew in the second application of B, but with respect to the infinity norm, things were already as bad as they were going to get after the forward transform. And it’s this infinity norm that we need to worry about if we want to avoid overflows.

### Scaling along the signal path

So what’s our actual matrix norms along the signal path, with the “B2″ DCT variant from last time? This is why I published the Octave code on Github. Let’s work through the 1D DCT first:

2> M=bink_dct_B2(eye(8));
3> S=diag(8 ./ diag(M*M'));
4> norm(M,2)
ans =  3.4324
5> norm(M,'inf')
ans =  8.7500
6> norm(S*M,2)
ans =  3.2962
7> norm(S*M,'inf')
ans =  8.4881
8> norm(M'*S*M,2)
ans =  8.0000
9> norm(M'*S*M,'inf')
ans =  8.0000


This gives both the 2-norms and infinity norms for first the forward DCT only (M), then forward DCT followed by scaling (S*M), and finally with the inverse transform as well (M'*S*M), although at that point it’s really just a scaled identity matrix so the result isn’t particularly interesting. We can do the same thing with 2D transforms using kron which computes the Kronecker product of two matrices (corresponding to the tensor product of linear maps, which is the mathematical background underlying separable transforms):

10> norm(kron(M,M),'inf')
ans =  76.563
11> norm(kron(S,S) * kron(M,M),2)
ans =  10.865
12> norm(kron(S,S) * kron(M,M),'inf')
ans =  72.047
13> norm(kron(M',M') * kron(S,S) * kron(M,M), 2)
ans =  64.000
14> norm(kron(M',M') * kron(S,S) * kron(M,M), 'inf')
ans =  64.000
15> norm(kron(M',M') * kron(S,S) * kron(M,M) - 64*eye(64), 'inf')
ans = 8.5487e-014
16> 64*eps
ans = 1.4211e-014


As you can see, the situation is very similar to what we saw in the butterfly example: while the 2-norm keeps growing throughout the entire process, the infinity norm actually maxes out right after the forward transform and stays roughly at the same level afterwards. The last two lines are just there to show that the entire 2D transform chain really does come out to be a scaled identity transform, to within a bit more than 6 ulps.

More importantly however, we do see a gain factor of around 76.6 right after the forward transform, and the inverse transform starts out with a total gain around 72; remember that this value has to be less than 128 for us to be able to do the transform using 16-bit integer arithmetic. As said before, it can fit, but it’s a close shave; we have less than 1 bit of headroom, so we need to be really careful about overflow – there’s no margin for sloppiness. And furthermore, while this means that the entire transform fits inside 16 bits, what we really need to make sure is that the way we’re computing it is free of overflows; we need to drill down a bit further.

For this, it helps to have a direct implementation of the DCT and IDCT; a direct implementation (still in Octave) for the DCT is here, and I’ve also checked in a direct IDCT taking an extra parameter that determines how many stages of the IDCT to compute. A “stage” is a set of mutually independent butterflies or rotations that may depend on previous stages (corresponding to “clean breaks” in a butterfly diagram or the various matrix factors in a matrix decomposition).

I’m not going to quote the entire code (it’s not very long, but there’s no point), but here’s a small snippet to show what’s going on:

  % odd stage 4
c4 = d4;
c5 = d5 + d6;
c7 = d5 - d6;
c6 = d7;

if stages > 1
% odd stage 3
b4 = c4 + c5;
b5 = c4 - c5;
b6 = c6 + c7;
b7 = c6 - c7;

% even stage 3
b0 = c0 + c1;
b1 = c0 - c1;
b2 = c2 + c2/4 + c3/2; % 5/4*c2 + 1/2*c3
b3 = c2/2 - c3 - c3/4; % 1/2*c3 - 5/4*c3

% ...
end


The trick is that within each stage, there are only integer additions and right-shifts (here written as divisions because this is Octave code, but in integer code it should really be shifts and not divisions) of values computed in a previous stage. The latter can never overflow. The former may have intermediate overflows in 16-bit integer, but provided that computations are done in two’s complement arithmetic, the results will be equal to the results of the corresponding exact integer computation modulo 216. So as long as said exact integer computation has a result that fits into 16 bits, the computed results are going to be correct, even if there are intermediate overflows. Note that this is true when computed using SIMD intrinsics that have explicit two’s complement semantics, but is not guaranteed to be true when implemented using 16-bit integers in C/C++ code, since signed overflow in C triggers undefined behavior!

Also note that we can’t compute values such as 5/4*c2 as (5*c2)/4, since the intermediate value 5*c2 can overflow – and knowing as we do already that we have less than 1 bit of headroom left, likely will for some inputs. One option is to do the shift first: 5*(c2/4). This works but throws away some of the least-significant bits. The form c2 + (c2/4) chosen here is a compromise; it has larger round-off error than the preferred (5*c2)/4 but will only overflow if the final result does. The findorth tool described last time automatically determines expressions in this form.

### Verification

Okay, that’s all the ingredients we need. We have the transform decomposed into separate stages; we know the value range at the input to each stage, and we know that if both the inputs and the outputs fit inside 16 bits, the results computed in each stage will be correct. So now all we need to do is verify that last bit – are the outputs of the intermediate stages all 16 bits or less?

2> I=eye(8);
3> F=bink_dct_B2(I);
4> S=diag(8 ./ diag(F*F'));
5> Inv1=bink_idct_B2_partial(I,1);
6> Inv2=bink_idct_B2_partial(I,2);
7> Inv3=bink_idct_B2_partial(I,3);
8> Inv4=bink_idct_B2_partial(I,4);
9> Scaled=kron(S,S) * kron(F,F);
10> norm(kron(I,I) * Scaled, 'inf')
ans =  72.047
11> norm(kron(I,Inv1) * Scaled, 'inf')
ans =  72.047
12> norm(kron(I,Inv2) * Scaled, 'inf')
ans =  77.811
13> norm(kron(I,Inv3) * Scaled, 'inf')
ans =  77.811
14> norm(kron(I,Inv4) * Scaled, 'inf')
ans =  67.905
15> norm(kron(Inv1,Inv4) * Scaled, 'inf')
ans =  67.905
16> norm(kron(Inv2,Inv4) * Scaled, 'inf')
ans =  73.337
17> norm(kron(Inv3,Inv4) * Scaled, 'inf')
ans =  73.337
18> norm(kron(Inv4,Inv4) * Scaled, 'inf')
ans =  64.000


Lines 2-4 compute the forward DCT and scaling matrices. 5 through 9 compute various partial IDCTs (Inv4 is the full IDCT) and the matrix representing the results after the 2D forward DCT and scaling. After that, we successively calculate the norms of more and more of the 2D IDCT: no IDCT, first stage of row IDCT, first and second stage of row IDCT, …, full row IDCT, full row IDCT and first stage of column IDCT, …, full 2D IDCT.

Long story short, the worst matrix infinity norm we have along the whole transform chain is 77.811, which is well below 128; hence the Bink B2 IDCT is overflow-free for 8-bit signals when implemented using 16-bit integers.

A similar process can be used for the forward DCT, but in our case, we just implement it in 32-bit arithmetic (and actually floating-point arithmetic at that, purely for convenience). The forward DCT is only used in the encoder, which spends a lot more time per block than the decoder ever does, and almost none of it in the DCT. Finally, there’s the scaling step itself; there’s ways to do this using only 16 bits on the decoder side as well (see for example “Low-complexity transform and quantization in H. 264/AVC”), but since the scaling only needs to be applied to non-zero coefficients and the post-quantization DCT coefficients in a video codec are very sparse, it’s not a big cost to do the scaling during entropy decoding using regular, scalar 32-bit fixed point arithmetic. And if the scaling is designed such that no intermediate values ever take more than 24 bits (which is much easier to satisfy than 16), this integer computation can be performed exactly using floating-point SIMD too, on platforms where 32-bit integer multiplies are slow.

And that’s it! Congratulations, you now know everything worth knowing about Bink 2.2’s integer DCTs. :)

This subject is fairly technical, and I’m going to proceed at a somewhat higher speed than usual; part of the reason for this post is just as future reference material for myself.

### DCTs

There’s several different types of discrete cosine transforms, or DCTs for short. All of them are linear transforms on N-dimensional vectors (i.e. they can be written as a N×N matrix), and they all relate to larger real DFTs (discrete Fourier transforms) in some way; the different types correspond to different boundary conditions and different types of symmetries. In signal processing and data compression, the most important variants are what’s called DCT-II through DCT-IV. What’s commonly called “the” DCT is the DCT-II; this is the transform used in, among other things, JPEG, MPEG-1 and MPEG-2. The DCT-III is the inverse of the DCT-II (up to scaling, depending on how the two are normalized; there’s different conventions in use). Thus it’s often called “the” IDCT (inverse DCT). The DCT-IV is its own inverse, and forms the basis for the MDCT (modified DCT), a lapped transform that’s at the heart of most popular perceptual audio codecs (MP3, AAC, Vorbis, and Opus, among others). DCTs I and V-VIII also exist but off the top of my head I can’t name any signal processing or data compression applications that use them (which doesn’t mean there aren’t any).

Various theoretical justifications exist for the DCT’s use in data compression; they are quite close to the optimal choice of orthogonal basis for certain classes of signals, the relationship to the DFT means there are fast (FFT-related) algorithms available to compute them, and they are a useful building block for other types of transforms. Empirically, after 25 years of DCT-based codecs, there’s little argument that they work fairly well in practice too.

Image (and video) codecs operate on 2D data; like the DFT, there are 2D versions of all DCTs that are separable: a 2D DCT of a M×N block decomposes into N 1D M-element DCTs on the columns followed by M N-element DCTs on the rows (or vice versa). JPEG and the MPEGs up to MPEG-4 ASP use 2D DCTs on blocks of 8×8 pixels. H.264 (aka MPEG-4 AVC) initially switched to 4×4 blocks, then added the 8×8 blocks back in as part of the “High” profile later. H.265 / HEVC has both sizes and also added support for 16×16 and 32×32 transform blocks. Bink 2 sticks with 8×8 DCTs on the block transform level.

### Algorithms

There are lots of different DCT algorithms, and I’m only going to mention a few of them. Like the DFT, DCTs have a recursive structure that allows them to be implemented in O(N log N) operations, instead of the O(N2) operations a full matrix-vector multiply would take. Unlike the FFT, which decomposes into nothing but smaller FFTs except for some multiplications by scalars (“twiddle factors”) plus a permutation, the recursive factorizations of one DCT type will usually contain other trigonometric transforms – both smaller DCTs and smaller DST (discrete sine transforms) of varying types.

The first important dedicated DCT algorithm is presented in [Chen77] (see references below), which provides a DCT factorization for any power-of-2 N which is substantially faster than computing the DCT using a larger FFT. Over a decade later, [LLM89] describes a family of minimum-complexity (in terms of number of arithmetic operations) solutions for N=8 derived using graph transforms, a separate algorithm that has no more than one multiply along any path (this is the version used in the IJG JPEG library), and a then-new fast algorithm for N=16, all in a 4-page paper. [AAN88] introduces a scaled algorithm for N=8 which greatly reduces the number of multiplications. A “scaled” DCT is one where the output coefficients have non-uniform scaling factors; compression applications normally follow up the DCT with a quantization stage and precede the IDCT with a dequantization stage. The scale factors can be folded into the quantization / dequantization steps, which makes them effectively “free”. From today’s perspective, the quest for a minimal number of multiplies in DCT algorithms seems quaint; fast, pipelined multipliers are available almost everywhere these days, and today’s mobile phone CPUs achieve floating point throughputs higher than those of 1989’s fastest supercomputers (2012 Nexus 4: 2.8 GFLOPS measured with RgbenchMM; Cray-2: 1.9 GFLOPS peak – let that sink in for a minute). That said, the “scaled DCT” idea is important for other reasons.

There’s also direct 2D methods that can reduce number of arithmetic operations further relative to a separable 1D implementation; however, the overall reduction isn’t very large, the 2D algorithms require considerably more code (instead of 2 loops processing N elements each, there is a single “unrolled” computation processing N2 elements), and they have data flow patterns that are unfriendly to SIMD or hardware implementations (separable filters, on the other hand, are quite simple to implement in a SIMD fashion).

For 1D DCTs and N=8, the situation hasn’t substantially changed since the 1980s. Larger DCTs (16 and up) have seen some improvement on their arithmetic operation costs in recent years [Plonka02] [Johnson07], with algorithms derived symbolically from split-radix FFTs (though again, how much of a win the better algorithms are heavily depends on the environment).

### Building blocks

Independent of the choice of DCT algorithm, they all break down into the following 3 basic components:

• Butterflies. A butterfly is the transform $(a, b) \mapsto (a + b, a - b)$ (they’re called that way because of how they’re drawn in diagram form). A butterfly is also its own inverse, up to a scale factor of two, since $(a+b) + (a-b) = 2a$ and likewise $(a+b) - (a-b) = 2b$.
• Planar rotations. Take a pair of values, interpret them as coordinates in the plane, and rotate them about the origin. I’ve written about them (even in the context of DCTs) before. The inverse of a rotation by θ radians is a rotation by -θ radians. There’s also planar reflections which are closely related to rotations and work out pretty much the same on the implementation side. Reflections are self-inverse, and in fact a butterfly is just a reflection scaled by $\sqrt{2}$.
• Scalar multiplies. Map $a \mapsto ca$ for some nonzero constant c. The inverse is scaling by 1/c.

There are also a few properties of the DCT-II that simplify the situation further. Namely, the DCT-II (when properly normalized) is an orthogonal transform, which means it can be decomposed purely into planar rotations and potentially a few sign flips at the end (which can be represented as scalar multiplications). Butterflies, being as they are a scaled reflection, are used because they are cheaper than a “proper” rotation/reflection, but they introduce a scale factor of $\sqrt{2}$. So scalar multiplications are used to normalize the scaling across paths that go through a different number of butterflies; but the DCT algorithms we’re considering here only have scaling steps at the end, which can conveniently be dropped if a non-normalized (scaled) DCT is acceptable.

What this all boils down to is that a DCT implementation basically boils down to which planar rotations are performed when, and which of the various ways to perform rotations are employed. There are tons of tricky trade-offs here, and which variant is optimal really depends on the target machine. Older standards (JPEG, the earlier MPEGs) didn’t specify exactly which IDCT was to be used by decoders, leaving this up to the implementation. Every codec had its own DCT/IDCT implementations, which caused problems when a video was encoded using codec A and decoded using codec B – basically, over time, the decoded image could drift from what the encoder intended. Some codecs even had multiple DCT implementations with different algorithms (say one for MMX, and another for SSE2-capable CPUs) so this problem occurred even between different machines using the same codec. And of course there’s the general problem that the “real” DCT involves irrational numbers as coefficients, so there’s really no efficient way to compute it exactly for arbitrary inputs.

### Integer transforms

The solution to all these problems is to pick a specific IDCT approximation and specify it exactly – preferably using (narrow) integer operations only, since floating-point is expensive in hardware implementations and getting 100% bit-identical results for floating-point calculations across multiple platforms and architectures is surprisingly hard in software.

So what denotes a good integer DCT? It depends on the application. As the title says, this post is about the 8×8 integer IDCT (and matching DCT) design used for Bink 2.2. We had the following requirements:

• Bit-exact implementation strictly required (Bink 2 will often go hundreds or even thousands of frames without a key frame). Whatever algorithm we use, it must be exactly the same on all targets.
• Must be separable, i.e. the 2D DCT factors into 1D DCT steps.
• Must be a close approximation to the DCT. The goal was to be within 2% in terms of the matrix 2-norm, and same for the coding gain compared to a “proper” DCT. Bink 2 was initially designed with a full-precision floating-point (or high-precision fixed point) DCT in mind and we wanted a drop-in solution that wouldn’t affect the rest of the codec too much.
• Basis vectors must be fully orthogonal (or very close) to allow for trellis quantization.
• Must have an efficient SIMD implementation on all of the following:
• x86 with SSE2 or higher in both 32-bit and 64-bit modes.
• ARM with NEON support.
• PowerPC with VMX/AltiVec (e.g. PS3 Cell PPU).
• Cell SPUs (PS3 again).
• PowerPC with VMX128 (Xbox 360 CPU).

Yep, lots of game consoles – RAD’s a game middleware company.

• 16-bit integer preferred, to enable 8-wide operation on 128-bit SIMD instruction sets. To be more precise, for input values in the range [-255,255], we want intermediate values to fit inside 16 bits (signed) through all stages of the transform.

This turns out to limit our design space greatly. In particular, VMX128 removes all integer multiplication operations from AltiVec – integer multiplication by scalars, where desired, has to be synthesized from adds, subtractions and shifts. This constraint ended up driving the whole design.

There’s other integer DCTs designed along similar lines, including several used in other video codecs. However, while there’s several that only use 16-bit integer adds and shifts, they are generally much coarser approximations to the “true” DCT than what we were targeting. Note that this doesn’t mean they’re necessarily worse for compression purposes, or that the DCT described here is “better”; it just means that we wanted a drop-in replacement for the full-precision DCT, and the transform we ended up with is a closer to that goal than published variants with similar complexity.

### Building an integer DCT

Most of the integer DCT designs I’m aware of start with a DCT factorization; the Bink 2 DCT is no exception, and starts from the DCT-II factorizations given in [Plonka02]. For N=8, this is equivalent to the factorization described in [LLM89]: by reordering the rows of the butterfly matrices in Plonka’s factorization and flipping signs to turn rotation-reflections into pure rotations, the DCT-II factorization from example 2.8 can be expressed as the butterfly diagram (click to zoom)

which corresponds to the DCT variation on the left of the 2nd row in figure 4 of [LLM89], with the standard even part. However, [LLM89] covers only N=8, N=16 and (implicitly) N=4, whereas [Plonka02] provides factorizations for arbitrary power-of-2 N, making it easy to experiment with larger transform sizes with a single basic algorithm. UPDATE: The previous version of the diagram had a sign flip in the rightmost butterfly. Thanks to “terop” for pointing this out in the comments!

Now, we’re allowing a scaled DCT, so the final scale factors of $\pm \sqrt{2}$ can be omitted. As explained above, this leaves butterflies (which have simple and obvious integer implementations) and exactly three rotations, one in the even part (the top half, called “even” since it produces the even-numbered DCT coefficients) and two in the odd part:

As customary for DCT factorizations, $c_k = \cos(k \pi / 2N)$ and $s_k = \sin(k \pi / 2N)$. Note that due to the corresponding trig identities, we have $c_{-k} = c_k$, $s_{-k} = -s_k$, $c_{N-k} = s_k$ and $s_{N-k} = c_k$, which means that the three rotations we see here (and indeed all rotations referenced in [LLM89]) can be expressed in terms of just $c_1$, $s_1$, $c_2$, $s_2$, $c_3$ and $s_3$.

Now, all we really need to do to get an integer DCT is to pick integer approximations for these rotations (preferably using only adds and shifts, see the constraints above!). One way to do so is the BinDCT [Liang01], which uses the decomposition of rotation into shears I explained previously and then approximates the shears using simple dyadic fractions. This is a cool technique, but yields transforms with basis vectors that aren’t fully orthogonal; how big the non-orthogonality is depends on the quality of approximation, with cheaper approximations being less orthogonal. While not a show-stopper, imperfect orthogonality violates the assumptions inherent in trellis quantization and thus reduces the quality of the rate-distortion optimization performed on the encoder side.

### Approximating rotations

So how do we approximate irrational rotations as integers (preferably small integers) with low error? The critical insight is that any matrix of the form

$\begin{pmatrix} c & s \\ -s & c \end{pmatrix}, \quad c^2 + s^2 = 1$

is a planar rotation about some angle θ such that $c=\cos \theta, s=\sin \theta$. And more generally, any matrix of the form

$\begin{pmatrix} c & s \\ -s & c \end{pmatrix}, \quad c^2 + s^2 \ne 0$

is a scaled planar rotation. Thus, if we’re okay with a scaled DCT (and we are), we can just pick two arbitrary integers c, s that aren’t both zero and we get a scaled integer rotation (about some arbitrary angle). To approximate a particular rotation angle θ, we want to ensure that $s/c \approx (\sin\theta)/(\cos\theta) = \tan\theta$. Since we’d prefer small integer values s and c, we can find suitable approximations simply by enumerating all possibilities with a suitable ratio, checking how closely they approximate the target angle, and using the best one. This is easily done with a small program, and by itself sufficient to find a good solution for the rotation in the even part of the transform.

The odd part is a bit trickier, because it contains two rotations and butterflies that combine the results from different rotations. Therefore, if the two rotations were scaled differently, such a butterfly will result not in a scaled DCT, but a different transformation altogether. So simply approximating the rotations individually won’t work; however, we can approximate the two rotations $(c_1,s_1), (c_2,s_2)$ simultaneously, and add the additional constraint that $c_1^2 + s_1^2 = c_2^2 + s_2^2$ (somewhat reminiscent of Pythagorean triples), thus guaranteeing that the norm of the two rotations is the same. Initially I was a bit worried that this might over-constrain the problem, but it turns out that even with this extra complication, there are plenty of solutions to choose from. As before, this problem is amenable to an exhaustive search over all small integers that have the right ratios between each other.

Finally, we need implementations of the resulting rotations using only integer additions/subtractions and shifts. Since we’re already fully integer, all this means is that we need to express the integer multiplications using only adds/subtracts and shifts. This is an interesting and very tricky problem by itself, and I won’t cover it here; see books like Hacker’s Delight for a discussion of the problem. I implemented a simple algorithm that just identifies runs of 0- and 1-bits in the binary representation of values. This is optimal for small integers (below 45) but not necessarily for larger ones, but it’s usually fairly good. And since the program already tries out a bunch of variants, I made it compute costs for the different ways to factor planar rotations as well.

The result is findorth.cpp, a small C++ program that finds candidate small integer approximations to the DCT rotations and determines their approximate cost (in number of adds/subtractions and shifts) as well. Armed with this and the rest of the factorization above, it’s easy to come up with several families of integer DCT-like transforms at various quality and cost levels and compare them to existing published ones:

Variant (c2,s2) (c1,s1) (c3,s3) Cost L2 err Gain (dB)
A1 (17,-7)/16 (8,-1)/8 (7,4)/8 32A 10S 0.072 8.7971
B1 (5,-2)/4 (8,-1)/8 (7,4)/8 30A 10S 0.072 8.7968
A2 (17,-7)/16 (19,-4)/16 (16,11)/16 38A 12S 0.013 8.8253
B2 (5,-2)/4 (19,-4)/16 (16,11)/16 36A 12S 0.013 8.8250
A3 (17,-7)/16 (65,-13)/64 (55,37)/64 42A 15S 0.003 8.8258
B3 (5,-2)/4 (65,-13)/64 (55,37)/64 40A 15S 0.012 8.8255
Real DCT - - - - 0.000 8.8259
BinDCT-L1 - - - 40A 22S 0.013 8.8257
H.264 - - - 32A 10S 0.078 8.7833
VC-1 - - - ? 0.078 8.7978

The cost is measured in adds/subtracts (“A”) and shifts (“S”); “L2 err” is the matrix 2-norm of the difference between the approximated DCT matrix and the true DCT, and “gain” denotes the coding gain assuming a first-order Gauss-Markov process with autocorrelation coefficient ρ=0.95 (a standard measure for the quality of transforms in image coding).

These are by far not all solutions that findorth finds, but they’re what I consider the most interesting ones, i.e. they’re at good points on the cost/approximation quality curve. The resulting transforms compare favorably to published transforms at similar cost (and approximation quality) levels; Bink 2.2 uses variant “B2″, which is the cheapest one that met our quality targets. The Octave code for everything in this post is available on Github.

The H.264 transform is patented, and I believe the VC-1 transform is as well. The transforms described in this post are not (to the best of my knowledge), and RAD has no intention of ever patenting them or keeping them secret. Part of the reason for this post is that we wanted to make an integer DCT that is of similar (or higher) quality and comparable cost than those in existing video standards freely available to everyone. And should one of the particular transforms derived in this post turn out to be patent-encumbered, findorth provides everything necessary to derive a (possibly slightly worse, or slightly more expensive) alternative. (Big thanks to my boss at RAD, Jeff Roberts, for allowing and even encouraging me to write all this up in such detail!)

This post describes the basic structure of the transform; I’ll later post a follow-up with the details of our 16-bit implementation, and the derivation for why 16 bits are sufficient provided input data is in the range of [-255,255].

### References

[Chen77]
Chen, Wen-Hsiung, C. H. Smith, and Sam Fralick. “A fast computational algorithm for the discrete cosine transform.” Communications, IEEE Transactions on 25.9 (1977): 1004-1009. PDF
[AAN88]
Yukihiro, Arai, Agui Takeshi, and Masayuki Nakajima. “A fast DCT-SQ scheme for images.” IEICE TRANSACTIONS (1976-1990) 71.11 (1988): 1095-1097.
[LLM89]
Loeffler, Christoph, Adriaan Ligtenberg, and George S. Moschytz. “Practical fast 1-D DCT algorithms with 11 multiplications.” Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference on. IEEE, 1989. PDF
[Liang01]
Liang, Jie, and Trac D. Tran. “Fast multiplierless approximations of the DCT with the lifting scheme.” Signal Processing, IEEE Transactions on 49.12 (2001): 3032-3044. PDF
[Plonka02]
Plonka, Gerhard, and Manfred Tasche. “Split-radix algorithms for discrete trigonometric transforms.” (2002). PDF
[Johnson07]
Johnson, Steven G., and Matteo Frigo. “A modified split-radix FFT with fewer arithmetic operations.” Signal Processing, IEEE Transactions on 55.1 (2007): 111-119. PDF

A lot of bit manipulation operations exist basically everywhere: AND, OR, XOR/EOR, shifts, and so forth. Some exist only on some architectures but have obvious implementations everywhere else – NAND, NOR, equivalence/XNOR and so forth.

Some exist as built-in instructions on some architectures and require fairly lengthy multi-instruction sequences when they’re not available. Some examples would be population count (number of bits set; available on recent x86s and POWER but not PowerPC), or bit reversal (available on ARMv6T2 and above).

And then there’s bit scanning operations. All of these can be done fairly efficiently on most architectures, but it’s not always obvious how.

### x86 bit scanning operations

On x86, there’s BSF (bit scan forward), BSR (bit scan reverse), TZCNT (trailing zero count) and LZCNT (leading zero count). The first two are “old” (having existed since 386), the latter are very new. BSF and TZCNT do basically the same thing, with one difference: BSF has “undefined” results on a zero input (what is the index of the first set bit in the integer 0?), whereas the “trailing zeros” definition for TZCNT provides an obvious answer: the width of the register. BSR returns the index of the “last” set bit (again undefined on 0 input), LZCNT returns the number of leading zero bits, so these two actually produce different results.

• x86_bsf(x) = x86_tzcnt(x) unless x=0 (in which case bsf is undefined).
• x86_bsr(x) = (reg_width - 1) - x86_lzcnt(x) unless x=0 (in which case bsr is undefined).

In the following, I’ll only use LZCNT and TZCNT (use above table to convert where necessary) simply to avoid that nasty edge case at 0.

### PowerPC bit scanning operations

PPC has cntlzw (Count Leading Zeros Word) and cntlzd (Count Leading Zeros Double Word) but no equivalent for trailing zero bits. We can get fairly close though: there’s the old trick x & -x which clears all but the lowest set bit in x. As long as x is nonzero, this value has exactly one bit set, and we can determine its position using a leading zero count.

• x86_tzcnt(x) = (reg_width - 1) - ppc_cntlz(x & -x) unless x=0 (in which case the PPC expression returns -1).
• x86_lzcnt(x) = ppc_cntlz(x).

### ARM bit scanning operations

ARM also has CLZ (Count Leading Zeros) but nothing for trailing zero bits. But it also offers the aforementioned RBIT which reverses the bits in a word, which makes a trailing zero count easy to accomplish:

• x86_tzcnt(x) = arm_clz(arm_rbit(x)).
• x86_lzcnt(x) = arm_clz(x).

### Bonus

Finally, ARMs NEON also offers VCLS (Vector Count Leading Sign Bits), which (quoting from the documentation) “counts the number of consecutive bits following the topmost bit, that are the same as the topmost bit”. Well, we can do that on all architectures I mentioned as well, using only ingredients we already have: arm_cls(x) = x86_lzcnt(x ^ (x >> 1)) - 1 (the shift here is an arithmetic shift). The expression y = x ^ (x >> 1) gives a value that has bit n set if and only if bits n and n + 1 of x are the same. By induction, the number of leading zeros in y is thus exactly the number of leading bits in x that match the sign bit. This count includes the topmost (sign) bit, so it’s always at least 1, and the instruction definition I just quoted requires us to return the number of bits following the topmost bit that match it. So we subtract 1 to get the right result. Since we can do a fast leading zero count on all quoted platforms, we’re good.

Conclusion: bit scans / zero bit counts in either direction can be done fairly efficiently on all architectures covered, but you need to be careful when zeros are involved.

Several long posts in the works; in the meantime, have this:

Bob, just out of the shower and still drowsy, gets a call: his dad just died. Stunned, he calls a few of his other relatives. 30 minutes later, he shows up at work, to meet his boss: “You’re late again, Bob. You had two warnings. That’s it – you’re fired.”

And so forth. If this is a story about Bob, these are high emotional stakes: Bob’s probably having one of the worst days of his life right now. This is his personal apocalypse, and we’re immersed in it, feeling every blow.

Zoom out. A thousand people in the city got the same call that morning. It’s a virus outbreak! Bob might still be one of the protagonists – the character the audience is supposed to identify with. He’s having a bad day, maybe the worst day of his life, sure – but he’s expected to get over it and somehow help the trained virologists through a series of contrived plot twists and stop the epidemic from spreading further.

Zoom out. Maybe there’s no virus outbreak. But there’s a giant asteroid heading for earth! In this kind of story, Bob might still be the everyman character who’s having a really bad day. And he will still get drawn into the plot somehow. We may be supposed to feel sorry for him, or the story might play what’s happening to Bob for laughs, but either way, there are bigger things at stake here! Get a grip, Bob!

Drama is not scale-invariant. Depending on the scope, an individual goes from the center of the universe in the story, to a piece on a chess board, to a bacterium under the microscope, and ultimately becomes completely invisible.

This, in short, is the problem I have with most time-travel stories.

Once your story widens its scope so far that the very substance of space and time takes center stage and there’s a series of delicate (or not-so-delicate) manipulations to make the “right” universe, the one where everyone goes the way it “should” be, be the “right” one (whatever that is supposed to mean), the characters and the world they inhabit become set dressing. You have put the entire universe in a neat little box, labeled it “the universe where Tim dies”, and then you zoom out and look at the tree of all possible timelines and how they branch, and you start rearranging things until you can’t see “the universe where Tim dies” in the part of the timeline that you care anymore.

In short, you are now writing a story about rebalancing a tree data structure. There happen to be people and so on in it, and you may even care about them, but at this point the scale and the stakes are so all-encompassing that it’s hard to really care, and all that’s left is a bunch of neat abstractions.

I like abstractions. I like writing abstractions, and writing about abstractions. I have written a multi-part series on multiple variants of the half-edge data structure on this very blog! And these kinds of stories tickle the part of my brain that likes abstractions and manipulating them.

But they also leave the part of my brain that is social and thinks about humans and their relationships oddly cold and unsatisfied, and I think that’s an inherent problem of this type of story, and I don’t think it can be fixed.