In the previous post I’ve talked about things you might want to know as someone who uses FFTs, this part covers all kinds of FFT implementation details, including the underlying reasons for a lot of the API complexities that showed up last time. I’ll also give some recommendations on what I think are good ways write a FFT right now, and presumably also going forward.

### Algorithm zoo

There’s a ton of FFT algorithms to choose from, but practically speaking, the broad Cooley-Tukey family of algorithms (I include variants like split-radix as part of the general family) is by far the most important.

If you don’t already know how CT FFTs work, I would recommend to avoid any sources that derive the algorithm by manipulating sums of trigonometric functions (doubly so if they do it with real sin/cos instead of complex exponentials). Way too much noise and index manipulation, nearly impossible to see what’s going on. I would also not recommend the other extreme, sources that explain FFTs purely in terms of signal flow charts/butterfly diagrams, because while that form can be useful to look at *particular* transforms (until they get too large to comfortably fit in a sensibly-sized diagram anyway), the first few (or last few, depending) levels of the decomposition are in some sense special cases (because their twiddles are trivial), and with a flow-chart form that is most of what you see. Since you’re forced to look at a concrete realization that is generally not too big and thus mostly special cases, it can be quite hard to see what the actual general construction is supposed to look like.

The least “noisy” introductory explanation I’ve ever seen for a CT radix-2 FFT (provided you know some basic linear algebra) wrote down a 8×8 DFT matrix as powers ω^{0}, …, ω^{7} of the 8th root of unity and proceeded to find and use the symmetries on it, then explained the general rule once you’d seen the particular example (it’s not too hard to figure out yourself once you start that way). I like this form: having the powers of the root of unity exposes the critical structure that is being used, but obscures the fact that of course ω^{2} is i and ω^{4} is -1, which is precisely the property that makes small signal-flow-graph based descriptions confusing to me because it leads to simplifications that break up the regularity somewhat.

If you have the required background (meaning you know what polynomial rings are and how they work), I’ve found ring-theoretic approaches the most insightful in terms of showing the actual structure of the problem without getting bogged down in the details (and in index manipulation) as tends to happen to varying degrees when working with sums of trigonometric terms, signal flow graphs or matrices. Honestly I think the most I’ve ever gotten out of a single source on the subject would be D. J. Bernstein’s “Multidigit multiplication for mathematicians” (section 7), which covers the underlying algebra for many of the major FFT variants in about 2 pages.

### DIT vs. DIF

All Cooley-Tukey family algorithms come in two dual variants, Decimation-in-Time vs. Decimation-in-Frequency. With the recursive decomposition as introduced by Gentleman and Sande, a DIT FFT for even N is decomposed into first computing two size-N/2 DFTs on the even and odd elements respectively. These two DFTs on the even and odd halves of the data can then be combined into a DFT for all N samples by first multiplying the odd-half coefficients by so-called “twiddle factors” and then adding and subtracting the even and twiddled odd halves from each other (“butterflies”).

Splitting into even and odd halves is done at every level of this recursion, and always before any of the data is modified. We would like the actual computation parts of the kernel (at the various sizes) to work on contiguous elements; what this ultimately results in is a big data permutation at the end that does all of the even/odd splits at once (which turns out to be a bit-reversal permutation, where if indices i are in { 0, …, N-1 }, the item at position i is sent to the position `bit_reverse(i)`

that literally reverses the order of bits in the binary expansion of the number.

Therefore this (now standard) version of Cooley-Tukey consists of first a permutation on the input, and then all the follow-up work can be conveniently done in-place. (We’ll go into memory access patterns later.)

The dual version of this approach is Decimation-in-Frequency (DIF). In DIF, you start with butterflies, this time between the first and second half of the input. Then you twiddle the second half and finally recurse into size-N/2 DFTs on the first and second halves (which are already contiguous), and once those partial DFTs are done, you interleave them. Again this interleaving happens as the last step at every level of the recursion, so you can skip it in the middle of the recursion and end up with FFT coefficients out of order – bit-reversed in fact, so you have a bit-reversal permutation at the end. In short, it’s the exact opposite of the DIT decomposition (if you invert a DIT FFT by reversing the flow graph, you end up with the corresponding DIF FFT, and vice versa).

The bit-reversal permutation is possible to do in-place, but it’s annoying and relatively slow. DIT FFTs, which need to do the reordering first, often prefer to be out-of-place: in their first pass, they apply the input permutation while copying to the destination, and then can work in-place for the rest of the computation. This can be viewed as a feature by saying that they leave the original data intact. DIF FFTs also want to do their computation in-place and have the reordering last; so they either overwrite the input completely leaving a completed but permuted FFT in the original input, and then do a final reordering step that copies to the destination, or they can be – very conveniently – *fully* in-place and leave the coefficients in a permuted order. This latter variant is the type of FFT that can give you permuted-order coefficients more quickly, which you might not mind as long as you’re only doing convolutions with it. The inverse would then use the corresponding DIT factorization, skipping the initial permutation step, and in fact ending up with zero shuffling whatsoever.

This is nice in principle, *however* the bit-reversed order is annoying for library users that do want FFT coefficients in the conventional order, and it gets worse when other types of decomposition (e.g. radix-4 or split-radix variants) are used which have a different “natural” order for their coefficients to end up in. In short, using DIF with whatever radix you want and then returning the unpermuted coefficients does save some work, but it does so at the expense of exposing a lot of implementation details, so it’s a bit dicey to expose as part of a library API.

Finally, the elephant in the room: multiply-adds. A radix-2 DIT step ultimately boils down to a twiddle of the odd half then a butterfly, which works out to (the additions, subtractions and multiplications here are using complex numbers, and w is a twiddle factor)

out0 = in0 + w*in1 out1 = in0 - w*in1

This looks as though it might nicely map to multiply-add operations where available, and in fact it does, although not exactly in the obvious way (I’ll get to that later). For a DIF version, remember that the order of operations was reversed, and we first add/subtract and *then* twiddle, so we get

out0 = in0 + in1 out1 = (in0 - in1) * w

which is not in multiply-add form. We might think this is a temporary setback and we just have to factor this w into the next operation using out1 to get our multiply-adds, but that doesn’t work either: out1 ends up going into another DFT layer that looks just like ours, and both of its inputs need to be twiddled, so if we delay multiplying by w, we get something like

out0 = deferred_w0*in0 + deferred_w1*in1 out1 = deferred_w0*in0 - deferred_w1*in1

Note we have two twiddles now because in general the twiddle factors feeding into the two inputs at the next level will not be the same. We can do one multiply and then two multiply-adds, but this is not great at all, and we still are deferring one twiddle multiply for the next iteration. (Maybe there is some good trick here that I’m missing?)

In short, with DIT we get a form that is amenable to using FMAs (or other types of multiply-adds) throughout; with DIF, not so much. In my opinion that severely limits the usefulness DIF + out-of-order coefficients as an attractive strategy to avoid permutations for convolutions, because now your DIF FFT potentially performs quite differently from its DIT inverse that shows up in the IFFT.

Therefore, these days, I’d stay away from permuted coefficient orders at API boundaries and just eat the permutation (and/or copy) to produce conventional-order results.

### Different radix, arithmetic complexity

The “canonical” way to write a Cooley-Tukey FFT for power-of-2 sizes uses a radix-2 decomposition, meaning at every level we subdivide into 2 sub-FFTs of size N/2. Another popular variant uses radix-4 steps, which do a slightly different calculation to immediately subdivide into 4 sub-FFTs of size N/4. This has slightly more complicated logistics, a different output permutation (corresponding to reversing the digits in a base-4 expansion of the index this time), and can reduce the number of multiplies relative to radix-2, which is what originally sparked the interest in it – the number of multiplications used to be the primary cost in the FFT algorithm, especially in the absence of hardware multipliers, much less pipelined ones.

There’s further refinements such as split-radix, which is a hybrid between radix 2 and 4, using the fact that part of the computation in a recursive decomposition is more efficiently done as radix-2 and the remainder is more efficient as radix-4 (as you might expect, this introduces further complications to the coefficient permutation, which for split-radix is quite complex). There’s a whole sub-genre of hunting down individual stray multiplications: complex multiplications with twiddle factors that are 1, -1, i, or -i are obviously unnecessary and can just be turned into real-valued additions and subtractions (optional with swapped or sign-flipped real/imaginary parts) with no actual multiplications. Slightly less obvious, multiplications with twiddles that correspond to +-45 and +-135 degrees (+-1 +- i) / sqrt(2) multiply both the real and imaginary part by the same value and can use a slightly cheaper rotation. And then there’s some more tricks with leaving values scaled to reduce the number of multiplies further.

All these tricks can be used systematically in the first few (or last few, depending) steps where these are all the twiddle factors that appear. Scalar implementations can also make use of them at other levels of the decomposition to special-case individual values, but for SIMD/SIMT architectures trying to special-case these rotations in the middle of larger blocks hurts way more than it helps.

More to the point, *all* of these techniques are about minimizing arithmetic complexity, and more specifically, reducing the number of multiplications, since the number of additions is the same for all of them – and also larger than the number of multiplies. And that’s a crucial caveat since, generally speaking, we’re on a very different cost model from counting number of individual floating-point operations these days. Let’s give a few examples to see how this matters:

- Intel Ivy Bridge (2012). Older uArchs like Sandy Bridge, Nehalem etc. are similar, all the way back down to the original Pentium Pro (1995). FP/SIMD multiplier on port 0, FP/SIMD adder on part 1, both fully pipelined, can independently receive an instruction every cycle. Thus, in the limit, FP ALU throughput is limited by whichever type of operation we have more of, and that’s additions in this case.
- Intel Haswell (2013), Broadwell (2014). FP/SIMD fused multiply-add unit on ports 0 and 1; FP adder on port 1. That’s right, HSW and BDW can do two FMAs per cycle (256b vector ones at that, too!) but only one FP add. Once again, multiplies are less of a problem than adds, funnily enough.
- Intel Skylake (2015) through Rocket Lake (2021). FP FMA on ports 0 and 1, adds also accepted on ports 0 and 1. No more add imbalance, but if we’re chasing op counts for throughput, it’s far more helpful to use FMAs efficiently than it is to minimize multiplies. (As of 2022 Alder Lake, we do get a dedicated FP adder on port 5.)
- ARM Cortex-A55 (2017). Extremely common mobile device “little” core. Two SIMD/FP pipes, the FP portion is effectively 64b wide, can do two multiplies/adds/FMAs per cycle, so again the biggest bang for the buck is by using FMAs. The newer “medium” Cortex-A710 (2021) is still two FMA-based FP/SIMD pipes although they are fully 128-bit now.
- Pretty much any GPU or DSP built within the last 20 years (and that’s conservative): built around pipelined multiply-accumulate operations. Not necessarily fused floating-point multiply-adds (which is a specific thing meaning there is only one rounding stage), but some form of multiply-accumulator, maybe fused, maybe with intermediate rounding, maybe with weird truncations in the middle, maybe fixed point, but there is some kind of multiply-add in there. If we do a multiply, we get a corresponding add for ~free, as long as we have a suitable candidate.

You get the idea. For a multiply-add based architecture, the operations we’re counting – if we’re counting anything – should be multiply-adds and not individual multiplies or adds. For a machine without multiply-adds but with separate superscalar mul and add pipes, it matters what the ratio of mul to add resources is; for integer (fixed-point), you usually get more adders than multipliers so the number of multiplications is the thing to count, but for floating-point, it was a common design point for a long time to have one pipelined FP multiplier and one pipelined FP adder, and in that case the bottleneck is whichever pipeline gets more work, which for FFTs is generally the adder. Now if your target is single-issue, has a different pipeline, or doesn’t have pipelined multipliers (or adders for that matter), the classic cost model might be relevant to you, but for newer targets with beefy vector pipelines, nickel-and-diming individual multiplies is worse than useless (since it actively introduces irregularities that are more expensive than just doing the math) and utilizing FMAs well (where available).

For what it’s worth, the classical operation costs of a power-of-2 sized FFT for the basic algorithm variants – just the leading terms:

- Radix-2: ~5N log2(N) operations
- Radix-4: ~4.25N log2(N) operations
- Split-radix: ~4N log2(N) operations

In my experience anyway, radix-2 and radix-4 can make use of all-FMAs quite easily, split-radix gets a bit iffy, so these days, if you’re gonna pick one, I think radix-4 is a solid choice.

### How do I use FMAs in this anyway?

Right. So, let’s write this out. For a DIT radix-2 step (radix-4 is analogous, just more of everything, you can figure it out), the *complex-valued* version as given above is

out0 = in0 + w*in1 out1 = in0 - w*in1

but of course we actually need to implement this in terms of individual *real* operations and expand out that complex arithmetic, which turns this into

out0.re = in0.re + w.re*in1.re - w.im*in1.im out0.im = in0.im + w.re*in1.im + w.im*in1.re out1.re = in0.re - w.re*in1.re + w.im*in1.im out1.im = in0.im - w.re*in1.im - w.im*in1.re

which makes it look like it’s 8 multiply-adds (I’ll use multiply-adds and FMAs interchangeably in the following, even though it’s not technically correct), but that’s clearly not a good way to do it. We could compute the shared “w*in1” part once, which takes 2 multiplies and 2 FMAs, and then we still have 2 adds and 2 subtractions left, so that’s not better in the “multiply-adds cost the same as isolated multiplies or adds” model. Now, what goes wrong here? Clearly, through the first half (the computation of `out0`

), we can use 4 FMAs, and there’s no redundant computations so far, so that all seems fine. But in that second half we’re mostly recomputing stuff we already computed in the first half. Luckily, there is a trick, which is actually easier to see in the complex-valued version above: we can compute `out1`

via

out1 = 2*in0 - out0

instead. This is a multiplication by a pure real number so we can do this using 2 more FMAs, giving us this computation once we expand out into reals:

out0.re = (in0.re + w.re*in1.re) - w.im*in1.im out0.im = (in0.im + w.re*in1.im) + w.im*in1.re out1.re = 2*in0.re - out0.re out1.im = 2*in1.re - out1.im

6 FMAs total. For comparison, the “classic” version using just regular multiplies and adds ends up with 4 multiplies and 6 adds/subtracts total, so our op count goes down from 10 to 6, quite the a big change (NOTE: none of this is new and certainly not my invention, I’m just trying to spare you the time of digging it all up).

Oh, and regarding that complex multiplication in the middle: there are old tricks to replace the 4-mul-2-add complex multiplication with 3 muls and 5 adds, 2 of which can be precomputed if one of the factors is known, as is the case for twiddle factors, so you end up with 3 muls and 3 adds. However, as you might expect, this algorithm is more irregular, needs 3 floats for the twiddle factors instead of 2 (and also prevents you from using tricks that might shrink this further), and even though it’s 3 multiplies and 3 adds it’s not in the form of 3 multiply-adds, although you can reduce it to an add, a multiply, and then 2 FMAs, which sounds good in theory, except…

### Is minimizing arithmetic operations actually the key?

In a word? No, not usually.

Here’s the problem. Take the 6-FMA snippet above, and assume we’re working scalar for now. That’s all the math to process a pair of complex values in a radix-2 DIT FFT step. You have to load the real and imaginary parts of both inputs (4 loads), load the real and imaginary parts of the twiddle factor (2 loads), and finally store the real and imaginary parts of both results, normally to the same place you just loaded them from (4 stores). In total, 6 loads, 4 stores and 6 arithmetic operations per 4 floats processed in each iteration.

This is not really a compute kernel so much as a memory streaming exercise with a part-time gig in arithmetic.

I was looking at a scalar version for simplicity, but the picture is exactly the same when looking at a vectorized version that processes multiple samples at once (because FFTs are so regular, vectorizing them is, for the most part, straightforward). It also doesn’t actually matter here whether a vectorized version uses a split or interleaved layout internally; interleaved has the real and imaginary values next to each other, but having complex numbers packed that way in SIMD registers means that instead of N “real-valued” SIMD lanes, we effectively end up with N/2 “complex-valued” SIMD lanes. The amount useful work per individual float remains unchanged (at best; in practice, using an interleaved format inside the kernel often picks up overheads from elsewhere. I’ll get to it later).

What we’re looking at here is the real problem with radix-2 factorizations, which admittedly is especially amplified when using the FMA versions of the core kernel: the ratio of computation to memory operations is unfavorable. Now exactly how bad this is or isn’t depends on what machine you’re targeting, and what the proportion of compute to memory resources is, but more loads/stores than arithmetic ops (or even a 1:1 ratio as you would get without FMAs) is never good news. This is also the real reason why the 3-mul 3-add complex multiplication which needs 3 floats per twiddle is not that attractive: it decreases arithmetic and increases memory load even further.

Radix-4 happens to have fewer multiplies, but much more importantly, it does the work in roughly half the number of passes, and thus halves the number of coefficient loads/stores. You can reduce this even further by using yet larger radices (if you have enough registers available to keep all those values around), but the biggest single jump is from radix 2 to radix 4.

Another note on radices: there’s a distinction between radix-4 and what I’ve seen called radix-2^{2}, a name that feels clunky but that I’ll keep using for lack of a better suggestion. The former is an actual radix-4 step, which is to say, for DIT, we have 3 twiddles “in front” into a 4-point DFT kernel (all multiplications by 1, i, -1 and -i, which are all not actual complex multiplications, just swapping values around and some negating). The latter is essentially two unrolled radix-2 passes chained back to back. In terms of arithmetic complexity, the difference is that a “proper” radix-4 kernel has 3 twiddle multiplies, while the radix-2^{2} variant contains 4 instances of a radix-2 sub-kernel that each does one twiddle multiply. However, classic radix-4 needs to load those 3 twiddles from somewhere; radix-2^{2} only needs to load 2 different twiddle factors (which both get used twice). Considering our woes with the amount of computation vs. amount of memory operations required, and that FMAs change the trade-offs for the effective cost of multiplications in a radix-2 kernel, radix-2^{2} is a very interesting candidate!^{1}

Point being: we can save a fair amount of unnecessary memory operations by using larger radices. This can use actual larger-radix algorithms, or it can use a radix-2^{k}-style approach where we may just load a bunch of values into registers and complete 2, 3 or more levels of the factorization at once while they’re there. The difference in twiddle factor loads really starts to ramp up as we do more of these: a “real” radix-8 FFT needs to load 7 complex twiddles, a radix-2^{3} step only needs 3 (and classic radix-4 into radix-2 would need 4). Combining steps like this, provided we have sufficient registers, lets us increase the amount of computation in between loads and stores to a level where we’re actually keeping the FP/SIMD units respectably busy and not just spamming memory operations.

### Memory operations, the second

So far, I’ve been looking at memory operations purely in terms of the number of instructions that need to execute. But what about actual cache effects?

Hoo boy. A lot has been written about this subject, too much of it overgeneralizing, using really broad strokes or just working from unstated, questionable assumptions. Lest I go full rant mode, I’ll try to avoid editorializing for the most part. A lot of what you will read on this topic, be it in textbooks, on the web, in forums (remember forums?) or papers, is either questionable or just plain bad.

The most important news first: if you’re using either the standard DIT or DIF factorizations which are in-place except for that initial (or final, depending) permutation step, then as long as you’re working on a contiguous array and the transform size is small enough so the coefficients fit in about half your L1 data cache, you’re pretty much going to be fine no matter what. Your data comfortably fits in the L1D cache, will be repeatedly accessed and kept hot, and nothing bad is going to happen.

L1D caches on CPUs these days are usually 32KB or larger, and that’s been true for a good while (it was already true on desktops 15 years ago, for example; these days mobile devices have also caught up in that regard). That’s plenty for a size-2048 complex FFT with 32-bit floats. That covers a lot of the sizes you’re going to see in practice! Beyond that, here’s some of the statements I’ve seen, and what I think are the important caveats to all of them:

“DIF has better cache behavior than DIT”: this one’s a bit tricky. With power-of-2 sizes, all FFT algorithms are going to do a lot of memory accesses with exact power-of-2 strides between them, and that is a recipe for running into “fun” memory subsystem problems. As discussed before, both DIF and DIT run the same steps, just in reverse order (relative to each other). What is true is that a DIT FFT has a mandatory permutation step in front, whereas DIF can avoid it entirely if reordered coefficients are acceptable. Also, in-place bit-reverse permutations are a tricky mess so DIT FFTs are more likely to out-of-place and do the permutation as part of a copy – and it is easy to do that permutation in a way that has exceedingly poor memory access patterns. So, yes, a reordered-coefficient DIF FFT is easy to implement fully in-place and efficient, and it can avoid the reordering step with its memory access troubles entirely. A DIT FFT is more likely to be out-of-place and do extra copies, and if the reordering isn’t done carefully, it can have bad cache performance – this is a solvable problem, though.^{2} And DIF has its own problems which I already went into earlier.

“Iterative FFTs are bad for the cache, use a recursive decomposition”: iterative FFTs will generally have an outer loop over the pass index (which determines current transform size), and a nested loop inside it that does all sub-FFTs of that size (either as function calls or just have that one loop handle all of them). Recursive approaches use, well, recursion; a size-N transform will internally call into smaller transforms for part of the data, and then either pre- or post-process them. That means that for transforms too large to fit into the L1D (or L2, or L3, …) cache, a straight iterative approach will stream the entire dataset into and out of the cache once per pass, while a recursion subdivision will eventually narrow down to a transform size that fits in the cache, and then all the sub-transforms on that subset will have the data warm in the data cache. Don’t get me wrong, this is good, but the straight linear loop structure of iterative FFTs is also good, and one recursive call per length-8 or length-16 or whatever your “leaf” transform size FFT is isn’t awesome. Mostly, I disagree with the unspoken assumption that they’re mutually exclusive. Iterative FFTs are great at handling the tons of small sub-FFTs near the leaf levels of the decomposition. Recursive FFTs are cache-oblivious and automatically “right-size” themselves. So just use both! Recurse down to a medium transform size that you’re sure is going to fit in the cache (I’d say somewhere from 256 to 1024 complex elements, depending on just how low-spec your smallest target is), and you can let the iterative approach make short work of all the tiny sub-FFTs while still getting all the nice automatic adaptation that recursive decomposition provides for larger sizes.

Over-generalization from old HW: this is in some textbooks, papers and forum posts. Especially late-80s and early-90s RISC workstations tended to have small, direct-mapped data caches that had completely pathological behavior on memory accesses with the right (or rather wrong) power-of-2 strides, and FFTs tend to hit all of these potholes one by one. People learned some very painful lessons in those days when they got burned quite badly by this, set-associative caches with multiple ways are not a luxury feature these days, and they have substantially defused this particular problem. It’s still possible to hit pathological behavior, but you have to try a bit harder, and there’s decent ways to work around it.

“You should use exclusively cache-oblivious approaches/the four-step algorithm/insert other plug here” – use whatever algorithms you want, but I would recommend doing your own research and not just blindly following recommendations that some rando typed into a forum in 2002. There’s a lot of algorithms out there but many of them are quite old and written in a very different situation without floating-point hardware, with slow multipliers, little or no access to instruction-level parallelism, and very different relative costs of compute to memory. The good news is that large FFTs used to be a big-iron HPC problem and these days, even legitimately large FFTs over many million data points can be computed in a fraction of a second on a battery-powered cellphone. It’s pretty hard these days to get into a problem size for FFTs that actually runs into some serious hardware limits. The bad news is that a lot of what’s been written about FFTs dates back to times when this was a much bigger concern and hasn’t been updated since. This post is, in part, me writing up what I still remember from about 4 weeks worth of reading the literature back in 2015, trying to give you the bits that I think are still useful and applicable and intentionally avoiding all the material that is, I think, past its sell-by date.

Finally, a disclaimer: I have personally written and used FFT kernels for small to medium problem sizes (everything still fits comfortably in a 256k L2 cache). I have never actually had reason to use FFTs with transform sizes above 32768 in an application, nor have I optimized or debugged an implementation that provides them, so anything that concerns large transforms specifically (which shifts us even more into a memory-bound scenario) is not something I can comment on.

### Stockham’s auto-sorter

There’s a Cooley-Tukey variant usually credited to Stockham that avoids the initial (or final, for DIF) bit-reversal, digit-reversal or whatever permutation by instead performing the reordering incrementally, interleaved with the processing passes. This was popular on vector machines and is sometimes suggested for SIMD.

It works just fine, but I’m not a big fan, for two reasons. The first is that Stockham gives up on any in-place processing entirely and instead “ping-pongs” between two buffers on every pass, which means it has about twice the active working set in the cache as an in-place algorithm (and more traffic in upstream caches or memory when it doesn’t fit anymore). The second is that this picks up extra data-shuffling work in every single pass; a batched reorder can be combined with other input processing (for example to turn data into an internal format), maybe an initial radix-2 or radix-4 pass, and also be done in a way that avoids thrashing the cache unnecessarily, because it’s a data-movement pass at heart and can be designed with that in mind. Then the actual workhorse FFT passes can focus on the math and memory access and don’t have data shuffling in the mix as well.

### SIMD

So far, I’ve discussed everything pretending we’re working with scalars. But FFTs (power-of-2 sized ones, anyway) are generally pretty easy to vectorize.

If you have the data given as separate arrays for real and imaginary parts, it can be almost as simple as replacing every scalar load/store with a vector load/store and every scalar arithmetic operation with the corresponding vector arithmetic operation. You might need to handle the first few (or last few, depending) passes specially, where you’re working first with pairs of adjacent elements and then groups of 2, 4 etc., but once the distance between elements you’re processing becomes larger than your vector width (which is what most of your passes will be working with), it’s straightforward.

It does mean that your initial (or final) few passes end up special. However, these are the passes right next to the input/output permutation, and combining the early special-case passes with the data permutation tends to solve problems in both: the data permutation usually involves some kind of SIMD transpose, and moving some of the math across that transpose boundary separates what would otherwise be intra-vector “horizontal” dependencies out into multiple vectors. As for the data movement, that is ordinarily just loads, stores and shuffling; having some arithmetic in there helps the instruction mix and makes it less likely to be bottlenecked purely on memory operations or shuffling.

The pure deinterleaved format is probably the easiest to understand conceptually, but it’s not very commonly used. Fully interleaved (all real, imaginary value pairs) is the other extreme. That’s a very common format for complex-valued input data, and it usually works reasonably well to do the FFT in that form directly – for complex multiplications specifically, you do tend to need some amount of shuffling and special instructions for complex multiplication (“addsub” and “fused multiply with alternating subtract/add”, most notoriously), but these instructions are often available. That said certain things which can just be dealt with implicitly in a fully deinterleaved form (such as swapping real and imaginary parts, or computing a complex conjugate) turn into actual instructions when working interleaved, so there’s a lot of small random fix-ups involved; again, I’m not a big fan.

Yet another option is to keep the data interleaved, but at the granularity of full vectors (or some larger granularity that is a multiple of the vector size). For example, for 8-wide SIMD, you might store real parts of elements 0-7, then imaginary parts of elements 0-7, then real parts of elements 8-15, and so forth. This is still addressed by a single base pointer and has nice data access locality characteristics, but is just as (if not more) convenient as fully deinterleaved for SIMD code. The trade-off here is that your vector width (or other interleaving granule size) now becomes part of your interface; I like this format *inside* FFT kernels, less so as part of their public interface. (I’ve been experimenting a bit with this though. Maybe I’ll write a follow-up on this in particular, some day.)

Whenever you don’t work internally with the format you have in your API, there’s going to be some translation at the ends. One end already has the input permutation and can generally rewrite the complex number format into whatever you’d prefer with no trouble. Then for the output you need to translate it back; ideally this is fused into the last pass, but one problem that does crop up with these SIMD variations is that you already have 2 or 3 versions of everything (data permutation combined with initial passes, then a standalone radix-4 or whatever pass, and adding another “final pass” with a different permutation on output can get to be a bit of a chore. It’s up to you how many variations you want to generate (or hand-write if you’re into that sort of thing); personally I tend to avoid having too many pre-built combinations and would rather live with a bit of inefficiency at the interface boundaries than dealing with the headache that is a combinatorial explosion of algorithm variants.

### Non-pow4 sizes, inverses, real-input or real-output FFTs, trig transforms, etc.

Non-pow4 first: anything based on radix-4 (or radix-2^{2}) factorizations needs to deal with the fact that half of all powers of 2 aren’t powers of 4. For those, you’re going to need one radix-2 (or radix-8, or radix-2^{3}) pass in there. There’s not much more to it than that, but it seemed worth writing up just to be explicit.

IFFTs are usually the same transform, just with the twiddles conjugated (scaling with 1/N or similar is often left to the caller). There’s other ways to reduce IFFTs to regular FFTs by swapping around coeffs etc. but I’d advise against it; depending on how your twiddles are stored, you might be able to produce the conjugate twiddles easily, but it’s also not too bad to just have a second version of the FFT kernel around for inverse transforms.

Real-input/real-output, well, there’s multiple ways to handle that, but the easiest is to “bitcast” the real input array into a size-N/2 interleaved complex number array, compute a complex size-N/2 FFT (which due to linearity gives FFT[even_reals] + i*FFT[odd_reals]), and then do a final radix-2 pass to combine the even/odd halves and compute the actual outputs. The inverse of that does, well, the exact same steps in reverse. I’ll not go into the details here but this is a classic algorithm. It’s not the most efficient but it’s reasonably and easy.

Other trigonometric transforms can also be reduced to inner FFTs with pre- and post-passes like that. If you have a decent FFT around, I’ve found this generally preferable to using specialized O(N log2(N)) algorithms specific to these transforms; trigonometric transforms generally have such direct algorithms, but their structure is almost always less regular than a FFT, so reduction to whatever-size FFT doesn’t have a lot of wasted work and then pre-/post-passes is usually my go-to.

### Fixed point

I haven’t mentioned it at all. This omission is by design, not out of neglect. Floating-point and SIMD floating-point is widely available these days, and fixed-point signal processing is becoming increasingly niche. Some of the trade-offs are different with integers (adds are cheaper; DSPs usually have integer multiply-accumulate but many integer or SIMD integer instruction sets do not) and all I’ll say on the topic is that the last time I’ve had to deal with a fixed-point FFT is about 15 years ago, and I haven’t missed it. Good luck to everyone who still has to deal with it on a regular basis, but I’m just happy that’s not me anymore.

### Twiddle storage

Twiddle factors are, essentially, a sine/cosine table. There is definitely a temptation to try and get cute here – maybe try to save some memory by only storing part of the values and inferring the rest by symmetry, or by realizing that the twiddles for FFTs of size N/2 are just all the even-numbered elements of the twiddles for size N here, and so forth.

In short, resist the temptation to do anything fancy with them. You want them in a form that is easy to load and doesn’t require a lot of bookkeeping, and trying to exploit symmetries too much to save on your sine table storage will probably not save you all that much memory but very well might make your hair go prematurely gray.

There’s an exemption for *really* straightforward stuff. For example, if you’re working in deinterleaved form, you need imaginary parts of twiddles (=sines) and real parts of twiddles (=cosines) both. Since cos(x) = sin(pi/2 + x), you can just store a sine table and use it for both, starting the cosine lookups a quarter of the way in. That is simple enough to not cause trouble (in my experience anyway), and it will also helps your data cache footprint, so it’s actually worthwhile.

Twiddles are one aspect where my lack of experience with large transform sizes shows: I expect that for large enough sizes, once your working data set and your twiddles are both streaming through not staying in the relevant cache levels, the extra traffic is an issue and you’d much rather spend some more arithmetic in the FFT kernel to compute twiddles (incrementally or otherwise) than eat avoidable extra loads from memory. However, I’ve not been in a situation where I’ve had to care, so I don’t really know one way or the other.

### Takeaways

This is a long post and I’ve been at it all day, but I’ll try to give you the summary of my personal recommendations:

- Cooley-Tukey DIT is, on balance, probably your best bet.
- Nonstandard-coefficient-order FFTs are cute but probably more trouble than they’re worth these days.
- Do use FMAs where available (which is quite common these days), and if you design any FFT code now, plan with them in mind. Also, FMAs are the great equalizer as far as the cost of different factorizations is concerned.
- Don’t sweat old-school operation counts, they’re optimizing for the wrong things these days.
- Prefer radix-4 (or radix-2
^{2}), less so for lower number of multiplies, and more because having fewer passes and fewer loads/stores is good and memory operation count is not negligible for this. - You don’t have to pick strictly between iterative and recursive approaches, and you can get the best of both worlds by recursing for larger to medium sizes, and handling the small sub-FFTs iteratively.
- Do worry about number of memory accesses and your access patterns, especially for things like digit-reversal permutations.
- SIMD: yes please; Stockham: probably not.
- Data layout-wise, it’s really up to you.
- Twiddles: Keep it simple, stupid!

And I think that’s it for today!

### Footnotes

[1] You can also get a “proper” radix-4 kernel with 2 complex twiddle loads instead of 4. The high concept is that instead of twiddling with ω_{N}^{k}, ω_{N}^{2k} and ω_{N}^{3k}, you can substitute the 3k twiddle with -k – which is just the complex conjugate of the first twiddle, so that doesn’t need a separate load. This substitution works fine but does affect the coefficient permutation (are you starting to see a pattern emerge?); see e.g. Bouguezel, Ahmad, Swamy, “Improved radix-4 and radix-8 FFT algorithms”, 2004 IEEE International Symposium on Circuits and Systems. The same trick can be used on split-radix factorizations to yield what is now usually called the “conjugate-pair split-radix factorization”. That said, as I keep repeating, with a FMA-counting cost model, the arithmetic operation difference between a radix-4 and radix-2 decomposition doesn’t really materialize, and radix-2^{2} comes out looking pretty good.

[2] Not to leave you hanging with a mysterious statement, one possible approach is in Blake, Witten, Cree, “The Fastest Fourier Transform in the South”, IEEE Transactions on Signal Processing, Vol. 61 No. 19, Oct 2013 – in short, be careful about how you implement the permutation, be sure to grab some adjacent data while you’re there, and also, once you have that data in registers, might as well do an initial radix-2 or radix-4 pass along with the reordering.

I was just looking over SIMD FFT code I wrote in 2015 for Bink Audio and Miles Sound System to replace the old, all-scalar implementation we had been using since (presumably) the late 90s. That in turn reminded me of writing that code originally, reading a lot of papers on the subject, and how I wished at the time that there were a write-up going into what mattered (and why) and what didn’t. It’s 8 years too late to save myself some time, but it might still be handy for you!

I’ll try to cover everything interesting to *users* of FFTs first. I’ll do a follow-up post that is only really interesting for FFT implementers. All of this assumes some familiarity with DSP concepts.

### Big picture: what do you need that FFT for, and what kind?

FFTs get used for many things, but there’s 3 main classes of use cases that all have somewhat different requirements:

- Actual spectrum analysis. You have some signal that you want to see the discrete Fourier spectrum of. Generally uses windowed FFTs (aka Short-Time Fourier Transforms, STFTs). Typically wants FFT coefficients in the “proper” (natural) order. Large transform sizes (let’s draw an arbitrary line in the sand and call any transform size above 4096 “large”) are common. All-real inputs are more common by far, and usually just one direction of transform (let’s call it “forward”) is used.
- Convolution or polynomial multiplication. Here, you’re just using the FFT as an intermediate step in a convolution algorithm. For convolution, real inputs are more common by far. For polynomials, you see complex a bit more often, but still not that common. If you don’t want cyclical convolution (and you usually don’t), not only are your inputs and outputs often real, the second half of FFT inputs is usually all zeroes and the second half of IFFT outputs is usually discarded; this can be used to speed up the transform if you care to exploit it. When used for convolutions, you don’t typically care whether coefficients are in the right order or not (for convolution, all you’re doing is element-wise multiplications in frequency space; reordered coefficients cause no real problems.) Can use quite large transforms but often doesn’t, even for uses cases like fairly long convolution reverbs (see below). Uses both forward and inverse transforms, in about equal proportion if one side of the convolution (e.g. a filter kernel) is fixed and pre-processed in advance; twice as many forward as inverse transforms if not.
- As building block in other DSP algorithms. Many other trigonometric transforms used in practice, e.g. DCTs and MDCTs, reduce to FFTs with some pre- or post-processing, sometimes both. The inner FFT here is usually a complex-valued one. Usually wants coefficients in order. It’s rare to see large transform sizes for this use case.

FFTs “want” to work in the complex domain and this is the most natural form for them. Real-valued signals can be turned into complex-valued signals by just adding all-0 imaginary parts, and then you can use a complex FFT, however this wastes memory and work; a “proper” real-input FFT doesn’t require blowing up the input into complex numbers, can compute just half the spectrum (the other half is implied by symmetry), and as a result uses about half the memory and is nearly twice as fast. One caveat: for a FFT that works on complex numbers, the difference between a “forward” and “inverse” FFT is very small (literally a sign flip), and there is not any universal agreement on which sign the “forward” transform should be (sigh), but it also doesn’t actually matter for many applications. Once you do a real FFT and corresponding IFFT, well the real FFT takes N real-valued inputs and outputs N/2 complex numbers (really N/2-1 complex numbers and 2 real numbers, but the two reals are frequently packed together into one complex value) and the real IFFT takes those N/2 complex numbers and outputs N reals, so they are not interchangeable and you need to use the right one.

### Big picture: transform sizes

*In principle*, any size FFT can be computed efficiently, in the sense that you get an O(N log N) asymptotic complexity bound no matter what. In practice, the practical FFT algorithms that you actually want to use in most cases need the transform size to be of a certain form to be efficient, and the arbitrary-size FFT algorithms work by reducing a size-N FFT to a different FFT of a nearby “convenient” size, again with pre-/post-processing.

The most important class of algorithms in practice (Cooley-Tukey and descendants) wants highly composite numbers with few different prime factors; powers of two (just factors of 2) are best. Having a single factor of 3 or 5 in there is also often supported, which gives you more convenient spacing of transform sizes: for example, if you just need slightly more than 1024 elements, you don’t have to jump all the way to 2048 but might instead use 5*256 = 1280.

If possible, and there’s no strong reasons to do anything else, stick with just powers of 2. If that’s not an option, try to make do with exactly one factor of 3 or 5 in there. Any other “odd” sizes will likely end up getting reduced to one of those sizes in the end anyhow.

### Big picture: libraries, interfaces, what to call?

If you’re a user of FFTs, and you’re interested mostly in real-valued data (you often are), check if your FFT implementation of choice supports real-valued input or has inverse transforms that produce real outputs from half a spectrum, instead of manually expanding real values to complex values with zero imaginary part. That’s a fairly easy near-2x speed-up, usually.

FFTs usually need some description of the transform on the side, depending on the transform type and size. This is a data structure that usually contains twiddle factors (which effectively means a sine/cosine table, sometimes in funny order) and some auxiliary data (tables or otherwise) used during coefficient reordering.

Some FFT libraries only support a small, restricted set of transform sizes (usually just certain powers of 2), and these can usually avoid a “setup” structure of “FFT plan” or similar entirely. It’s a simpler API but more restricted.

There’s in-place vs. out-of-place. This is really more of an implementation detail, but sometimes it matters. Some approaches lend themselves to a fully in-place implementation (where the input and output are the same array), others don’t. “Real” in-place needs less memory but has more complicated logistics and sometimes inefficiencies. Asa a result, often in-place functions exist but are just an API entry point you can call that internally allocates workspace memory, works out-of-place, then copies the results back. Sometimes you also have API functions that take an optional pointer to workspace memory; if those exist, that usually means they need it and if you don’t give them workspace memory, they’ll allocate it themselves and free it once they’re done, so allocating the workspace once and passing it in can save a lot of allocation churn if you’re doing a lot of transforms. In short, there’s a lot of variations here, and this part is often poorly documented, but this is something you’ll want to look into because eating a lot of unnecessary allocations and copies in this type of DSP functionality can contribute a significant cost.

Then there’s complex number format. There’s two main contenders: interleaved and split. Interleaved has complex numbers stored as pairs of real part and imaginary part right next to each other. This is the format most commonly used for complex number data types (like C99 built-in complex type or C++ `std::complex`

). FFT algorithms can work in this form directly, and some do, but it can be advantageous to split the data into two arrays, one containing just the real parts and one just the imaginary parts of each value. In general, interleaved form is pretty much always supported, but if the library supports the split format, that is usually the faster one, and it might be more convenient for your code as well, so it could be worth looking into.

Finally, permuted coefficients vs. not. If you only want to use a FFT for convolutions, you might not care about the order coefficients are in. Some algorithms can produce a FFT with coefficients in a non-standard order at a lower cost. You don’t see them that much anymore for reasons I’ll get into later, but if you use a FFT library that supports this option, and you only care about convolutions, it might be interesting for you.

With that all said, I know the question of library recommendations is going to come up, so for what it’s worth: if you’re on an Apple platform, vDSP ships as part of the Accelerate framework, is always present, and looks fine to me (although I haven’t tried the DFT/FFT functionality in particular), so there’s at least a few platforms where there’s an obvious choice.

There’s always FFTW, which I have last used in the early 2010s and don’t have fond memories of. It’s a big library with external dependencies, not easy to build or use, a ton of code (around half a megabyte last I checked), and for a while anyway it seemed to lag behind quite a bit in SIMD instruction set support (that might have been fixed by now, or maybe that was just a configuration problem on my part). FFTW does have pretty much everything once you get over that integration hump, though.

Intel has FFT functionality as part of their MKL (Math Kernel Library). I haven’t used it for FFTs but MKL in general tends to offer excellent performance *on Intel CPUs*. They detect CPUs by model number and mysteriously fall back to their slowest unoptimized kernels on any x86 device that doesn’t report as Intel (unless you override it with an environment variable), which is a reason for me not to use it on general principle but YMMV. MKL is also fairly big (and of course x86-specific) so if distribution size is a problem for you and you just want a FFT this is probably not what you want.

The one that’s easiest for me to recommend is Mark Borgerding’s KISS FFT. It’s not the fastest FFT around but it’s small, simple, and easy to build and integrate. If you find yourself bottlenecked by FFT speed you’ll want to look into something more advanced but if you just need a FFT-type algorithm to replace a calculation that would otherwise be O(N^{2}) with something that’s O(N log N) and doesn’t have a huge amount of wasted work, this fits the bill. You’re losing a constant factor relative to what you could be getting with proper vectorization so be advised that there’s faster options when that becomes a problem.

### Big picture: convolutions

Convolutions are a fundamental signal processing operation for filtering (and also for effects such as convolution reverb in audio processing), and by what’s usually called the Convolution Theorem, time-domain convolution (the operation we’re interested in) corresponds to frequency-domain point-wise multiplication. (Vice versa, frequency-domain convolution also corresponds to time-domain point-wise multiplication, which is the math behind how windowing affects the FFT).

The theory is this: convolving a length-N signal X by a length-M signal Y directly takes N*M multiply-adds (the first one is just a multiply, but close enough). If we take the first signal to be our data and the second to be our filter, that means convolving by Y takes M operations per sample. (It stands to reason that we have to *some* work for every output sample on average.) This gets bad if Y is long. In the aforementioned convolution reverb, Y could be a 5-second sample at 48kHz sampling rate. That’s 240,000 multiplies per sample, which is what we in the business refer to as “not ideal”. The convolution theorem tells us we can do a Fourier transform, then one multiply per sample in the frequency domain, which is 4 real multiplies, and finally we do a inverse Fourier transform back. Fourier transforms are (mumble mumble) something something O(N log N), or I guess in our case O(M log M), so maybe this is O(log M) arithmetic operations per sample, and log2(M) is 18. Even assuming some fairly high constant factors here we’d expect this to work out to ballpark a few hundreds of operations per sample, roughly a 3 orders of magnitude speed-up.

Only problem being: I glossed over all the details of how exactly that actually works. This is not the subject of this post, so I’ll keep it brief, but there’s a bunch of important techniques here to actually get this to work, and this is another one of those things that are often glossed over, so I want to at least put all the search keywords in here so you know what to look for if you actually want (or need) to implement this.

First, we want “discrete convolution”, which is the convolution operation that belongs with what’s usually called the DTFT (Discrete-Time Fourier Transform). The FFT is a realization of the DFT (Discrete Fourier Transform) *not* DTFT, so it’s a different type of transform. DFT is like DTFT but for *periodic* signals, and it also has a convolution operation it’s associated with for its version of the Convolution Theorem, but the convolution we get with the DFT/FFT is what’s called *cyclical* convolution – it wraps around the signal (which is assumed to be periodic). If we want the no-wraparound (discrete) version, we can get it from the cyclical version by padding the signal with extra zeroes and using a long enough transform length that no actual wrap-around takes place. For our length-N signal, length-M filter setup, we can pad both out to be N+M-1 samples long, and then use cyclical convolution (which is just point-wise multiplication of DFT/FFT coefficients) to get our discrete convolution result.

That works, but if we did that directly, we’d need the whole input signal in advance, and we also get O((N+M) log2(N+M)) operations for the whole job, so the amount of operations per output sample has a dependency on the input length. That’s not only worse than advertised, it also fundamentally doesn’t work in a setting where you get the signal in real-time, because N (the eventual length of your input signal) is not something you know while it’s happening, and you don’t know all the future samples yet. So the next step is to realize that no, of course you don’t need full knowledge of all samples past and future to do a convolution with a finite-length filter. What you can do is cut your source signal into small pieces, say also of size M, filter those individually, and then combine the results. This works just fine and the search term for the main approaches are “overlap-save” and “overlap-add” method. This gets the dependency on N out of there entirely, and we actually do have O(log(M)) asymptotic operation count per output sample now. Nice.

This is now good enough for offline editing or batch processing, but still doesn’t work for something like real-time audio, because we work in units of big blocks, and for our M around 240,000, we need blocks of roughly that many input samples to do the job. That’s 5 seconds of audio. If we need to wait until we have more than 5 seconds of audio to do another batch, that means we can’t do it without stuttering unless we introduce 5 seconds of latency. Once again we’re off by several orders of magnitude – we might be willing to live with 20-30ms of latency, although for Pro Audio a real 5ms or thereabouts would actually be nice. Direct convolution doesn’t have this delay. As soon as we have a new input sample, we can compute the corresponding output sample – although it might take us a lot of operations to do so. FFT-based convolution amortizes the effort greatly, but introduces unacceptably large latency when used directly with large kernels. So what do we do?

Well, several tricks. First, we can use the same “cut into pieces” trick we used on the input signal on the filter as well; for example, we might cut it into pieces 1024 samples long (duration 21.3ms at 48kHz sampling rate), which is getting towards the upper end of what we’re willing to tolerate. Then we can use overlap-add/overlap-save style techniques on these pieces to build our convolution with a 240,000-sample signal out of 235 convolutions by a length-1024 signal (this is effectively just another direct convolution, just between corresponding coefficients of 1024-sample blocks). We don’t need to redo all these every time, every 1024 samples we transform one new block of 1024 and keep the older ones around – the search term here is “Frequency-Domain Delay Line”. That adds a few hundred multiply-adds per sample to our tally, but a few hundred extra we can live with.

Next, that 1024-sample latency is also not actually necessary. It’s true it takes 21.3ms to complete a 1024-sample block, but that doesn’t mean you have to twiddle your thumbs while that’s going. The problem is really only with the “beginning” parts of Y – which get multiplied by the most recent samples from X. Once you’re 1024 samples into Y, those values get multiplied by “old” samples in X (at least 1024 samples old), so there’s not nearly as much of a rush. So we can use direct convolution with just the initial part of Y to keep our latency down (it can now again go about as low as direct convolution can), and then use optimized FFT-based convolution for the contribution from more distant parts of the signal. That contribution from the direct convolution is now contributing something like a 1024 multiply-adds per sample, which is starting to hurt, but at least we can hit low latency now.

Finally, the actual solution in practice: this realization that the big blocks are only a problem in terms of latency for recent operation is true in general, and lets us use this whole process recursively. For example, in our initial 1024 samples, the first 64 samples (1.3ms worth at 48kHz) are latency-critical and probably need direct convolution. Past that point, we might switch to length-64 blocks with FFT-based convolution. At some point past 1024 samples into the “past”, we might switch over to length-1024 blocks. And for further back-in-time samples, we might switch to length-16384 blocks or so (our 5s impulse response fits into 15 16,384-sample blocks, so the 15 resulting overlap-adds/overlap-saves we don’t worry about much, but if it were longer we might do this a few more time with increasing sizes). There’s some subtlety about what block sizes to use and when to switch (you might think it would be good to use pow2 sizes and greedily double the block size as soon as we can every time, but turns out that’s not work-optimal), and there’s also other concerns such as how much processing time the various transform sizes take and when to schedule them etc., but this is the actual “good” convolution algorithm for real-time applications, and it’s called “Non-Uniform Partitioned Block Convolution”.

All right. This was just a very brief, very rough overview, and I would absolutely not recommend trying to fill in the details from my few paragraphs, but especially with convolutions, I’ve several times seen code that really does a 2D FFT on a whole image or similar to convolve with a 50×50 pixel kernel, and anecdotally it seems that outside the domain of real-time audio, the better techniques are not nearly as well-known as they ought to be, hence this extended detour into convolution algorithms.

This concludes my notes on the subjects that might be interesting for FFT users. Next part will be for FFT implementers only, pretty much.

Back in 2007 I wrote my DXT1/5 (aka BC1/3) encoder rygdxt, originally for “fr-041: debris” (so it was size-constrained). A bit later I put up the source and Sean Barrett adapted it into “stb_dxt”, which is probably the form that most know it in today.

It’s a simple BC1 encoder that gives decent quality, the underlying algorithm is reasonably fast (fast enough to say bake textures you produce once per session in a game from a character creator, say, which is one of the cases I’ve ended up using it in in a professional context), and is easy to integrate.

The basic algorithm uses the same primitives most BC1 encoders use (I’ll assume in the following you know how BC1 works): compute the average and covariance matrix of the block of pixels, compute the principal component of the covariance to get an initial guess for what direction the vector between the two endpoints should point in. Then we project all pixel values onto that vector to find the min/max support points in that direction as the initial seed endpoints (which determines the initial palette), assign each pixel the palette entry closest to it, and do some iterative refinement of the whole thing.

rygdxt/stb_dxt only uses the 4-color mode. It did not bother with the 3-color + 1-bit transparency mode. It’s much more niche, increases the search space appreciably and complicates the solver, and is rarely useful. The original code was written for 64k intros and the like and this was an easy small potential win to give up on to save on code size.

Nothing in there was invented by me, but at the time I wrote it anyway, DXT1/BC1 encoders (at least the ones I was looking at) were doing some of these steps in a way that was more complicated than necessary. For example, one thing I vividly recall is that one encoder I was looking at at the time (I believe it was Squish?) was determining the principal component of the covariance matrix by forming the characteristic polynomial (which for a 3×3 matrix is cubic), using Cardano’s formula to find the roots of the polynomial to determine eigenvalues, and finally using Gaussian Elimination (I think it was) to find vectors spanning the nullspace of the covariance matrix minus the eigenvalue times the identity. That’s a very “undergrad Linear Algebra homework” way of attacking the problem; it works, but it’s complicated with a fair amount of tricky code and numerical issues to wrestle with.

rygdxt, with its focus on size, goes for a much simpler approach: use power iteration. Which is to say, pick a start vector (the only tricky bit), and then iterate multiplying covariance matrix by that vector and normalizing the result a few times. This gives you a PCA vector directly. In practice, 3-6 iterations usually sufficient to get as good a direction vector as makes sense for BC1 encoding (stb_dxt uses 4), and the whole thing fits in a handful lines of code with absolutely nothing subtle or numerically tricky going on. I don’t claim to have invented this, but I will say that before rygdxt I saw a bunch of encoders futzing around with more complicated and messier approaches like Squish, or even bothering with computing a general 3×3 (or even NxN) eigendecomposition, and these days power iteration seems to be the go-to, so if nothing else I think rygdxt helped popularize a much simpler technique in that space.

Another small linear algebra trick in there is with the color matching. We have a 4-entry palette with colors that lie (approximately, because everything is quantized to an 8-bit integer grid) on a line segment through RGB space. The brute-force way to find the best match is to compute the 4 squared distances between the target pixel value and each of the 4 palette entries. This is not necessarily a bad way to do it (especially if you use narrow data types and SIMD instructions, because the dataflow is very simple and regular), but it is essentially computing four 4-element dot products per pixel. What rygdxt/stb_dxt uses instead is the fact that if it were a perfect line segment in a continuous space, we could just use an orthogonal projection to find the nearest point on the line, which with appropriate normalization boils down to a single dot product per pixel. In that continuous simplification, the two interpolated colors sit exactly 1/3rd and 2/3rds along the way between the two endpoints. However working on the aforementioned 8-bit integer grid means that the interpolated colors can sometimes be noticeably off from their ideal placement, especially when the two endpoints are close together. What rygdxt therefore does is compute where the actual interpolated 1/3rd-of-the-way and 2/3rds-of-the-way colors land on the line (two more dot products), and then we can do our single dot product with the line direction and use the values we computed earlier to figure out which of these 4 colors is closest in the 1D space along the line, which is just a few comparisons and can be done branchlessly if desired.

The result doesn’t always match with the distances code using the brute-force solution would get, but it’s very close in practice, and reducing the computations by a factor of nearly four sped up the BC1 encoding process nicely (old 2008 evaluation by my now-colleague Charles here).

That leaves us with the actual subject of this blog post, the iterative refinement logic! I just answered an email by someone asking for an explanation of what that code does and why, so here goes.

### The refinement function

The code in question is here.

Ultimately, what rygdxt/stb_dxt does to refine the results is linear least-squares minimization. (Another idea that’s not mine, this one I definitely got from Squish). We’re holding the DXT indices (interpolation weights) constant and solving for optimal endpoints in a least-squares sense. In each of the RGB color channels, the i’th target pixel is approximated by a linear interpolation

where x0, x1 are the endpoints we’re solving for and w_{i} is one of {0, 1/3, 2/3, 3/3} depending on which of the 4 indices is used for that pixel. Writing that out for the whole block in say the red channel turns into an overdetermined system of 16 linear equations

to be solved for x0r, x1r (the first/second endpoint’s R value).

Let’s call that tall and skinny matrix on the left A; is the 2D column vector we’re solving for, and the RHS vector of the pixel r values we can just call “r’.

That means our problem is to minimize (2-norm), or equivalently .

is quadratic; the way you find its minimum is by computing the derivative and equating it to 0, which leads us to what’s called the Normal Equations, which for this problem are

A is a 16×2 matrix, so A^{T}A is a tiny symmetric matrix (2×2) and contains the dot products of the columns of A with each other.

We have 3 color channels, not just r but g and b as well. That means we have 3 copies of the same linear system with 3 different right-hand sides, or equivalently we can view the whole thing as a matrix equation with a 3-wide right-hand side. Either way, all 3 systems have the same A matrix, the only thing that differs is the right-hand sides.

We accumulate (and g and b as well) in the first pass, and also compute the entries of . To solve the system above, because we just have a 2×2 matrix, we can use Cramer’s rule to solve it directly, no need to mess around with factorizations or Gaussian Elimination or similar.

That’s the basic idea. The RefineBlock function uses two more tricks:

- instead of the weights being {0, 1/3, 2/3, 3/3}, we multiply them by 3 and also scale the RHS by 3 (the latter, we never actually do explicitly). Getting the extra scaling is essentially free during the linear system solve, especially since we already need to do some scaling per-channel anyway, because the R/B endpoint values we solve for are in [0,31] (instead of [0,255] for the input pixel values), and the G values are in [0,63]. Scaling everything by 3 means there are no fractions involved in the computation, it’s all small integers, which will be useful for the second trick. It also means that when we compute the determinant of the 2×2 system for Cramer’s rule, it’s an integer computation, so we don’t have to worry about near-zero values and the like. (Although in this case, it’s not too hard to prove that A is singular, i.e. has determinant 0, if and only if all the w
_{i}are the same, which is easy enough to detect up front.) - now our weights w
_{i}(matrix entries in A) are all in {0,1,2,3}. The three entries in sum, respectively (note we scaled by 3, so turns into ): , , and . Note that all three values we’re summing only depend on w_{i}, and w_{i}is one of 4 possible values (depending on the index), so we can just precompute all of them. Also note they’re all quite small: and are at most 9, and is either 0 or 2, so they all comfortably fit in 4 bits. We’re summing 16 of these values (one per pixel in the block), and they’re all positive. That means the final sums fit into 8 bits no problem. Therefore the actual table we have packs the 3 values into one integer:

.

We sum that one integer per pixel. All of the individual sums are guaranteed to be <256 so we can extract the corresponding bits after the accumulation loop. That means the cost of computing the entries of A^T A becomes quite cheap (a single 4-entry table lookup and 32-bit integer add per pixel), and the bulk of the work in that first loop is just computing the right-hand sides.

And that’s about it. I know that approach was later also used by NVidia Texture Tools, beyond that I have no idea of its reach, but if it’s handy for someone, cool!

Two acquaintances independently asked about this today, so it seems worth a write-up: recently (as of this writing), DeepMind published a new paper about a new practical fast matrix multiplication algorithm, along with a press release that is a bit misleading and seems to have led to some confusion already.

In particular, while the paper does introduce many new “practical” (not scare quotes, I’m using the word in a specific sense here that I’ll make more precise in a minute) fast small-matrix multiplication algorithms, that does not mean that you should replace your small-matrix library routines for 3×3 or 4×4 matrix multiplication with them. That’s not actually what they’re meant for, and they wouldn’t be good at it.

If you just want the executive summary, here it is: these are definitely interesting algorithms from an arithmetic complexity theory standpoint – especially for the case of 4×4 matrices over finite fields, where (to the best of my knowledge) Strassen’s algorithm from 1969 was still the reigning champion. These algorithms are also *practically* relevant, meaning that not only do they have better asymptotic lower bounds than Strassen’s algorithm, they are still algorithms you might actually use in practice, unlike essentially everything else that has been written on the topic in the last 50 years: these algorithms are correct, and will in principle win over Strassen’s algorithm with large enough matrices, but that cut-off is well beyond the sizes that anyone is actually doing computations with.

And if you know what Strassen’s algorithm is, you might be in the market for the results from this paper, In fact, I’ll go one further and say that if you are currently using Strassen’s algorithm somewhere, you should definitely check the paper out. For the rest of you, I’ll try to give a very short summary of the topic and explain why actual small matrices are not really the intended use case.

## Strassen’s algorithm

Regular matrix multiplication of 2×2 matrices uses the standard “dot products of rows with columns” algorithm:

As written, this does 8 multiplications and 4 additions. In 1969, Volker Strassen discovered an algorithm for this that only uses 7 multiplications but 18 additions (follow the link for more details, I won’t go into it here); Winograd later presented a variant that only uses 15 additions. This is interesting in principle if multiplications are much more expensive than additions, something which is true in some settings but does not commonly apply to hardware floating-point math these days. Hardware floating-point now commonly implements *fused multiply-add* (FMA) instructions, and the 2×2 matrix multiplication above can be implemented using 4 regular multiplications, 4 fused multiply-adds, and no separate additions at all. Moreover, Strassen’s algorithm has some numerical stability issues when used with floating-point (these concerns do not exist when it’s used for finite field arithmetic, though!) that means it also must be used carefully. What this means is that you would never actually consider using Strassen’s algorithm on 2×2 matrices, and that is in fact not how it’s normally presented.

The form of Strassen’s algorithm that is of practical interest works not with 2×2 matrices, but with 2×2 *block matrices*, that is, 2×2 matrices whose elements are themselves matrices. Matrix multiplication has a very regular structure that looks exactly the same when multiplying block matrices as it does when multiplying matrices of scalars, you just need to be careful about operand order for multiplications because matrix multiplications, unlike multiplications in the scalar ring or field, are not commutative:

This looks identical to what we had before (except it’s now all upper case), but the operations mean something different: before we were multiplying scalars with each other, so we had something like individual real number multiplications (or floating-point multiplications in actual numerical code), now the products are matrix-matrix products (which are O(N³) operations when multiplying two square NxN matrices using the standard algorithm, a mix of multiplications and additions or possibly fused-multiply-adds) and the sums are matrix-matrix sums (which are O(N²) additions). And because what we’re describing here is a matrix multiplication algorithm to begin with, the smaller sub-matrix multiplications can also use Strassen’s algorithm if they want to.

In short, not only does each multiplication and addition in this block matrix represent many underlying scalar operations, but the relative cost of multiplications compared to additions keeps growing as N (the size of the blocks in our block matrix) increases. And that’s where Strassen’s algorithm is actually used in practice: eventually, once N becomes large enough, the many extra additions and irregular structure become worthwhile. It’s important to note that arithmetic operation count is not the only factor here: the extremely regular structure of conventional matrix multiplication, and the ease with which it can use FMAs, means that the arithmetic operations in a tuned matrix multiplication kernel are a lot cheaper than you might expect, because these kernels tend to do an extremely good job of keeping the machine busy. With small matrices (and actual 4×4 matrices definitely fit that description), other overheads dominate. Somewhat larger blocks are mostly limited by memory subsystem effects and carefully manage their footprint in the various cache levels and TLBs, which tends to include techniques such as extra memory copying and packing stages that might seem like a waste because they’re not spamming floating-point math, but are worth the cost because they make everything else go faster. With much larger blocks, eventually Strassen can become attractive, although the actual cut-off varies wildly between architectures. I’ve seen reports of Strassen multiplication becoming useful for blocks as small as 128×128, but more usually, it’s used for blocks that are at least 512×512 elements, if not much more. All this assuming that its less-than-ideal numerical properties are not a show-stopper for the given application.

## AlphaTensor

That brings us back to AlphaTensor. What DeepMind has implemented is, in essence, a variant of the neural-net-guided Monte Carlo Tree Search they’ve been using with great success to build Chess and Go AIs. This time, they use it to search for small-matrix multiplication algorithms. I’m not the right person to go into the details here, and it’s not important for the purposes of this post. We can just treat this procedure as a black-box optimizer that can do a guided search for matrix-multiplication algorithms meeting a user-specified set of criteria.

One obvious criterion would be optimizing for minimum multiplication count, and in fact one of the discoveries reported is a “Strassen-like” factorization that uses 47 multiplications to multiply two 4×4 matrices (compared to 7*7=49 multiplications for two nested applications of Strassen’s 2×2 algorithm, or 64 multiplications for the direct algorithm). And separate from the more theoretical concern of minimum operation count for a “practical” algorithm, the same optimizer can also incorporate measured runtimes on actual devices into its operation, and thus be used to perform a guided search for algorithms that are fast on certain devices.

That’s the process used to yield figure 5 in the paper, which reports speed-ups of optimized 4×4 matrix multiplication factorizations against the baseline (which is the regular algorithm). Note that what this does is one high-level 4×4 block matrix multiply using the algorithm in question at the top level; all the smaller lower-level matrix multiplies use regular matrix multiplication. Also note that the reported matrix sizes start at 8192×8192, i.e. in this case, the 2048×2048-element blocks are multiplied regularly.

Furthermore, note that the reported speed-ups that the press release refers to as “10-20%” are compared to the baseline of using regular matrix multiplication, not against Strassen (in this case, “Strassen-square”, the aforementioned 4×4 factorization obtained by applying Strassen’s algorithm twice to a 4×4 matrix). Strassen results are reported in the figure as well, they’re the red bars.

In short, the new algorithms are a sizeable improvement, especially on the TPU, but the perspective here is important: this type of algorithm is interesting when individual multiplications are quite expensive, in this case because they are actually multiplications of fairly large matrices themselves (2048×2048 blocks are nothing to sneeze at).

If you’re actually multiplying 4×4 matrices of scalars, the standard algorithm remains the way to go, and is likely to stay that way for the foreseeable future.

I wrote about Morton codes long ago, and I see that post and the accompanying code referenced frequently, but there’s a few points I want to make about it.

First, if you’re on a x86 CPU with BMI2 support, you have access to `PDEP`

and `PEXT`

, which make Morton encoding/decoding trivial. 2D Morton encoding one 16-bit coordinate to a 32-bit word is a single 32-bit `PDEP`

with `0x55555555`

, decoding it is a `PEXT`

with the same mask, and you can also use the same thing to encode 64-bit values with appropriate longer masks extended in the obvious way. The only fly in the ointment is that earlier AMD Zen CPUs technically support `PDEP`

and `PEXT`

but with an incredibly slow implementation, so this is not a good implementation if your target HW includes those CPUs. That said on Zen 3 and later the two instructions are fine, and ARM has added an equivalent to their vector pipes with SVE2, so while a bit iffy right now, I expect this might be the method of choice eventually, for CPUs anyway. I’ve been seeing this a fair amount in GPU shaders too, though, which brings me to the next subject for today.

The methods presented in the original post assume that the Morton-coded index value is 32 bits. I did not point this out explicitly in the text back in 2009, but you can either add another pass to produce a 64-bit result, or remove passes if you don’t need full pairs of 16-bit coordinates (or triples of 10-bit coordinates) interleaved into one 32-bit value.

For example, I gave this code for Morton-encoding a pair of unsigned 16-bit values to form a 32-bit index: (Dropping the comments that detail the bit layout after each step because they cause formatting issues)

// "Insert" a 0 bit after each of the 16 low bits of x uint32_t Part1By1(uint32_t x) { x &= 0x0000ffff; x = (x ^ (x << 8)) & 0x00ff00ff; x = (x ^ (x << 4)) & 0x0f0f0f0f; x = (x ^ (x << 2)) & 0x33333333; x = (x ^ (x << 1)) & 0x55555555; return x; } uint32 EncodeMorton2(uint32_t x, uint32_t y) { return (Part1By1(y) << 1) + Part1By1(x); }

If we only care about 8 bits worth of x and y, that first shift-and-xor pass (which tries to move the high 8 bits of `x`

into place) is useless and we can just skip it:

// "Insert" a 0 bit after each of the 16 low bits of x uint32_t Part1By1_8b(uint32_t x) { x &= 0xff; // 8 bits only this time! x = (x ^ (x << 4)) & 0x0f0f0f0f; // 0x0f0f would suffice x = (x ^ (x << 2)) & 0x33333333; // 0x3333 would suffice x = (x ^ (x << 1)) & 0x55555555; // 0x5555 would suffice return x; } // Only uses low 8 bits of x, y uint32 EncodeMorton2_8b(uint32_t x, uint32_t y) { return (Part1By1_8b(y) << 1) + Part1By1_8b(x); }

Lastly, if you’re encoding or decoding multiple values, which is extremely common because after all the whole point is to turn a pair (or triple) of values into a single integer, and the values are sufficiently narrow, we can handle x and y at once, by packing them into a suitable value before we do the rest of the computation.

For example, the case above where we only have 8-bit x and y coordinates (not that uncommon, say within a tile) can do a variant of the trick above to handle the full pair at only slightly more work than just handling a single coordinate in the original code would be:

uint32_t EncodeMorton2_8b_better(uint32_t x, uint32_t y) { // Pack x and y into t, starting at bits 0 and 16 respectively uint32_t t = (x & 0xff) | ((y & 0xff) << 16); // Do the full interleave as above (this time the full // 32-bit masks are necessary) t = (t ^ (t << 4)) & 0x0f0f0f0f; t = (t ^ (t << 2)) & 0x33333333; t = (t ^ (t << 1)) & 0x55555555; // Merge the halves. // All the odd bits are clear, so instead of shifting t right by // 16 to get y and then shifting left by 1, we can just shift right // by 15. return (t >> 15) | (t & 0xffff); }

All in all is this is roughly the same cost of a single invocation of the original `Part1By1`

. If you need a full 16-bit x and y but have 64-bit registers available, you can get a 32-bit result using a 64-bit extension of the original code (you just need to expand the masks appropriately).

This technique is independent of how exactly you do the bit-(de-)interleaving, so you can in principle combine this with `PDEP`

and `PEXT`

. There’s not much point though, because if you have a good implementation then doing two `PDEP`

with different masks and a final OR will be cheaper, and if you might encounter one of the bad implementations you just want to avoid these instructions entirely and use the above SIMD-within-a-register (SWAR) algorithms.

I’ve only showed 2D encoding here, the same trick works for decoding:

pair<uint32_t,uint32_t> DecodeMorton2_8b(uint32_t code) { // Separate even and odd bits to top and bottom half, respectively uint32_t t = (code & 0x5555) | ((code & 0xaaaa) << 15); // Decode passes t = (t ^ (t >> 1)) & 0x33333333; t = (t ^ (t >> 2)) & 0x0f0f0f0f; t ^= t >> 4; // No final mask, we mask next anyway: // Return x and y return make_pair(t & 0xff, (t >> 16) & 0xff); }

And of course the same idea works for the 3D variants as well.

In summary, To Whom It May Concern: 1. PDEP/PEXT and friends (when available and reliably good) make this trivial, 2. if you only need a few bits you can skip passes and make it cheaper, 3. the bit movement is completely regular so you can also pack multiple values ahead of time and get two (de)interleaves for the price of one.

The adventure continues! Last time I went over the core decoder loop we use on AMD Jaguar-based consoles, i.e. Xbox One and PS4. I felt that one was a good example to start with since the target hardware is fixed, behaves quite predictably, and the various features and limitations of the machine conspire to make the solution essentially canonical: the narrow CPU front-end makes it so that minimizing macro-ops issued is a significant concern, and the `BEXTR`

-based decode is a very natural fit.

This time, our hardware target is “any 64-bit x86”, used on less specialized targets like Linux, Windows and Mac 64-bit x86. We don’t have a single CPU type we’re optimizing for; it needs to run on everything, and we’d like to have good performance on anything released within the last 10 years (at the time this code was written in 2016) or so. We do get to have variants for newer ISA extensions when they help but we’d like a good baseline version for “any 64-bit x86 CPU”.

As with the previous two parts, you should read the introduction to the Huffman decoders first, continue with the “Reading bits in far too many ways” series on this blog (part 2, part 3) which is also a prerequisite, and if you can’t read and understand x86-64 assembly language this will be hard to follow.

### You know how this goes by now

I’m not going into as much detail in this post as I did in the Jaguar one, both because our target this time is somewhat fuzzier and because we’re exploring variations on a theme here. It’s interesting to cover the key ideas, not so interesting to re-do variants of the same analysis many times in a row.

Our setting, then, is the same as it was last time: we have a Huffman code stream where the code lengths are constrained to be relatively short so we can do 5 in a row between refills, we use a 4KiB table with 2048 entries covering all possible 11-bit codewords, and decoding proceeds without having to do any significant data-dependent branches.^{1} We know that decoding back-to-back values from the same bitstream is limited by critical path latency, so we use multiple bitstreams.

On the AMD Jaguar, 3 independent bitstreams was sufficient to get us out of the miserable latency-bound hole and into the more desirable throughput-bound regime where we’re doing a decent job keeping the execution units busy. On the wider out-of-order cores that can sustain 4 or more instructions per cycle, 3 streams is not going to be enough for that, which is why we also have 6-stream variants that will be covered in a later part.^{2}

Once again, I’ll go over this from the inside out: first figure out how to decode individual bytes, then sort out the refill strategy, then deal with the remaining plumbing.

### The basic decode step

Our game plan hasn’t changed: peek ahead at the next 11 bits, do our table lookup, consume the right number of bits, emit the resulting byte. The most basic version of this, assuming we keep an explicit “remaining bits” count per stream, works out to something like this:

; peek at low 11 bits mov rcx, rBits and rcx, 2047 ; table fetch movzx ecx, word [rTablePtr + rcx*2] ; shift out consumed bits, update bit count shr rBits, cl sub rBitCount, cl ; emit decoded byte shr ecx, 8 mov [rDest + <index>], cl

So far, there should be no surprises here.^{3} As given, this works out to 7 instructions per byte decoded, quite a lot more than the 3 we had in the Jaguar version, but this time we don’t have guaranteed access to BMI instructions – and even if we did, they would not be as good as they are on Jaguar.

### Two more small tweaks

Part of x86’s historical baggage is that bits 8 through 15 inclusive of the a, b, c and d registers can be accessed as their own register, and this can be used on the *emit* sequence to save an instruction, getting us down to 6:

mov [rDest + <index>], ch

Due to the way the x86-64 instruction encoding works out, `rDest`

in this code can’t be just anything. It has to be from a limited set of registers^{4} but other than that, we’re still not doing anything special here.

Note that we want to be using `ECX`

as the register we load the table entry into because the variable-distance shifts in basic x86-64 all require the shift amount to live in `CL`

.^{5} We don’t particularly care what register name we use for our scratch temporary; RCX/ECX is as good as any other. Nevertheless, using RCX (and only RCX) as the temporary to load table entries into is *clearly* the right choice here, but compilers would frequently insert extra moves (sometimes more than one extra on the critical path) here, which is how we ended up hand-writing these loops in assembly to begin with.^{6}

The second fun wrinkle is that even though we can easily arrange for our shift count to always be in `CL`

to begin with, on Intel CPUs with BMI2 available, we always want to use the new `SHLX`

instruction instead, even when we’re shifting by the low bits of `RCX`

, because `SHLX`

takes 1 uop instead of 2 or 3 on newer Intel uArchs.^{7}

The final non-obvious change we do for the “real” version of the decode is that instead of the peek logic presented above, we actually do this instead:

mov ecx, 2047 and rcx, rBits

Why load the immediate first using a separate instruction and then AND with `rBits`

? Well, the important realization here is that moving an immediate value into `ecx/rcx`

has no dependencies and can execute at any time, whereas `mov rcx, rBits`

does depend on `rBits`

and thus ends up becoming a part of the critical path. Now this register-register move is generally “free” on microarchitectures that can implement reg-reg moves via renaming at an effective 0-cycle latency, which on the Intel side is Ivy Bridge and later, but there are two important caveats: first, not every machine we’re targeting (or at least were targeting at the time this code was written, 6 years ago) is Ivy Bridge and later, and second, some later microarchitectures had subtle bugs in their implementation of `MOV`

elimination and ended up disabling it later. Hence, as a general rule, the load-immediate-first version is preferable since it takes 1 cycle off the critical path for every decode when `MOV`

elimination is not performed, for whatever reason.^{8}

### Refill

As mentioned above, for this version we prefer the methods with explicit “remaining bits” counts, specifically what I described as variant 4 in my write-up on bit reading.

The reason to prefer this approach here is that it keeps the load for the bit buffer refill independent of the final bit count, meaning that refill load’s latency does not necessarily end up on the critical path. Since the latency along the critical path is the primary concern for this loop, this is definitely what we want.

In the “production” version of the loop, we keep the remaining bit count for stream 0 in `al`

, the refill pointer in `r8`

, and the bit buffer for stream 0 in `r11`

. The version we use is this (with the instructions scheduled slightly differently, but that’s extremely minor):

; refill streams mov rbp, [r8] ; next0 = load(in0) movzx rcx, al ; bitc0 shl rbp, cl ; next0 << bitc0; use shlx with BMI2+! or r11, rbp ; bits0 |= next0 << bitc0 shr ecx, 3 ; 7 - bytes_consumed0 sub r8, rcx ; in0 += bytes_consumed0 - 7 or al, 56 ; bitc0 |= 56 add r8, 7 ; in0 += 7

That’s a fair number of instructions, but the dependency structure is quite favorable:

- The load of the next bits to refill only depends on
`r8`

, which is known as soon as the previous iterations’ refill is complete (all the instructions that update`r8`

are in the above code snippet!) and in particular is completely independent of the 5 table loads per stream. - Insertion of the new bits depends on the final bit count being known, but once the 3 instructions (
`MOVZX`

,`SHL`

,`OR`

, all pure int ALU) for that part of the dependency chain are done, we can start decoding the next byte. - Updating the bit count and the read pointers takes several more instructions, but neither of these are as critical because the bit count doesn’t determine any load offsets and the read pointer is only used for the next iterations’ refill.

Once again, using `SHLX`

for the shift when available is preferred. In this case, in addition to the 2 uops saved, it also lets us remove the 1 cycle for the `MOVZX`

from the critical path.

Beyond that, there’s a few more instructions in the loop to increment the destination pointer, maintain/test the “bytes to decode” count and check whether the input pointers have crossed, but all of this works the same as in the last part and is not on the critical part so I won’t spend time on it here.

### Analysis

With all that said, what’s the expected cost of this decoder? Unlike the Jaguar example which targets a single known microarchitecture, this version of the loop targets a general PC target and thus should be good for numerous uArch generations from multiple vendors over something like a 10-year window.

We have in fact done just that kind of testing, and there are some minor variants (mainly depending on whether BMI2 and thus `SHLX`

is available, as noted above), but the details here get very repetitive and tedious. I don’t particularly feel like writing all of this down, nor would it make for interesting reading.

Instead, I’m going to take a single uArch as our proxy. I’ll be using Intel’s Skylake (SKL), which Intel essentially (in progressively tweaked versions) kept re-releasing in between 2015 and 2021, meaning there’s a nearly 7-year window of Intel consumer CPUs that are all essentially SKL spin-offs. In the notebook space we saw Ice Lake in late 2019 but for desktop, 2021’s Rocket Lake (the first “real” SKL successor for desktop) was neither well-received nor sold well, and was soon replaced by Alder Lake in late 2021.

In practice, this means that at the time of this writing, the mean, median and mode Intel desktop x86 that is at least a year or so old is some SKL variant; during that time AMD made major inroads into the desktop space with their Zen cores (which, fortunately for our analysis, have quite similar performance characteristics). Notebooks are a bit more varied, but if I’m going to look at only one x86 uArch, SKL seems like a solid choice.

As mentioned multiple times throughout this post already, we’re mainly concerned about latency here. As such, I won’t go into many uArch details here and just point out that the relevant cores are a lot wider than the Jaguars we looked at last time (namely, they support 4+ instructions decoded/dispatched/retired per cycle), have 4 integer ALUs, can (under the right conditions) support 2 loads and 1 store per cycle, up to 2 branches per cycle, and in general have just more of essentially everything compared to the small Jaguar cores we looked at last time. (They also typically clock at twice the frequency, or more.)

Ironically, precisely *because* these cores are so much wider, the analysis gets a lot simpler. Here’s the gist of it: the all important critical path between iterations for a single stream starts with the final bit count for the last iteration and then goes through the refill and 5 subsequent symbols decodes until the bit count for the current iteration is known.

In the code as presented above, that means we start with the `MOVZX`

, `SHL`

and `OR`

from the refill (all basic ALU ops that are 1-cycle latency on pretty much every x86 core you can buy these days) until we know the new value of `rBits`

, for 3 cycles refill latency (can be 2 if we use SHLX).

Then we have 5 subsequent decodes where we care about the latency until we have the updated (shifted) rBits. In the version with the hoisted immediate that uses `and rcx, rBits`

, the 3 relevant instructions on that dependency chain are this AND, the table load, and then the shift by CL. The AND and shift are also 1-cycle latency. In the final iteration, we also need to know the final bit count; this is computed by another independent instruction (the `SUB`

) that depends on the table load and is likely to issue at the same time as our shift, so it doesn’t change the latency estimate. Finally, the load is the longest-latency instruction. Now SKL has 4-cycle-latency loads for certain very restricted cases, if the address itself originated on the load unit (i.e. for pointer chasing), but for our table lookup, we have a complicated addressing mode and the index register originates on the ALU not the load unit, so 5-cycle load latency is what we get.

Everything else is not critical, so for now, we’re going to ignore it. What does that leave us with? For one of our 3 streams, the latency along that critical path ends up being 3 + 5×(1 + 5 + 1) = 38 cycles.

Skylake can sustain 4+ fused-domain^{9} uOps per cycle during those 38 cycles, giving us 38 × 4 = 152 issue slots. One refill takes around 8 fused-domain uOps (with SHLX), each decode step takes 6. That means one stream’s worth of decodes takes around 8 + 5×6 = 38 uops, so for 3 streams that’s 114 uops out of our budget, leaving another 38 issue slots remaining.

That leaves the work we haven’t accounted for so far: one of our streams needs to do a BSWAP (another 3 uops), we have the two pointer-crossing checks (2-4 uops depending on what macro-fusion happens), the destination pointer increment (1 uop), and the final loop condition/test (1-2 uops, again depending on macro-fusion). And that’s it. At worst, that takes another 10 uops out of our budget.

We’re left with about 124 issue slots used over 38 cycles, about 81% of what the machine can sustain. That leaves enough air in the schedule that we don’t need to be too worried about exactly what the schedulers and port load-balancing logic do; we really do expect to be mostly latency-limited still, but at least latency-limited with a decent uArch utilization. Finally, our instruction mix is balanced enough, and the machine has enough resources, that we needn’t worry too much about any particular type of resource (like say load units or shifters) getting overcrowded.

In short, the best we can expect out of the loop described above is to decode 15 bytes (5 bytes each from 3 streams) every 38 cycles, which works out to 2.53 cycles per byte. In practice, on my Skylake test machine, benchmarks hover around 41.8 cycles per 15 bytes (i.e. around 2.79 cycles/byte), both in the idealized test rig and on real-world data.

That’s approximately 10% above our rough lower-bound estimate, purely from operation latencies, assuming scheduling is perfect, there are no cache misses, and so forth. At this point, we could dig deeper and try to figure out where we’re losing these 3.8 cycles relative to our critical path estimate, and if there’s something we can do about it, but the writing’s already on the wall: with one stream, we would be utterly latency-limited with a mostly-empty machine. The 3 streams we’re using now get us to a more respectable level of uArch utilization, but while they were enough to get us into our target throughput-bound regime on Jaguar, they’re not enough, not even in a lower-bound estimate where there are never any delays due to cache misses or suboptimal uop scheduling.

### Discussion

Really, this should not be a surprise. On a machine twice as wide, with many more execution resources, instructions per cycle, and 2 cycle longer load latency, *of course* we need more streams to hide those latencies.

However, we’re starting to bump into other problems. Our bit-IO method uses 3 registers worth of state per stream: the bits, the bit count, and the refill pointer. That’s 9 registers used. We also definitely need a stack pointer, a destination buffer pointer, at least one scratch register (`rcx`

), and also a base pointer for our Huffman table and preferably also an “end-of-buffer” pointer (or, equivalently, a counter) in a register for the final loop test. That’s 14 registers spoken for, and x86-64 only has 16 GPRs.

Now, some of those registers don’t *need* to be live for most of the loop. The “end-of-buffer” pointer never changes and is only used once at the end so it’s not a big problem to leave in memory, and our refill pointers are only used during refill so we can spill some of them without wreaking havoc if we don’t mind the extra instructions to spill and reload them, but we’re getting on thin ice here.

Using more streams would be interesting, but we need to be careful not to introduce a lot of extra instructions. Potentially troublesome spills/refills introducing latency in the wrong places would suck as well. In short, we’d like a few more streams (just one extra is cutting it too close), but that in turn means 3 GPRs per stream is not going to work. We need something more compact, which leads us to a quite different design for the 6-stream decoders we added in Oodle 2.3.0. But that will have to wait for another post. Until then!

### Footnotes

[1] The “how many bytes are left” loop condition is just a counted loop and *extremely* predictable. The “did the pointers cross” check is technically data-dependent, but it is also extremely predictable (in the fast decoder loops anyway), because the decoder loops only run until our already-advanced pointers first cross, at which point we exit the fast loops and switch to the more careful variants.

[2] The trickiest part to negotiate when doing this experimentally is the transition from latency- to throughput-bound. As long as you’re latency-bound, your best move is to use strategies that minimize latency, even when doing so increases dynamic instruction count or utilization of limited execution resources. Once you have enough streams to cover the critical path latency with some slack left over, you need to worry about those bottlenecks instead, even when doing so adds latency. Options that were clearly inferior in the latency-bound regime will end up preferable in the throughput-bound regime.

[3] When writing this code in C/C++, you would probably keep the remaining bit count for the bit buffer in a 32- or 64-bit integer, not a byte, but done naively that would result in an extra instruction to extract and zero-extend the “length” field from the low 8 bits of the table entry. Using a byte register avoids the extra instruction, or alternatively, you could subtract the full 16-bit table entry from the bit count and mask the final bit count down to 8 bits (effectively reducing modulo 256) before using it. The two approaches are equivalent; for this presentation I went with the byte register variant because it’s more straightforward.

[4] x86-64 was designed to share as much of the instruction encoding (and hence the decoder logic) between the 32-bit and 64-bit versions. 32-bit x86 has 8 general-purpose registers and 3-bit register numbers; x64-64 increases this to 16 GPRs (which take 4 bits), and the high bits of the up to 3 register numbers in an instruction are grouped together in what’s called a REX prefix byte. Only some instructions have a REX prefix: if all 3 register number high bits are zero and the instruction is not 64-bit either, it’s not required. However, the register numbering for 8-bit registers worked differently than it did for the other registers: for “classic” x86 code, the 8 8-bit registers are AL, CL, DL, BL, AH, CH, DH, and BH (in that order), where those last 4 are refer to [15:8] in the corresponding 32-bit register. With the REX prefix present, every 64-bit register has its low 8 bits addressable as a corresponding “L” register, but the “H” register names are inaccessible. Thus, if you want to store `CH`

to `[rDest + immed]`

, `rDest`

needs to be one of the low 8 registers.

[5] Other (non-x86) architectures mostly don’t care about which architectural register the shift amount lives in, but almost everyone looks only at the low 5-8 bits (depending on architecture and context) of the shift count. This makes it so that arranging our (len,sym) pairs so that len is in the low byte (which for LE byte order means first byte) is almost universally the best choice.

[6] Hand-writing in ASM also made it practical to use the “store from CH” trick, which is not a huge deal in this particular version since there are other problems that we’ll get to in a minute, but this particular quirk of x86 is one that compilers generally won’t even try to use, even though it’s advantageous in cases like this one.

[7] This is one of my “favorite” x86 warts. The reason `SHL`

by CL is awkward, and often more uOps than you would expect, is because the architectural definition of this (and other variable-shift instructions) is quite strange for backwards-compatibility reasons. Namely, all the shift-by-CL instructions set different flags (and set them in different ways) depending on whether the effective shift count ends up being 0, 1, or something else. This goes back to a problem with the original 8086. The 8086 didn’t have a barrel shifter or any other way to do fast distant shifts; “`SHL AX, CL`

” existed but effectively just repeated `SHL AX, 1`

`CL`

times. This definition was presumably convenient given the *extremely* limited microcode ROM space the 8086 worked with but has the two unfortunate side effects that a) you could make shifts take a very long time on the 8086 by setting `CL`

to 255 and b) when CL is 0, no shifting takes place at all, and in particular the flags stay unchanged from before. Moreover, there’s another wrinkle in that `SHL AX, 1`

sets the overflow flag “correctly” (probably mostly by accident, because for left-shifts specifically the natural implementation of it on a 8086-like ALU data path is `ADD AX, AX`

, which does, and right-shifts can’t overflow), but when doing an iterated sequence of such shifts, the final overflow flag will indicate whether the last shift-by-1 overflowed, which is useless when the count is not 1, because for a “shift by 3” operation what you really want to know is whether there was an overflow at any point during the process, not whether the last instruction in a lengthy expansion overflowed (which is neither here nor there). The tightened definition that appears in the original 80286 manuals specifies that only the last 5 bits of the shift count matter (no doubt anticipating the coming expansion to 32-bit with the 80386) and further says that 1-bit-shifts set the overflow flag correctly but multi-bit shifts leave the overflow flags in an undefined state (not mentioned in these early manuals is that an effective shift count of 0 leaves the flags unchanged, but all x86 CPUs I’m aware of nevertheless behave this way). Long story short, the actual data portion of x86 `SHL reg, CL`

is exactly what you would expect, but which flags get updated and how is a tangled mess that, on most Intel CPUs, takes another 1 or 2 uops to resolve. Interestingly, no AMD uArchs I’m aware of, present or historical, have a problem with it.

[8] Another option available on machines with BMI1 or later is to use `ANDN`

to perform our AND operation. The idea is to put the ones’ complement of our desired mask (so `~2047`

) into a scratch register and then use `andn rcx, rInvertedMask, rBits`

to initialize `rcx`

. In short, we don’t actually care about the “not” part of the `ANDN`

, but it conveniently comes as a non-destructive 3-operand instruction that we can use to avoid the move. This does the job in one instruction, which is great, but requires us to sacrifice another register, which is not so great. In this particular code it’s not really worthwhile since we’re entirely latency-limited on sufficiently wide cores, so having a `MOV`

/`AND`

pair is not a problem as long as we don’t extend the critical path unnecessarily.

[9] I won’t go into what exactly fused-domain uOps are; it’s there for precision, but if you don’t know what that means, just ignore it, it’s way beyond the scope of what we need for this post.

In the last part we went over the general ideas of Huffman coding as implemented in the newer Oodle Data coders, this time we’ll be looking at one particular implementation that is both interesting and “historically relevant”: Oodle was designed with games in mind, an important class of hardware to consider for game middleware is game consoles, and versions of the AMD Jaguar CPU were in both the PS4 and Xbox One (mostly unmodified except for a bump in clock rate in the “mid-lifecycle upgrade” models of both). We wanted Kraken to perform well on those machines, so we spent some time optimizing Oodle for it. Before I go into the details, let’s do a bit of background on the machine itself, but be advised that this will be in-depth and that you may need to re-read the previous part first. Furthermore, this post contains plenty of x86 assembly; if you’re uncomfortable or unfamiliar with that, you probably won’t get much out of it, sorry.

### Meet the AMD Jaguar

The Jaguar, or less prosaically “Family 16h”, is a small, low-power, out-of-order 64-bit x86 CPU core designed for small systems and embedded applications such as, well, game consoles. In the game console variants, the Jaguar CPUs appear on the main SoC along with the GPU (and most other components). It’s designed for multi-core operation and cores usually appear in clusters of four that share a common L2 cache, usually 512kb of L2 per core. In the Xbox/PS4 versions, there are two such clusters and thus two L2 cache slices. This post is only concerned with tasks that run on a single thread so I won’t be spending time on this part of the architecture.

Each core has 32KiB of L1 instruction and 32KiB of L1 data cache. The frontend decode/dispatch/retire logic is 2 instructions wide, and the relevant unit for most of it is what is variously (depending on the source) called “macro-ops” or “cops” (complex ops), depending on the source. I’ll stick with macro-ops. Macro-ops are typically a data-processing instruction along with a memory reference, so something like the x86 instruction `add rax, [rsi]`

would be a single macro-op.^{1} Macro-ops get broken into either one or two micro-ops (I’ll write uops in the following) for execution, but instruction decoding, tracking and retirement all works on macro-ops. The backend has six execution units, each of which can accept one micro-op per cycle: two integer (which I’ll refer to as I0 and I1), one load (L), one store (S), and two SIMD/floating point (which I’ll refer to as F0 and F1). The pipelines are very symmetric: almost all integer instructions can execute in either I0 and I1 (the biggest exceptions being multiplies and divides, which are I1 only), and most FP/SIMD instructions can execute in either F0 or F1 (FP addition and SIMD integer multiplication are only supported in F0, and FP multiplies and store/convert are F1 only). Consequently, most pairings of two independent instructions can execute in the same cycle, if they’re not both contending for the same resource. Of these backend limitations, in my experience the one you’re most likely to hit is the one load per cycle limit.

That said, whenever I’ve looked, the two instructions per cycle decode/dispatch limit is usually the more relevant one. On the Jaguars, using the load-operate and even read-modify-write instructions where possible is a good idea (because it gives you two uops per macro-op), and generally preferable to splitting loads out.

Speaking of loads, the L1 data cache is 8-way associative, write-back, and internally splits 64-byte cache lines into 16-byte sectors. Unaligned loads and stores that stay within a single sector are free, crossing a sector boundary occupies the load/store pipes for an extra cycle (potentially more if it also crosses a page boundary etc.). The load-to-use latency for the L1 data cache is 3 cycles to the integer pipes, 5 cycles to the FP/SIMD pipes, both of which are quite low numbers compared to most of its contemporaries.^{2} The theme of low latencies continues for other parts of the backend: FP32 multiplies and SIMD integer multiplies complete in 2 clock cycles, FP32 and FP64 adds in 3, and most SIMD ALU operations take a single cycle. L2 misses take relatively long though, at a minimum load-to-use latency of 25 cycles.

Lots of console developers found these cores underwhelming, mostly due to the narrow design and fairly low clock rates (around 1.6 and 1.7GHz in the original PS4 and Xbox One, respectively). On the other hand, these cores are quite small, power-efficient, and the PS4/Xb1 console generation came with 8 of them, at a time when more than 4 cores was a rarity in the consumer space. Personally, I quite like them: they’re not the fastest but what they are is extremely even-tempered and predictable. They have a relatively low ceiling on the instructions per cycle and peak performance they can achieve, but getting there is generally a fairly straightforward process, and there’s not much in the way of gotchas or nasty surprises. They’re a pleasant core to optimize for^{3}, and AMD helped by providing good documentation for it.

### The Plan

Because of the aforementioned decode/dispatch/retire limits and low instruction latencies, optimizing code with reasonably nice memory access patterns for the Jaguars is, more often than not, an exercise in minimizing number of instructions executed for a given task. (As I said, they’re fairly straightforward to optimize for!) Therefore, if we want a fast Huffman decoder on these machines, it’s a good idea to see if we can do it with as few instructions as possible.

While reviewing the above-quoted docs, one thing I noticed was that `BEXTR`

, an instruction from BMI1, turns into one uop, is supported on both integer pipes, and has 1-cycle latency. `BEXTR`

is an odd duck: it extracts a given number of bits from a given starting point in the first source operand, and as such is essentially a counterpart to PowerPCs `rlwinm`

or ARMs `UBFM`

, but while these latter two instructions have the bitfield position and width given as an immediate operand, `BEXTR`

takes a register operand for the bitfield specification.^{4} Code that wants to do repeated bitfield extraction with the same operands can burn a register on a constant (itself a fairly steep cost on the relatively register-starved x86) and then use `BEXTR`

, which replaces a move, shift, and an bitwise AND instruction.^{5} The second source register operand to `BEXTR`

contains, itself, bit-packed values: the lower 8 bits give the index of the LSB of the bitfield to extract, the next 8 bits give the width in bits.

This is usable for the bitstream decoding part of our Huffman decoder. Using a “bit extraction” style decoder (variant 3 in this post) means we repeatedly do operations of the form `(bit_buf >> bit_pos) & ((1 << 11) - 1))`

to peek at our next 11 bits, and that is just `BEXTR(bit_buf, bit_pos + (11 << 8))`

. It doesn’t cause any problems to have a constant bias that shows up only in the high bytes added to our bit position, so we can just declare our bit positions to have that offset added at all times while in registers, and that lets us do our bit buffer peek in a single 1-uop instruction on Jaguar cores. Because of another x86 quirk, namely that byte-sized instructions exist and preserve the remaining bits of the register, we can do updates of bit_pos using byte-sized additions or subtractions that leave the high bits alone, if we want to.^{6}

Finally, we don’t want to do a store for every byte we decode, because that’s an extra instruction and we’re easily limited by instructions (or rather, macro-ops) executed. Fortunately we can use the SSE4.1 instruction `PINSRB`

(packed insert byte), which inserts a byte value from an integer register or memory into a given lane of a vector register. Vector registers hold 128 bits (16 bytes), which means we can amortize the number of stores and do one every 16 or so bytes instead of for every codeword. Finally, because Jaguar cores treat memory references inside an instruction as separate uops but not separate macro-ops, and macro-ops are one of our main limiters, we want to use memory references liberally if doing so lets us reduce the number of macro-ops we need.

Putting this all together, note that the pseudocode for the per-symbol processing in a LSB-first Huffman decoder, as outlined in the previous part, looks something like this:

// peek uint32_t bits = (bit_buf >> bit_pos) & 2047; // consume bits bit_pos += table[bits].len; // decode symbol emit(table[bits].sym);

and using the various techniques outlined above, we can turn this into a mere 3 x86 instructions:

; peek. rBitPos[15:8] = 11 bextr rBits, rBitBuf, rBitPos ; advance bit offset (update low byte only) add rBitPosb, [rTableBase + rBits*2] ; put table[bits].sym at position N into xmm0 vpinsrb xmm0, xmm0, [rTableBase + rBits*2 + 1], N

On the Jaguar, this decomposes into 3 macro-ops and 5 uops: 2 loads, 2 integer ops, 1 SIMD. The bit extract to grab rBits from rBitBuf takes a single cycle; the bit position update takes 3 cycles to load the value from the table and an extra cycle to complete the addition. We don’t actually care about the top bytes being preserved here, since we don’t expect overflows, but we do care about our load-operand being byte-sized. Either way, that’s 5 cycles critical path latency from one decoded symbol to the next. Finally, the vector byte inserts to collect the output bytes are not on the critical path. They need to be fast enough to keep up with our decoding bytes once the table loads finish (and they are, single they can complete at a rate of 1 per cycle) but that’s about it. With the Jaguar frontend supporting at most 2 macro-ops per cycle, this code takes at least 1.5 cycles per symbol decoded in the frontend, and 2 cycles per symbol decoded in the load pipeline. Meaning that as given, this code is limited more by the load pipeline than the frontend. However, this is not the only work that needs to happen in this loop, and the Jaguar is out-of-order, so we can build up a backlog of load pipeline work; if we later need to do more integer work in the loop that does not take many loads (spoiler: we will), the load pipeline will get to catch up.

Finally, as mentioned above, our critical path between back-to-back loads from the same stream is 5 cycles on the Jaguar. If we use 3 streams and interleave their processing during decode, then the frontend will get around to the first instruction for the second byte of stream 0 about 4.5 cycles in (although the load pipeline will take about 6 cycles to work through its backlog before then). In other words, the timing here can roughly work out, but it’s not perfectly matched; we will build up a bit of backlog in the load pipeline and the reorder buffer before this is done, but as long as we choose our instructions carefully and don’t go too lopsided, we can make this work while keeping the core nice and busy the whole time through.

I was pretty excited when I first realized this 3-instruction sequence was a viable candidate for the core of our Huffman decoder on Jaguar, but to get a real decoder we also need to deal with bit buffer refills, pointer advancing, and end-of-buffer checks.

### The actual decoder

As mentioned above (and in the previous part), we use 3 separate bitstreams for parallelism. Of these, two bitstreams are regular “forward” bitstreams in increasing address order, and one is written backwards. The numbering of these is a bit odd: in the physical Oodle format, the layout is `strm0-> | strm2-> | <-strm1`

, i.e. stream 0 is forward and comes first (as you would expect), stream 1 is backward and comes last, and stream 2 is also forward and somewhat awkwardly sandwiched in the middle, for “historical reasons”. Namely, Kraken uses forward-backward stream pairs in many places.^{7} The Huffman decoder used to be the same way; when we noticed (while working on this Jaguar decoder, in fact) that three streams would be advantageous, we had to put the third stream *somewhere*. Putting stream 2 in the middle turned out to be slightly easier.^{8} The advantage of the odd-looking backward stream is that it saves us a bit of signaling in the container format (not a trivial concern for a compression format) and also gives us a nice way to do end-of-buffer checks. Namely, the three are contiguous, and all three read pointers (called `in0`

, `in1`

and `in2`

in the following) are in that single contiguous region. Furthermore, in0 keeps increasing, in1 keeps decreasing, and at any point in a well-formed stream, we have `buffer_begin`

≤ `in0`

≤ `in2`

≤ `in1`

≤ `buffer_end`

. During decoding, we do the two interior checks of the read pointers against each other; the end-of-buffer checks on either end are implied by transitivity, and we don’t need to actually do them, or keep those extra pointers around. The only pointers we need to check are the ones we already keep around anyway. Neat!

Now, loads aren’t zero-sized; we use the (common in C) convention that “end” pointers point one past the last element of arrays. So we don’t want to start loading from `in1`

, and with 64-bit (8-byte) loads, the largest address we can ever safely load from is at `buffer_end - 8`

, assuming the buffer is at least 8 bytes to begin with (which we check beforehand). Decrementing `in1`

by 8 before the loop takes care of both issues: now in1 points to the last address we can do a valid 64-bit load from, and as a side effect `in0`

≤ `in2`

≤ `in1`

ends up guaranteeing that in0 and in2 are also good to safely load 8 bytes from without overrunning the buffer. Finally, the optimized decoder loop described here decodes 5 bytes each from 3 streams and writes the results using a 16-byte SIMD store, so it can only safely run until 16 bytes before the intended end of the buffer. All the remaining special cases (less than 8 bytes left in some of the streams, very short input streams, or close to the end of the output buffer) are left to a dedicated safe loop that generally handles the last few bytes, needs to do more careful checking, and is certainly not using hand-tweaked assembly. There would be no point for speed since it only ever handles very few bytes, and besides that’s the exact loop where you very much want a higher-level language for better debugging facilities and good integrations with sanitizers, fuzzers etc.

With a plan for all those details, all we need to take care of now is the refill logic and sort out the remaining plumbing. Looking at the decoder sketch above, we see that we need at least 2 registers worth of state per bitstream: one register to contain `bit_buf`

(rBitBuf in the pseudo-ASM), and one for `bit_pos`

. Once we consider refilling, we also need the corresponding read pointer (the `inN`

I was just taking about). For 3 streams, 3 registers of state per stream works out to 9 registers, a bit more than half of our general-purpose register name pool, which is workable.

As for refill, that is luckily straightforward in a “bit extract” style scheme. At the top of every iteration, we want to load the next 8 bytes from the current input pointer:

mov rBitBuf0, [in0]

For the reverse byte order `in1`

stream, we use a big-endian load (`MOVBE`

) instead, which is the same cost as the regular load on the Jaguars.^{9}

Then we decode 5 values from each of the 3 streams. With our 11 bits code length limit, that means we end up consuming at most 55 bits from each stream. Most relevant bit reading techniques support at most either 56 or 57 bits in a row without a refill when using 64-bit registers, so this fits well.^{10} Decoding 3×5 = 15 symbols also works out very nicely with our 128-bit vector registers, so we do a single unaligned vector store every 15 bytes.^{11}

Finally, after each stream has decoded 5 symbols, we need to check how many bytes to advance the read pointer by, and what the new start position within the byte is. The number of bytes we need to advance the read pointer by is `(bit_pos >> 3) & 7`

, which on the Jaguar, we can compute using a single BEXTR if we can afford a register just to store the constant 0x303, which we can.^{12} We then either add or subtract this from the corresponding `in`

pointer. Finally, we need to clear the bits corresponding to the byte position (that we just took care of) in `bit_pos`

, which is an AND by ~0x38. This keeps the high bits, containing the 11 bitfield length that we need, intact. The actual code below does this masking at the start of the next iteration instead of at the end of the current iteration, but conceptually this belongs with the pointer advance.

And that’s pretty much it. Here’s the full decoder loop, written in NASM. We originally tried to write this in C++ with intrinsics, but that got nasty, so we eventually switched to a real assembler. The original version has the comments laid out differently but I need to fit this into an annoyingly narrow blog CMS theme, so this will look a bit clunky:

; main decode loop ; rax = scratch ; rbx = bitextr0 ; rcx = bitextr1 ; rdx = bitextr2 ; rbp = bextr const ; rsi = table ptr ; rdi = -bytes_left_to_decode ; r8 = in0 ; r9 = in1 ; r10 = in2 ; r11 = bits0 (only live in inner loop) ; r12 = bits1 (only live in inner loop) ; r13 = bits2 (only live in inner loop) ; r14 = decodeend ; r15 = (unused) sub r9, 8 ; in1 -= 8 mov ebx, 0xb00 ; 11 field width mov ecx, 0xb00 mov edx, 0xb00 mov ebp, 0x303 ; for byte step align 16 .bulk_inner: ; non-crossing invariant: in0 <= in2 && in2 <= in1 cmp r8, r10 ja .bulk_done cmp r10, r9 ja .bulk_done ; refill stream 0 ; read next bits0, keep bit offset within byte mov r11, [r8] and ebx, ~0x38 ; refill stream 1 movbe r12, [r9] and rcx, ~0x38 ; refill stream 2 mov r13, [r10] and rdx, ~0x38 %assign i 0 %rep N_DECS_PER_REFILL ; stream 0 ; peek bextr rax, r11, rbx ; consume add bl, [rsi+rax*2] ; grab sym vpinsrb xmm0, xmm0, [rsi+rax*2+1], i+0 ; stream 1 bextr rax, r12, rcx add cl, [rsi+rax*2] vpinsrb xmm0, xmm0, [rsi+rax*2+1], i+1 ; stream 2 bextr rax, r13, rdx add dl, [rsi+rax*2] vpinsrb xmm0, xmm0, [rsi+rax*2+1], i+2 %assign i i+3 %endrep %undef i ; final advances ; num_bytes_step0 bextr rax, rbx, rbp ; in0 += num_bytes_step0 add r8, rax bextr rax, rcx, rbp sub r9, rax bextr rax, rdx, rbp add r10, rax vmovdqu [rdi+r14], xmm0 add rdi, 15 ; loop while bytes_to_decode > 0 js .bulk_inner

That’s the core 3-stream Huffman decoder loop. Time to quit it with the hand-waving and do an actual analysis (if only back of the envelope) to make sure we’re on the right track here.

### Analysis

We already looked at the core decode step earlier and noted that it has 3 macro-ops (I’ll write 3M in the following), and for the backend: 2 integer 0/1 ops (just 2I for short), 2 load unit cycles for aligned loads (2L for short), and 1 FP/SIMD op (1F for short). We do this 5 times per stream. Also per stream is the refill/advance logic, which we now know the instructions for: 1 load for the refill, and 3 integer ALU ops for the byte advance and bitpos update. The load in the refill is almost always unaligned, though. It’s a 64-bit load, and as noted in the introduction, unaligned loads are free if they stay within an aligned 16-byte sector, and cost at least 1 cycle extra when they don’t. Out of the possible load offsets mod 16, 9 (0 through 8 inclusive) stay within a 16-byte sector, the other 7 do not. That’s 7/16=0.4375 odds of at least one cycle extra, and some of those cases (such as crossing cache lines and pages) get more expensive than just adding a cycle. For sanity in the following, let’s just say that we bake this all down to somewhat simpler numbers and expect around 1.44 cycles average case (but probably closer to 1.5 in realistic conditions) for those unaligned refill loads, 2 cycles for a much more pessimistic estimate. In other words, we want to bill the unaligned refill loads at costing more than single aligned load, since the expected number the load pipelines are occupied with them is larger.

Taking that into account, the four instructions involved in refill and advance for a single stream boil down to 4M, 1.44-2L, and 3I.

Then, we have some cross-stream shared instructions: the two compare/jump pairs for our pointer-crossing check at the beginning account for 4M 4I, the final store accounts for 1M 1.44-2S (since it’s also unaligned), and the final `ADD`

/`JS`

pair contribute another 2M 2I to the tally. That’s all instructions in the loop accounted for.

For an overall throughput estimate, we get:

- 15 × 3M (decodes) + 3 × 4M (stream refill/advance) + 7M (shared rest) = 64M total, so 64 macro-ops, enough to occupy the front-end for at least 32 cycles.
- 15 × 2L (decodes) + 3 × 1.44-2L (stream refills) = 34.3-36L total, so the load unit is busy for 34.3-36 cycles.
- 15 × 2I (decodes) + 3 × 3I (stream refills) + 6I (shared) = 45I total, evenly distributes over both integer ALU pipes for 22.5 cycles worth of pressure.
- 15 × 1F (decodes) = 15F total, distributed over both FP/SIMD pipes for 7.5 cycles worth of pressure, so they’re loafing.
- 1.44-2S for 1.44-2 cycles worth of pressure on the store pipe which I assume is sitting on the sidelines munching popcorn.

Purely in terms of pressure on the execution resources, we’re mainly limited by the load pipes which are busy for around 34.5-36 cycles every iteration, closely followed by the frontend which is occupied for at least 32 cycles if everything goes perfectly. 34.3-36 cycles to decode 15 bytes works out to 2.287-2.4 cycles per byte decoded. This is assuming we can ever get throughput-bound to begin with, and is budgeting absolutely no time for L1 cache misses and such.

How does the critical path look? By my reckoning, the most likely candidate takes a freshly updated in pointer from the end of a previous iteration, does an unaligned load to refill which takes 4 cycles for the data to show up, then does 5 back-to-back decodes from that stream which we know have a critical path latency of 5 cycles each, and then finally needs to do a `BEXTR`

on the resulting bitpos followed by an integer add/subtract to produce the next load address. That’s 4 + 5 × 5 + 2 = 31 cycles of critical path latency through the stream decodes, worse if anything bad happens, like extra delays due to page crossings on a load or similar. 31 cycles is close enough to our other 2 limiters for it to be considered a 3-way near-tie. A hitch in the front-end or load pipes or any extra delay along the critical path is likely to end up delaying any given iteration if it occurs. Note I’m purely looking an ideal throughput estimates here, there is no modeling or simulation of machine details going on, all we’re doing is tallying up some figures based on known machine characteristics.

In short, from this rough estimate, we would expect somewhere around 2.3-2.4 cycles per byte for this decoder under very idealized circumstances where there’s not a single cache miss or hitch anywhere along the way, and “a bit worse” (to be decided what that means) when less idealized.

### The rubber hits the road

So what happens when we actually run it?

One of the nice things about coding for game consoles is that the hardware is known and tends to have very predictable, repeatable performance. With that said, here’s stats for the exact loop quoted above running on a PS4 on a synthetic test set (decoding a random stream with a very boring “every symbol is 8 bits” Huffman table, which of course you’d never do, but makes for a test run that’s very easy to validate the results of) 1000 times, and reporting 1st, 50th (median) and 95th percentile cycles per byte:

huff3_jaguar_asm: med 2.28/b, 1st% 2.25/b, 95th% 2.38/b

I swear I did not fudge this in any way, that’s the actual figures I got on a real test run just now. So that much, ahem, very encouraging, to say the least. But what happens when you time it in the middle of an actual Kraken decode of ~250MB of real non-synthetic test data?

SimpleProf :seconds calls count : clk/call clk/count get_array_huff : 0.3879 17689 178617138 : 34946.1 3.46 huff_x64jaguar_loop : 0.3035 31350 178617138 : 15427.6 2.71

Here `get_array_huff`

includes everything the Huffman decoder has to do, including reading the headers, the Huffman table descriptions, validating the code lengths, setting up the tables, the fast decode loops, and the slower near-end-of-data tail decoders, and `huff_x64jaguar_loop`

is just the core optimized decode loop that handles most of the bulk data. “Calls” is the number of calls to either function and “count” is the number of bytes decoded. In this case, 250MB of data “only” decode about 178MB through the Huffman decoders; less than 250MB because we also do LZ-style dictionary compression (not covered in this series). So in this particular real-world use case (which very much does not have all the buffers already nicely in the L1/L2 caches as it’s running), we’re about 17% slower than our ideal average-case throughput estimate for this loop, which is still respectable. Also visible from these two lines is that the core decode loop where around 75-80% of the overall Huffman decoding time is spent with the rest being in setup or tail handling. That is fairly typical for our decoder implementations on various platforms. And you can infer from the figures given that our average array of bytes that use a single Huffman table is about 10k long. For this part, looking at averages is misleading: the distribution is quite wide. Many arrays are 60k+, but many others are well below 3k. The former spend more time in the core decoder (always nice since that part is flat), the latter spend a lot more time proportionately in header parsing and table initialization, which is why we can’t neglect it.

This last part is also why the Jaguar decoder (or, for that matter, all other decoders in Oodle) doesn’t bother with trying to set up tables to decode multiple symbols at once. This sounds enticing but it makes table setup more complicated and slower, and also adds many complications to the decoders because everything emits a variable number of bytes now. For example, using `PINSRB`

to group output bytes would not work if each decode step produce either 1 or 2 bytes; we would need to do individual stores for every symbol, and also increment the destination pointer after every decode. This would add at least 2 instructions per byte decoded (a store and a destination pointer add), probably more. When your single-byte-at-a-time decode kernel runs 3 instructions per byte to begin with and instruction count is a major limiting factor, adding 2 extra more instructions to maybe decode 2 bytes at a time isn’t all that tempting. We can decode another byte in an extra three instructions with the single-byte-at-a-time decoder *guaranteed*, and we don’t need to do any expensive extra work during table setup to do so. We also don’t build up a debt of 2 instructions every time we don’t manage to decode 2 symbols at once that we later have to make up just to break even. Multi-symbol decoding is an old standby when decoding from a single stream, because you can hide a lot of extra work in the shadow of that nasty long critical path, but decoding from multiple streams simultaneously gives you more productive ways to spend those CPU cycles and maybe even get to the holy grail of being primarily throughput bound.

And that’s all I got for this post! I’m not sure which of the other decoder variants I’ll tackle next. Apologies for the long delay, but writing these up takes more effort than my usual blog post, and I need to be in the right headspace to even try doing it.

### Footnotes

[1] If you’re familiar with Intel microarchitectures but not AMD, macro-ops are roughly comparable to what Intel calls “fused-domain micro-ops”, except they’re “even CISCier”, in the sense that even read-modify-write instructions like `add [rdx], rax`

count as a single macro-op, where Intel would split them into a add-from-memory and a store internally. I’ll also add that historically speaking describing AMD macro-ops as similar to Intel fused-domain uops is backwards; AMD has been using “fat” macro-ops as part of their x86 instruction decomposition for a long time, since at least the K7 (original Athlon) architectures. Intel added fused micro-ops (which are more restricted) years later to their microarchitectures when they realized that having the frontend deal with these “chunkier” units was beneficial.

[2] Typical L1D load-to-use times for contemporary designs were 4 or 5 cycles. For that matter, they still are at time of writing, 9 years after Jaguar-based HW hit the shelves. That said, the Jaguars target much lower clock rates than those other designs—the fastest Jaguar descendants ran a bit above 2GHz, other OoO x86 cores from the same timeframe have similar L1D sizes and were typically designed to hit 4GHz or above, so presumably the Jaguar cores can fit a lot more logic into a pipeline stage.

[3] To editorialize even more, the Jaguar’s nearly complete lack of sharp edges and huge performance cliffs was a welcome change after the PS3/Xbox 360 generation, where the main CPU cores seemed at times like they had nothing but.

[4] Presumably due to a problem with the way immediate operand encoding in x86 works. Specifying a bitfield position and width needs at least 6 + 6 = 12 bits to be generally useful on a 64-bit machine. But x86 immediate operands for instructions with 32- or 64-bit operands only come in two sizes: 8 bits and 32 bits. The former is not enough, the latter is very wasteful. 16-bit immediates only exist for 16-bit register instructions, and this part of the encoding would be quite expensive to add exceptions to. Interestingly AMD added a short-lived immediate-operand version of `BEXTR`

that indeed spends a full 32 bits, but this version of the instruction decodes to 2 macro-ops on Jaguar and was removed from Zen 3. It never shipped on any Intel CPU.

[5] All “big core” Intel CPUs that support `BEXTR`

(that I’m aware of, anyway) decode it into 2 uops. The same CPUs can usually eliminate register-register moves during register renaming and would also take 2 uops for a shift-and-mask combination, so `BEXTR`

has never been particularly interesting on them, since it offers at best some minor advantages in the frontend. It’s more interesting on the Intel Atom-derived cores such as the Alder Lake E-cores (which like Jaguar have a 1 uop, 1 cycle version) and AMD Zen cores, though.

[6] Yet another AMD/Intel difference: Intel has been renaming the 8-bit parts of registers such as AL and AH of RAX or R9B of R9 separately for a long time, meaning AL and AH can reside in separate locations in the physical register file. When referencing the merged halves as a single register (such as AX, EAX or RAX) later, Intel CPUs used to either stall (the “partial register stall” of long ago) or, later, started inserting merge operations that combined the results into the instruction stream. AMD has never done this, and instead seems to do partial updates on every operation. That means that on Intel CPUs, code that alternately updates AL and AH can execute as two independent dependency streams, whereas on AMD CPUs all these updates run in series. However, AMD never needs to insert any merge ops either, and has no special penalty for referring to the full register after a partial update, which is handy in our use case: on the Jaguar we’re always concerned with macro-ops through the front-end, so injecting merge operations would suck for us. Good that we don’t get any!

[7] Two streams so that we can alternate decoding from them, because (as seen in this post already and also mentioned in previous parts) sequential decoding from a single bitstream tends to result in very long dependency chains and is a bottleneck. Having pairs of forward and backwards streams with the two meeting in the middle allows us to store a single size in the bitstream to signal the start position of the reverse stream and the combined size as a single value. They also can, in some contexts, act as padding for each other. During decoding, ensuring the pointers don’t cross corresponds to a end-of-buffer check; we don’t know the exact size of either stream up front, but we know that the read pointers may never cross, and once decoding is done they should point to the same location.

[8] Arguably, we should have at least renumbered the streams and swapped labels of streams 1 and 2; but as is often the case, this was quickly prototyped, found to be working, then for a while we were concerned with other things such as buffer overflow safety and such, and by the time we realized it was pretty odd for stream 2 to appear in the bitstream before stream 1, it had long shipped to customers and was very much not worth a format-breaking change to rectify.

[9] Very convenient how the bitstream layout chosen works out so nicely for the most constrained of the important target platforms for Oodle.

[10] Yet another very-much-not-a-coincidence.

[11] This one actually is a coincidence, but I’ll take it.

[12] Yes, the decoder loop keeps 0x303 pinned in a register the entire time it’s running. Long-time friends will realize why this delights me; sometimes the universe just smiles at you like that. This one’s for you, Felix.

BC1-7 (variously also known by the older names S3TC, DXTC, RGTC and BPTC) are the standard compressed texture formats on PC. The newer BC6H and BC7 formats have precise specs that require bit-exact decoding, which means that encoders know exactly what results they’re going to get. The older BC1-5 are a lot more loosely specified, which has been a source of annoyance and problems for BCn encoder authors for some time.

While working on Oodle Texture, I did some experimentation trying to figure out what various GPU hardware decoders actually did. I did get good results for BC1-3 but BC4 and BC5 turned out to be a bit more troublesome. More recently, my colleague Sean Barrett spend some more time on this and we managed to get bit-exact equivalent expressions for Intel and NVidia hardware, but not AMD, where we were off by the last bit on some 32-bit float results in a small number of cases. This is completely irrelevant in practice but it bothered me, so last weekend I spent some more time trying to figure out those results too. I now have satisfying models for all three vendor decoders which match bit-exactly in my tests. That level of precision is unlikely to ever matter to anyone, but along the way there are several common mistakes in BC1-5 implementations and misunderstandings about the format to clear up, so it seems worth it to do a write-up.

### Overview

BC1-7 encode fixed-size blocks of 4×4 pixels to a fixed-size output: 64 bits (8 bytes) per block for BC1 and BC4, 128 bits (16 bytes) for all the other formats. The basic idea is to approximate the colors of pixels in a block by a small number of colors spaced evenly along one or more line segments in RGBA space. Per block, we send the endpoints of the line segment (quantized to a relatively low number of bits), plus an index per pixel that selects which of the colors along the line segment to use. These indices are usually between 2 and 4 bits each. Additionally, almost all of these formats have multiple modes that distribute the values slightly differently. Let’s briefly go over each format in turn:

BC1 sends color endpoints in RGB565 encoding (5 bits for red and blue, 6 bits for green). Somewhat unfortunately, this choice of encoding contains no exact grayscale values other than black and white, which means that slow grayscale gradients in this format tend to oscillate between areas tinged green and areas tinged purple. There are two modes, both of which use a 4-color palette and 2 index bits per pixel to select the palette entry (2×16 bits for the endpoints plus 16×2 bits for 16 2-bit indices, one per pixel, makes 64 bits total). The more commonly used mode uses the two specified endpoint colors, plus another two interpolated colors located 1/3rd and 2/3rds of the way along the line segment between them respectively, for a palette of 4 colors (all 4 colors have 1.0 in the alpha channel). The second mode has only one interpolated color halfway between the two endpoints; the third color has R=G=B=A=0, “transparent black” (in some cases, this is considered a separate “BC1A” format, and stock “BC1” decodes as R=G=B=0 and A=1; this distinction is not available everywhere though). This was intended for use with 1-bit alpha textures. If whoever reads the texture doesn’t care about the alpha channel, this second mode can also be used to get a “free black” even when the line between endpoints goes nowhere near actual black.

BC2 is for textures with an alpha channel that needs more than 1 bit. It consists of what is basically a BC1 color block (except here, all blocks are always in the first mode with 4 interpolated colors; the secondary mode is not available) and 4 bits of alpha per pixel, stored in memory before the BC1-esque color bits. BC2 has existed since the original S3TC, but I’ve never actually seen it used in practice. My guess is that the main reason it’s still around is that it costs very little hardware if you already support BC1 and 3, so despite little practical use, it’s cheap enough to support that nobody has lobbied to get it deprecated.

BC3 is the other format for RGBA textures. Once again, it contains a BC1 color block that is locked into the 4-color interpolation mode, and a different 64-bit encoding for the alpha bits in front. This one specifies two endpoints in the A channel as 8-bit values, followed by 48 bits (3 bits per pixel) to select the palette index. Once again, there are two modes: the more common one has the 8 colors spaced evenly along the line segment, the secondary mode only spaces 6 points along the line and adds two special indices that always decode to 0.0 and 1.0 no matter what the endpoint values are. BC3 is one of the two primary formats used for RGBA texture data.

BC4, unlike the previous formats, is intended for single-channel textures. It is often described as “just a BC3 alpha block”, which is close but not quite true as we’ll see later. There are unsigned and signed variants; the unsigned variants use the regular 8-bit UNorm encoding for values between 0 and 1, the signed variants use 8-bit SNorm for values between -1 and 1. In the signed variant, the special indices in the secondary mode correspond to -1 and +1, respectively.

BC5 also has unsigned and signed variants. A BC5 block is really just two independent BC4 blocks, one for the red and one for the green channel.

BC6H is meant for HDR color data. It only encodes RGB channels, the alpha channel always decodes to 1. There are unsigned and signed variants. The format decodes to 16-bit half-float format and color interpolation works by essentially treating the 16-bit half-floats as a 16-bit sign-magnitude integer value and interpolating those linearly, which works decently for unsigned data, not so well for signed data for blocks containing sign changes. BC6H has 14 modes, sort of, but most of these work essentially the same except for allocating endpoint bits slightly differently, depending on how close the values are to each other. There are two “major” modes in the format, one which specifies 2 sets of line segments (plus a code denoting which pixels use which line segment, from a small set of supported patterns) and has 3 bits of index data per pixel, and one of which specifies just one line segment with 4 bits of index data per pixel. BC6H decoding is complicated but specified bit-exactly so I won’t be talking about it in the rest of this post.

BC7 is a more complicated version of BC3 that adds several more options. It has 8 modes, 4 of which are RGB-only (with A decoding to 1.0), the idea being that even textures with alpha often have large opaque regions, and it helps everything to have more options for coding these color-only regions well. The different modes in BC7 contain various combination of either 1, 2, or 3 different line segments (and corresponding sets of endpoints), once again with the assignment of pixels to line segments chosen from a small, fixed set; either one index per pixel (when RGB and A behave similarly) or two (when they do not); and various index pixel depth, plus several other features that I won’t get into here. BC7 decoding is also complicated but specified bit-exactly and won’t appear in the rest of this post.

### Common (software) mistakes

This is the part that’s probably the most applicable to the majority of readers: before we even get to specification details, many encoders and decoders make two small mistakes in how they implement the basics.

The first and most common one is about the different modes in the BC1 format. The way a BC1 encoder selects which mode it wants a block to use is by sending the coded endpoints in a different order. Note that both color endpoints get packed to RGR565 format. When the first endpoint, interpreted as an unsigned 16-bit integer, is larger than the second one, this selects the more common four-color mode. If the first endpoint is less than or equal to the second, that selects three-color + transparent black mode. Note that when both colors are equal, you have no choice; you always end up in three-color mode (where the 3rd palette entry is special). This is not much of a problem because both endpoints being equal is a degenerate case anyway; all three other palette entries are equal in that situation, so just avoid the third. In all other cases, when the colors you picked result in the wrong order, the encoder can just swap them around and adjust the indices to match. Linear interpolation is symmetric so this does not change the set of available colors. This much, essentially all implementations get right. The tricky part involves BC2 and BC3: BC2 and BC3 blocks *almost* contain regular BC1 blocks, except they don’t have any mode-switching. The color values in BC2/3 always use the four-color mode. (This is pointed out in all specifications of the format, but easy to miss.) Several encoders only ever use the four-color mode to begin with, and swapping endpoints into the order you would need for BC1 never hurts; just be aware that three-color mode is not an option for BC2 or 3, and encoding assuming you have it available will give the wrong results.

The second mistake involves BC4 and 5. BC4 encoders/decoders often take “it’s just like a BC3 alpha block” too far and actually encode or decode unsigned BC4 (or BC5) as BC3 8-bit data. This is not far off, but it’s an approximation. BC4 and BC5 actually decode somewhat differently than BC3 alpha does, with measurably different results and higher internal precision. For most textures this doesn’t matter too much, but the extra precision is definitely useful for content like normal maps. BC5 two-channel normal maps can get up to nearly 11 bits of precision in regions with small value range, as opposed to the at most 8 bits you would get out of regular R8G8B8A8 or BC3 alpha encoding. It’s highly contextual whether this matters or not, and especially for rough surfaces there’s usually not much difference, but it can be useful especially when rendering things with reflective, smooth surfaces like machine parts, cars and so forth.

### What do the specs say?

The premise of this post is that GPU decoders don’t exactly agree with each other, but of course they at least approximately agree or the formats would be useless in practice. Can we say something more about what “approximately agree” means? The two primary sources for this sort of thing in the PC space are the Khronos Data Format Specification and the D3D11 Functional Spec. The Khronos spec is in force for Vulkan and supplants the older OpenGL extensions saying the same thing; as of this writing, the current version is 1.3 and the description of the BC1 formats starts here. It short and to the point, but unfortunately fairly light on details and completely missing any compliance requirements; it specifies what results an ideal decoder should produce, but not how close practical implementations need to be to be considered good enough.

Our secondary primary source is the D3D11 Functional Spec (a descendant of the D3D10 spec and the D3D12 spec is given as differences from D3D11, so this is still in force). The relevant sections start here. In theory, GL/Vulkan devices could do something completely different, but in practice the same hardware is designed to work for both (for obvious reasons), so whichever spec has the more detailed requirements ends up determining the requirements that the hardware is actually trying to satisfy. And the D3D11 version of the spec is a lot more detailed; in particular it has the short section 19.5.2 “Error Tolerance” which I’ll quote verbatim:

Valid implementations of BC formats other than BC6H and BC7 may optionally promote or do round-to-nearest division, so long as they meet the following equation for all channels of all texels:

| generated - reference | < absolute_error + 0.03 *MAX( | endpoint_0 - endpoint_1 |, | endpoint_0_promoted - endpoint_1_promoted | )absolute_error is defined in the description of each format.

endpoint_0, endpoint_1, and their promoted counterparts have been converted to float from either UNORM or SNORM as specified in the Integer Conversion rules. Values that the reference decodes to 0.0, 1.0 or -1.0 must always be exact.

For BC6H and BC7, decompression hardware is required to be bit accurate; the hardware must give results that are identical to the decoder described in this specification.

The part about promotion is clarified in the next 2 subsections and boils down to implementations being allowed to do fixed-point computations at an intermediate wider bit width, as long as values are extended to that intermediate bit width as specified. For the rest, I’ll note that all formats under consideration permit endpoint_0=endpoint_1, in which case the error term reduces to `| generated - reference | < absolute_error`

, so the absolute error term (I’ll get to that in a minute) primarily affects how precise decoding of the endpoint values themselves is. This value is 1/255 for BC1-3, 1/32767 for the signed BC4/BC5 formats, and 1/65535 for unsigned BC4/5 (indicating that BC4/5 have higher internal resolution). The second, relative error term requires that implementations have to be within ±3% of the reference, which is really not a tight bound: even for usual 8-bit texels in a range of [0,255], ±3% works out to ±7 in the integer space (actually 7.65, but when only considering integers, we can round down here) in the worst case when we choose endpoint_0 so that it decodes 0 to and endpoint_1 to decode to 1, or vice versa. This relative error term matters most when the endpoints are far from each other in a given channel, and the rationale for it is that the BC1 quantization is pretty harsh to begin with (you only get 4 values to cover that range). An extra 3% is less than one fifth the expected quantization error from getting to use at most 4 colors, so in that sense, it’s not a big deal. The problem for encoders is that there are often many candidate encodings for a block that have similar error; more possible encodings gives us more chances to get overall errors low if we can manage to find just the right combination of endpoints and indices, including spacing endpoints much further apart than necessary given the range of values in the block because it makes one of the interpolated values land exactly where we need it. Having this relative error term in the mix means we have to play it safe here.

More to the point, having this level of allowed differences between hardware encoders also means that the same texture wil decode differently on different GPUs, even if we’re not trying to do anything tricky. Sometimes blocks legitimately have a large value range because that’s what those blocks’ contents are, and having large differences possible between decoder implementations is a bummer, because we don’t even get an accurate picture of what the error from encoding any particular texture in BC1-5 is; it depends on what GPU it’s decoded on!

The reason I originally looked into this for Oodle Texture was because I wanted tighter error bounds than just the spec requirements. While in theory there is a large space of possible decoder implementations permitted by the spec, in practice there is a very small number of GPU vendors active in the PC space, and once they’ve settled on a particular decoder implementation, they really have no reason to ever change it unless the spec changes again. So I did some testing, and as far as I can tell, indeed neither NVidia, AMD nor Intel seem to have touched their BCn decoder blocks in the last 10 years or so (and why would they?). That means that both in terms of the current marketplace and installed base, there’s mostly just 3 decoders we have to worry about; that excludes really old hardware as well as vendors with a tiny market share, but if nothing else it will give us an idea of what we should really expect.

### Probing what decoders do

Luckily, the BCn formats have many properties that make it fairly tractable to completely capture decoder behavior. For the BC1/BC2-3 color portion, the R, G, B channels are decoded independently from each other, and the pixels don’t influence each other. R and B endpoints are specified with 5 bits of precision and G with 6. That means we can look at all possible combinations of R/B endpoint values with a 128×128 pixel texture (32×32 blocks). In each block, we make sure to use each of the 4 indices at least once, and then write a compute shader that does a `Load`

/ `texelFetch`

operation for every pixel (so that filtering doesn’t confuse the results) and writes the result to a 32-bit float/channel texture. For G we need a texture with 256×256 pixels for 64×64 blocks. That’s a set of 3 small images that completely captures BC1 decoder behavior. For BC3 alpha and BC4 unsigned/signed we have 8-bit endpoints in a single channel, so we need an entire 1024×1024 pixels (256×256 blocks). For BC2 and BC5 we can do some quick tests just to confirm that they behave as expected (BC2: alpha is a 4-bit encoding that decodes as expected, BC5: same as two BC4 blocks, one for R and one for G).

Some more quick tests indicated that:

- As far as I can tell, on all three vendors tested (AMD, NV, Intel), BC1 internally decodes to 8-bit fixed point, which then, depending on whether the format is UNorm or UNorm_sRGB, gets converted to float either using the same conversion the GPU uses for regular 8-bit UNorm textures, or a 256-entry LUT (or maybe LUT + math combination?) used for sRGB->float conversions (hard to be sure, but tested by noting that all values obtained for 8-bit BC1 RGB channels appear in the list of 256 values output by regular UNorm or UNorm_sRGB conversions; if this is not what’s happening the observed behavior appears to be equivalent).
- On all three vendors, BC3 alpha appears to decode to 8-bit fixed point, once again matching the 8-bit UNorm->float conversions, which in turn for all 3 vendors exactly match the results of rounding “x/255” to the nearest floating-point value. Setting e.g. endpoint_0=0 and endpoint_1=1 (8-bit integer values) did
*not*result in any interpolated values inside the open interval (0,1/255). - On all three vendors, BC4 and BC5 decode to more than 8 bits of precision. For Intel and NVidia, all interpolated values for unsigned/signed BC4/5 formats occur on the list of decoded 16-bit UNorm and 16-bit SNorm values, respectively, and they are consistent with first decoding to 16-bit int then converting to float via the UNorm/SNorm conversions. On all three vendors, 16-bit UNorm decode as correctly rounded
`x/65535`

and 16-bit SNorm as correctly rounded`max(x,-32767)/32767`

, respectively. On AMD, the resulting floats do*not*all appear on the list of UNorm16/SNorm16 values; instead the decoder appears to use an internal 14-bit format, details below. Setting e.g. endpoint_0=0, endpoint_1=1 (integer byte values) for BC4 unsigned*does*result in interpolated values inside (0,1/255).

Furthermore, Ignacio Castaño wrote a 2009 blog post on how NVidia GPUs decompress BC1 (with mention of the relevant patent), and it was straightforward to verify that more recent NV GPUs (tested on GeForce 1080 and 2070 series devices) still appear to use the same logic. For Intel, testing was done on a late-2013 laptop with Haswell integrated GPU, and we contacted Tom Forsyth at Intel to confirm that the BCn decoder logic was still essentially unchanged and answer some questions regarding the exact BC4/5 decoder behavior. For AMD, testing was done with an old Radeon HD 7000 series GPU we had kicking around the office as well as game console devkits for the previous and current generation (GCN2 and RDNA2 variants, respectively), the latter on account of them already being on the network and set up for remote testing. (So no pre-GCN AMD GPU in the mix, not sure if that makes a difference).

### What all decoders seem to agree on

For all vendors, the endpoint_0 and endpoint_1 values in all formats (which always appear as palette entries 0 and 1, respectively) appear to be reproduced exactly the same. BC1 as written in the D3D11 functional spec first expands the endpoint values from 5 or 6 bits to 8 bits by replicating the top bits; all three vendors appear to do this or something equivalent, and then convert the result from 8-bit UNorm to float exactly. For BC4/5, likewise palette entries 0 and 1 always seem to exactly make the result of the respective correctly rounded UNorm8 or SNorm8 to float conversion, respectively. Furthermore the D3D11 functional spec requires (quoted above) that the values specified to decode to exactly 0, 1 or -1 indeed decode to these values in all implementations, and this appears to hold.

Taking these all together, this means that for BC1/BC2-3 color, the parts that decoders actually differ in are how exactly the colors 1/3rd and 2/3rds of the way (in four-color mode) or 1/2 of the way (in 3-color mode) are computed, and for BC3 alpha and BC4/5, how the 6 (respectively 4, depending on the mode) actually interpolated values are determined. I’ll cover these vendor by vendor, in order of increasing complexity (of description though not necessarily hardware).

### Intel GPUs

Intel’s decoders are the most straightforward to describe. BC1-3 color is computed in fixed point, from the endpoint values extended to 8 bits using bit replication, using the formula `((256-w)*a + w*b + 128) >> 8`

or something equivalent, i.e. 8-bit fixed-point linear interpolation with round-to-nearest. For the 1/2-of-the-way color in 3-color mode, the weight is w=128 (of course); in 4-color mode, the interpolation weights are w=85 (for the 1/3rd point) and w=171 (for 2/3rds), respectively. These weights are just `round(k*256 / 3)`

for k=1, 2.

BC3 alpha appears to use the weights `round(k*256/7)`

for k=1,…,6 in 6-interpolated color mode, and `round(k*256/5)`

for k=1,…,4 in 4-interpolated-color mode.

BC4, signed and unsigned both, takes the integer UNorm/SNorm endpoint values (for SNorm, -128 is an alternative encoding for -127, they mean the same thing), and use the 16-bit interpolation formula `t = ((65536-w)*a + w*b + 128) >> 8`

to get the interpolation result, where `w=round(k*65536/7)`

, k=1,…,6 in 6-interpolated-color mode, `w=round(k*65536/5)`

, k=1,…,4 in 4-interpolated-color mode.

For BC4/5 unsigned, the GPU then computes `t + (t >> 8)`

; the resulting integer is treated as a 16-bit UNorm value and finally converted to float. For signed, the GPU computes `abs(t) + (abs(t) >> 7) + (abs(t) >> 14)`

with the sign of the original t (i.e. if t was negative, this result is then negated). This number is treated as 16-bit SNorm value and converted to float.

### AMD GPUs

AMDs GPUs still behave in a way that is straightforward to describe, with a wrinkle involving BC4/5. But let’s start with BC1-3: BC1-3 is computed in fixed point, from the endpoint values extended to 8 bits using bit replication, using the formula `((64-w)*a + w*b + 32) >> 6`

or something equivalent, i.e. 6-bit fixed-point linear interpolation with round-to-nearest. This is the exact same computation that is prescribed for interpolation in BC7 decoders, so it likely uses the same hardware. In BC1 3-color mode halfway points, we have w=32 (again, no surprise), and for 4-color mode the weights are 21 and 43, again matching BC7 and equivalent to `w=round(k*64/3)`

for k=1, 2.

BC3 alpha uses the weights `round(k*64/7)`

for k=1,…,6 in 6-color mode, `round(k*64/5)`

for k=1,…4 in 4-color mode.

BC4/5, signed and unsigned both, tales the integer UNorm/SNorm endpoint values (for SNorm, after the -128 -> -127 correction) and uses the same interpolation formula, just without the shift or rounding term: `t = (64-w)*a + w*b`

, again suggesting that it’s probably all using the same hardware. The resulting product fits in a 14-bit integer. Note that by construction, t is either in [-127*64,127*64] for signed values or [0,255*64] for unsigned ones.

For BC4/5 SNorm, this result is then converted to float by effectively doing a correctly rounded division by 127*64 = 8128. This turns out to be somewhat more complicated than the straight division by 127 used for 8-bit SNorm values, mainly due to the extra 6 bits in the input. For BC4/5 UNorm, the result is converted to float by doing an *almost* correctly rounded division by 255*64 = 16320. Once again, this would ordinarily be a bit more complicated than the division by 255 used for UNorm8->float, again due to the extra 6 bits of input; however, here, for whatever reason, be it just a bug or a small hardware optimization that is not clear to me, a very small subset of values (13 total of the 16077 unique possible results for all combinations of inputs) is rounded incorrectly.

This result differs in final mantissa bit of the 32-bit float values only and it truly does not matter for any use case I can imagine since actual BC4/5 encoding-induced errors are many orders of magnitude larger than this, but in case anyone cares about reproducing the results exactly, my hacky conversion function that manually detects the incorrectly-rounded cases is here. (The version for signed decoding is unnecessary because that case is correctly rounded. I thought implementing it might give me an idea of what’s going on with the unsigned version, but no dice.)

OK, so both Intel and AMD are reasonably easy to explain, and probably just reuse hardware that is already there; it’s worth pointing out that BC6H also uses the same interpolation formula as BC7 and works on signed 17-bit integers internally, so there’s probably at least 3 color channels worth of 6×17-bit multipliers lying around somewhere in the decoder path. I was initially very surprised about the 16-bit weights used by the Intel version, but if they just swap the operands between BC4/5 and BC6 and build something like a 8×17-bit multiplier instead, that might be totally fine. (Or maybe they even have their own wider multiplier for BC4/5. Don’t know, I’m just guessing at likely implementation options from the observed behavior here.)

### NVidia GPUs

The formulas used by NVidia GPUs are nothing like the other two vendors, and do not look like they can share any hardware with BC6H/BC7 decoding at all. They do, however, very much look like they’re designed to share as much as possible between BC1-5, and by someone who is dead set to spend as little logic on it as they can reasonably get away with. My best guess is that this design probably predates BC6H/BC7; BC6 forces you into having at least a few rectangular multipliers in the BCn decoder block and once you have them, there’s not much reason to avoid using them.

Anyway, the decoders we’ve seen before use simple linear interpolation formulas with round-to-nearest, or at least give results that are equivalent to them. The NVidia decoders use different formulas for the 5-bit-endpoint red/blue channels in BC1 than they use for the 6-bit-endpoint green channel, and then green/alpha channel logic seems to be the same, with BC4 and BC5 using just the green/alpha channel decoders (and ignoring red/blue). This in turn means that red/blue only ever work from 5-bit inputs, whereas green needs to support both 6-bit and 8-bit endpoints (the latter for use in BC5), and alpha always has 8-bit endpoints (for BC3).

Let’s do the simpler red/blue channels first. Ignacio’s 2009 article gives the formulas for palette entries 0/1 as well, but those are just equivalent ways of writing the 5->8 bit expand using bit replication. The interesting palette entries are 2 and 3, and those are (again, not following Ignacio’s presentation here, but this gives bit-equivalent results):

// red/blue, four-color mode; a, b are 5-bit col2 = ((2*a + b) * 22) >> 3; col3 = ((2*b + a) * 22) >> 3;

The rationale here is that expanding from 5 bits to 8 is effectively multiplying by 255/31, and if we compute (2*a + b) we need to divide the result by 3; the final fraction of 255/(31*3) = 2.741935 is quite close to 11/4, so that’s what it’s approximated by. So this calculation folds bit-expansion and linear interpolation into one and can work with slightly smaller intermediate values as a result (the multiplications by known constants are just shifts and adds).

For red and blue, the halfway point in three-color mode is computed similarly:

// red/blue, three-color mode; a, b are 5-bit col2 = ((a + b) * 33) >> 3;

This is a decent approximation, but somewhat annoyingly, it just behaves differently from the more straightforward AMD/Intel ones; folding the scaling and lerp into one makes the weight constants slightly smaller but also makes the overall rounding behavior different, which is awkward. Oh well.

Anyway, green and alpha is when the fun really starts. Both green and alpha do use a more explicit linear interpolation formulation with round-to-nearest, but the values being multiplied by and the overall way things are put together are still decidedly odd. But let’s start at the beginning. First, BC1 green, four-color mode (again, presentation different from Ignacio’s, but equivalent)

// green, four-color mode, a, b are 6-bit ae = (a << 2) | (a >> 4); // bit expand be = (b << 2) | (b >> 4); diff = be - ae; scaled_diff = 80*diff + (diff >> 2); col2 = ae + ((128 + scaled_diff) >> 8); col3 = be + ((128 - scaled_diff) >> 8);

and in three-color mode:

// green, three-color mode, ae, be as above diff = be - ae; scaled_diff = 128*diff + (diff >> 2); col2 = ae + ((128 * scaled_diff) >> 8);

Here, the effective factor is something like 80.25/256, which is pretty far off (the correct fraction for a real 1/3rd here would be around 85.33/256).

On to BC4/5. Here, our interpolation factor are multiples of 1/7th and 1/5th, respectively. NVidia conceptually splits this into two parts: we first multiply by the small integer numerator, and then by a fixed-point approximation of the reciprocal of 1/5 and 1/7, and all of that is sandwiched into our interpolation formula above.

The set of numbers we might want to multiply by here are thus 0,1,2,3,4,5,6, and 7. For one final trick, note that instead of saying that we want “1/5th of the way from a to b”, we might equivalently say (by symmetry) “4/5ths of the way from b to a”, and NV decoder uses this to avoid doing any multiplies in the “multiply by numerator” portion: of the values 0 to 5 we need for the multiples of one fifth, 0 is 0 (which is very cheap to multiply by indeed) and 1, 2, and 4 are powers of 2 (shifts). Of the remaining numbers, 5=5-0 (so instead of going 5/5ths from a to b, we can go 0/5ths from b to a) and 3=5-2 (turn 3/5ths from a to b into 2/5ths from b to a). For the multiples of 1/7th, it works similarly: 0,1,2,4 are already easy, and for the remaining ones, 7-3=4 (which is easy), 7-5=2 (easy), 7-6=1 (easy), and 7-7=0 (easy). So the HW here can switch “a” and “b” around so that the factor is either 0 or a power of 2, choosing which end of the line segment to interpolate from.

Either way, the interpolation procedure for BC4 in 6-interpolated-color mode works out to:

// BC4; a, b are 8-bit, 6-interp mode ae = expand_to_16(a); be = expand_to_16(b); diff = b - a; // b, a not be, ae! col0 = ae; col1 = be; col2 = ae + 36*(diff); // 1/7 col3 = ae + 36*(diff<<1); // 2/7 col4 = be - 36*(diff<<2); // 3/7 col5 = ae + 36*(diff<<2); // 4/7 col6 = be - 36*(diff<<1); // 5/7 col7 = be - 36*(diff); // 6/7

where `expand_to_16(x) = (x<<8) | x`

for unsigned formats and the way less nice `expand_to_16(x) = copysign((abs(x) * ((1<<14) + (1<<7) + 1) >> 6, x)`

(aside from the sign manipulation this is just shifts and ORs, I just wrote it more compactly).

Here 36=32+4 is a convenient number with just 2 set bits which makes it easy to multiply by with just shifts and adds. The actual factor we would want to approximate for the reciprocal here is 65535/(255 * 7) is about 36.714, so this is not terrible. In 4-interpolated-color mode for BC4, we get:

// BC4; a, b are 8-bit, 4-interp mode ae = expand_to_16(a); be = expand_to_16(b); diff = b - a; // b, a not be, ae! col0 = ae; col1 = be; col2 = ae + 48*(diff); // 1/5 col3 = ae + 48*(diff<<1); // 2/5 col4 = be - 48*(diff<<1); // 3/5 col5 = be - 48*(diff); // 4/5

48 = 32 + 16 is again just 2 set bits, the factor to approximate is 65535/(255*5) = 51.4, so that one’s somewhat dubious, but oh well, it is what it is.

BC5 is just two of those. Finally, for BC3 alpha, we get variation of the BC4 unsigned formulas, but this time we want a 8-bit result not a 16-bit result, so the mystery shift along with the rounding bias is back. Let `scale_and_round(k,d) = ((k*d + (d >> 3) + 128) >> 8)`

, then

// BC3 alpha; a, b are 8-bit, 6-interp mode diff = b - a; col0 = a; col1 = b; col2 = a + scale_and_round(36, diff); col3 = a + scale_and_round(36, diff*2); col4 = b + scale_and_round(36, -diff*4); col5 = a + scale_and_round(36, diff*4); col6 = b + scale_and_round(36, -diff*2); col7 = b + scale_and_round(36, -diff);

and finally:

// BC3 alpha; a, b are 8-bit, 4-interp mode diff = b - a; col0 = a; col1 = b; col2 = a + scale_and_round(48, diff); col3 = a + scale_and_round(48, diff*2); col4 = b + scale_and_round(48, -diff*2); col5 = b + scale_and_round(48, -diff);

And that, at long last, is it.

### Conclusions

If you got here, you have no-one to blame but yourself. If you’re one of the handful of people actively working on BCn encoders (hi!), this is probably actionable information to you. If you read this far out of curiosity or because you want to know exactly what you’re getting on the various GPUs, hopefully now you know. And in the not unlikely scenario that you are someone from the future with a question about weird BCn decoding behavior that I referred to this post, well, hopefully this makes it a bit clearer what’s going on. It’s also hopefully usable if you want a software model of the given three GPU vendors’ decoders without having to juggle multiple machines or video cards around all the time.

If you’re an encoder author and not sure what to do about this, one of the main takeaways from this for me was that even though the D3D spec formulation of BC1 decoding explicitly works on 8-bit integer values and divides by 2 or 3 with truncation, and the Khronos spec treats everything as real numbers with unspecified precision, both Intel and AMD actually use the (somewhat nicer) division with round to nearest, and NVidia uses it as well in the green channel, while in the red/blue channels “it’s complicated” (because of the scaling factor folded into the linear interpolation). My older rygdxt (then turned into stb_dxt) used round-to-nearest, Oodle Texture used to default to the D3D spec behavior of truncating, but between AMD and Intel explicitly rounding and NVidia being somewhere in the middle, the next Oodle Texture release (2.9.5) will probably switch to round to nearest.

Also, instead of using the same weights for red, green, and blue, we are currently planning to use 85/256ths as our new approximation to 1/3 (we used to divide by 3 exactly) in the red and blue channels, and 83/256ths in the green channel, in the spirit of making everyone equally unhappy, to land somewhere between Intel’s choice of 85/256ths, AMDs of 21/64ths = 84/256ths, and NVidias sort-of 80.25/256th’s. It was chosen as a convenient dyadic fraction with not too many significant digits that had OK error estimates over the relevant range for all three vendors; somewhat higher for NV than for AMD or Intel, but they chose this crappy approximation themselves, so that’s on them. (For the halfway point, we use 1/2 with round to nearest, same as AMD and Intel.)

The main reason for this change is that Oodle Texture used to solely encode for the a D3D reference decoder that does not actually ship anywhere; we worked to stay within the relevant tolerances (the 3% relative error etc.) to avoid truly nasty surprises, but at the end of the day nobody ships against the D3D reference rasterizer, they ship on actual GPUs. The new approximate decoder model is not particularly complicated (the main weirdness is the different weights for G and R/B) and targeting it made the errors as computed with the accurate hardware models go down across the board, even though the error as measured against the D3D reference increased (it would have to, seeing as that was the sole metric we used to target before). As of this writing we haven’t done anything about BC3 alpha or BC4/5 yet, but we might change our reference/target decoder for that as well in the foreseeable future.

All three decoders are well within the error bounds required by the D3D spec (and Khronos doesn’t have any particular requirements to begin with). In an ideal world, we’d just pick one of these options (doesn’t matter which) and standardize on it; we have standardized the behavior of pretty much all other numbers formats that GPUs work with and that’s been good for everyone. It wouldn’t do much good for any new software for a while, as long as the multiple HW variants are around, but I don’t see texture memory bandwidth concerns or the BCn formats going away any time soon, the more complex BC6-7 and ASTC already nail down their respective decoder behavior, and it sure would be nice to put the current mess behind us *eventually*, even if it’s another 10 years from now.

Finally, the formulas and explanations given here are collated from data collected over several runs and experiments conducted over a period of one and a half years or so; I tried to unify the presentation somewhat for this blog post but I probably introduced mistakes along the way. If you find any problems, it’s probably a bug. Ping me and I’ll try to fix it.

Many thanks to Sean Barrett who did a lot of the legwork to figure out the last few details of Intels and AMDs BC4/5 decoders, Tom Forsyth who crucially told us that yes indeed the Intel decoders for BC4/5 used full 16-bit weights (something I dismissed as implausible for a good long while), and Ignacio Castaño for his 2009 write-up on the NVidia BCn decoders.

Last time I covered the big picture, so we know the ground rules for the modular entropy coding layer in Oodle Data: bytestream consisting of several independent streams, pluggable algorithms, bytes in and bytes out, and entropy decoding is done as a separate pass, not inlined into the code that consumes the decoded data.

There are good and bad news here: the good news is that the encoders and decoders need to do just one thing, they have homogeneous data to work with, and they ultimately boil down to very simple operations in extremely predictable loops so they can pull out all the stops when it comes to optimization. The bad news is that since they are running on their own, if they’re inefficient, there’s no other useful work being accomplished when the decoders are stalled. Careful design is a necessity if we want to ensure that things go smoothly come decode time.

In this post, I’ll start by looking at Huffman coding, or rather, Huffman decoding; encoding, we won’t have to worry about much, since it’s an inherently easier problem to begin with and the design decisions we’ll make to enable fast decoding will also make encoding faster. Required reading for everything that follows will be my earlier posts “A whirlwind introduction to dataflow graphs” as well as my series “Reading bits in far too many ways” (part 2, part 3). I will assume the contents of these posts as known, so if something is unclear and has to do with bit IO, it’s in there.

### Huffman (de)coding basics

Huffman coding is a form of variable-length coding that in its usual presentation is given a finite symbol alphabet *Σ* along with a positive count *f _{s}* (the

*frequency*) of how often each symbol

*s*occurs in the source. Huffman’s classic algorithm then assigns variable-length binary codes (meaning strings of 0s and 1s) of length

*c*to each symbol that guarantee unique decodability and minimize the number of bits in the coded version of the source; we want to send the symbols from the source in fewer bits, and the way to do so is to assign shorter codes to more frequent symbols. To be precise, it solves the optimization problem

_{s}(i.e. minimize the number of bits sent for the payload, because each *c _{s}*-bit code word occurs

*f*times in the source) for integer

_{s}*c*subject to the constraints

_{s}The first of these constraints just requires that code lengths be positive integers and is fairly self-explanatory, the second requires equality in the Kraft-McMillan inequality which guarantees that the resulting code is both complete and uniquely decodable; in fact, let’s just call the quantity the “Kraft defect”. When *K*<0, the code has redundancy (i.e. there free code space left for assignment), for *K*=0 it is complete, and *K*>0 is over-complete (not uniquely decodable or using more code space than is available).

Huffman’s algorithm to perform this construction is a computer science classic, intuitive, a literal textbook example for greedy algorithms and matroids, and it even gives us not just the sequence of code lengths, but an actual code assignment that achieves those code lengths. Nevertheless, actual Huffman codes are of limited use in applications, primarily due to one big problem: the above problem places no upper bound on the code lengths, and indeed a *n*-symbol alphabet can reach a maximum code length of up to *n*-1 given the right (adversarial) frequency distribution that produces a fully left- or right-leaning tree.^{1} This is bad for both encoders and decoders, which generally *have* to adhere at the very least to limits derived from the choice of target platform (mostly register width) and bit I/O refill strategy (see my long series on the topic mentioned earlier), although it is often desirable to enforce much tighter limits (more on that below). One way to accomplish this is by taking the results of the standard Huffman construction and using heuristics and local rewrite rules on either the tree or the code lengths to make the longest codes shorter at the expense of making some other (more frequent) codes longer than they would otherwise have to be. Another easy heuristic option is to just use the regular Huffman construction and if codes end up over-long, divide all symbol frequencies by 2 while rounding up (so that non-zero symbol frequencies never drop to zero) and keep trying again until success. This works because it keeps the frequencies of more popular symbols approximately correct and in the right relation to each other, while it flattens out the distribution of the less popular symbols, which all end up eventually having symbol frequencies of 1, resulting in a uniform symbol distribution at the bottom end. A non-heuristic, actually optimal option is to directly solve a constrained version of the above optimized problem that adds a maximum codeword length constraint. The standard solution to the resulting combinatorial optimization problem is called the package-merge algorithm.

A related observation is that there is really no particular reason to prefer the code assignment produced by Huffman’s algorithm. Even for cases where no length-limiting is required to fall within usual limits, there are usually many possible code assignments that all result in the exact same bit count for the payload portion; and given an assignment of code length to symbols, it is easy to come up with a codeword assignment that matches those code lengths. In short, the per-symbol code lengths make a difference to how efficient a given code is, but we can freely choose which codes we assign subject to the length constraint, and can do so in a way that is convenient for us.

This has the effect of deciding on a single *canonical* Huffman tree for each valid sequence of code lengths (valid here meaning *K*=0), even in cases where Huffman’s algorithm generates distinct trees and code assignments that happen to have the same code lengths. The actual coded bits are just part of the puzzle; an actual codec needs to transmit not just the coded bits, but also enough information for the decoder to work out what codes were assigned to which symbols, and specifying a canonical Huffman code requires less information than its non-canonical counterpart would.^{2} There are several options as to how exactly we assign codes from code lengths, the most common of which is to assign code words sequentially to symbols in lexicographic order for the (*c _{s}*,

*s*) pairs—that is, code lengths paired up with symbol values, ordered by ascending code length and then symbol value.

^{3}

Using canonical, length-limited codes is a common choice. For example, Deflate (the algorithm used in zlib/ZIP) uses canonical codes limited to a maximum length of 15 bits, and JPEG limits the maximum code length to 16 bits.

In principle, such a data stream can be decoded by reading a bit at a time and walking the Huffman tree accordingly, starting from the root. Whenever a leaf is reached, the corresponding symbol is emitted and the tree walk resets back to the root. However, processing data one bit at a time in this fashion is very slow, and implementations generally avoid it. Instead, decoders try to decode many bits at once. At the extreme, when all code lengths are below some upper bound *N*, a decoder can just “peek ahead” *N* bits into the bitstream, and prepare a table that contains the results of the tree walks for all 2^{N} possible bit patterns; table entries store which symbol was eventually reached and how many bits were actually consumed along the way. The decoder emits the symbol, advances its read cursor by the given number of bits, and reads the next symbol.

This method can be much faster and is easy to implement; however, when *N* is large, the resulting tables will be too, and there are steep penalties whenever table sizes start exceeding the sizes of successive cache levels, since the table lookups themselves are essentially random.^{4} Furthermore, most sources being compressed aren’t static, so Huffman tables aren’t used forever. A typical Huffman table will

last somewhere between 5 and 100 kilobytes of data. For larger *N*, it’s easy for table initialization to take as long, or longer, than actual bitstream decoding with that table, certainly for shorter-lived Huffman tables. The compromise most commonly used in practical implementations is to use a multi-level table scheme, where individual tables cover something like 9 to 12 bits worth of input. All the shorter symbols are decoded in a single table access, but the entries for longer symbols don’t contain a symbol value; instead, they reference a secondary table that resolves the remaining bits (there can be many such secondary tables). The point of Huffman coding is to assign shorter codes to more common symbols, so having to take a secondary table walk is relative uncommon. And of course there are many possible variants of this technique: table walks can do more than 2 levels, and in the other direction, if we have many shorter codes (say at most 4 bits long), then a single access to a 512-entry, 9-bit table can resolve two symbols (or even more) at once. This can accelerate decoding of very frequent symbols, at the expense of a more complicated decoder inner loop and slower table build step.^{5}

### Huffman coding in Oodle

Our older codecs, namely Oodle LZH and LZHLW, used a 15-bit code length limit and multi-level tables, both common choices. The problem with this design is that every symbol decode needs to check whether a second-level table lookup is necessary; a related problem is that efficient bit IO implementations maintain a small buffer that is periodically refilled to contain somewhere between 23 and 57 bits of data (details depend on the choice of bit IO implementation and whether we’re on a 32-bit or 64-bit platform, see the bit-IO series I referenced earlier). When we know that codes are guaranteed to be short enough, we can decode multiple symbols per refill. This is significant because refilling the bit buffer is about as expensive as decoding a symbol itself; being able to always decode two symbols per refill, as opposed to one, has the potential to reduce refill overhead from around 50% of decode time to about 33% on 32-bit targets!

When Charles first prototyped what later turned into Kraken, he was using our standard bit IO implementation, which uses byte-wise refill so post-refill, the 32-bit version guarantees at least 25 usable bits, while for in 64-bit the number is 57 bits, all typical limits.

Therefore, if codes are limited to at most 12 bits, the 32-bit decoder can sustain a guaranteed two symbols per refill; for 64-bit, we can manage 4 decodes per refill (using 48 out of 57 bits). As a side effect, the tables with a 12-bit limit are small enough that even without a multi-level table scheme, everything fits in the L1D cache. And if codes are limited to 11 bits, the 32-bit decoders still manage 2 symbols per refill, but the 64-bit decoders now go up to 5 decodes per refill, which is a nice win. Therefore, the newer Oodle codecs use Huffman codes length-limited to only 11 bits, which allows branch-free decoding of 2 symbols per refill on 32-bit platforms, and 5 symbols per refill on 64-bit targets.^{6}

The other question is, of course, how much limiting the code lengths aggressively in this way costs in terms of compression ratio, to which the answer is: not much, although I don’t have good numbers for the 11-bit limit we eventually ended up with. For a 12-bit limit, the numbers in our tests were below a 0.1% hit vs. the original 15-bit limit (albeit with a note that this was when using proper package-merge, and that the hit was appreciably worse with some of the hackier heuristics). We ended up getting a fairly significant speed-up from this choice, so that slight hit is well worth it.

With these simplifications applied, our core Huffman decoder uses pre-baked table entries like this:

struct HuffTableEntry { uint8_t len; // length of code in bits uint8_t sym; // symbol value };

and the decoder loop looks something like this (32-bit version):

while (!done) { bitbuf.refill(); // can decode up to 25 bits without refills! // Decode first symbol { intptr_t index = bitbuf.peek(11); HuffTableEntry e = table[index]; bitbuf.consume(e.len); output.append(e.sym); } // Decode second symbol { intptr_t index = bitbuf.peek(11); HuffTableEntry e = table[index]; bitbuf.consume(e.len); output.append(e.sym); } }

Note that with a typical bit buffer implementation, the `peek`

operation uses a single bit-wise AND or shift instruction; the table access is a 16-bit load (and load address calculation, which depending on the target CPU might or might not be an addressing mode). The bit buffer `consume`

operation usually takes an integer shift and an add, and finally the “append” step boils down to an integer store, possibly an integer shift to get the symbol value in the right place prior to the store (the details of this very much depend on the target ISA, I’ll get into such details in later posts in this series), and some pointer updating that are easily amortized if the loop is unrolled more than just twice.

That means that in terms of the work we’re doing, we’re already starting out with something around 5-7 machine instructions per symbol, and if we want good performance, it matters a lot what exactly those instructions are and how they relate. But before we get there, there’s a few other details about the resulting dataflow to understand first.

The first and most important thing to note here is what the critical path for the entire decoder operation is: everything is in serialized through the bit buffer access. The `peek`

operation depends on the bit buffer state after the previous `consume`

, the table access depends on the result of the `peek`

operation, and the `consume`

depends on the result of the table access (namely the `len`

field) which completes our cycle. This is the critical dependency chain for the entire decoder, and the actual symbol decoding or output isn’t even part of it. Nothing in this loop depends on or cares about `e.sym`

at all; its load and eventual store to the output buffer should be computed eventually, but there’s no rush. The code length determination, on the other hand, is our primary bottleneck. If the critical computation for `peek`

and `consume`

works out to a single 1-cycle latency integer instruction each (which it will turn out it does on most of the targets we’ll see later in this series), then everything hinges on how fast we can do that table access.

On current out-of-order CPUs (no matter the ISA), the answer usually turns out to be 4 or 5 clock cycles, making the critical path from one symbol decoded to the next either 6 or 7 clock cycles. This is very bad news, because as just pointed out, we’re executing around 5-7 instructions per symbol decoded, total. This means that, purely from the dependency structure, we’re limited to around 1 instruction per cycle average on CPUs that can easily sustain more than 3 instructions per cycle given the right code. Viewed another way, in those 7 cycles per symbol decoded, we could easily run 21 independent instructions (or even more) if we had them, but all we do is something like 6 or 7.^{7}

The obvious solution here is to not have a single bit buffer (or indeed bitstream), but several at once. Once again we were not the first or only ones to realize this, and e.g. Yann Collet’s Huff0 uses the same idea. If the symbols we want to decode are evenly distributed between multiple bitstreams, we can have multiple one-symbol-at-a-time dependency chains in flight at once, and hopefully get to a state where we’re mostly (or even completely) limited by instruction throughput (i.e. total instruction count) and not latency.

Our high-level plan of attack will be exactly that. Huff0 uses 4 parallel bitstreams. For reasons that will become clearer in subsequent parts of this series, Huffman coding in the newer Oodle codecs has two different modes, a 3-bitstream mode and a 6-bitstream mode. Using more bitstreams enables more parallelism but the flipside is that it also requires more registers to store the per-bitstream state and is more sensitive to other micro-architectural issues; the actual counts were chosen to be a good compromise between the desires of several different targets.^{8}

For the rest of this post, I’ll go over other considerations; in the next part of this series, I plan to start looking at some actual decoder inner loops.

### Aside: tANS vs. Huffman

The ANS family of algorithms is the new kid on the block among entropy coders. Originally we meant to use tANS in the original Kraken release back in 2016, but the shipping version of Kraken doesn’t contain it; it was later added as a new entropy coding variant in 2018 along with Leviathan, but is rarely used.

However, one myth that was repeatedly claimed around the appearance of tANS was that it is “faster than Huffman”. While it is true that some tANS decoders are faster than some Huffman decoders, this is mostly an artifact of other implementation choices and the resulting dependency structure of the decoder, so I want to spend a few paragraphs explaining when and why this effect appears, and why despite it, it’s generally not a good idea to replace Huffman decoders with tANS decoders.

I’m not going to explain any of the theory or construction of tANS here; all I’ll note is that the individual per-symbol decode step uses tables similar to a Huffman decode table, and paraphrasing old code from Charles here, the key decode step ends up being:

HuffTableEntry e = table[state]; state = next_state[state] | bitbuffer.get(e.len); output.append(e.sym);

In essence, this is a “lookahead Huffman” decoder, where each stream keeps *N* bits of lookahead at all times. Instead of looking at the next few bits in the bit buffer to figure out the next code length, we already have enough bits to figure out what the next code is stored in our `state`

, and after decoding a symbol we shift out a number of bits corresponding to the length of the code and replenish the bottom of our state from the bitstream.^{9}

The bits come from a different place, and we now have the extra `state`

de-facto shift register in the mix, so why would this run faster, at least sometimes? Again we need to look at the data dependencies.

In the regular Huffman decode, we had a single critical dependency chain with the peek/table lookup/consume loop, as well as the bit-buffer refill every few symbols. In this version of the loop, we get not one but two separate dependency chains:

- The
`state`

update has a table lookup (two in fact, but they’re independent and can overlap), the bit buffer`get`

operation to get the next set of low bits for the next state (usually something like an integer add and a shift), and an OR to combine them. This is not any better than the regular Huffman decoder (in fact it’s somewhat worse in multiple ways), but note that the high-latency memory loads happen*before*any interactions with the bit buffer. - The bit buffer update has the periodic refill, and then the
`get`

operations after the`state`

table lookups (which are a combined`peek`

and`consume`

).

The crucial difference here is that in the regular Huffman decoder, most of the “refill” operation (which needs to know how many bits were consumed, load extra bytes and shift them into place, then do some accounting) is on the same dependency chain as the symbol decoding, and decoding the next symbol after a refill can make no meaningful progress until that refill is complete. In the TANS-style decoder, the refill can overlap with the state-dependent table lookups; and then once those table lookups complete, the updates to the bit buffer state involve only ALU operations. In throughput terms the amount of work per symbol in the TANS-style decoder is not better (in fact it’s appreciably worse), but the critical-path latency is improved, and when not using enough streams to keep the machine busy, that latency is all that matters.

A related technique that muddles this further is that ANS-based coders can use interleaving, where (in essence) we just keep multiple `state`

values around, all of which use the same underlying bitstream. This gives us twice the parallelism: two TANS decoders with their own `state`

are independent, and even making them share the same underlying bitstream keeps much of the advantage, because the high-latency portion (the table lookup) is fully overlapped, and individual reads from the bit buffer (once all the lengths are known) can usually be staggered just 1 cycle apart from each other.

A 2× interleaved tANS decoder is substantially faster than a single-stream Huffman decoder, and has lower incremental cost (a single extra integer state is easier to add and track than the registers required for a second bitstream state). However it also does more work per symbol than a 2-stream Huffman decoder does, so asymptotically it is slower—if you ever get close to the asymptote, that is.

A major disadvantage of tANS-style decoding is that interleaving multiple states into the same bitstream means that there is essentially only a single viable decoder implementation; bits need to be read in the same way and at the same time by all decoders to produce the same results, fast implementations need larger and more expensive to set up tables than a straight Huffman decoder, plus other overheads in the bitstream. In contrast, as we’ll see later, the 6-stream Huffman

coding variant is decoded on some platforms as two 3-stream blocks, which would not easily map to a tANS-style implementation. In the end, the lower asymptotic performance and decreased flexibility is what made us stick with a regular Huffman decoder in Oodle Data, because as it turns out we can get decent wins from decoding the bitstreams somewhat differently on different machines.

### Huffman table initialization

As mentioned several times throughout this post, it’s easy for Huffman table initialization to become a major time sink, especially when tables are large, used for just a short time, or the table builder is unoptimized. Luckily, optimizing the table builder isn’t too hard, and it also gives a nice and intuitive (to me anyway) view of how canonical Huffman codes work in practice.

For now, suppose that decoding uses a MSB-first bit packing convention, i.e. a bit-at-a-time Huffman decoder would read the MSB of a codeword first and keep reading additional less significant bits until the symbol is determined. Although our 11-bit code length limit is quite short, that’s still a large number of bits to work through in an example, so let’s instead consider at a toy scenario with a code length limit of 4 bits and four symbols a, b, c, d with code lengths 1, 3, 2, and 3 bits, respectively.

The code length limit of 4 bits means we have a total space of 16 possible 4-bit patterns to cover; I’ll refer to each such bit pattern as a “slot”. Our 1-bit symbol “a” is the first and shortest, which means it gets assigned first, and as such gets a 1-bit code of “0”, with the next 3 bits irrelevant. In short, all slots with a leading 0 bit correspond to the symbol ‘a’; there’s 2^{4-1} = 8 of them. The next-shortest symbol is ‘c’, which get the next available 2-bit code. All codes starting with 0 are taken; the next available slot starts with a 2-bit pattern of “10”, so that’s what we use. There are 2^{4-2} = 4 slots corresponding to a bit pattern starting “10”, so that’s what we assign to c. The two remaining symbols ‘b’ and ‘d’ get 3-bit codes each. Anything starting with “0” or “10” is taken, so they need to start with “11”, which means our choice is forced: ‘b’ being earlier in the alphabet (since we break ties in code length by which symbol is lower-valued) gets the codes starting “110”, ‘d’ gets “111”, and both get 2^{4-3}=2 table slots each. If we write up the results, we get:

Code | Sym | Len |
---|---|---|

0000 |
a | 1 |

0001 |
a | 1 |

0010 |
a | 1 |

0011 |
a | 1 |

0100 |
a | 1 |

0101 |
a | 1 |

0110 |
a | 1 |

0111 |
a | 1 |

1000 |
c | 2 |

1001 |
c | 2 |

1010 |
c | 2 |

1011 |
c | 2 |

1100 |
b | 3 |

1101 |
b | 3 |

1110 |
d | 3 |

1111 |
d | 3 |

Note that we simply assigned all of the code space in linear order to symbols sorted by their code length and breaking ties by symbol value, where each symbol gets 2^{N–k} slots, where *N* is the overall code length limit and *k* is the code length for the symbol in question. At any step, the next code we assign simply maps to the first few bits of the next available table entry. It’s trivial to discover malformed codes in this form: we need to assign exactly 2^{N} slots total. If any symbol has a code length larger than *N* or we end up assigning more or less than that number of slots total, the code is, respectively, over- or under-complete; the current running total of slots assigned is a scaled version of the number in the Kraft inequality.

Moreover, it should also be obvious how the canonical table initialization for a MSB-first code works out in practice: the table is simply written top to bottom in linear order, with each entry repeated 2^{N–k} times; in essence, a slightly more complicated `memset`

. Eventually, each slot is just duplicated a small number of times, like 4, 2 or just 1, so as code lengths get closer to the limit it’s better to use specialized loops and not treat it as a big memset anymore, but the basic idea remains the same.

This approach requires a list of symbols sorted by code length; for Oodle, the code that decodes Huffman table descriptions already produces the symbol list in this form. Other than that, it is straightforward.

The only wrinkle is that, for reasons to do with the details of the decoder implementation (which will be covered in future posts), the fast Huffman decoders in Oodle do not actually work MSB-first; they work LSB-first. In effect, we produce the table as above, but then reverse all the bit strings. While it is possible to directly build a canonical Huffman decode table in a LSB-first fashion, all the nice (and SIMD-friendly) sequential memory writes we had before now turn into scatters, which is a lot messier. It turns out to be easier and faster to first build a MSB-first table as described above and then apply a separate bit-reverse permutation to the elements to obtain the corresponding LSB-first decoding table. This is extra work, but bit-reverse permutations have a very regular structure and also admit a fast SIMD implementation that ends up being essentially a variant of a matrix transpose operation. In our experiments, the SIMD-friendly (and memset-able) MSB-first table initialization followed by a MSB→LSB bit-reverse was much faster than a direct LSB-first canonical table initialization, and LSB-first decoding saved enough work in the core decode loops to be worth the extra step during table building for the vast majority of our Huffman tables, so that’s what we settled on.

And that’s it for the high-level overview of Huffman decoding in Oodle Data. For the next few upcoming posts, I plan to get into detail about a bunch of the decoder loops we use and the CPUs they are meant for, which should keep us busy for a while!

### Footnotes

[1] An easy way to construct a pathological frequency distribution is to pick the symbol frequencies as Fibonacci numbers. They grow exponentially with *n*, but not so quickly as to make the symbol frequencies required to exceed common practical code length limits impossible to occur for typical block sizes. The existence of such cases for at least certain input blocks means that implementations need to handle them.

[2] There are strictly more Huffman trees than there are canonical Huffman trees. For example, if symbol ‘a’ has a 1-bit code and symbols ‘b’ through ‘e’ all get 3-bit codes, then our corresponding canonical tree for that length distribution might be `(a ((b c) (d e)))`

, but it is equally possible (with the right symbol frequencies) to get say the tree `(a ((b d) (c e)))`

or one of many other possible trees that all result in the same code lengths. The encoded size depends only on the chosen code lengths, and spending extra bits on distinguishing between these equivalent trees is wasteful unless the tree shape carries other useful information, which it can in some applications, but not in usual Huffman coding. Along the same lines, for any given Huffman tree there are in general many possible sets of symbol frequencies that produce it, so sending an encoding of the original symbol frequencies rather than the tree shape they result in is even more wasteful.

[3] When assigning codewords this way and using a MSB-first bit packing convention, the blocks of codes assigned to a given code length form intervals. When codes are limited to *N* bits, that means the length of a variable-length code word in the bitstream can be determined by figuring out which interval a code falls into, which takes at most *N*-1 comparisons against constants of width ≤*N*-1 bits. This allows for different, and often more efficient, implementation options for the latency-critical length determination step.

[4] This is inherent: the entire point of entropy coding is to make bits in the output bitstream approach an actual information content around 1 bit. In the process of Huffman coding, we are purposefully making the output bitstream “more random”, so we can’t rely on locality of reference to give us better cache hit rates here.

[5] Decoding multiple symbols at once may sound like an easy win, but it does add extra work and complications to decoding symbols that are long enough not to participate in pair-at-a-time decoding, and once again the table build time needs to be carefully watched; generally speaking multi-symbol decoding tends to help on extremely redundant input but slows down decoding of data that achieves maybe 7 bits per byte from Huffman coding, and the latter tends to be more common than the former. The actual evaluation ends up far from straightforward and ends up highly dependent on other implementation details. Multi-symbol decoding tends to look good when plugged into otherwise inefficient decoders, but as these inefficiencies are eliminated the trade-offs get more complicated.

[6] No doubt we were neither the first nor last to arrive at that particular design point; with an 8-bit alphabet, the set of viable options for at least two decodes per refill if 32b targets matter essentially boils down to 9b to 12b, with 11b being the first where 5 decodes/refill on 64b targets are possible.

[7] This gap is one of the things that makes multi-symbol decoding look great when plugged into a simple toy decoder. Decoding multiple symbols at once requires extra bookkeeping and instructions, and also introduces extra dependencies between the stores since the number of symbols decoded in each step is no longer known at compile time and folded into an addressing mode, but when we’re only using about a third of the capacity of the machine, inserting extra instructions into the symbol decode is very cheap.

[8] Oodle has the same bitstream format for everyone; while we can (and do) make choices to avoid pathologies on all of the targets we care about, any choice we make that ends up in the bitstream needs to work reasonably well for all targets.

[9] `next_state`

here essentially just tabulates a shift and mask operation; in a “real” tANS decoder this can be more complicated, but when we’re looking at a case corresponding to a Huffman table and symbol distribution, that’s all it does.

April 26, 2016 was the release date of Oodle 2.1.5 which introduced Kraken, so it celebrated its 5-year anniversary recently. A few months later we launched Selkie and Mermaid, which wetre already deep in development at the time of the Kraken release but not quite finished yet, and in early 2018 we added Leviathan, adding a higher-compression option to our current suite of codecs, the “sea beastiary”.

In those 5 years, these codecs have seen wide adoption in their intended market (which is video games).

There’s a few interesting things to talk about, and the 5-year anniversary seems as good a reason as any to get started; this is the first in what will be a series of yet to be determined length, in which I’ll do a deep-dive on one interesting aspect of these codecs, namely the way they handle entropy coding.

Before I can get into any details, first some general notes on how things fit together. Kraken, Mermaid, Selkie, and Leviathan are lossless data compression algorithms, all variations on the basic LZ77 + entropy coding formula that has been the de facto standard in general-purpose compressors since the late 80s, because such codecs can achieve a good balance of compression ratio and compression/decompression speed for practical applications. Other codecs belonging to this family include Deflate (ZIP/gzip), LZX (Amiga LZX/CAB), LZMA (7zip/xz), Zstd, Brotli, LZHAM, and many others, including most of the older Oodle Data codecs (Oodle LZH/LZHLW, LZA, LZNA, and BitKnit).

The “LZ77” portion here refers to the LZ77 algorithm, which, broadly speaking, compresses data by replacing repeated byte sequences in a stream with references to prior occurrences; as long as the back-reference is smaller than the bytes themselves, this will result in compression. Nobody actually uses the original LZ77 algorithm per se (which has a very inefficient encoding by today’s standards), but the general idea remains the same. Some codecs, especially ones designed for faster decoding or encoding, use just this byte/string matching approach without any entropy decoding; this includes many well-known faster codecs such as LZ4, Snappy, LZO, and again many others, including the remaining older Oodle Data

codecs (Oodle LZB/LZBLW and LZNib).

I won’t talk about the LZ77 portion here (that’s its whole own thing), and instead focus on the “entropy coding” portion of things. In essence, what all these codecs do is identify repeated strings, resulting in some mixture of back-references (“matches”) and bytes that couldn’t be profitably matched to anything earlier in the data stream (“literals”). All the codecs mentioned differ in how exactly these things are encoded, but all of them eventually boil it down to one

or more data streams that each have symbols from a finite-size and not too large alphabet (typical sizes include a few dozen to a few hundred distinct symbols in the alphabet), plus maybe a few extra streams that contain raw bits for data that is essentially random.

Entropy coding, then, processes these streams of symbols from a finite alphabet and tries to turn them into an equivalent form that is smaller. There are numerous ways to do this; one well-known method, and indeed a mainstay in data compression to this day (with some caveats), is Huffman coding, which counts how often individual symbols occur and assigns short bit codes to more likely symbols and longer symbols to less likely symbols, in a way that remains unambiguous and uniquely decodable.^{1}

The second major class of techniques is arithmetic coding, which compresses slightly better than Huffman for typical data and can give significant improvements on either small alphabets or highly skewed distributions, and is also more suitable for applications when there is not an a priori known fixed symbol distribution, but rather an on-line model that is being updated as new data is seen (“adaptive coding”).

Finally, the new kid on the block is the ANS (“Asymmetric Numeral System”) family of methods, which is conceptually related to arithmetic coding but works differently, and has different trade-offs. Kraken and Mermaid use ANS sometimes and Leviathan frequently; I’m sure I’ll cover its usage in those codecs at some point, and for now there are plenty of my older writings on the topic from around 2014-2016.

Anyway, it is common for LZ77-derived algorithms with a Huffman or arithmetic coding backend to use odd alphabet sizes; when you’re assigning variable-length codes to everything anyway, it doesn’t really matter whether your alphabet has just 256 distinct byte values in it, or if the limit is 290 or 400 or something else non-power-of-2. It’s likewise common to interleave all kinds of data into a single contiguous bitstream. The sea beastiary family does things a bit differently than most codecs.

Namely, the input (uncompressed) data is processed in 128KiB chunks, which all have self-contained bytestreams with an explicitly signaled size.^{2} These chunk bytestreams are themselves made up out of multiple independent bitstreams for different types of data, and most of these streams (except for a few “misc”/leftover data streams) use a single shared entropy coding layer to handle what we now call “primary streams”.^{3}

This layer works on streams that always use a byte alphabet, i.e. unsigned values between 0 and 255. That means that everything that goes through this layer needs to be presented in a byte-packed format. This kind of restriction is common in fast byte-aligned LZ77 coders without a final entropy coding stage (like LZ4 or Snappy), but atypical outside of that space.

Working in bytes needs some care to be taken in other parts of the codec design, but comes with a massive advantage, because it allows us to have an uncompressed mode that just stores streams as uncompressed bytestreams and doesn’t need any decoding at all. That means that streams that are nearly random, or just short, don’t have to waste CPU time on setting up a more complicated decoder and then decoding bytes individually through a more complicated algorithm when just grabbing uncompressed bytes straight from the bytestream is good enough.

In short, for most codecs, the choice of entropy coding scheme is a design-time decision. Algorithms like Deflate which uses LZ+Huffman *always* perform Huffman coding, even when doing so isn’t actually worthwhile, because they use extended alphabets with more than 256 symbols in them that mean pass-through coding would still need to use larger-than-8-bit units, both introducing extra waste and making it more expensive in terms of CPU time. Sticking with a byte alphabet means that the Oodle sea beastiary codecs can either use a conventional entropy coding stage or choose to send data uncompressed and act like a faster byte-aligned LZ codec. In fact, that is exactly how Selkie works: under the hood, Selkie is the same codec as Mermaid, but with all streams in uncompressed mode at all times (and some other features disabled).

The final big departure from most codecs is that the sea beastiary codecs run all entropy decoding as separate passes that consume an input bytestream with a common header and output an array of bytes. Uncompressed streams can skip the decoding step and consume data straight from the input bytestream. These passes just handle whatever entropy coding scheme is used and nothing else. This means that instead of the choice of entropy coders being fundamentally built into the algorithm, it’s a collection of small loops that can easily be interchanged or added to, and indeed that’s what we’ve been doing; the original version of Kraken shipped in Oodle 2.1.5 with a choice between just uncompressed streams and one particular version of Huffman coding. We’ve added several more options since then, including an alternative Huffman coding variant, an optional RLE pre-pass, TANS, and the ability to mix-and-match all of these within a single stream, all without it being a big redesign, and we’re quite happy with the results.

There’s many interesting things in here that I plan to cover in small installments, starting with the way we do Huffman (de)coding, which is worth several posts on its own. I’ll try to keep updating this at a reasonable cadence, but no promises since my blogging time is pretty limited these days!

### Footnotes

[1] Huffman coding technically refers to such variable-length

coding (meaning not all symbols need to be coded with the same number of bits) only when the code in question is actually generated via Huffman’s algorithm (which minimizes the length of the coded stream given a certain number of assumptions), but in practice most implementations don’t actually use Huffman’s original algorithm which can assign awkwardly long code lengths to rare symbols. Instead the maximum code length is usually capped, which is a different problem and yields codes that are not actually Huffman codes (and are very slightly less efficient), but just general variable-length codes. Having said that, approximately everyone calls it Huffman coding even when the codes are length-limited anyway, and I won’t sweat it either in the following.

[2] The chunks being independent is how we implement what is called “thread-phased” decoding. As is explained further down in the article, Oodle decodes the bitstreams to temporary memory buffers in a first phase, and the chunks can be processed completely independently for this part of the process, making it a prime candidate to move off to a separate thread (or threads). After phase 1 completes, the actual LZ decoding takes place in “phase 2”; this portion can make references to any bytes earlier in the stream, so it must be run in order and can’t as easily be threaded. Customers that want wider parallelism (or easy seeking) usually chop data into fixed-size blocks, typically between 64KiB and 512KiB, that are encoded independently with no back-references permitted across block boundaries. This decreases compression ratio but makes it trivial to have many workers decoding different blocks at the same time.

[3] In the original code, these streams of bytes are, very non-distinctly, just called “arrays”. When specifying the bitstream more formally, a lot of things suddenly needed proper names to avoid confusion, and “arrays” was just too indistinct and overloaded, so “primary streams” was what we settled on. Sometimes additional data streams exist within a chunk for information not contained in the primary streams, and these are therefore called “secondary streams”.