Skip to content

Entropy decoding in Oodle Data: x86-64 6-stream Huffman decoders

It’s been a while! Last time, I went over how the 3-stream Huffman decoders in Oodle Data work. The 3-stream layout is what we originally went with. It gives near-ideal performance on the last game console generation’s AMD Jaguar cores and is also a good fit for 32-bit x86 and ARM targets, both of which were still a pretty important target for us at the time the scheme was designed (today, pretty much anything new uses 64-bit code, so that’s what we’re designing for; sub-optimal performance on 32b targets is not a reason for us to keep features out of the bitstream anymore).

As per the analysis in the last part, the 3-stream decoders are getting decent issue slot utilization out of a Skylake-era x86 core, but are still fundamentally limited by the critical path latency for individual symbol decodes and the bit buffer refill. We’re also using a lot of extra instructions compared to the minimum necessary to shave a few cycles off that critical path latency. Instead, what we can do is optimize for instruction/issue slot count and throughput at the expense of a slight increase in latency, and then use a lot more streams. Ideally, if we have enough independent work to keep the machine filled, the critical path latency stops being the primary limiter altogether.

That said, we still have those 32-bit and Jaguar targets who prefer the 3-stream layout and could maybe use 4 streams without introducing a lot of extra spills, but certainly not a lot more. The compromise we use in Oodle Data is that we introduced a 6-stream layout that uses 3 streams for the front half of the data being coded, and an independent 3-stream group for the back half. In effect, we’re doing two 3-stream decodes (with two separate output buffers) at the same time. The platforms that would rather have 3 streams decode the two in sequence instead.1

Many-stream logistics

The number 6 was chosen because it cleanly divides into two sets of 3 (for the platforms that prefer 3), but it also happened to work out in other ways: it turns out both 4 and 8 streams are a bit awkward in terms of register use on x86-64 and 32-bit ARM, and would introduce several extra spills. Probably wouldn’t be a huge deal, but the odd counts are there for a reason.

Anyway, register use is the watchword in general here: the scheme described in the 3-stream article uses 3 registers worth of state per stream: the actual bit buffer, the “number of bits present” count, and the stream read pointer. On x86-64, we have 16 integer GPRs, sort of. The stack pointer is really spoken for, though, which leaves us with 15. 6 streams at 3 registers of state per stream would need 18 registers’ worth just for the bit stream state; the stream read pointer is only used during refill so spilling those during the core decoding loop isn’t the end of the world, but even so, that’s 12 registers allocated, and we still need an output pointer (actually two in our case), a pointer to the decoding table, and at least one or two registers worth of temporary space… and that’s more registers than we have available already. Spilling and reloading around the bit buffer refill isn’t ideal, but workable. When just cycling between our 6 streams means some amount of register thrashing, though, we’re paying an extra tax not just per every 5 symbols decoded, but for every single symbol. The idea with using more streams was to be able to use fewer instructions per decode to use the machine more efficiently. If we’re saving one instructions’ worth of work per symbol and then adding two instructions worth of spill and reload, we’re not doing great. This can still be worth it if we get better instruction-level parallelism (ILP) potential out of it, but we’d rather not.

Therefore, our goal is to be efficient both in terms of instructions executed and in terms of register state we need to keep around. This old post of mine conveniently presents a solution, in the form of bit reader “variant 5”, the throughput-optimized one. The actual scheme is explained in that post, so I won’t reiterate it here. Just know that the basic idea is to get rid of the “bits consumed” counter per stream by setting the MSB of the 64-bit bit buffer upon refill (then shifting the whole word down if parts of the first byte were already consumed earlier), and determining the consumed bit count using a leading zero count. Conveniently, shifting out just-consumed low bits using an unsigned bit-shift right also updates the consumed bit count for us. Therefore, we save both the extra register for the counter and the extra instructions for maintaining it. The trade-off is that this version of bit-buffer IO doesn’t have the “lookahead” property where the refill load can start before the final consumed bit count of the previous iteration is done. Therefore, we pick up one more load latency along the critical path.

Enough build-up, how does a single-symbol decode look like in this form? Assuming BMI2 is available, it’s this:2

  andn  rcx, rMask, rBits             ; peek (rMask = ~2047)
  movzx ecx, word [rTablePtr + rcx*2] ; table fetch
  shrx  rBits, rBits, rcx             ; bit buffer update
  mov   [rDest + <index>], ch         ; emit byte

Four instructions and 5 uops (4 fused-domain; the store is 2 ops, one store address and one store data which counts as a single uop in the fused domain, the others are 1 uop each) per symbol decoded, which is pretty sweet. Note we splurge on an extra register to keep the constant ~2047 just so we can do the masking with ANDN, which has a non-destructive 3-register form, instead of MOV/AND. The extra MOV doesn’t matter that much when MOV elimination is active, but due to CPU bugs this was actually later disabled in several CPU models, so this is more consistent. Also note both our destination pointers need to be kept within the low 8 registers so the move-from-CH instruction encoding is available, or else it takes an extra shift instruction and uop. (That’s weird quirks of the x64 encoding for you.) The scheme from the previous article takes 5 instructions and 6 uops per symbol decoded (if we use the same ANDN trick, one more of each otherwise), so this change saves us one instruction and one register’s worth of state per stream, with the latter saving further instructions that would otherwise go into spills and reloads if we used 6 streams naively.

The critical path for this is simple: the first three instructions are on it (for any given stream), the last one is not. The AND and shift have 1 cycle of latency each, the indexed load has 5 cycles (on Skylake), for 7 cycles total. Continuing with last article’s assumption of around 4 fused-domain uop issue slots per cycle (the reality is a bit more complicated, but this is good enough), over these 7 cycles, we get 4×7 = 28 issue slots, of which we’re filling 24 when we’re doing 6 streams. When we’re doing this 5 times in a row, that means 5×4 = 20 unused issue slots by the core decode; we do have other things in the loop such as constant loads, pointer increments, the loop condition, the stream overlap checks, and maybe some spills and reloads around the refill stage, so these remaining slots don’t necessarily have to go to waste either. We don’t get to fill every issue slot every cycle anyway because the scheduling and load balancing in the CPU is not perfect, but this is still a promising setup.

Refill

How do we do the bit buffer refill with this? Well, this again follows the basic logic from “Reading bits in far too many ways (part 3)”. If we implement that naively, we get something like this (per stream):

  lzcnt    rBits, rBits           ; # bits consumed
  mov      rcx, rBits
  shr      rcx, 3                 ; # of full bytes consumed
  add      rBitsPtr, rcx          ; advance ptr
  and      rBits, 7               ; leftover bit count
  mov      rcx, [rBitsPtr]        ; load next
  or       rcx, rMarker           ; set marker bit (MSB)
  shrx     rBits, rcx, rBits      ; consume leftover bits

This is already not bad, but note that most of this is on the critical path for the refill as written. The most annoying part of that is the lzcnt which has a 3-cycle latency on most Intel parts. However, note that the rBits value we do the lzcnt on is always the shrx result of the final decode from that stream, and nothing else looks at the contents of the bit buffer after said shift. Therefore, we can juggle things around to do the leading zero count early. The final decode for each of the 6 streams does:

  andn  rcx, rMask, rBits             ; peek (rMask = ~2047)
  lzcnt rBits, rBits                  ; early lzcnt
  movzx ecx, word [rTablePtr + rcx*2] ; table fetch
  add   rBits, rcx                    ; lzcnt update
  mov   [rDest + <index>], ch         ; emit byte

Instead of a final shrx, we use an add instruction instead (making the bits past the initial 8 bits garbage, but we can deal with that), and we move the lzcnt up so it happens concurrent with the table load (which has 5 cycles latency and is on the critical path). That takes the leading zero count off the critical path entirely and doesn’t need us to execute any extra instructions. We do need to clear the garbage bits that end up at bit 8 and above in rBits afterwards. However, if we look at the refill code given above, that’s a trivial fix (funny how that works out):

  ; # bits consumed already in low 8 bits of rBits
  movzx    rcx, low_byte(rBits)
  shr      rcx, 3                 ; # of full bytes consumed
  add      rBitsPtr, rcx          ; advance ptr
  and      rBits, 7               ; leftover bit count
  mov      rcx, [rBitsPtr]        ; load next
  or       rcx, rMarker           ; set marker bit (MSB)
  shrx     rBits, rcx, rBits      ; consume leftover bits

The code still has some other details to take care of. Because we treat 6-stream blocks as a pair of 3-stream blocks, we still have one backwards, reverse-endian stream in there, we need to maintain 2 separate output pointers, we still need to do the stream-crossing checks and have our loop condition, and so forth. All of that is routine and I won’t go into it here, other than noting that the reverse-endian streams require a 64-bit bswap for their refill, which adds 2 uops and 2 cycle latency to their respective critical paths (on Intel), so those streams have the longest critical-path latency.

Analysis and results

I already went into the core symbol decode step before, noting that it takes 4 instructions, 4 fused-domain uops, and 5 unfused-domain uops per symbol. For the rest of our critical path candidate, that leaves the refill.

With the tweaked final decode described above, the leading zero count is off the critical path, but we still count it as 1 uop that effectively belongs with the refill (it also changes one shrx to an add, but that doesn’t make a difference at the level of analysis I’m doing in this series, although I will note that this is also slightly beneficial since there are more execution ports that can take integer adds than there are ports that execute shifts).

For the refill, that leaves us with 8 instructions executed (including the lzcnt), all single-uop. The critical path for the refill (after the final rBits update for a stream) goes movzx, shr rcx, add rBitsPtr, mov [rBitsPtr], or rBits, shrx rBits. All of these except for the load are 1 cycle latency, the load is 4 cycles on Skylake and derivatives, 5 on newer uArchs. That’s 9 or 10 cycles critical path latency through the refill, respectively, 11/12 cycles on the reversed streams that need the bswap.

We’re doing this for 6 streams; 8 uops × 6 streams = 48 uops and issue slots – plus another 4 uops for the two bswaps implied by the two backwards streams, implying a grand total of 52 uops. By our previous “about 4 unfused domain uop issue slots per cycle” rule of thumb, that’s around 52/4 = 13 cycles worth of work – we’re actually throughput limited on this portion, for once. And since we’re running this in a loop, we can expect the throughput-limited refill and “other loop overhead” portions to collapse with the latency-limited symbol decode step to some extent.

In the “production” version of the loop, it turns out there’s 15-20 fused-domain uops worth of other work (that’s not symbol decoding or refill) in this loop. I’m giving a range because 5 of these are compare-branch instruction pairs that can be macro-fused but whether that actually happens in any given instance is again a bit complicated and depends on other factors such as code alignment that are outside the scope of the level of analysis I’m doing here.

If we take the higher count of 20, this happens to line up precisely with the 20 “wasted” issue slots we noticed during the symbol decode portion. I did not notice this before writing this post and highly doubt that things shake up anywhere close to this in practice, because machine behavior gets fairly “twitchy” in the high-ILP regime and any stray disturbance such as non-ideal scheduler instruction picks or individual L1 cache misses cause long ripple effects. Nevertheless, it does suggest that on Skylake and its many process tweaks, we expect to be, just about, throughput-bound.

This time we’re decoding from 6 streams, 5 symbols each, for 30 symbols decoded per loop iteration. We expect to take 4 issue slots per symbol decode (120 slots total), 8 issue slots per refill (48 slots total), and 20 issue slots worth of bookkeeping etc. per loop iteration, for a total of 188 issue slots. With our 4 issue slots per cycle rule of thumb, if everything works out perfectly, that comes out to 47 cycles per iteration exactly.

For critical path latency, we have 6 independent streams, and the slowest ones are the two byte-reversed ones. Their critical path goes through the symbol decode and refill. We have 5 decodes of 7 cycles each plus a refill of 11 cycles, giving us a latency estimate of about 46 cycles if everything works out right.

In short, under very idealized circumstances, we expect to be throughput-bound (or close to), and hitting around 47/30 = 1.57 cycles/symbol decoded.

How fast does the “production” version of this loop actually run on a Skylake machine, in a real workload, on real data? This is where I would ordinarily post results of a test run, but as I’m writing this, the Skylake test machine I’ve been using for this series isn’t letting me remote in for some reason, so I don’t have same-machine results to post just yet. The values I remember are around 1.83 cycles/symbol for 55 cycles per loop iteration, off from our ideal throughput estimate by 8 cycles or about 17%. (I’ll try to remember to update that with recent measured figures once I get that machine back up and running). That’s a higher discrepancy than I’m comfortable with, but we’re running into the limits of what we can do with our very basic back-of-the-envelope estimates here. A more detailed analysis using uICA gives a lower-bound estimate of 49.75 cycles for Skylake, with the listed bottleneck being “Decoder”. That is about 10% smaller than the actual observed figures on real workloads that I remember, which as noted in the previous installment of this series is a pretty typical ratio for real-world results (involving cache misses, branch mispredictions, other process/OS interference etc.) vs. throughput estimates. Either way, at around 1.83 cycles/symbol observed for 6-stream compared to 2.79 cycles/symbol observed for 3-stream, we get a 1.53x speed-up from doubling the number of streams.

I do have actual measured results on my home desktop machine, which as of this writing is an AMD Ryzen 9 7950X3D (Zen 4) at 4.2GHz nominal. On that machine, when doing single-threaded Huffman decoding using the 3-stream BMI2 decoder, I get around 2.08 RDTSC ticks (which are not actually clock cycles due to dynamic frequency scaling) per symbol, while the 6-stream BMI2 variant gets around 1.37 RDTSC ticks per symbol, for a speed-up factor of around 1.52x from using 6 streams – almost exactly the same ratio, on a different microarchitecture. (There are also dedicated Zen-optimized versions of the inner loop, which exploit some specifics of AMDs Zen microarchitectures, and average around 1.79 RDTSC ticks/symbol for 3-stream and 1.29 ticks/symbol for 6-stream on the same machine; in this case, the speed-up works out to “only” 1.39x)

For what it’s worth, around 1.5x is also about the level of speed-up I see on wider 64-bit ARM machines, although the really wide ones (e.g. Apple M1/M2) can see even larger speed-ups from 6-stream.

On said Zen 4 machine, the pure 6-stream Huffman decoding BMI2 inner loop hits a rate of around 3.07 GB/s decoded per second (in our case, 1 symbol=1 byte), on a single core, with the system otherwise mostly idle, i.e., running the OS and desktop but nothing particularly taxing. When using the Zen-optimized variant, it hits 3.25GB/s.

The entire Huffman decode on the same data, same machine averages 2.52GB/s on a single core (using the straight BMI2 variant, 2.65GB/s for the Zen-optimized variant); this includes other tasks like reading the Huffman code lengths from the bit stream, setting up decode tables, as well as the careful end-of-stream processing.

Discussion and conclusion

The original 6-stream design started as a thought experiment about how many streams a purely throughput-bound Huffman decoder would have to be using on a theoretical x86 with as many registers as we wanted, but the same basic instruction latencies as the Haswell/Broadwell/Skylake machines that were current at the time (early 2016). Once it became clear that the 4-instruction/symbol decode step was not hard to do and the resulting stream count worked out to “around 6”, exactly twice what we were shipping previously and without bumping into register count limits, the (admittedly slightly off-kilter) pair-of-3-streams design coalesced as something that was easy to fit into the existing bitstream structure and could just be ignored on targets that didn’t benefit from it, while giving us a considerable speed boost on the rest.

In aggregate, this series of posts is meant both as an explanation of the methods used and more generally, as an illustrative worked example about what designing a data format for higher ILP potential looks like. A “standard” single-stream Huffman decoder is fundamentally bottlenecked by its single serial dependency chain, but the “right” number of dependency chains for a given task is not necessarily straightforward, nor is it always easy to figure out by experimentation. For example, the throughput-optimized decode step used in this version of the Huffman decoder is strictly worse than the latency-optimized decoder for typical 3- or 4-stream setups, so unless you’re aware of the underlying dynamics, you wouldn’t even try it for anything larger. Likewise, the latency-optimized approaches that work best for smaller stream counts run out of steam as more streams are added (mostly from extra spills), suggesting that there is no further win to be had from increased stream counts when in fact there is.

The particular numbers here are always subject to change; the original 3-stream count in the Oodle Kraken format was motivated primarily by the AMD Jaguar cores that were our most important target at the time, and the 6-stream version, although it happens to work out fine, was just a convenient multiple. As CPU designs get wider, the optimal number of streams keeps increasing, provided other likely bottlenecks keep getting fixed alongside.3 That said, a not-quite-ideal 6-stream arrangement on a machine that would do even better with 9 is still much preferable to your usual single serial bitstream decode.

If nothing else, takes this series as a practical demonstration of just how much wasted potential there is on current hardware (“current” here really meaning “anything built in the 21st century”) in serialization formats with a single stream of variable-sized data where each token needs to be decoded to be able to start decoding the next. Of course decoding speed is far from the only concern in practice, but it should at least be considered, when too often it is ignored entirely. There are definitely problems with multi-stream formats (especially when consuming data incrementally or in a streaming fashion), so they are not a panacea, but in many use cases it is completely fine to decode data in small batches (on the order of tens of kilobytes) with some buffering, and in those cases a multi-stream design can offer substantial benefits, especially in write-once read-many scenarios.

Finally, one lengthy digression on a complementary strategy: decoding multiple Huffman symbols at once. The basic idea here is straightforward: if we’re doing a table lookup per symbol anyway, and if the whole point of variable-length codes is to assign shorter code to more common symbols, can’t we decode multiple symbols from a single table lookup, provided those symbols are all fully contained in the portion of the bitstream we peeked at? And yes, of course we can, but there are considerable logistical difficulties if we look again at the final per-symbol decode step we ended up with:

  andn  rcx, rMask, rBits             ; peek (rMask = ~2047)
  movzx ecx, word [rTablePtr + rcx*2] ; table fetch
  shrx  rBits, rBits, rcx             ; bit buffer update
  mov   [rDest + <index>], ch         ; emit byte

This design uses 2 bytes per table entry so it doesn’t have space to decode multiple symbols at once. The easiest generalization is to allow decoding either 1 or 2 symbols per step. That means per table entry, we now need the combined length of all symbols decoded, the values for the first and (potentially) second symbol decoded, and a symbol count (or other flag) for whether the table entry describes 1 or 2 symbols. Take a basic layout of a 32-bit table entry containing 4 bytes (len, nsyms, sym1, sym2) for a concrete example (that’s not the only possible design point, but it is a sensible one). Note that our table entries are now twice the size (and much trickier to initialize), and furthermore, for each table lookup, we don’t know anymore whether we’re going to decode 1 or 2 symbols, so we can’t use immediate offsets for our stores anymore; in every decode step, we need to figure out what the symbol count is and advance the output pointer. We also either need multiple loads or do a single combined 32-bit load and shift to get access to the various fields; pick your poison. My best bet at a workable 1-or-2 symbol decode step looks something like this:

  andn  rcx, rMask, rBits              ; peek (rMask = ~2047)
  mov   ecx, dword [rTablePtr + rcx*4] ; table fetch
  shrx  rBits, rBits, rcx              ; bit buffer update
  movzx rNumSyms, ch                   ; extract nsyms
  shr   ecx, 16                        ; align output byte(s)
  mov   [rDest], cx                    ; emit output byte(s)
  add   rDest, rNumSyms                ; advance write pointer

Note that despite still using oddball tricks like the movzx-from-ch to get (effectively) a bitfield extract, which comes with strings attached (rNumSyms can’t be just any register for this to work), we’re now using 7 instructions and 7 fused-domain uops (8 unfused-domain) to decode either 1 or 2 symbols. Furthermore, we’re adding a secondary dependency chain on the outputs between adjacent logical streams: the add rDest, rNumSyms needs to complete before the next stream can finish its output. That’s not the end of the world since only the stores depend on it, but it’s an extra potential for slowdown.

This is just not all that appealing. As noted above, we already get to be mostly throughput-bound (actually frontend, but at least we’re not constantly waiting on dependency chains anymore) on the original Skylake target, using 4 instructions per symbol decoded. This modified version will in the best case only ever write 2-symbol pairs and use 7 instructions to do so, for 3.5 instructions per symbol decoded. But that’s with a 100% success rate. Even if we manage to decode a pair of symbols a staggering 75% of the time, we’re still taking 7 instructions to decode a single symbol in the remaining cases, which works out to an average of 0.25 × 7 + 0.75 × 3.5 = 4.375 instructions per symbol – still worse than a single symbol decode, although to be fair, the double-symbol decodes end up doing a refill less often when they hit. Considering that, it’s likely that around 75% pair decode rate, we’d probably break even or see a slight speed-up using multi-symbol decoding.

That’s still not particularly promising though; from multi-symbol decoding, we’d like to see a lot more than “break even around the point where we decode 2 symbols at once 3/4 of the time”. Moreover, from the entropy of the symbol stream, we can work out how big the decode table needs to be to be able to successfully decode symbol pairs on most decode steps. Oodle Data uses a (very low) code-length limit of 11 bits to facilitate infrequent refills. In our existing table size of 2048 entries, we can expect to see frequent pairs of symbols if the entropy is around 5.5 bits per symbol or so, which is very low. Our typical data for things like literals after LZ compression is more frequently around 7.2-7.8 bits per symbol (each of which is from an 8-bit alphabet). We would need something on the order of 215 = 32768 table entries for pair decodes to be common; at 4 bytes per table entry, that’s 128KiB, which is definitely outside the L1D cache, and in fact takes up a sizeable fraction of the L2 cache as well. So now we’re expecting to miss the L1D cache on those table accesses as well, meaning our expected latency per symbol decode for the memory access goes from around 5 cycles to around 16-20, which completely blows up our assumptions. Now we have nowhere near enough streams to hide that latency again, and we also need to spend the time to set up these giant tables; as you can see from the numbers quoted in the preceding section, the time spent setting up the decoding tables even for our small (2048-entry, 2 bytes/entry) configuration is not negligible, and multi-symbol decode tables need more space per element, more elements, and are more expensive to set up.

Add all these together, and the case for table-driven multi-symbol decoding in the general case just doesn’t look great: it increases the amount of work per decode step, which is disproportionately felt when the decode steps are otherwise lightweight or there are multiple streams available to compensate for critical-path latency. It increases table build time, which can be a significant fraction of overall decode time if tables change frequently, table building isn’t optimized (it often isn’t) or tables are large, and it often ends up requiring impractically large tables to be able to decode pairs to begin with.

That doesn’t mean that it doesn’t have its uses, especially in cases where you know the expected entropy is low. For example, the JPEG decoder in stb_image normally decodes AC coefficients in two steps, first reading a (run,level) pair (this is encoded into a single 8-bit value as per the JPEG spec) and then reading an extra number of bits determined by the “level” value to figure out the coefficient value. Most AC coefficients are very small (mostly ±1 or ±2), so stb_image’s JPEG decoder additionally has the “fast AC” table that combines decoding (run,level,extra_bits) all into one table lookup (into a 1KiB table) in the typical case when the AC coeffs are very small.

However, for general-purpose lossless (de)coding in software, multi-symbol decoding just isn’t very compelling in my experience.

And that’s it for this post and the Huffman portion of this series, unless there’s interest in some of the other targets. (There’s also a suite of optimized ARM decoders, both 32- and 64-bit, and while they have some cool tricks I’m happy with, they don’t add anything substantially new to the material already covered so far.) I’m leaving the door open to do another follow-up on the non-Huffman portions of Oodle Data’s entropy coding—especially TANS—but I don’t have any concrete plans yet. We’ll see!

Footnotes

[1] This does give us a full 6 independent streams when batch-decoding the entire contents of a Huffman block up front, but is still effectively 3 streams when decoding incrementally. Not a problem in the regular SW Oodle decoders, but there are other “streaming” implementations that don’t get a benefit out of the 6-stream layout. The platforms we support that greatly prefer 3-stream batch decoding were much more important to our customers than the streaming use case at the time this was designed, so this was a sensible trade-off.

[2] You might notice that this code has a tendency to keep all shift counts in explicitly rcx, that exact register, not just a named temporary like I do for the other register references. That’s because even though I’m showing the BMI2 versions of everything, we also support non-BMI2 paths for machines that don’t have it, and in those we want to use the shr reg, cl forms. We also use the store-from-ch instruction, so the register receiving the table load needs to be one of eax, ebx, ecx or edx anyhow; it’s easiest to just make rcx the dedicated temporary register in this code and make sure everything that needs to be a shift count ends up in there in, that way the BMI2 and non-BMI2 paths can be very similar.

[3] Dougall Johnson has done some experimentation on Apple M1/M2 hardware and according to him, going up to even 12 streams is still beneficial, but I haven’t tried to replicate this myself.

Computational complexity of texture encoding

Most standard texture compression formats use a type of algorithmic vector quantization (meaning that instead of storing an explicit codebook of possible blocks, the codebook entries are determined by an algorithm that uses the encoded block as an input). This is the case for all the BCn formats, ETC/EAC, and ASTC, but not PVRTC, where the decoded pixels near the edge of a block also depend on adjacent blocks, which makes the encoding process more of a global problem (so the notes in this post do not apply).

Furthermore, the contents of the encoded blocks can usually be partitioned into three categories:

  1. Header/mode bits, which determine how the remaining contents of the block are to be interpreted.
  2. Endpoints, which specify some of the colors present in the block in some form. The endpoints are then expanded into a larger palette of colors, and pixels select which color they want using
  3. Indices into the palette. Pixels generally pick the index that gives the closest match in the palette according to some error metric.

The expansion process from specified endpoints into the full palette frequently involves some form of linear interpolation, and the error metric is usually some variant of squared error (possibly weighted), for numerous reasons.

I have now several times run into people who thought that encoding textures at reasonable speed was impressive because this problem looks like certain well-known hard problems such as Integer Least Squares. This is not the case, for at least two orthogonal reasons:

  1. Being able to reduce a problem to some variant of Integer Least Squares, or Integer Linear Programming, does not automatically make it hard in a computational complexity sense. The reduction in a hardness proof needs to go the other way: you need to be able to turn an arbitrary instance of a known hard problem into an instance of your problem to prove that your problem is hard. It is entirely possible, and frequently quite easy, to go the other way and turn trivial problems into instances of NP-hard or worse problems (usually by accident).
  2. Hyper-polynomial growth of the best-known solution to a problem as a function of problem size N matters when the problem size depends on the input, but this is not the case for block-based texture compression. When the N for some hyper-polynomial algorithm is bounded by a constant independent of the size of the input (in this case that would be the dimensions of the texture), the asymptotic complexity of whatever algorithm you use to solve a NP-hard subproblem does not really matter.

In particular, if you chop an image/texture into fixed-size blocks (i.e. block size doesn’t depend on the input) that are then encoded fully independently, and encoding any given block takes a bounded amount of time, that makes encoding of the full image linear time in the number of input pixels, even if encoding any individual block is a combinatorial nightmare. You might not like the constant factors in front, but it’s still going to be linear time.

For the purposes of computational complexity, behold the following universal algorithm to encode compressed textures:

best, best_dist = None, None
for E in enumerate_coded_blocks():
  D = decode_block(E)
  dist = distance_metric(D, original_block)
  if best_dist is None or dist < best_dist:
    best_dist = dist
    best = E
return best

This is a perfectly fine linear-time algorithm with the only problem being that even with something as modest as 64-bit blocks, it’s not expected to finish before the heat death of the universe. Don’t worry though, this post is not just an exercise in pedantry.

Namely, the other thing we can immediately do from the decomposition of a block into its constituent parts above is to shrink the search space drastically, and of course practical encoding algorithms do just that.

  • Not all header/mode bits need to be explored. If desired, methods that still want to guarantee optimality can use branch-and-bound techniques to prune the search. In practice, nobody actually wants to spend the extra time (or cares about the generally tiny improvements from) guaranteeing optimality, so more commonly heuristics are used here as well.
  • Formats that split their blocks into subsets let you do parts of the solve separately, which shrinks the search space considerably where it applies.
  • Given a set of endpoints, the source pixels, and a distance metric, it is pointless to try all possible choices of indices. Once we fix the endpoints for a subset (and possibly some index constraints resulting from that choice), the optimal indices can generally be determined directly. This direction is usually cheap to solve exactly, but the endpoints are a somewhat awkward search space.
  • In the other direction, if we fix a set of indices, the source pixels, and the distance metric, we can try to solve for the optimal endpoints given that set of indices. This direction is usually tricky to solve exactly, but the search space is easier to prune to a small subset. It’s the basis of cluster fit methods like the original algorithm in the S3TC patent and methods derived from it.

The actual reason I’m writing this post is that for several of the simpler formats, the resulting search space is small enough that exhaustive search is completely viable. Not so much for real-world texture encoding, it’s too slow and the benefits are way too small for that. For BC1-5, it’s also pointless because the decoder isn’t exactly specified. But it’s definitely doable to get reference results if you want to see what optimality actually looks like.

In particular, for BC3 alpha or BC4 blocks, the set of possible endpoint/mode combinations is just 16 bits (the ordering of the endpoints selects one of two modes), so 65536 possible choices of endpoint pairs. Determining the optimal endpoints in 1D isn’t even hard (and can be done exactly using fairly straightforward math), and it’s all very easy to accelerate using SIMD, using a GPU, or both. This is not only eminently viable, it’s viable enough that not only can this be used to test individual blocks (or small image snippets), you can encode full textures this way just fine. Sure, encoding a 1024×1024 texture that way takes seconds instead of milliseconds but that’s not long to wait for a reference result at all.

BC5 is just two independent BC4 blocks, one per channel, and they’re completely independent, so this also applies to BC5. BC1 color blocks are still 64-bit blocks and have a 32-bit space of possible endpoint pairs. However, 232 = about 4 billion “interesting” encodings (since we don’t care about sub-optimal choices of indices) is still small enough that exhaustive testing is totally viable on either CPU or GPU if you’re using an optimized search, a beefy CPU/GPU, parallelization and are willing to wait a bit to encode a single 4×4 pixel block. (In this case, you probably would not want to encode a full 1024×1024 texture this way.) And if we can do BC1 and BC4, we can also do BC2 (the alpha part is trivial) and BC3 (effectively separate BC1+BC4, they decompose). In short, BC1-5 are actually totally practical to do exhaustive encoding these days, and the same applies to ETC and EAC.

Would you use this for a practical encoder? Definitely not. Likewise, it’s not viable for the more complex formats like BC6-7 or ASTC. Nevertheless, it felt worth pointing out that optimal BC4-5/EAC encoding (in the least-squares sense anyway) for a known decoder is eminently possible even at home for real-world-sized textures if you’re willing to wait for a few seconds (and that’s with no pruning whatsoever). BC1-3/ETC are more involved, but still at a timescale where determining optimal solutions (given a known decoder) for a small image patch is something you can easily do on a regular home computer even without using any pruning strategies if you’re willing to wait a few hours.

Now the actual application given here is obviously very specific to what I happen to be working on, but the reason I’m writing it up is that I’m trying to make a few more general points: first, “this looks like what I know is a NP-hard problem” is no reason to assume something is actually hard (never mind infeasible) in practice, even if you actually want optimality – hardness reductions work the other way round, and this is by no means the only instance of NP-hard subproblems of bounded size that I’ve run into, which are a very different beast. Second, exploiting even some very basic symmetries can sometimes take combinatorial optimization problems from “never going to happen” to “that’ll be 5 minutes”, so it’s worth looking. Third, computers are fast. Trying on the order of 100,000 things if said trials don’t take long is just not that big a deal and hasn’t been for a long time. For that matter, trying 4 billion things has also been entirely practical for a while, as long as individual trials are in the “100s to 1000s of cycles” range (i.e. don’t try this with slow interpreted languages) and you have a few cores. For the past 15 years or so, pretty much every unary function on 32-bit inputs (especially floats) and every binary function on pairs of 16-bit inputs, I’ve just tested exhaustively on all possible inputs unless there was a very compelling reason why this wasn’t possible.

A very brief BitKnit retrospective

UPDATE May 7, 2023: I wrote this post yesterday somewhat in a huff (for reasons not worth going into) and the original post contained several inaccuracies. These have been corrected in this version and I’ve marked the places where I was wrong [like this]. Let’s just say I wish I hadn’t posted this in its original form, but I did, and I’d rather make the corrections publicly now than pretend it didn’t happen.

BitKnit is a LZ77+RANS lossless general-purpose compressor that was designed 8 years ago (1.0 release on May 20, 2015) to replace the aging and very slow to encode LZ77+arithmetic coding codec built into Granny 3D, and also worked – somewhat unintentionally – as a LZ77/entropy coding test vehicle. The codec did end up in Granny and later a small variation (with slightly different bitstream to work within the different container) in Oodle 2.1.2. In Oodle it was quickly superseded by the “oceanic cryptozoology” (first Kraken, then later Mermaid/Selkie and Leviathan) codecs, so BitKnit the actual codec is just a historical curiosity these days (and deprecated in current Oodle versions), but it was our first released code to use several ideas that later ended up in other Oodle codecs [the original version implied that these ideas started in BitKnit; that’s incorrect], and might be of general interest. So I’ll keep it brief and focus mainly on the parts of it that actually got successfully used elsewhere.

Basic overview

BitKnit (I’ll just write BK from here on out) is basically your usual LZ77 with entropy coding backend. Like LZX and then LZMA, it keeps a history of recent match offsets that can be cheaply reused; BK was designed mainly for Granny files, which are chock full of fixed-size records with highly structured offsets, so unlike LZXs LRU history of 3 and LZMAs of 4 recent match offsets, it keeps 7. This is somewhat expensive to maintain in the encoder/decoder and mostly a waste for ASCII text and text-based formats, but turns out to be very useful for the structured binary data that was its intended use case.

For entropy coding, it uses a 2-way interleaved rANS coder with 32-bit state that emits 16-bit words at a time, and 15-bit “semi-adaptive” (aka deferred summation) multi-symbol (i.e. not binary) models. Semi-adaptive meaning that they change over time, but not after every symbol; model updates are batched and the model remains static in between, which lets it amortize update cost and build some auxiliary data structures to accelerate decoding. The particular model it ended up with was a compromise to hit its speed targets, and its update mechanism wasn’t great. It was significantly faster than LZMA/CABAC/etc.-style “update after every symbol” fully adaptive models, but with appreciably worse compression, and slower to decode than a fully static model would’ve been (which would’ve also enabled tANS usage). Our later codecs jettisoned this part of the design; Oodle LZNA (as the name suggests, a LZMA-derived design) used multi-symbol rANS but with full adaptation after every symbol. Kraken, Mermaid and Leviathan (but not Selkie, which is a byte-aligned LZ with no entropy coding) support multiple choices of entropy coders but they’re all based on static models.

BitKnits use of a semi-adaptive model makes it easy to do the modelling and most of the encoding in a single front-to-back pass. Static models are trickier to encode for because there are interdependencies between e.g. the LZ parse and the statistics of the symbols emitted, which makes encoding harder and somewhat slower, but such is life. The actual rANS bitstream needs to be encoded in reverse though (we prefer our decoders to consume data in increasing address order). To limit encoder space memory usage, BitKnit processes data in chunks of 64k (uncompressed) at a time; the rANS bitstream is flushed after every chunk, which adds a small bit of overhead but lets us use a guaranteed bounded amount of memory in the encoder for arbitrary input sizes, improves its memory access patterns, and is also handy from a file format perspective. (Old post about rANS in practice, derived from how we used it in BK then LZNA, here). All the Oodle formats are also internally chunked, and have been forever (it’s how the framing format works) and I can recommend this strategy in general.

Entropy coding experiments

While working on BitKnit, we were experimenting with a lot with different entropy coding options, and much of it ended up in our other formats in some way or other [my colleague Charles Bloom also did a lot of experimentation at the same time, there was a lot of bouncing ideas back and forth in a bunch of mail threads, and the previous version of this post implied I did all that work, which is not true.] My blog post “Mixing discrete probability distributions” was written while working on BK and underlies both its semi-adaptive models and the follow-up “Models for adaptive arithmetic coding”. The latter are what’s used in Oodle LZNA, mostly with 8-, 9-, 16- or 17-symbol alphabets, using a SIMD decoder. This style of model with a larger radix than binary needs to sustain a much smaller number of symbol decodes per second than the equivalent binary arithmetic coder does, and especially combined with interleaved rANS coding gives quite a nice speed boost in the decoder. The caveat is that they’re usually not quite as good in terms of compression ratio as the “binary tree”-style adaptive binary models, but I consider it a great trade-off that even now, 8 years after writing it up, feels under-explored to me. (LZNA did use it; our follow-up formats didn’t simply because we’ve generally moved away from design points where adaptive coding makes sense entirely, since in our application space, decoder speed is very important.)

The actual models that ended up in LZNA were designed by my colleague Charles Bloom; ultimately, the adaptive coding direction was rejected as being too slow for the rates BK was targeting (not least because some of the platforms BK was meant to work on didn’t have suitable integer SIMD either), but it was a great fit for LZNAs slower target. Alas, it too didn’t last too long; Leviathan, which came out a few years later, usually gets quite close – sometimes better – at something like 10x the decode speed, using purely static models.

LZ literals

The main problem with batched model adaptation is that it takes quite a bit longer for the encoder (or decoder) to respond to changes in symbol statistics. LZMAs backend can pivot rapidly on such changes: when an unexpected (i.e. low-probability) symbol is seen, it’s probability is immediately boosted drastically. In my experiments, even batched updates that run every 2 symbols (i.e. update after every second symbol) did much worse than immediate updates/full adaptation on LZ literals, i.e. all the bytes not covered by string matches. Other types of symbols fared much better in my testing, but that’s of little use, because literals usually make up most of a compressed LZ data stream both by symbol count and by encoded size. If the literals have to be fully adaptive, that most of the runtime cost of full adaptation already, which as already mentioned was slower than what BK was aiming for.

Because BK was meant for Granny (a 3D mesh and animation exporter/runtime toolkit), most of the data I was looking at for mesh exports in particular was mesh vertex data, topology (in the form of index buffers), and textures. Vertex data is laid out in the form of fixed-size structs containing fields like vertex positions, normals, UV coordinates, and so forth. Meshes often have a fair bit of redundancy between vertices, where the bytes encoding positions of vertices frequently match positions of other nearby vertices, it’s common for authored vertex data to have meshes that are axis-aligned planar components that have a lot more structure still, and both the normals and UV coordinates in such cases tend to be highly repetitive as well, although usually with some variation. Indices are list of small integers that are usually “more or less” increasing (meaning the average value of the indices you see in a sliding window of recent indices will steadily go over up time, although there’s of course outliers). Some textures are quite noisy, and LZs don’t do great on those (you can improve things slightly with methods like PNGs filters). Some are “vector-y” with very geometric features and large flat areas, and LZs do much better on those. And some are “vector-y” with gradients, which again LZs don’t do awesomely on (it somewhat depends on gradient orientation), but they obviously have very nice structure in the byte values.

What most of these have in common is that delta coding (i.e. sending values as differences to a prior value, rather than directly) tends to work well on them. And from a different direction, LZMA (the main codec I was comparing against for BK) has a special case in its literal coding for the literal immediately following a match: the idea is that the decoder can be fairly sure that right after a match ends, the next literal in the data stream is not what the next byte after the final matched location would have been – otherwise we would just have sent a longer match to begin with. LZMA does not strictly enforce this, and the optimal parser can do such choices sometimes because it turns out to be advantageous, but the probability of the literal byte being the same as the match byte is still quite low. So what LZMA uses after a match is use a probability model for the XOR of the literal byte and the match byte.

This does two things: first, we expect that XOR to not be 0 with high likelihood, so even if everything else is completely random, we now end up a with an alphabet of 255 uniformly random symbols (with a very low likelihood of a 0), instead of 256 random symbols, which is slightly better even assuming nothing about the distribution of values at all. But second, the match byte and the literal byte are not uncorrelated in practice. The first literal after a match, especially in scenarios like ours where we have arrays of fixed-size records with very regular structure, is often a “near miss”, and the match byte still ends up being a good predictor. LZMA uses a bit-wise binary tree model, so XOR is a natural predictor for that topology.

BitKnit, however, does not; it uses byte-based coding, and as explained above we have a lot of data where we expect delta coding to do well. Hence, BitKnit sends the byte difference (mod 256) between the match byte and the literal byte. [This is a part where I really got the history wrong, there’s an entire correction below.]. This worked great, and some experimentation later, it just ended up sending all literals this way, not just the first literal after a match – the first literal after a match sends the difference to the byte right after the match ended, the second literal after a match a difference to the byte after that, and so on.

This solved several problems at once: first, it acts as an automatic delta-coder for this kind of data, and is neatly integrated with the LZ so there is no fiddly setting up good byte distances to use for delta-ing or similar. Because these distances are determined by the match distances, it’s all automatic within the format. It’s not generally as good as a custom format-aware delta coder, but it tends to get a lot of the benefit and Just Works. Second, it turns out that (for the data I was looking at anyway) the delta-literal distribution was usually a lot less variable than the raw literal distribution, which was the key enabler for BitKnits use of semi-adaptive models throughout.

BitKnit itself always codes literals this way; this is not ideal for other types of data, but – somewhat surprisingly, to me anyway – worked decently even more text-based data like JSON or XML. Right around the time of BKs original release, Charles wrote this up in his post on what he ended up calling LZ-Sub. He was using this at the same time in LZQ1, an experimental codec of his that never shipped but was used for lots of experiments that led to Kraken. Kraken, Mermaid and Leviathan all used this method and later expanded on it. Notably, neither LZQ1 nor the newer codecs go strictly one way or the other. LZQ1 could switch whenever it re-transmitted Huffman tables; Kraken and Mermaid work in chunks (128k of uncompressed data), and both of them have a per-chunk flag whether literals should be taken as raw values or as deltas from the match bytes. The delta decoding is slightly more expensive than just copying raw literals but does tend to give decent compression ratio improvements on many sources. As with essentially all other decisions in these codecs, the decision is made from an explicit space-speed trade-off (i.e. evaluate both compression ratio gains, if any, of using delta-literals, estimate decompression cost, and make the decision by weighing both factors according to a user-configurable trade-off). Leviathan goes further down this path and adds several other literal modes, so there’s not just two. Leviathan doesn’t have any notion of per-byte difference other than subtraction – we tried XOR but never found any real compelling use case – and instead can optionally include other types of context during its literal modeling.

The trade-off here is that the more complicated modes are both more expensive to decode and need to be evaluated by the encoder, which adds a bit to the encode-time cost.

For an example of how this works out in practice, you can look at for example Aras Pranckevičius’ recent (at time of writing) series of blog posts. In post 2 of the series he compares several of the Oodle codecs to a bunch of other competitors and finds out that we compare very well in terms of compression ratio, speed and decompression speed. The differences in his first test are as drastic as they are largely because Aras’ test set (structured binary data, namely several large-ish arrays of data items that are 4 floats each) plays exactly to our formats’ strengths. As you might notice, this is very similar to what you would see for the “arrays of vertex data” situation that was the design goal for BitKnit and prompted many of its design decisions. Kraken and Mermaid improved on BKs design considerably, but the delta-literals in particular come straight from BK and work very well for this kind of data. Later on in the series in part 6, Aras experiments with his own data-specific delta filtering, at which point the big differences between Zstd and Kraken essentially disappear (worth noting his results are exactly on the fast to medium part of the compression speed scale; Zstd gets better bang for the buck in that range, we tend to gain some more compression ratio at the slower settings, but overall I’d describe the two as comparable). In terms of absolute numbers, the delta literals in Kraken mean that Kraken at “Optimal1” (level 5) in the first test hits about 3.71:1 compression ratio while zstd level 18 (comparable encode speed) manages around 3.04:1. The latter test has Kraken Optimal1 around 4.34:1 and zstd level 15 (closest quoted) around 4.17:1, which I’m going to call roughly even. (We could make both codecs trade places here by playing around more with Zstd’s compression levels and Kraken’s configurable space/speed trade-off setting). In short, the delta literals manage to extract about half the gain a specialized difference filter does, with no manual data-specific work required.

Lots of repeat offsets

The main other fairly unique design choice in BitKnit is to have many repeat offset slots; where LZX, LZMA and Zstd (and also Kraken, for that matter) use 3, BitKnit uses an extravagant 7. [Turns out, Charles had also experimented with this some years prior, but that one I legitimately didn’t remember, and either way it’s a matter of changing a single constant, I’m sure it’s been independently discovered many times.] As with everything else in BitKnit, this comes down to its genesis as a codec for a data format leaning towards fixed-size records. In short, the extra repeat offset slots are useful for the data it’s designed for, whereas for many other types of data they don’t really help. The extra repeat offsets rarely outright hurt compression, but maintaining them does take time in the decoder.

One weird wrinkle in BitKnit that we didn’t take forward into other formats is the replacement policy. The standard method is to use “move to front” coding, i.e. repeat offset slots are numbered from most to least recently used. With BitKnits many offset slots, I found that I got some slight wins by not inserting new offsets into the precious “0” (frontmost) slot; instead, new offsets got inserted into slot 2, and the only way for an offset to make it into slots 0 or 1 is to get reused at least once. This was a slight win in BitKnit but didn’t make it into any of our newer codecs; notably this was designed into BitKnit before I had an optimal (or optimizing anyway) parser working, and was a small but solid win with the heuristics-driven parse I was using in the encoder at the time. When we re-evaluated this idea for other formats later, we found that optimizing parsers are good at managing the recent offset slots anyway so there wasn’t much benefit to this. Leviathan did end up using 7 recent match slots (largely because it provided benefits on certain types of highly structured data, same as I saw in BK) but with standard MTF/LRU update rules. (I wrote up how BitKnit implements its offset history years ago.)

And that’s it! I’m not sure if these notes are valuable to anyone else, but BK was a fun little codec to work on (and it was little – decoder, heuristic encoder with multiple speed levels and optimizing encoder all fit into a 2700-line file C++ file, with extensive comments) and it led to many ideas that we elaborated on further in our follow-up codecs.

More details on the literal models

I got the history wrong in the first version of this post, and some digging into old emails and commits later the details are actually more interesting than I remembered, so for what it’s worth, here’s some more history on how the various RAD codecs ended up having delta-coded literals as a mode.

Early BitKnit started with the idea of building something with adaptive multi-symbol arithmetic coding built on rANS. Multi-symbol arithmetic coders have existed for a long but, but they were mostly avoided for a long time due to requiring a divide, which is fairly expensive. At the time (2014), new-ish PCs had 32-bit integer divide latencies that were not unreasonable (clock cycle counts in the high 20s, on a dedicated divider that didn’t block anything else), but some of the targets we were still shipping code for had division that took 100+ cycles and blocked all integer execution for the entire time, which was problematic to say the least. Encoding for rANS still needed divisions but decoding did not, so this was very attractive as simultaneously being faster than a regular multi-symbol decoder on newer PCs, and not being unacceptably bad on more constrained targets.

The constraint was that to be able to use a divide-free approach, we needed to track estimates for a probability distribution in a way that kept the sum of all scaled probabilities at a fixed power of 2 at all times. That’s where the earlier “mixing discrete probability distributions” comes in. Early BK first used this to build a variant of LZMAs binary-tree probability models that worked on groups of 4 bits (nibbles) at a time, and also a variant of LZMAs literal-after-match (“LAM”) model, “Nib-LAM”.

The regular nibble models maintain 17 probability distributions over 16 symbols each: one distribution for the high nibble of a byte, and 16 more for the low nibble depending on what the value of the high nibble was. “Nib-LAM” keeps 48 distributions over 16 symbols: 16 distributions for the high nibble of the literal given the high nibble of the match byte, 16 distributions for the low nibble of the literal if those high 4 bits all agreed (indexed by low nibble of the match byte), and 16 distributions for the nibble of the literal if those high 4 bits didn’t agree (indexed by the high nibble of the literal that we just sent).

Early versions of BK used this Nib-LAM model, but unlike LZMA (which only uses this type of model for the first byte after a match), even the very first versions of BK I could find in source control always used this for later literals too (just extending the most recent match forward). I honestly don’t recall whether this was intentional or just a mistake, but literal coding in BK always worked this way, from the very beginning, and this gelled nicely with the delta literals.

Around the same time, my colleague Charles Bloom was experimenting with what turned into LZNA, and also played around with the Nib-LAM models. These were sometimes better than the bit-wise binary tree LAM models used by LZMA, sometimes worse; them being a bit worse makes sense if you think of them as a “chunkier” approximation of a bit-by-bit model, but more interesting was why they were sometimes doing a lot better.

He noticed a property that is easiest to see in the case when the high nibbles between the literal and match byte agree: we have 16 possible contexts (one for each of the possible high nibbles of the match byte, which agrees with the high nibble of the literal) each storing a distribution over 16 possible values. That is, when those high nibbles match, we track probabilities for all 16*16 possible combinations of (literal low nibble, match low nibble), and in particular can pick up any correlation between those, even when we have a mismatching bit somewhere in the middle (LZMAs bitwise-LAM models “detour” to a different part of the model immediately when that happens). And in particular, he noticed that on files where the Nib-LAM models did surprisingly well, he could do even better by sending straight-up delta literals.

As far as I can tell, that’s how the delta literals in their described form ended up in BK: generalizing the LZMA LAM model idea to a higher radix to get Nib-LAM, using Nib-LAM on all literals (not just the first after a match), Charles noticing that on the cases where that really improves on LZMA you can do even better by sending straight deltas, and then the deltas ended up in BK as the primary literal mechanism and replacing Nib-LAM for good as part of a push for speed, since Nib-LAM was just too slow for BKs target speeds and the delta literals worked fine with semi-adaptive models.

Most of the significant work on all three of LZNA, LZQ1 and BitKnit happened concurrently (and with many mails being sent back and forth…) from late February to late May of 2015. The original version of this post was poorly written; for the record, I’m not trying to claim sole credit for anything described in here, nor do I want to erase the contributions of others on those same emails threads at the time (which included not just Charles and I, but at the very least also Jeff Roberts, Sean Barrett, and Won Chun).

Notes on FFTs: for implementers

In the previous post I’ve talked about things you might want to know as someone who uses FFTs, this part covers all kinds of FFT implementation details, including the underlying reasons for a lot of the API complexities that showed up last time. I’ll also give some recommendations on what I think are good ways write a FFT right now, and presumably also going forward.

Algorithm zoo

There’s a ton of FFT algorithms to choose from, but practically speaking, the broad Cooley-Tukey family of algorithms (I include variants like split-radix as part of the general family) is by far the most important.

If you don’t already know how CT FFTs work, I would recommend to avoid any sources that derive the algorithm by manipulating sums of trigonometric functions (doubly so if they do it with real sin/cos instead of complex exponentials). Way too much noise and index manipulation, nearly impossible to see what’s going on. I would also not recommend the other extreme, sources that explain FFTs purely in terms of signal flow charts/butterfly diagrams, because while that form can be useful to look at particular transforms (until they get too large to comfortably fit in a sensibly-sized diagram anyway), the first few (or last few, depending) levels of the decomposition are in some sense special cases (because their twiddles are trivial), and with a flow-chart form that is most of what you see. Since you’re forced to look at a concrete realization that is generally not too big and thus mostly special cases, it can be quite hard to see what the actual general construction is supposed to look like.

The least “noisy” introductory explanation I’ve ever seen for a CT radix-2 FFT (provided you know some basic linear algebra) wrote down a 8×8 DFT matrix as powers ω0, …, ω7 of the 8th root of unity and proceeded to find and use the symmetries on it, then explained the general rule once you’d seen the particular example (it’s not too hard to figure out yourself once you start that way). I like this form: having the powers of the root of unity exposes the critical structure that is being used, but obscures the fact that of course ω2 is i and ω4 is -1, which is precisely the property that makes small signal-flow-graph based descriptions confusing to me because it leads to simplifications that break up the regularity somewhat.

If you have the required background (meaning you know what polynomial rings are and how they work), I’ve found ring-theoretic approaches the most insightful in terms of showing the actual structure of the problem without getting bogged down in the details (and in index manipulation) as tends to happen to varying degrees when working with sums of trigonometric terms, signal flow graphs or matrices. Honestly I think the most I’ve ever gotten out of a single source on the subject would be D. J. Bernstein’s “Multidigit multiplication for mathematicians” (section 7), which covers the underlying algebra for many of the major FFT variants in about 2 pages.

DIT vs. DIF

All Cooley-Tukey family algorithms come in two dual variants, Decimation-in-Time vs. Decimation-in-Frequency. With the recursive decomposition as introduced by Gentleman and Sande, a DIT FFT for even N is decomposed into first computing two size-N/2 DFTs on the even and odd elements respectively. These two DFTs on the even and odd halves of the data can then be combined into a DFT for all N samples by first multiplying the odd-half coefficients by so-called “twiddle factors” and then adding and subtracting the even and twiddled odd halves from each other (“butterflies”).

Splitting into even and odd halves is done at every level of this recursion, and always before any of the data is modified. We would like the actual computation parts of the kernel (at the various sizes) to work on contiguous elements; what this ultimately results in is a big data permutation at the start that does all of the even/odd splits at once (which turns out to be a bit-reversal permutation, where if indices i are in { 0, …, N-1 }, the item at position i is sent to the position bit_reverse(i) that literally reverses the order of bits in the binary expansion of the number.

Therefore this (now standard) version of Cooley-Tukey consists of first a permutation on the input, and then all the follow-up work can be conveniently done in-place. (We’ll go into memory access patterns later.)

The dual version of this approach is Decimation-in-Frequency (DIF). In DIF, you start with butterflies, this time between the first and second half of the input. Then you twiddle the second half and finally recurse into size-N/2 DFTs on the first and second halves (which are already contiguous), and once those partial DFTs are done, you interleave them. Again this interleaving happens as the last step at every level of the recursion, so you can skip it in the middle of the recursion and end up with FFT coefficients out of order – bit-reversed in fact, so you have a bit-reversal permutation at the end. In short, it’s the exact opposite of the DIT decomposition (if you invert a DIT FFT by reversing the flow graph, you end up with the corresponding DIF FFT, and vice versa).

The bit-reversal permutation is possible to do in-place, but it’s annoying and relatively slow. DIT FFTs, which need to do the reordering first, often prefer to be out-of-place: in their first pass, they apply the input permutation while copying to the destination, and then can work in-place for the rest of the computation. This can be viewed as a feature by saying that they leave the original data intact. DIF FFTs also want to do their computation in-place and have the reordering last; so they either overwrite the input completely leaving a completed but permuted FFT in the original input, and then do a final reordering step that copies to the destination, or they can be – very conveniently – fully in-place and leave the coefficients in a permuted order. This latter variant is the type of FFT that can give you permuted-order coefficients more quickly, which you might not mind as long as you’re only doing convolutions with it. The inverse would then use the corresponding DIT factorization, skipping the initial permutation step, and in fact ending up with zero shuffling whatsoever.

This is nice in principle, however the bit-reversed order is annoying for library users that do want FFT coefficients in the conventional order, and it gets worse when other types of decomposition (e.g. radix-4 or split-radix variants) are used which have a different “natural” order for their coefficients to end up in. In short, using DIF with whatever radix you want and then returning the unpermuted coefficients does save some work, but it does so at the expense of exposing a lot of implementation details, so it’s a bit dicey to expose as part of a library API.

Finally, the elephant in the room: multiply-adds. A radix-2 DIT step ultimately boils down to a twiddle of the odd half then a butterfly, which works out to (the additions, subtractions and multiplications here are using complex numbers, and w is a twiddle factor)

  out0 = in0 + w*in1
  out1 = in0 - w*in1

This looks as though it might nicely map to multiply-add operations where available, and in fact it does, although not exactly in the obvious way (I’ll get to that later). For a DIF version, remember that the order of operations was reversed, and we first add/subtract and then twiddle, so we get

  out0 = in0 + in1
  out1 = (in0 - in1) * w

which is not in multiply-add form. We might think this is a temporary setback and we just have to factor this w into the next operation using out1 to get our multiply-adds, but that doesn’t work either: out1 ends up going into another DFT layer that looks just like ours, and both of its inputs need to be twiddled, so if we delay multiplying by w, we get something like

  out0 = deferred_w0*in0 + deferred_w1*in1
  out1 = deferred_w0*in0 - deferred_w1*in1

Note we have two twiddles now because in general the twiddle factors feeding into the two inputs at the next level will not be the same. We can do one multiply and then two multiply-adds, but this is not great at all, and we still are deferring one twiddle multiply for the next iteration. (Maybe there is some good trick here that I’m missing?)

In short, with DIT we get a form that is amenable to using FMAs (or other types of multiply-adds) throughout; with DIF, not so much. In my opinion that severely limits the usefulness DIF + out-of-order coefficients as an attractive strategy to avoid permutations for convolutions, because now your DIF FFT potentially performs quite differently from its DIT inverse that shows up in the IFFT.

Therefore, these days, I’d stay away from permuted coefficient orders at API boundaries and just eat the permutation (and/or copy) to produce conventional-order results.

Different radix, arithmetic complexity

The “canonical” way to write a Cooley-Tukey FFT for power-of-2 sizes uses a radix-2 decomposition, meaning at every level we subdivide into 2 sub-FFTs of size N/2. Another popular variant uses radix-4 steps, which do a slightly different calculation to immediately subdivide into 4 sub-FFTs of size N/4. This has slightly more complicated logistics, a different output permutation (corresponding to reversing the digits in a base-4 expansion of the index this time), and can reduce the number of multiplies relative to radix-2, which is what originally sparked the interest in it – the number of multiplications used to be the primary cost in the FFT algorithm, especially in the absence of hardware multipliers, much less pipelined ones.

There’s further refinements such as split-radix, which is a hybrid between radix 2 and 4, using the fact that part of the computation in a recursive decomposition is more efficiently done as radix-2 and the remainder is more efficient as radix-4 (as you might expect, this introduces further complications to the coefficient permutation, which for split-radix is quite complex). There’s a whole sub-genre of hunting down individual stray multiplications: complex multiplications with twiddle factors that are 1, -1, i, or -i are obviously unnecessary and can just be turned into real-valued additions and subtractions (optional with swapped or sign-flipped real/imaginary parts) with no actual multiplications. Slightly less obvious, multiplications with twiddles that correspond to +-45 and +-135 degrees (+-1 +- i) / sqrt(2) multiply both the real and imaginary part by the same value and can use a slightly cheaper rotation. And then there’s some more tricks with leaving values scaled to reduce the number of multiplies further.

All these tricks can be used systematically in the first few (or last few, depending) steps where these are all the twiddle factors that appear. Scalar implementations can also make use of them at other levels of the decomposition to special-case individual values, but for SIMD/SIMT architectures trying to special-case these rotations in the middle of larger blocks hurts way more than it helps.

More to the point, all of these techniques are about minimizing arithmetic complexity, and more specifically, reducing the number of multiplications, since the number of additions is the same for all of them – and also larger than the number of multiplies. And that’s a crucial caveat since, generally speaking, we’re on a very different cost model from counting number of individual floating-point operations these days. Let’s give a few examples to see how this matters:

  • Intel Ivy Bridge (2012). Older uArchs like Sandy Bridge, Nehalem etc. are similar, all the way back down to the original Pentium Pro (1995). FP/SIMD multiplier on port 0, FP/SIMD adder on part 1, both fully pipelined, can independently receive an instruction every cycle. Thus, in the limit, FP ALU throughput is limited by whichever type of operation we have more of, and that’s additions in this case.
  • Intel Haswell (2013), Broadwell (2014). FP/SIMD fused multiply-add unit on ports 0 and 1; FP adder on port 1. That’s right, HSW and BDW can do two FMAs per cycle (256b vector ones at that, too!) but only one FP add. Once again, multiplies are less of a problem than adds, funnily enough.
  • Intel Skylake (2015) through Rocket Lake (2021). FP FMA on ports 0 and 1, adds also accepted on ports 0 and 1. No more add imbalance, but if we’re chasing op counts for throughput, it’s far more helpful to use FMAs efficiently than it is to minimize multiplies. (As of 2022 Alder Lake, we do get a dedicated FP adder on port 5.)
  • ARM Cortex-A55 (2017). Extremely common mobile device “little” core. Two SIMD/FP pipes, the FP portion is effectively 64b wide, can do two multiplies/adds/FMAs per cycle, so again the biggest bang for the buck is by using FMAs. The newer “medium” Cortex-A710 (2021) is still two FMA-based FP/SIMD pipes although they are fully 128-bit now.
  • Pretty much any GPU or DSP built within the last 20 years (and that’s conservative): built around pipelined multiply-accumulate operations. Not necessarily fused floating-point multiply-adds (which is a specific thing meaning there is only one rounding stage), but some form of multiply-accumulator, maybe fused, maybe with intermediate rounding, maybe with weird truncations in the middle, maybe fixed point, but there is some kind of multiply-add in there. If we do a multiply, we get a corresponding add for ~free, as long as we have a suitable candidate.

You get the idea. For a multiply-add based architecture, the operations we’re counting – if we’re counting anything – should be multiply-adds and not individual multiplies or adds. For a machine without multiply-adds but with separate superscalar mul and add pipes, it matters what the ratio of mul to add resources is; for integer (fixed-point), you usually get more adders than multipliers so the number of multiplications is the thing to count, but for floating-point, it was a common design point for a long time to have one pipelined FP multiplier and one pipelined FP adder, and in that case the bottleneck is whichever pipeline gets more work, which for FFTs is generally the adder. Now if your target is single-issue, has a different pipeline, or doesn’t have pipelined multipliers (or adders for that matter), the classic cost model might be relevant to you, but for newer targets with beefy vector pipelines, nickel-and-diming individual multiplies is worse than useless (since it actively introduces irregularities that are more expensive than just doing the math) and utilizing FMAs (where available) well is more important.

For what it’s worth, the classical operation costs of a power-of-2 sized FFT for the basic algorithm variants – just the leading terms:

  • Radix-2: ~5N log2(N) operations
  • Radix-4: ~4.25N log2(N) operations
  • Split-radix: ~4N log2(N) operations

In my experience anyway, radix-2 and radix-4 can make use of all-FMAs quite easily, split-radix gets a bit iffy, so these days, if you’re gonna pick one, I think radix-4 is a solid choice.

How do I use FMAs in this anyway?

Right. So, let’s write this out. For a DIT radix-2 step (radix-4 is analogous, just more of everything, you can figure it out), the complex-valued version as given above is

  out0 = in0 + w*in1
  out1 = in0 - w*in1

but of course we actually need to implement this in terms of individual real operations and expand out that complex arithmetic, which turns this into

  out0.re = in0.re + w.re*in1.re - w.im*in1.im
  out0.im = in0.im + w.re*in1.im + w.im*in1.re
  out1.re = in0.re - w.re*in1.re + w.im*in1.im
  out1.im = in0.im - w.re*in1.im - w.im*in1.re

which makes it look like it’s 8 multiply-adds (I’ll use multiply-adds and FMAs interchangeably in the following, even though it’s not technically correct), but that’s clearly not a good way to do it. We could compute the shared “w*in1” part once, which takes 2 multiplies and 2 FMAs, and then we still have 2 adds and 2 subtractions left, so that’s not better in the “multiply-adds cost the same as isolated multiplies or adds” model. Now, what goes wrong here? Clearly, through the first half (the computation of out0), we can use 4 FMAs, and there’s no redundant computations so far, so that all seems fine. But in that second half we’re mostly recomputing stuff we already computed in the first half. Luckily, there is a trick, which is actually easier to see in the complex-valued version above: we can compute out1 via

  out1 = 2*in0 - out0

instead. This is a multiplication by a pure real number so we can do this using 2 more FMAs, giving us this computation once we expand out into reals:

  out0.re = (in0.re + w.re*in1.re) - w.im*in1.im
  out0.im = (in0.im + w.re*in1.im) + w.im*in1.re
  out1.re = 2*in0.re - out0.re
  out1.im = 2*in1.re - out1.im

6 FMAs total. For comparison, the “classic” version using just regular multiplies and adds ends up with 4 multiplies and 6 adds/subtracts total, so our op count goes down from 10 to 6, quite the a big change (NOTE: none of this is new and certainly not my invention, I’m just trying to spare you the time of digging it all up).

Oh, and regarding that complex multiplication in the middle: there are old tricks to replace the 4-mul-2-add complex multiplication with 3 muls and 5 adds, 2 of which can be precomputed if one of the factors is known, as is the case for twiddle factors, so you end up with 3 muls and 3 adds. However, as you might expect, this algorithm is more irregular, needs 3 floats for the twiddle factors instead of 2 (and also prevents you from using tricks that might shrink this further), and even though it’s 3 multiplies and 3 adds it’s not in the form of 3 multiply-adds, although you can reduce it to an add, a multiply, and then 2 FMAs, which sounds good in theory, except…

Is minimizing arithmetic operations actually the key?

In a word? No, not usually.

Here’s the problem. Take the 6-FMA snippet above, and assume we’re working scalar for now. That’s all the math to process a pair of complex values in a radix-2 DIT FFT step. You have to load the real and imaginary parts of both inputs (4 loads), load the real and imaginary parts of the twiddle factor (2 loads), and finally store the real and imaginary parts of both results, normally to the same place you just loaded them from (4 stores). In total, 6 loads, 4 stores and 6 arithmetic operations per 4 floats processed in each iteration.

This is not really a compute kernel so much as a memory streaming exercise with a part-time gig in arithmetic.

I was looking at a scalar version for simplicity, but the picture is exactly the same when looking at a vectorized version that processes multiple samples at once (because FFTs are so regular, vectorizing them is, for the most part, straightforward). It also doesn’t actually matter here whether a vectorized version uses a split or interleaved layout internally; interleaved has the real and imaginary values next to each other, but having complex numbers packed that way in SIMD registers means that instead of N “real-valued” SIMD lanes, we effectively end up with N/2 “complex-valued” SIMD lanes. The amount useful work per individual float remains unchanged (at best; in practice, using an interleaved format inside the kernel often picks up overheads from elsewhere. I’ll get to it later).

What we’re looking at here is the real problem with radix-2 factorizations, which admittedly is especially amplified when using the FMA versions of the core kernel: the ratio of computation to memory operations is unfavorable. Now exactly how bad this is or isn’t depends on what machine you’re targeting, and what the proportion of compute to memory resources is, but more loads/stores than arithmetic ops (or even a 1:1 ratio as you would get without FMAs) is never good news. This is also the real reason why the 3-mul 3-add complex multiplication which needs 3 floats per twiddle is not that attractive: it decreases arithmetic and increases memory load even further.

Radix-4 happens to have fewer multiplies, but much more importantly, it does the work in roughly half the number of passes, and thus halves the number of coefficient loads/stores. You can reduce this even further by using yet larger radices (if you have enough registers available to keep all those values around), but the biggest single jump is from radix 2 to radix 4.

Another note on radices: there’s a distinction between radix-4 and what I’ve seen called radix-22, a name that feels clunky but that I’ll keep using for lack of a better suggestion. The former is an actual radix-4 step, which is to say, for DIT, we have 3 twiddles “in front” into a 4-point DFT kernel (all multiplications by 1, i, -1 and -i, which are all not actual complex multiplications, just swapping values around and some negating). The latter is essentially two unrolled radix-2 passes chained back to back. In terms of arithmetic complexity, the difference is that a “proper” radix-4 kernel has 3 twiddle multiplies, while the radix-22 variant contains 4 instances of a radix-2 sub-kernel that each does one twiddle multiply. However, classic radix-4 needs to load those 3 twiddles from somewhere; radix-22 only needs to load 2 different twiddle factors (which both get used twice). Considering our woes with the amount of computation vs. amount of memory operations required, and that FMAs change the trade-offs for the effective cost of multiplications in a radix-2 kernel, radix-22 is a very interesting candidate!1

Point being: we can save a fair amount of unnecessary memory operations by using larger radices. This can use actual larger-radix algorithms, or it can use a radix-2k-style approach where we may just load a bunch of values into registers and complete 2, 3 or more levels of the factorization at once while they’re there. The difference in twiddle factor loads really starts to ramp up as we do more of these: a “real” radix-8 FFT needs to load 7 complex twiddles, a radix-23 step only needs 3 (and classic radix-4 into radix-2 would need 4). Combining steps like this, provided we have sufficient registers, lets us increase the amount of computation in between loads and stores to a level where we’re actually keeping the FP/SIMD units respectably busy and not just spamming memory operations.

Memory operations, the second

So far, I’ve been looking at memory operations purely in terms of the number of instructions that need to execute. But what about actual cache effects?

Hoo boy. A lot has been written about this subject, too much of it overgeneralizing, using really broad strokes or just working from unstated, questionable assumptions. Lest I go full rant mode, I’ll try to avoid editorializing for the most part. A lot of what you will read on this topic, be it in textbooks, on the web, in forums (remember forums?) or papers, is either questionable or just plain bad.

The most important news first: if you’re using either the standard DIT or DIF factorizations which are in-place except for that initial (or final, depending) permutation step, then as long as you’re working on a contiguous array and the transform size is small enough so the coefficients fit in about half your L1 data cache, you’re pretty much going to be fine no matter what. Your data comfortably fits in the L1D cache, will be repeatedly accessed and kept hot, and nothing bad is going to happen.

L1D caches on CPUs these days are usually 32KB or larger, and that’s been true for a good while (it was already true on desktops 15 years ago, for example; these days mobile devices have also caught up in that regard). That’s plenty for a size-2048 complex FFT with 32-bit floats. That covers a lot of the sizes you’re going to see in practice! Beyond that, here’s some of the statements I’ve seen, and what I think are the important caveats to all of them:

“DIF has better cache behavior than DIT”: this one’s a bit tricky. With power-of-2 sizes, all FFT algorithms are going to do a lot of memory accesses with exact power-of-2 strides between them, and that is a recipe for running into “fun” memory subsystem problems. As discussed before, both DIF and DIT run the same steps, just in reverse order (relative to each other). What is true is that a DIT FFT has a mandatory permutation step in front, whereas DIF can avoid it entirely if reordered coefficients are acceptable. Also, in-place bit-reverse permutations are a tricky mess so DIT FFTs are more likely to work out-of-place and do the permutation as part of a copy – and it is easy to do that permutation in a way that has exceedingly poor memory access patterns. So, yes, a reordered-coefficient DIF FFT is easy to implement fully in-place and efficient, and it can avoid the reordering step with its memory access troubles entirely. A DIT FFT is more likely to be out-of-place and do extra copies, and if the reordering isn’t done carefully, it can have bad cache performance – this is a solvable problem, though.2 And DIF has its own problems which I already went into earlier.

“Iterative FFTs are bad for the cache, use a recursive decomposition”: iterative FFTs will generally have an outer loop over the pass index (which determines current transform size), and a nested loop inside it that does all sub-FFTs of that size (either as function calls or just have that one loop handle all of them). Recursive approaches use, well, recursion; a size-N transform will internally call into smaller transforms for part of the data, and then either pre- or post-process them. That means that for transforms too large to fit into the L1D (or L2, or L3, …) cache, a straight iterative approach will stream the entire dataset into and out of the cache once per pass, while a recursion subdivision will eventually narrow down to a transform size that fits in the cache, and then all the sub-transforms on that subset will have the data warm in the data cache. Don’t get me wrong, this is good, but the straight linear loop structure of iterative FFTs is also good, and one recursive call per length-8 or length-16 or whatever your “leaf” transform size FFT is isn’t awesome. Mostly, I disagree with the unspoken assumption that they’re mutually exclusive. Iterative FFTs are great at handling the tons of small sub-FFTs near the leaf levels of the decomposition. Recursive FFTs are cache-oblivious and automatically “right-size” themselves. So just use both! Recurse down to a medium transform size that you’re sure is going to fit in the cache (I’d say somewhere from 256 to 1024 complex elements, depending on just how low-spec your smallest target is), and you can let the iterative approach make short work of all the tiny sub-FFTs while still getting all the nice automatic adaptation that recursive decomposition provides for larger sizes.

Over-generalization from old HW: this is in some textbooks, papers and forum posts. Especially late-80s and early-90s RISC workstations tended to have small, direct-mapped data caches that had completely pathological behavior on memory accesses with the right (or rather wrong) power-of-2 strides, and FFTs tend to hit all of these potholes one by one. People learned some very painful lessons in those days when they got burned quite badly by this, but set-associative caches with multiple ways are not a luxury feature these days, and they have substantially defused this particular problem. It’s still possible to hit pathological behavior, but you have to try a bit harder, and there’s decent ways to work around it.

“You should use exclusively cache-oblivious approaches/the four-step algorithm/insert other plug here” – use whatever algorithms you want, but I would recommend doing your own research and not just blindly following recommendations that some rando typed into a forum in 2002. There’s a lot of algorithms out there but many of them are quite old and written in a very different situation without floating-point hardware, with slow multipliers, little or no access to instruction-level parallelism, and very different relative costs of compute to memory. The good news is that large FFTs used to be a big-iron HPC problem and these days, even legitimately large FFTs over many million data points can be computed in a fraction of a second on a battery-powered cellphone. It’s pretty hard these days to get into a problem size for FFTs that actually runs into some serious hardware limits. The bad news is that a lot of what’s been written about FFTs dates back to times when this was a much bigger concern and hasn’t been updated since. This post is, in part, me writing up what I still remember from about 4 weeks worth of reading the literature back in 2015, trying to give you the bits that I think are still useful and applicable and intentionally avoiding all the material that is, I think, past its sell-by date.

Finally, a disclaimer: I have personally written and used FFT kernels for small to medium problem sizes (everything still fits comfortably in a 256k L2 cache). I have never actually had reason to use FFTs with transform sizes above 32768 in an application, nor have I optimized or debugged an implementation that provides them, so anything that concerns large transforms specifically (which shifts us even more into a memory-bound scenario) is not something I can comment on.

Stockham’s auto-sorter

There’s a Cooley-Tukey variant usually credited to Stockham that avoids the initial (or final, for DIF) bit-reversal, digit-reversal or whatever permutation by instead performing the reordering incrementally, interleaved with the processing passes. This was popular on vector machines and is sometimes suggested for SIMD.

It works just fine, but I’m not a big fan, for two reasons. The first is that Stockham gives up on any in-place processing entirely and instead “ping-pongs” between two buffers on every pass, which means it has about twice the active working set in the cache as an in-place algorithm (and more traffic in upstream caches or memory when it doesn’t fit anymore). The second is that this picks up extra data-shuffling work in every single pass; a batched reorder can be combined with other input processing (for example to turn data into an internal format), maybe an initial radix-2 or radix-4 pass, and also be done in a way that avoids thrashing the cache unnecessarily, because it’s a data-movement pass at heart and can be designed with that in mind. Then the actual workhorse FFT passes can focus on the math and memory access and don’t have data shuffling in the mix as well.

SIMD

So far, I’ve discussed everything pretending we’re working with scalars. But FFTs (power-of-2 sized ones, anyway) are generally pretty easy to vectorize.

If you have the data given as separate arrays for real and imaginary parts, it can be almost as simple as replacing every scalar load/store with a vector load/store and every scalar arithmetic operation with the corresponding vector arithmetic operation. You might need to handle the first few (or last few, depending) passes specially, where you’re working first with pairs of adjacent elements and then groups of 2, 4 etc., but once the distance between elements you’re processing becomes larger than your vector width (which is what most of your passes will be working with), it’s straightforward.

It does mean that your initial (or final) few passes end up special. However, these are the passes right next to the input/output permutation, and combining the early special-case passes with the data permutation tends to solve problems in both: the data permutation usually involves some kind of SIMD transpose, and moving some of the math across that transpose boundary separates what would otherwise be intra-vector “horizontal” dependencies out into multiple vectors. As for the data movement, that is ordinarily just loads, stores and shuffling; having some arithmetic in there helps the instruction mix and makes it less likely to be bottlenecked purely on memory operations or shuffling.

The pure deinterleaved format is probably the easiest to understand conceptually, but it’s not very commonly used. Fully interleaved (all real, imaginary value pairs) is the other extreme. That’s a very common format for complex-valued input data, and it usually works reasonably well to do the FFT in that form directly – for complex multiplications specifically, you do tend to need some amount of shuffling and special instructions for complex multiplication (“addsub” and “fused multiply with alternating subtract/add”, most notoriously), but these instructions are often available. That said certain things which can just be dealt with implicitly in a fully deinterleaved form (such as swapping real and imaginary parts, or computing a complex conjugate) turn into actual instructions when working interleaved, so there’s a lot of small random fix-ups involved; again, I’m not a big fan.

Yet another option is to keep the data interleaved, but at the granularity of full vectors (or some larger granularity that is a multiple of the vector size). For example, for 8-wide SIMD, you might store real parts of elements 0-7, then imaginary parts of elements 0-7, then real parts of elements 8-15, and so forth. This is still addressed by a single base pointer and has nice data access locality characteristics, but is just as (if not more) convenient as fully deinterleaved for SIMD code. The trade-off here is that your vector width (or other interleaving granule size) now becomes part of your interface; I like this format inside FFT kernels, less so as part of their public interface. (I’ve been experimenting a bit with this though. Maybe I’ll write a follow-up on this in particular, some day.)

Whenever you don’t work internally with the format you have in your API, there’s going to be some translation at the ends. One end already has the input permutation and can generally rewrite the complex number format into whatever you’d prefer with no trouble. Then for the output you need to translate it back; ideally this is fused into the last pass, but one problem that does crop up with these SIMD variations is that you already have 2 or 3 versions of everything (data permutation combined with initial passes, then a standalone radix-4 or whatever pass, and adding another “final pass” with a different permutation on output can get to be a bit of a chore. It’s up to you how many variations you want to generate (or hand-write if you’re into that sort of thing); personally I tend to avoid having too many pre-built combinations and would rather live with a bit of inefficiency at the interface boundaries than dealing with the headache that is a combinatorial explosion of algorithm variants.

Non-pow4 sizes, inverses, real-input or real-output FFTs, trig transforms, etc.

Non-pow4 first: anything based on radix-4 (or radix-22) factorizations needs to deal with the fact that half of all powers of 2 aren’t powers of 4. For those, you’re going to need one radix-2 (or radix-8, or radix-23) pass in there. There’s not much more to it than that, but it seemed worth writing up just to be explicit.

IFFTs are usually the same transform, just with the twiddles conjugated (scaling with 1/N or similar is often left to the caller). There’s other ways to reduce IFFTs to regular FFTs by swapping around coeffs etc. but I’d advise against it; depending on how your twiddles are stored, you might be able to produce the conjugate twiddles easily, but it’s also not too bad to just have a second version of the FFT kernel around for inverse transforms.

Real-input/real-output, well, there’s multiple ways to handle that, but the easiest is to “bitcast” the real input array into a size-N/2 interleaved complex number array, compute a complex size-N/2 FFT (which due to linearity gives FFT[even_reals] + i*FFT[odd_reals]), and then do a final radix-2 pass to combine the even/odd halves and compute the actual outputs. The inverse of that does, well, the exact same steps in reverse. I’ll not go into the details here but this is a classic algorithm. It’s not the most efficient but it’s reasonable and easy.

Other trigonometric transforms can also be reduced to inner FFTs with pre- and post-passes like that. If you have a decent FFT around, I’ve found this generally preferable to using specialized O(N log2(N)) algorithms specific to these transforms; trigonometric transforms generally have such direct algorithms, but their structure is almost always less regular than a FFT, so reduction to whatever-size FFT doesn’t have a lot of wasted work and then pre-/post-passes is usually my go-to.

Fixed point

I haven’t mentioned it at all. This omission is by design, not out of neglect. Floating-point and SIMD floating-point is widely available these days, and fixed-point signal processing is becoming increasingly niche. Some of the trade-offs are different with integers (adds are cheaper; DSPs usually have integer multiply-accumulate but many integer or SIMD integer instruction sets do not) and all I’ll say on the topic is that the last time I’ve had to deal with a fixed-point FFT is about 15 years ago, and I haven’t missed it. Good luck to everyone who still has to deal with it on a regular basis, but I’m just happy that’s not me anymore.

Twiddle storage

Twiddle factors are, essentially, a sine/cosine table. There is definitely a temptation to try and get cute here – maybe try to save some memory by only storing part of the values and inferring the rest by symmetry, or by realizing that the twiddles for FFTs of size N/2 are just all the even-numbered elements of the twiddles for size N here, and so forth.

In short, resist the temptation to do anything fancy with them. You want them in a form that is easy to load and doesn’t require a lot of bookkeeping, and trying to exploit symmetries too much to save on your sine table storage will probably not save you all that much memory but very well might make your hair go prematurely gray.

There’s an exemption for really straightforward stuff. For example, if you’re working in deinterleaved form, you need imaginary parts of twiddles (=sines) and real parts of twiddles (=cosines) both. Since cos(x) = sin(pi/2 + x), you can just store a sine table and use it for both, starting the cosine lookups a quarter of the way in. That is simple enough to not cause trouble (in my experience anyway), and it will also helps your data cache footprint, so it’s actually worthwhile.

Twiddles are one aspect where my lack of experience with large transform sizes shows: I expect that for large enough sizes, once your working data set and your twiddles are both streaming through not staying in the relevant cache levels, the extra traffic is an issue and you’d much rather spend some more arithmetic in the FFT kernel to compute twiddles (incrementally or otherwise) than eat avoidable extra loads from memory. However, I’ve not been in a situation where I’ve had to care, so I don’t really know one way or the other.

Takeaways

This is a long post and I’ve been at it all day, but I’ll try to give you the summary of my personal recommendations:

  • Cooley-Tukey DIT is, on balance, probably your best bet.
  • Nonstandard-coefficient-order FFTs are cute but probably more trouble than they’re worth these days.
  • Do use FMAs where available (which is quite common these days), and if you design any FFT code now, plan with them in mind. Also, FMAs are the great equalizer as far as the cost of different factorizations is concerned.
  • Don’t sweat old-school operation counts, they’re optimizing for the wrong things these days.
  • Prefer radix-4 (or radix-22), less so for lower number of multiplies, and more because having fewer passes and fewer loads/stores is good and memory operation count is not negligible for this.
  • You don’t have to pick strictly between iterative and recursive approaches, and you can get the best of both worlds by recursing for larger to medium sizes, and handling the small sub-FFTs iteratively.
  • Do worry about number of memory accesses and your access patterns, especially for things like digit-reversal permutations.
  • SIMD: yes please; Stockham: probably not.
  • Data layout-wise, it’s really up to you.
  • Twiddles: Keep it simple, stupid!

And I think that’s it for today!

Footnotes

[1] You can also get a “proper” radix-4 kernel with 2 complex twiddle loads instead of 4. The high concept is that instead of twiddling with ωNk, ωN2k and ωN3k, you can substitute the 3k twiddle with -k – which is just the complex conjugate of the first twiddle, so that doesn’t need a separate load. This substitution works fine but does affect the coefficient permutation (are you starting to see a pattern emerge?); see e.g. Bouguezel, Ahmad, Swamy, “Improved radix-4 and radix-8 FFT algorithms”, 2004 IEEE International Symposium on Circuits and Systems. The same trick can be used on split-radix factorizations to yield what is now usually called the “conjugate-pair split-radix factorization”. That said, as I keep repeating, with a FMA-counting cost model, the arithmetic operation difference between a radix-4 and radix-2 decomposition doesn’t really materialize, and radix-22 comes out looking pretty good.
[2] Not to leave you hanging with a mysterious statement, one possible approach is in Blake, Witten, Cree, “The Fastest Fourier Transform in the South”, IEEE Transactions on Signal Processing, Vol. 61 No. 19, Oct 2013 – in short, be careful about how you implement the permutation, be sure to grab some adjacent data while you’re there, and also, once you have that data in registers, might as well do an initial radix-2 or radix-4 pass along with the reordering.

Notes on FFTs: for users

I was just looking over SIMD FFT code I wrote in 2015 for Bink Audio and Miles Sound System to replace the old, all-scalar implementation we had been using since (presumably) the late 90s. That in turn reminded me of writing that code originally, reading a lot of papers on the subject, and how I wished at the time that there were a write-up going into what mattered (and why) and what didn’t. It’s 8 years too late to save myself some time, but it might still be handy for you!

I’ll try to cover everything interesting to users of FFTs first. I’ll do a follow-up post that is only really interesting for FFT implementers. All of this assumes some familiarity with DSP concepts.

Big picture: what do you need that FFT for, and what kind?

FFTs get used for many things, but there’s 3 main classes of use cases that all have somewhat different requirements:

  1. Actual spectrum analysis. You have some signal that you want to see the discrete Fourier spectrum of. Generally uses windowed FFTs (aka Short-Time Fourier Transforms, STFTs). Typically wants FFT coefficients in the “proper” (natural) order. Large transform sizes (let’s draw an arbitrary line in the sand and call any transform size above 4096 “large”) are common. All-real inputs are more common by far, and usually just one direction of transform (let’s call it “forward”) is used.
  2. Convolution or polynomial multiplication. Here, you’re just using the FFT as an intermediate step in a convolution algorithm. For convolution, real inputs are more common by far. For polynomials, you see complex a bit more often, but still not that common. If you don’t want cyclical convolution (and you usually don’t), not only are your inputs and outputs often real, the second half of FFT inputs is usually all zeroes and the second half of IFFT outputs is usually discarded; this can be used to speed up the transform if you care to exploit it. When used for convolutions, you don’t typically care whether coefficients are in the right order or not (for convolution, all you’re doing is element-wise multiplications in frequency space; reordered coefficients cause no real problems.) Can use quite large transforms but often doesn’t, even for uses cases like fairly long convolution reverbs (see below). Uses both forward and inverse transforms, in about equal proportion if one side of the convolution (e.g. a filter kernel) is fixed and pre-processed in advance; twice as many forward as inverse transforms if not.
  3. As building block in other DSP algorithms. Many other trigonometric transforms used in practice, e.g. DCTs and MDCTs, reduce to FFTs with some pre- or post-processing, sometimes both. The inner FFT here is usually a complex-valued one. Usually wants coefficients in order. It’s rare to see large transform sizes for this use case.

FFTs “want” to work in the complex domain and this is the most natural form for them. Real-valued signals can be turned into complex-valued signals by just adding all-0 imaginary parts, and then you can use a complex FFT, however this wastes memory and work; a “proper” real-input FFT doesn’t require blowing up the input into complex numbers, can compute just half the spectrum (the other half is implied by symmetry), and as a result uses about half the memory and is nearly twice as fast. One caveat: for a FFT that works on complex numbers, the difference between a “forward” and “inverse” FFT is very small (literally a sign flip), and there is not any universal agreement on which sign the “forward” transform should be (sigh), but it also doesn’t actually matter for many applications. Once you do a real FFT and corresponding IFFT, well the real FFT takes N real-valued inputs and outputs N/2 complex numbers (really N/2-1 complex numbers and 2 real numbers, but the two reals are frequently packed together into one complex value) and the real IFFT takes those N/2 complex numbers and outputs N reals, so they are not interchangeable and you need to use the right one.

Big picture: transform sizes

In principle, any size FFT can be computed efficiently, in the sense that you get an O(N log N) asymptotic complexity bound no matter what. In practice, the practical FFT algorithms that you actually want to use in most cases need the transform size to be of a certain form to be efficient, and the arbitrary-size FFT algorithms work by reducing a size-N FFT to a different FFT of a nearby “convenient” size, again with pre-/post-processing.

The most important class of algorithms in practice (Cooley-Tukey and descendants) wants highly composite numbers with few different prime factors; powers of two (just factors of 2) are best. Having a single factor of 3 or 5 in there is also often supported, which gives you more convenient spacing of transform sizes: for example, if you just need slightly more than 1024 elements, you don’t have to jump all the way to 2048 but might instead use 5*256 = 1280.

If possible, and there’s no strong reasons to do anything else, stick with just powers of 2. If that’s not an option, try to make do with exactly one factor of 3 or 5 in there. Any other “odd” sizes will likely end up getting reduced to one of those sizes in the end anyhow.

Big picture: libraries, interfaces, what to call?

If you’re a user of FFTs, and you’re interested mostly in real-valued data (you often are), check if your FFT implementation of choice supports real-valued input or has inverse transforms that produce real outputs from half a spectrum, instead of manually expanding real values to complex values with zero imaginary part. That’s a fairly easy near-2x speed-up, usually.

FFTs usually need some description of the transform on the side, depending on the transform type and size. This is a data structure that usually contains twiddle factors (which effectively means a sine/cosine table, sometimes in funny order) and some auxiliary data (tables or otherwise) used during coefficient reordering.

Some FFT libraries only support a small, restricted set of transform sizes (usually just certain powers of 2), and these can usually avoid a “setup” structure of “FFT plan” or similar entirely. It’s a simpler API but more restricted.

There’s in-place vs. out-of-place. This is really more of an implementation detail, but sometimes it matters. Some approaches lend themselves to a fully in-place implementation (where the input and output are the same array), others don’t. “Real” in-place needs less memory but has more complicated logistics and sometimes inefficiencies. Asa a result, often in-place functions exist but are just an API entry point you can call that internally allocates workspace memory, works out-of-place, then copies the results back. Sometimes you also have API functions that take an optional pointer to workspace memory; if those exist, that usually means they need it and if you don’t give them workspace memory, they’ll allocate it themselves and free it once they’re done, so allocating the workspace once and passing it in can save a lot of allocation churn if you’re doing a lot of transforms. In short, there’s a lot of variations here, and this part is often poorly documented, but this is something you’ll want to look into because eating a lot of unnecessary allocations and copies in this type of DSP functionality can contribute a significant cost.

Then there’s complex number format. There’s two main contenders: interleaved and split. Interleaved has complex numbers stored as pairs of real part and imaginary part right next to each other. This is the format most commonly used for complex number data types (like C99 built-in complex type or C++ std::complex). FFT algorithms can work in this form directly, and some do, but it can be advantageous to split the data into two arrays, one containing just the real parts and one just the imaginary parts of each value. In general, interleaved form is pretty much always supported, but if the library supports the split format, that is usually the faster one, and it might be more convenient for your code as well, so it could be worth looking into.

Finally, permuted coefficients vs. not. If you only want to use a FFT for convolutions, you might not care about the order coefficients are in. Some algorithms can produce a FFT with coefficients in a non-standard order at a lower cost. You don’t see them that much anymore for reasons I’ll get into later, but if you use a FFT library that supports this option, and you only care about convolutions, it might be interesting for you.

With that all said, I know the question of library recommendations is going to come up, so for what it’s worth: if you’re on an Apple platform, vDSP ships as part of the Accelerate framework, is always present, and looks fine to me (although I haven’t tried the DFT/FFT functionality in particular), so there’s at least a few platforms where there’s an obvious choice.

There’s always FFTW, which I have last used in the early 2010s and don’t have fond memories of. It’s a big library with external dependencies, not easy to build or use, a ton of code (around half a megabyte last I checked), and for a while anyway it seemed to lag behind quite a bit in SIMD instruction set support (that might have been fixed by now, or maybe that was just a configuration problem on my part). FFTW does have pretty much everything once you get over that integration hump, though.

Intel has FFT functionality as part of their MKL (Math Kernel Library). I haven’t used it for FFTs but MKL in general tends to offer excellent performance on Intel CPUs. They detect CPUs by model number and mysteriously fall back to their slowest unoptimized kernels on any x86 device that doesn’t report as Intel (unless you override it with an environment variable), which is a reason for me not to use it on general principle but YMMV. MKL is also fairly big (and of course x86-specific) so if distribution size is a problem for you and you just want a FFT this is probably not what you want.

The one that’s easiest for me to recommend is Mark Borgerding’s KISS FFT. It’s not the fastest FFT around but it’s small, simple, and easy to build and integrate. If you find yourself bottlenecked by FFT speed you’ll want to look into something more advanced but if you just need a FFT-type algorithm to replace a calculation that would otherwise be O(N2) with something that’s O(N log N) and doesn’t have a huge amount of wasted work, this fits the bill. You’re losing a constant factor relative to what you could be getting with proper vectorization so be advised that there’s faster options when that becomes a problem.

Big picture: convolutions

Convolutions are a fundamental signal processing operation for filtering (and also for effects such as convolution reverb in audio processing), and by what’s usually called the Convolution Theorem, time-domain convolution (the operation we’re interested in) corresponds to frequency-domain point-wise multiplication. (Vice versa, frequency-domain convolution also corresponds to time-domain point-wise multiplication, which is the math behind how windowing affects the FFT).

The theory is this: convolving a length-N signal X by a length-M signal Y directly takes N*M multiply-adds (the first one is just a multiply, but close enough). If we take the first signal to be our data and the second to be our filter, that means convolving by Y takes M operations per sample. (It stands to reason that we have to some work for every output sample on average.) This gets bad if Y is long. In the aforementioned convolution reverb, Y could be a 5-second sample at 48kHz sampling rate. That’s 240,000 multiplies per sample, which is what we in the business refer to as “not ideal”. The convolution theorem tells us we can do a Fourier transform, then one multiply per sample in the frequency domain, which is 4 real multiplies, and finally we do a inverse Fourier transform back. Fourier transforms are (mumble mumble) something something O(N log N), or I guess in our case O(M log M), so maybe this is O(log M) arithmetic operations per sample, and log2(M) is 18. Even assuming some fairly high constant factors here we’d expect this to work out to ballpark a few hundreds of operations per sample, roughly a 3 orders of magnitude speed-up.

Only problem being: I glossed over all the details of how exactly that actually works. This is not the subject of this post, so I’ll keep it brief, but there’s a bunch of important techniques here to actually get this to work, and this is another one of those things that are often glossed over, so I want to at least put all the search keywords in here so you know what to look for if you actually want (or need) to implement this.

First, we want “discrete convolution”, which is the convolution operation that belongs with what’s usually called the DTFT (Discrete-Time Fourier Transform). The FFT is a realization of the DFT (Discrete Fourier Transform) not DTFT, so it’s a different type of transform. DFT is like DTFT but for periodic signals, and it also has a convolution operation it’s associated with for its version of the Convolution Theorem, but the convolution we get with the DFT/FFT is what’s called cyclical convolution – it wraps around the signal (which is assumed to be periodic). If we want the no-wraparound (discrete) version, we can get it from the cyclical version by padding the signal with extra zeroes and using a long enough transform length that no actual wrap-around takes place. For our length-N signal, length-M filter setup, we can pad both out to be N+M-1 samples long, and then use cyclical convolution (which is just point-wise multiplication of DFT/FFT coefficients) to get our discrete convolution result.

That works, but if we did that directly, we’d need the whole input signal in advance, and we also get O((N+M) log2(N+M)) operations for the whole job, so the amount of operations per output sample has a dependency on the input length. That’s not only worse than advertised, it also fundamentally doesn’t work in a setting where you get the signal in real-time, because N (the eventual length of your input signal) is not something you know while it’s happening, and you don’t know all the future samples yet. So the next step is to realize that no, of course you don’t need full knowledge of all samples past and future to do a convolution with a finite-length filter. What you can do is cut your source signal into small pieces, say also of size M, filter those individually, and then combine the results. This works just fine and the search term for the main approaches are “overlap-save” and “overlap-add” method. This gets the dependency on N out of there entirely, and we actually do have O(log(M)) asymptotic operation count per output sample now. Nice.

This is now good enough for offline editing or batch processing, but still doesn’t work for something like real-time audio, because we work in units of big blocks, and for our M around 240,000, we need blocks of roughly that many input samples to do the job. That’s 5 seconds of audio. If we need to wait until we have more than 5 seconds of audio to do another batch, that means we can’t do it without stuttering unless we introduce 5 seconds of latency. Once again we’re off by several orders of magnitude – we might be willing to live with 20-30ms of latency, although for Pro Audio a real 5ms or thereabouts would actually be nice. Direct convolution doesn’t have this delay. As soon as we have a new input sample, we can compute the corresponding output sample – although it might take us a lot of operations to do so. FFT-based convolution amortizes the effort greatly, but introduces unacceptably large latency when used directly with large kernels. So what do we do?

Well, several tricks. First, we can use the same “cut into pieces” trick we used on the input signal on the filter as well; for example, we might cut it into pieces 1024 samples long (duration 21.3ms at 48kHz sampling rate), which is getting towards the upper end of what we’re willing to tolerate. Then we can use overlap-add/overlap-save style techniques on these pieces to build our convolution with a 240,000-sample signal out of 235 convolutions by a length-1024 signal (this is effectively just another direct convolution, just between corresponding coefficients of 1024-sample blocks). We don’t need to redo all these every time, every 1024 samples we transform one new block of 1024 and keep the older ones around – the search term here is “Frequency-Domain Delay Line”. That adds a few hundred multiply-adds per sample to our tally, but a few hundred extra we can live with.

Next, that 1024-sample latency is also not actually necessary. It’s true it takes 21.3ms to complete a 1024-sample block, but that doesn’t mean you have to twiddle your thumbs while that’s going. The problem is really only with the “beginning” parts of Y – which get multiplied by the most recent samples from X. Once you’re 1024 samples into Y, those values get multiplied by “old” samples in X (at least 1024 samples old), so there’s not nearly as much of a rush. So we can use direct convolution with just the initial part of Y to keep our latency down (it can now again go about as low as direct convolution can), and then use optimized FFT-based convolution for the contribution from more distant parts of the signal. That contribution from the direct convolution is now contributing something like a 1024 multiply-adds per sample, which is starting to hurt, but at least we can hit low latency now.

Finally, the actual solution in practice: this realization that the big blocks are only a problem in terms of latency for recent operation is true in general, and lets us use this whole process recursively. For example, in our initial 1024 samples, the first 64 samples (1.3ms worth at 48kHz) are latency-critical and probably need direct convolution. Past that point, we might switch to length-64 blocks with FFT-based convolution. At some point past 1024 samples into the “past”, we might switch over to length-1024 blocks. And for further back-in-time samples, we might switch to length-16384 blocks or so (our 5s impulse response fits into 15 16,384-sample blocks, so the 15 resulting overlap-adds/overlap-saves we don’t worry about much, but if it were longer we might do this a few more time with increasing sizes). There’s some subtlety about what block sizes to use and when to switch (you might think it would be good to use pow2 sizes and greedily double the block size as soon as we can every time, but turns out that’s not work-optimal), and there’s also other concerns such as how much processing time the various transform sizes take and when to schedule them etc., but this is the actual “good” convolution algorithm for real-time applications, and it’s called “Non-Uniform Partitioned Block Convolution”.

All right. This was just a very brief, very rough overview, and I would absolutely not recommend trying to fill in the details from my few paragraphs, but especially with convolutions, I’ve several times seen code that really does a 2D FFT on a whole image or similar to convolve with a 50×50 pixel kernel, and anecdotally it seems that outside the domain of real-time audio, the better techniques are not nearly as well-known as they ought to be, hence this extended detour into convolution algorithms.

This concludes my notes on the subjects that might be interesting for FFT users. Next part will be for FFT implementers only, pretty much.

What’s that magic computation in stb__RefineBlock?

Back in 2007 I wrote my DXT1/5 (aka BC1/3) encoder rygdxt, originally for “fr-041: debris” (so it was size-constrained). A bit later I put up the source and Sean Barrett adapted it into “stb_dxt”, which is probably the form that most know it in today.

It’s a simple BC1 encoder that gives decent quality, the underlying algorithm is reasonably fast (fast enough to say bake textures you produce once per session in a game from a character creator, say, which is one of the cases I’ve ended up using it in in a professional context), and is easy to integrate.

The basic algorithm uses the same primitives most BC1 encoders use (I’ll assume in the following you know how BC1 works): compute the average and covariance matrix of the block of pixels, compute the principal component of the covariance to get an initial guess for what direction the vector between the two endpoints should point in. Then we project all pixel values onto that vector to find the min/max support points in that direction as the initial seed endpoints (which determines the initial palette), assign each pixel the palette entry closest to it, and do some iterative refinement of the whole thing.

rygdxt/stb_dxt only uses the 4-color mode. It did not bother with the 3-color + 1-bit transparency mode. It’s much more niche, increases the search space appreciably and complicates the solver, and is rarely useful. The original code was written for 64k intros and the like and this was an easy small potential win to give up on to save on code size.

Nothing in there was invented by me, but at the time I wrote it anyway, DXT1/BC1 encoders (at least the ones I was looking at) were doing some of these steps in a way that was more complicated than necessary. For example, one thing I vividly recall is that one encoder I was looking at at the time (I believe it was Squish?) was determining the principal component of the covariance matrix by forming the characteristic polynomial (which for a 3×3 matrix is cubic), using Cardano’s formula to find the roots of the polynomial to determine eigenvalues, and finally using Gaussian Elimination (I think it was) to find vectors spanning the nullspace of the covariance matrix minus the eigenvalue times the identity. That’s a very “undergrad Linear Algebra homework” way of attacking the problem; it works, but it’s complicated with a fair amount of tricky code and numerical issues to wrestle with.

rygdxt, with its focus on size, goes for a much simpler approach: use power iteration. Which is to say, pick a start vector (the only tricky bit), and then iterate multiplying covariance matrix by that vector and normalizing the result a few times. This gives you a PCA vector directly. In practice, 3-6 iterations usually sufficient to get as good a direction vector as makes sense for BC1 encoding (stb_dxt uses 4), and the whole thing fits in a handful lines of code with absolutely nothing subtle or numerically tricky going on. I don’t claim to have invented this, but I will say that before rygdxt I saw a bunch of encoders futzing around with more complicated and messier approaches like Squish, or even bothering with computing a general 3×3 (or even NxN) eigendecomposition, and these days power iteration seems to be the go-to, so if nothing else I think rygdxt helped popularize a much simpler technique in that space.

Another small linear algebra trick in there is with the color matching. We have a 4-entry palette with colors that lie (approximately, because everything is quantized to an 8-bit integer grid) on a line segment through RGB space. The brute-force way to find the best match is to compute the 4 squared distances between the target pixel value and each of the 4 palette entries. This is not necessarily a bad way to do it (especially if you use narrow data types and SIMD instructions, because the dataflow is very simple and regular), but it is essentially computing four 4-element dot products per pixel. What rygdxt/stb_dxt uses instead is the fact that if it were a perfect line segment in a continuous space, we could just use an orthogonal projection to find the nearest point on the line, which with appropriate normalization boils down to a single dot product per pixel. In that continuous simplification, the two interpolated colors sit exactly 1/3rd and 2/3rds along the way between the two endpoints. However working on the aforementioned 8-bit integer grid means that the interpolated colors can sometimes be noticeably off from their ideal placement, especially when the two endpoints are close together. What rygdxt therefore does is compute where the actual interpolated 1/3rd-of-the-way and 2/3rds-of-the-way colors land on the line (two more dot products), and then we can do our single dot product with the line direction and use the values we computed earlier to figure out which of these 4 colors is closest in the 1D space along the line, which is just a few comparisons and can be done branchlessly if desired.

The result doesn’t always match with the distances code using the brute-force solution would get, but it’s very close in practice, and reducing the computations by a factor of nearly four sped up the BC1 encoding process nicely (old 2008 evaluation by my now-colleague Charles here).

That leaves us with the actual subject of this blog post, the iterative refinement logic! I just answered an email by someone asking for an explanation of what that code does and why, so here goes.

The refinement function

The code in question is here.

Ultimately, what rygdxt/stb_dxt does to refine the results is linear least-squares minimization. (Another idea that’s not mine, this one I definitely got from Squish). We’re holding the DXT indices (interpolation weights) constant and solving for optimal endpoints in a least-squares sense. In each of the RGB color channels, the i’th target pixel is approximated by a linear interpolation

\displaystyle (1-w_i) x_0 + w_i x_1

where x0, x1 are the endpoints we’re solving for and wi is one of {0, 1/3, 2/3, 3/3} depending on which of the 4 indices is used for that pixel. Writing that out for the whole block in say the red channel turns into an overdetermined system of 16 linear equations

\displaystyle \begin{pmatrix} 1-w_0 & w_0 \\ 1-w_1 & w_1 \\ \vdots & \vdots \\ 1-w_{15} & w_{15} \end{pmatrix} \begin{pmatrix} x_{0r} \\ x_{1r} \end{pmatrix} = \begin{pmatrix} r_0 \\ r_1 \\ \vdots \\ r_{15} \end{pmatrix}

to be solved for x0r, x1r (the first/second endpoint’s R value).

Let’s call that tall and skinny matrix on the left A; x = (x_{0r}, x_{1r})^T is the 2D column vector we’re solving for, and the RHS vector of the pixel r values we can just call “r’.

That means our problem is to minimize ||Ax - r||_2 (2-norm), or equivalently ||Ax - r||^2.

||Ax-r||^2 is quadratic; the way you find its minimum is by computing the derivative and equating it to 0, which leads us to what’s called the Normal Equations, which for this problem are

\displaystyle A^T A x - A^T r = 0 \\ \Leftrightarrow A^T A x = A^T r

A is a 16×2 matrix, so ATA is a tiny symmetric matrix (2×2) and contains the dot products of the columns of A with each other.

We have 3 color channels, not just r but g and b as well. That means we have 3 copies of the same linear system with 3 different right-hand sides, or equivalently we can view the whole thing as a matrix equation with a 3-wide right-hand side. Either way, all 3 systems have the same A matrix, the only thing that differs is the right-hand sides.

We accumulate A^T r (and g and b as well) in the first pass, and also compute the entries of A^T A. To solve the system above, because we just have a 2×2 matrix, we can use Cramer’s rule to solve it directly, no need to mess around with factorizations or Gaussian Elimination or similar.

That’s the basic idea. The RefineBlock function uses two more tricks:

  1. instead of the weights being {0, 1/3, 2/3, 3/3}, we multiply them by 3 and also scale the RHS by 3 (the latter, we never actually do explicitly). Getting the extra scaling is essentially free during  the linear system solve, especially since we already need to do some scaling per-channel anyway, because the R/B endpoint values we solve for are in [0,31] (instead of [0,255] for the input pixel values), and the G values are in [0,63]. Scaling everything by 3 means there are no fractions involved in the computation, it’s all small integers, which will be useful for the second trick. It also means that when we compute the determinant of the 2×2 system for Cramer’s rule, it’s an integer computation, so we don’t have to worry about near-zero values and the like. (Although in this case, it’s not too hard to prove that A is singular, i.e. has determinant 0, if and only if all the wi are the same, which is easy enough to detect up front.)
  2. now our weights wi (matrix entries in A) are all in {0,1,2,3}. The three entries in A^T A sum, respectively (note we scaled by 3, so 1-w_i turns into 3-w_i): (3-w_i)^2, (3-w_i) w_i, and w_i^2. Note that all three values we’re summing only depend on wi, and wi is one of 4 possible values (depending on the index), so we can just precompute all of them. Also note they’re all quite small: (3-w_i)^2 and w_i^2 are at most 9, and (3-w_i) w_i is either 0 or 2, so they all comfortably fit in 4 bits. We’re summing 16 of these values (one per pixel in the block), and they’re all positive. That means the final sums fit into 8 bits no problem. Therefore the actual table we have packs the 3 values into one integer:
    \displaystyle ((3-w_i)^2 \ll 16) + ((3-w_i) w_i \ll 8) + (w_i)^2.
    We sum that one integer per pixel. All of the individual sums are guaranteed to be <256 so we can extract the corresponding bits after the accumulation loop. That means the cost of computing the entries of A^T A becomes quite cheap (a single 4-entry table lookup and 32-bit integer add per pixel), and the bulk of the work in that first loop is just computing the right-hand sides.

And that’s about it. I know that approach was later also used by NVidia Texture Tools, beyond that I have no idea of its reach, but if it’s handy for someone, cool!

On AlphaTensor’s new matrix multiplication algorithms

Two acquaintances independently asked about this today, so it seems worth a write-up: recently (as of this writing), DeepMind published a new paper about a new practical fast matrix multiplication algorithm, along with a press release that is a bit misleading and seems to have led to some confusion already.

In particular, while the paper does introduce many new “practical” (not scare quotes, I’m using the word in a specific sense here that I’ll make more precise in a minute) fast small-matrix multiplication algorithms, that does not mean that you should replace your small-matrix library routines for 3×3 or 4×4 matrix multiplication with them. That’s not actually what they’re meant for, and they wouldn’t be good at it.

If you just want the executive summary, here it is: these are definitely interesting algorithms from an arithmetic complexity theory standpoint – especially for the case of 4×4 matrices over finite fields, where (to the best of my knowledge) Strassen’s algorithm from 1969 was still the reigning champion. These algorithms are also practically relevant, meaning that not only do they have better asymptotic lower bounds than Strassen’s algorithm, they are still algorithms you might actually use in practice, unlike essentially everything else that has been written on the topic in the last 50 years: these algorithms are correct, and will in principle win over Strassen’s algorithm with large enough matrices, but that cut-off is well beyond the sizes that anyone is actually doing computations with.

And if you know what Strassen’s algorithm is, you might be in the market for the results from this paper, In fact, I’ll go one further and say that if you are currently using Strassen’s algorithm somewhere, you should definitely check the paper out. For the rest of you, I’ll try to give a very short summary of the topic and explain why actual small matrices are not really the intended use case.

Strassen’s algorithm

Regular matrix multiplication of 2×2 matrices uses the standard “dot products of rows with columns” algorithm:

\begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix} \begin{pmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{pmatrix} = \begin{pmatrix} a_{11} b_{11} + a_{12} b_{21} & a_{11} b_{12} + a_{12} b_{22} \\ a_{21} b_{11} + a_{22} b_{21} & a_{21} b_{12} + a_{22} b_{22} \end{pmatrix}

As written, this does 8 multiplications and 4 additions. In 1969, Volker Strassen discovered an algorithm for this that only uses 7 multiplications but 18 additions (follow the link for more details, I won’t go into it here); Winograd later presented a variant that only uses 15 additions. This is interesting in principle if multiplications are much more expensive than additions, something which is true in some settings but does not commonly apply to hardware floating-point math these days. Hardware floating-point now commonly implements fused multiply-add (FMA) instructions, and the 2×2 matrix multiplication above can be implemented using 4 regular multiplications, 4 fused multiply-adds, and no separate additions at all. Moreover, Strassen’s algorithm has some numerical stability issues when used with floating-point (these concerns do not exist when it’s used for finite field arithmetic, though!) that means it also must be used carefully. What this means is that you would never actually consider using Strassen’s algorithm on 2×2 matrices, and that is in fact not how it’s normally presented.

The form of Strassen’s algorithm that is of practical interest works not with 2×2 matrices, but with 2×2 block matrices, that is, 2×2 matrices whose elements are themselves matrices. Matrix multiplication has a very regular structure that looks exactly the same when multiplying block matrices as it does when multiplying matrices of scalars, you just need to be careful about operand order for multiplications because matrix multiplications, unlike multiplications in the scalar ring or field, are not commutative:

\begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix} \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{pmatrix} = \begin{pmatrix} A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\ A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22} \end{pmatrix}

This looks identical to what we had before (except it’s now all upper case), but the operations mean something different: before we were multiplying scalars with each other, so we had something like individual real number multiplications (or floating-point multiplications in actual numerical code), now the products are matrix-matrix products (which are O(N³) operations when multiplying two square NxN matrices using the standard algorithm, a mix of multiplications and additions or possibly fused-multiply-adds) and the sums are matrix-matrix sums (which are O(N²) additions). And because what we’re describing here is a matrix multiplication algorithm to begin with, the smaller sub-matrix multiplications can also use Strassen’s algorithm if they want to.

In short, not only does each multiplication and addition in this block matrix represent many underlying scalar operations, but the relative cost of multiplications compared to additions keeps growing as N (the size of the blocks in our block matrix) increases. And that’s where Strassen’s algorithm is actually used in practice: eventually, once N becomes large enough, the many extra additions and irregular structure become worthwhile. It’s important to note that arithmetic operation count is not the only factor here: the extremely regular structure of conventional matrix multiplication, and the ease with which it can use FMAs, means that the arithmetic operations in a tuned matrix multiplication kernel are a lot cheaper than you might expect, because these kernels tend to do an extremely good job of keeping the machine busy. With small matrices (and actual 4×4 matrices definitely fit that description), other overheads dominate. Somewhat larger blocks are mostly limited by memory subsystem effects and carefully manage their footprint in the various cache levels and TLBs, which tends to include techniques such as extra memory copying and packing stages that might seem like a waste because they’re not spamming floating-point math, but are worth the cost because they make everything else go faster. With much larger blocks, eventually Strassen can become attractive, although the actual cut-off varies wildly between architectures. I’ve seen reports of Strassen multiplication becoming useful for blocks as small as 128×128, but more usually, it’s used for blocks that are at least 512×512 elements, if not much more. All this assuming that its less-than-ideal numerical properties are not a show-stopper for the given application.

AlphaTensor

That brings us back to AlphaTensor. What DeepMind has implemented is, in essence, a variant of the neural-net-guided Monte Carlo Tree Search they’ve been using with great success to build Chess and Go AIs. This time, they use it to search for small-matrix multiplication algorithms. I’m not the right person to go into the details here, and it’s not important for the purposes of this post. We can just treat this procedure as a black-box optimizer that can do a guided search for matrix-multiplication algorithms meeting a user-specified set of criteria.

One obvious criterion would be optimizing for minimum multiplication count, and in fact one of the discoveries reported is a “Strassen-like” factorization that uses 47 multiplications to multiply two 4×4 matrices (compared to 7*7=49 multiplications for two nested applications of Strassen’s 2×2 algorithm, or 64 multiplications for the direct algorithm). And separate from the more theoretical concern of minimum operation count for a “practical” algorithm, the same optimizer can also incorporate measured runtimes on actual devices into its operation, and thus be used to perform a guided search for algorithms that are fast on certain devices.

That’s the process used to yield figure 5 in the paper, which reports speed-ups of optimized 4×4 matrix multiplication factorizations against the baseline (which is the regular algorithm). Note that what this does is one high-level 4×4 block matrix multiply using the algorithm in question at the top level; all the smaller lower-level matrix multiplies use regular matrix multiplication. Also note that the reported matrix sizes start at 8192×8192, i.e. in this case, the 2048×2048-element blocks are multiplied regularly.

Furthermore, note that the reported speed-ups that the press release refers to as “10-20%” are compared to the baseline of using regular matrix multiplication, not against Strassen (in this case, “Strassen-square”, the aforementioned 4×4 factorization obtained by applying Strassen’s algorithm twice to a 4×4 matrix). Strassen results are reported in the figure as well, they’re the red bars.

In short, the new algorithms are a sizeable improvement, especially on the TPU, but the perspective here is important: this type of algorithm is interesting when individual multiplications are quite expensive, in this case because they are actually multiplications of fairly large matrices themselves (2048×2048 blocks are nothing to sneeze at).

If you’re actually multiplying 4×4 matrices of scalars, the standard algorithm remains the way to go, and is likely to stay that way for the foreseeable future.

Morton codes addendum

I wrote about Morton codes long ago, and I see that post and the accompanying code referenced frequently, but there’s a few points I want to make about it.

First, if you’re on a x86 CPU with BMI2 support, you have access to PDEP and PEXT, which make Morton encoding/decoding trivial. 2D Morton encoding one 16-bit coordinate to a 32-bit word is a single 32-bit PDEP with 0x55555555, decoding it is a PEXT with the same mask, and you can also use the same thing to encode 64-bit values with appropriate longer masks extended in the obvious way. The only fly in the ointment is that earlier AMD Zen CPUs technically support PDEP and PEXT but with an incredibly slow implementation, so this is not a good implementation if your target HW includes those CPUs. That said on Zen 3 and later the two instructions are fine, and ARM has added an equivalent to their vector pipes with SVE2, so while a bit iffy right now, I expect this might be the method of choice eventually, for CPUs anyway. I’ve been seeing this a fair amount in GPU shaders too, though, which brings me to the next subject for today.

The methods presented in the original post assume that the Morton-coded index value is 32 bits. I did not point this out explicitly in the text back in 2009, but you can either add another pass to produce a 64-bit result, or remove passes if you don’t need full pairs of 16-bit coordinates (or triples of 10-bit coordinates) interleaved into one 32-bit value.

For example, I gave this code for Morton-encoding a pair of unsigned 16-bit values to form a 32-bit index: (Dropping the comments that detail the bit layout after each step because they cause formatting issues)

// "Insert" a 0 bit after each of the 16 low bits of x
uint32_t Part1By1(uint32_t x)
{
  x &= 0x0000ffff;
  x = (x ^ (x <<  8)) & 0x00ff00ff;
  x = (x ^ (x <<  4)) & 0x0f0f0f0f;
  x = (x ^ (x <<  2)) & 0x33333333;
  x = (x ^ (x <<  1)) & 0x55555555;
  return x;
}

uint32 EncodeMorton2(uint32_t x, uint32_t y)
{
  return (Part1By1(y) << 1) + Part1By1(x);
}

If we only care about 8 bits worth of x and y, that first shift-and-xor pass (which tries to move the high 8 bits of x into place) is useless and we can just skip it:

// "Insert" a 0 bit after each of the 16 low bits of x
uint32_t Part1By1_8b(uint32_t x)
{
  x &= 0xff; // 8 bits only this time!
  x = (x ^ (x <<  4)) & 0x0f0f0f0f; // 0x0f0f would suffice
  x = (x ^ (x <<  2)) & 0x33333333; // 0x3333 would suffice
  x = (x ^ (x <<  1)) & 0x55555555; // 0x5555 would suffice
  return x;
}

// Only uses low 8 bits of x, y
uint32 EncodeMorton2_8b(uint32_t x, uint32_t y)
{
  return (Part1By1_8b(y) << 1) + Part1By1_8b(x);
}

Lastly, if you’re encoding or decoding multiple values, which is extremely common because after all the whole point is to turn a pair (or triple) of values into a single integer, and the values are sufficiently narrow, we can handle x and y at once, by packing them into a suitable value before we do the rest of the computation.

For example, the case above where we only have 8-bit x and y coordinates (not that uncommon, say within a tile) can do a variant of the trick above to handle the full pair at only slightly more work than just handling a single coordinate in the original code would be:

uint32_t EncodeMorton2_8b_better(uint32_t x, uint32_t y)
{
  // Pack x and y into t, starting at bits 0 and 16 respectively
  uint32_t t = (x & 0xff) | ((y & 0xff) << 16);

  // Do the full interleave as above (this time the full
  // 32-bit masks are necessary)
  t = (t ^ (t <<  4)) & 0x0f0f0f0f;
  t = (t ^ (t <<  2)) & 0x33333333;
  t = (t ^ (t <<  1)) & 0x55555555;

  // Merge the halves.
  // All the odd bits are clear, so instead of shifting t right by
  // 16 to get y and then shifting left by 1, we can just shift right
  // by 15.
  return (t >> 15) | (t & 0xffff);
}

All in all is this is roughly the same cost of a single invocation of the original Part1By1. If you need a full 16-bit x and y but have 64-bit registers available, you can get a 32-bit result using a 64-bit extension of the original code (you just need to expand the masks appropriately).

This technique is independent of how exactly you do the bit-(de-)interleaving, so you can in principle combine this with PDEP and PEXT. There’s not much point though, because if you have a good implementation then doing two PDEP with different masks and a final OR will be cheaper, and if you might encounter one of the bad implementations you just want to avoid these instructions entirely and use the above SIMD-within-a-register (SWAR) algorithms.

I’ve only showed 2D encoding here, the same trick works for decoding:

pair<uint32_t,uint32_t> DecodeMorton2_8b(uint32_t code)
{
  // Separate even and odd bits to top and bottom half, respectively
  uint32_t t = (code & 0x5555) | ((code & 0xaaaa) << 15);

  // Decode passes
  t = (t ^ (t >> 1)) & 0x33333333;
  t = (t ^ (t >> 2)) & 0x0f0f0f0f;
  t ^= t >> 4; // No final mask, we mask next anyway:

  // Return x and y
  return make_pair(t & 0xff, (t >> 16) & 0xff);
}

And of course the same idea works for the 3D variants as well.

In summary, To Whom It May Concern: 1. PDEP/PEXT and friends (when available and reliably good) make this trivial, 2. if you only need a few bits you can skip passes and make it cheaper, 3. the bit movement is completely regular so you can also pack multiple values ahead of time and get two (de)interleaves for the price of one.

Entropy decoding in Oodle Data: x86-64 3-stream Huffman decoders

The adventure continues! Last time I went over the core decoder loop we use on AMD Jaguar-based consoles, i.e. Xbox One and PS4. I felt that one was a good example to start with since the target hardware is fixed, behaves quite predictably, and the various features and limitations of the machine conspire to make the solution essentially canonical: the narrow CPU front-end makes it so that minimizing macro-ops issued is a significant concern, and the BEXTR-based decode is a very natural fit.

This time, our hardware target is “any 64-bit x86”, used on less specialized targets like Linux, Windows and Mac 64-bit x86. We don’t have a single CPU type we’re optimizing for; it needs to run on everything, and we’d like to have good performance on anything released within the last 10 years (at the time this code was written in 2016) or so. We do get to have variants for newer ISA extensions when they help but we’d like a good baseline version for “any 64-bit x86 CPU”.

As with the previous two parts, you should read the introduction to the Huffman decoders first, continue with the “Reading bits in far too many ways” series on this blog (part 2, part 3) which is also a prerequisite, and if you can’t read and understand x86-64 assembly language this will be hard to follow.

You know how this goes by now

I’m not going into as much detail in this post as I did in the Jaguar one, both because our target this time is somewhat fuzzier and because we’re exploring variations on a theme here. It’s interesting to cover the key ideas, not so interesting to re-do variants of the same analysis many times in a row.

Our setting, then, is the same as it was last time: we have a Huffman code stream where the code lengths are constrained to be relatively short so we can do 5 in a row between refills, we use a 4KiB table with 2048 entries covering all possible 11-bit codewords, and decoding proceeds without having to do any significant data-dependent branches.1 We know that decoding back-to-back values from the same bitstream is limited by critical path latency, so we use multiple bitstreams.

On the AMD Jaguar, 3 independent bitstreams was sufficient to get us out of the miserable latency-bound hole and into the more desirable throughput-bound regime where we’re doing a decent job keeping the execution units busy. On the wider out-of-order cores that can sustain 4 or more instructions per cycle, 3 streams is not going to be enough for that, which is why we also have 6-stream variants that will be covered in a later part.2

Once again, I’ll go over this from the inside out: first figure out how to decode individual bytes, then sort out the refill strategy, then deal with the remaining plumbing.

The basic decode step

Our game plan hasn’t changed: peek ahead at the next 11 bits, do our table lookup, consume the right number of bits, emit the resulting byte. The most basic version of this, assuming we keep an explicit “remaining bits” count per stream, works out to something like this:

  ; peek at low 11 bits
  mov rcx, rBits
  and rcx, 2047
  ; table fetch
  movzx ecx, word [rTablePtr + rcx*2]
  ; shift out consumed bits, update bit count
  shr rBits, cl
  sub rBitCount, cl
  ; emit decoded byte
  shr ecx, 8
  mov [rDest + <index>], cl

So far, there should be no surprises here.3 As given, this works out to 7 instructions per byte decoded, quite a lot more than the 3 we had in the Jaguar version, but this time we don’t have guaranteed access to BMI instructions – and even if we did, they would not be as good as they are on Jaguar.

Two more small tweaks

Part of x86’s historical baggage is that bits 8 through 15 inclusive of the a, b, c and d registers can be accessed as their own register, and this can be used on the emit sequence to save an instruction, getting us down to 6:

  mov [rDest + <index>], ch

Due to the way the x86-64 instruction encoding works out, rDest in this code can’t be just anything. It has to be from a limited set of registers4 but other than that, we’re still not doing anything special here.

Note that we want to be using ECX as the register we load the table entry into because the variable-distance shifts in basic x86-64 all require the shift amount to live in CL.5 We don’t particularly care what register name we use for our scratch temporary; RCX/ECX is as good as any other. Nevertheless, using RCX (and only RCX) as the temporary to load table entries into is clearly the right choice here, but compilers would frequently insert extra moves (sometimes more than one extra on the critical path) here, which is how we ended up hand-writing these loops in assembly to begin with.6

The second fun wrinkle is that even though we can easily arrange for our shift count to always be in CL to begin with, on Intel CPUs with BMI2 available, we always want to use the new SHLX instruction instead, even when we’re shifting by the low bits of RCX, because SHLX takes 1 uop instead of 2 or 3 on newer Intel uArchs.7

The final non-obvious change we do for the “real” version of the decode is that instead of the peek logic presented above, we actually do this instead:

  mov ecx, 2047
  and rcx, rBits

Why load the immediate first using a separate instruction and then AND with rBits? Well, the important realization here is that moving an immediate value into ecx/rcx has no dependencies and can execute at any time, whereas mov rcx, rBits does depend on rBits and thus ends up becoming a part of the critical path. Now this register-register move is generally “free” on microarchitectures that can implement reg-reg moves via renaming at an effective 0-cycle latency, which on the Intel side is Ivy Bridge and later, but there are two important caveats: first, not every machine we’re targeting (or at least were targeting at the time this code was written, 6 years ago) is Ivy Bridge and later, and second, some later microarchitectures had subtle bugs in their implementation of MOV elimination and ended up disabling it later. Hence, as a general rule, the load-immediate-first version is preferable since it takes 1 cycle off the critical path for every decode when MOV elimination is not performed, for whatever reason.8

Refill

As mentioned above, for this version we prefer the methods with explicit “remaining bits” counts, specifically what I described as variant 4 in my write-up on bit reading.

The reason to prefer this approach here is that it keeps the load for the bit buffer refill independent of the final bit count, meaning that refill load’s latency does not necessarily end up on the critical path. Since the latency along the critical path is the primary concern for this loop, this is definitely what we want.

In the “production” version of the loop, we keep the remaining bit count for stream 0 in al, the refill pointer in r8, and the bit buffer for stream 0 in r11. The version we use is this (with the instructions scheduled slightly differently, but that’s extremely minor):

  ; refill streams
  mov    rbp, [r8] ; next0 = load(in0)
  movzx  rcx, al   ; bitc0
  shl    rbp, cl   ; next0 << bitc0; use shlx with BMI2+!
  or     r11, rbp  ; bits0 |= next0 << bitc0
  shr    ecx, 3    ; 7 - bytes_consumed0
  sub    r8, rcx   ; in0 += bytes_consumed0 - 7
  or     al, 56    ; bitc0 |= 56
  add    r8, 7     ; in0 += 7 

That’s a fair number of instructions, but the dependency structure is quite favorable:

  • The load of the next bits to refill only depends on r8, which is known as soon as the previous iterations’ refill is complete (all the instructions that update r8 are in the above code snippet!) and in particular is completely independent of the 5 table loads per stream.
  • Insertion of the new bits depends on the final bit count being known, but once the 3 instructions (MOVZX, SHL, OR, all pure int ALU) for that part of the dependency chain are done, we can start decoding the next byte.
  • Updating the bit count and the read pointers takes several more instructions, but neither of these are as critical because the bit count doesn’t determine any load offsets and the read pointer is only used for the next iterations’ refill.

Once again, using SHLX for the shift when available is preferred. In this case, in addition to the 2 uops saved, it also lets us remove the 1 cycle for the MOVZX from the critical path.

Beyond that, there’s a few more instructions in the loop to increment the destination pointer, maintain/test the “bytes to decode” count and check whether the input pointers have crossed, but all of this works the same as in the last part and is not on the critical part so I won’t spend time on it here.

Analysis

With all that said, what’s the expected cost of this decoder? Unlike the Jaguar example which targets a single known microarchitecture, this version of the loop targets a general PC target and thus should be good for numerous uArch generations from multiple vendors over something like a 10-year window.

We have in fact done just that kind of testing, and there are some minor variants (mainly depending on whether BMI2 and thus SHLX is available, as noted above), but the details here get very repetitive and tedious. I don’t particularly feel like writing all of this down, nor would it make for interesting reading.

Instead, I’m going to take a single uArch as our proxy. I’ll be using Intel’s Skylake (SKL), which Intel essentially (in progressively tweaked versions) kept re-releasing in between 2015 and 2021, meaning there’s a nearly 7-year window of Intel consumer CPUs that are all essentially SKL spin-offs. In the notebook space we saw Ice Lake in late 2019 but for desktop, 2021’s Rocket Lake (the first “real” SKL successor for desktop) was neither well-received nor sold well, and was soon replaced by Alder Lake in late 2021.

In practice, this means that at the time of this writing, the mean, median and mode Intel desktop x86 that is at least a year or so old is some SKL variant; during that time AMD made major inroads into the desktop space with their Zen cores (which, fortunately for our analysis, have quite similar performance characteristics). Notebooks are a bit more varied, but if I’m going to look at only one x86 uArch, SKL seems like a solid choice.

As mentioned multiple times throughout this post already, we’re mainly concerned about latency here. As such, I won’t go into many uArch details here and just point out that the relevant cores are a lot wider than the Jaguars we looked at last time (namely, they support 4+ instructions decoded/dispatched/retired per cycle), have 4 integer ALUs, can (under the right conditions) support 2 loads and 1 store per cycle, up to 2 branches per cycle, and in general have just more of essentially everything compared to the small Jaguar cores we looked at last time. (They also typically clock at twice the frequency, or more.)

Ironically, precisely because these cores are so much wider, the analysis gets a lot simpler. Here’s the gist of it: the all important critical path between iterations for a single stream starts with the final bit count for the last iteration and then goes through the refill and 5 subsequent symbols decodes until the bit count for the current iteration is known.

In the code as presented above, that means we start with the MOVZX, SHL and OR from the refill (all basic ALU ops that are 1-cycle latency on pretty much every x86 core you can buy these days) until we know the new value of rBits, for 3 cycles refill latency (can be 2 if we use SHLX).

Then we have 5 subsequent decodes where we care about the latency until we have the updated (shifted) rBits. In the version with the hoisted immediate that uses and rcx, rBits, the 3 relevant instructions on that dependency chain are this AND, the table load, and then the shift by CL. The AND and shift are also 1-cycle latency. In the final iteration, we also need to know the final bit count; this is computed by another independent instruction (the SUB) that depends on the table load and is likely to issue at the same time as our shift, so it doesn’t change the latency estimate. Finally, the load is the longest-latency instruction. Now SKL has 4-cycle-latency loads for certain very restricted cases, if the address itself originated on the load unit (i.e. for pointer chasing), but for our table lookup, we have a complicated addressing mode and the index register originates on the ALU not the load unit, so 5-cycle load latency is what we get.

Everything else is not critical, so for now, we’re going to ignore it. What does that leave us with? For one of our 3 streams, the latency along that critical path ends up being 3 + 5×(1 + 5 + 1) = 38 cycles.

Skylake can sustain 4+ fused-domain9 uOps per cycle during those 38 cycles, giving us 38 × 4 = 152 issue slots. One refill takes around 8 fused-domain uOps (with SHLX), each decode step takes 6. That means one stream’s worth of decodes takes around 8 + 5×6 = 38 uops, so for 3 streams that’s 114 uops out of our budget, leaving another 38 issue slots remaining.

That leaves the work we haven’t accounted for so far: one of our streams needs to do a BSWAP (another 3 uops), we have the two pointer-crossing checks (2-4 uops depending on what macro-fusion happens), the destination pointer increment (1 uop), and the final loop condition/test (1-2 uops, again depending on macro-fusion). And that’s it. At worst, that takes another 10 uops out of our budget.

We’re left with about 124 issue slots used over 38 cycles, about 81% of what the machine can sustain. That leaves enough air in the schedule that we don’t need to be too worried about exactly what the schedulers and port load-balancing logic do; we really do expect to be mostly latency-limited still, but at least latency-limited with a decent uArch utilization. Finally, our instruction mix is balanced enough, and the machine has enough resources, that we needn’t worry too much about any particular type of resource (like say load units or shifters) getting overcrowded.

In short, the best we can expect out of the loop described above is to decode 15 bytes (5 bytes each from 3 streams) every 38 cycles, which works out to 2.53 cycles per byte. In practice, on my Skylake test machine, benchmarks hover around 41.8 cycles per 15 bytes (i.e. around 2.79 cycles/byte), both in the idealized test rig and on real-world data.

That’s approximately 10% above our rough lower-bound estimate, purely from operation latencies, assuming scheduling is perfect, there are no cache misses, and so forth. At this point, we could dig deeper and try to figure out where we’re losing these 3.8 cycles relative to our critical path estimate, and if there’s something we can do about it, but the writing’s already on the wall: with one stream, we would be utterly latency-limited with a mostly-empty machine. The 3 streams we’re using now get us to a more respectable level of uArch utilization, but while they were enough to get us into our target throughput-bound regime on Jaguar, they’re not enough, not even in a lower-bound estimate where there are never any delays due to cache misses or suboptimal uop scheduling.

Discussion

Really, this should not be a surprise. On a machine twice as wide, with many more execution resources, instructions per cycle, and 2 cycle longer load latency, of course we need more streams to hide those latencies.

However, we’re starting to bump into other problems. Our bit-IO method uses 3 registers worth of state per stream: the bits, the bit count, and the refill pointer. That’s 9 registers used. We also definitely need a stack pointer, a destination buffer pointer, at least one scratch register (rcx), and also a base pointer for our Huffman table and preferably also an “end-of-buffer” pointer (or, equivalently, a counter) in a register for the final loop test. That’s 14 registers spoken for, and x86-64 only has 16 GPRs.

Now, some of those registers don’t need to be live for most of the loop. The “end-of-buffer” pointer never changes and is only used once at the end so it’s not a big problem to leave in memory, and our refill pointers are only used during refill so we can spill some of them without wreaking havoc if we don’t mind the extra instructions to spill and reload them, but we’re getting on thin ice here.

Using more streams would be interesting, but we need to be careful not to introduce a lot of extra instructions. Potentially troublesome spills/refills introducing latency in the wrong places would suck as well. In short, we’d like a few more streams (just one extra is cutting it too close), but that in turn means 3 GPRs per stream is not going to work. We need something more compact, which leads us to a quite different design for the 6-stream decoders we added in Oodle 2.3.0. But that will have to wait for another post. Until then!

Footnotes

[1] The “how many bytes are left” loop condition is just a counted loop and extremely predictable. The “did the pointers cross” check is technically data-dependent, but it is also extremely predictable (in the fast decoder loops anyway), because the decoder loops only run until our already-advanced pointers first cross, at which point we exit the fast loops and switch to the more careful variants.

[2] The trickiest part to negotiate when doing this experimentally is the transition from latency- to throughput-bound. As long as you’re latency-bound, your best move is to use strategies that minimize latency, even when doing so increases dynamic instruction count or utilization of limited execution resources. Once you have enough streams to cover the critical path latency with some slack left over, you need to worry about those bottlenecks instead, even when doing so adds latency. Options that were clearly inferior in the latency-bound regime will end up preferable in the throughput-bound regime.

[3] When writing this code in C/C++, you would probably keep the remaining bit count for the bit buffer in a 32- or 64-bit integer, not a byte, but done naively that would result in an extra instruction to extract and zero-extend the “length” field from the low 8 bits of the table entry. Using a byte register avoids the extra instruction, or alternatively, you could subtract the full 16-bit table entry from the bit count and mask the final bit count down to 8 bits (effectively reducing modulo 256) before using it. The two approaches are equivalent; for this presentation I went with the byte register variant because it’s more straightforward.

[4] x86-64 was designed to share as much of the instruction encoding (and hence the decoder logic) between the 32-bit and 64-bit versions. 32-bit x86 has 8 general-purpose registers and 3-bit register numbers; x64-64 increases this to 16 GPRs (which take 4 bits), and the high bits of the up to 3 register numbers in an instruction are grouped together in what’s called a REX prefix byte. Only some instructions have a REX prefix: if all 3 register number high bits are zero and the instruction is not 64-bit either, it’s not required. However, the register numbering for 8-bit registers worked differently than it did for the other registers: for “classic” x86 code, the 8 8-bit registers are AL, CL, DL, BL, AH, CH, DH, and BH (in that order), where those last 4 are refer to [15:8] in the corresponding 32-bit register. With the REX prefix present, every 64-bit register has its low 8 bits addressable as a corresponding “L” register, but the “H” register names are inaccessible. Thus, if you want to store CH to [rDest + immed], rDest needs to be one of the low 8 registers.

[5] Other (non-x86) architectures mostly don’t care about which architectural register the shift amount lives in, but almost everyone looks only at the low 5-8 bits (depending on architecture and context) of the shift count. This makes it so that arranging our (len,sym) pairs so that len is in the low byte (which for LE byte order means first byte) is almost universally the best choice.

[6] Hand-writing in ASM also made it practical to use the “store from CH” trick, which is not a huge deal in this particular version since there are other problems that we’ll get to in a minute, but this particular quirk of x86 is one that compilers generally won’t even try to use, even though it’s advantageous in cases like this one.

[7] This is one of my “favorite” x86 warts. The reason SHL by CL is awkward, and often more uOps than you would expect, is because the architectural definition of this (and other variable-shift instructions) is quite strange for backwards-compatibility reasons. Namely, all the shift-by-CL instructions set different flags (and set them in different ways) depending on whether the effective shift count ends up being 0, 1, or something else. This goes back to a problem with the original 8086. The 8086 didn’t have a barrel shifter or any other way to do fast distant shifts; “SHL AX, CL” existed but effectively just repeated SHL AX, 1 CL times. This definition was presumably convenient given the extremely limited microcode ROM space the 8086 worked with but has the two unfortunate side effects that a) you could make shifts take a very long time on the 8086 by setting CL to 255 and b) when CL is 0, no shifting takes place at all, and in particular the flags stay unchanged from before. Moreover, there’s another wrinkle in that SHL AX, 1 sets the overflow flag “correctly” (probably mostly by accident, because for left-shifts specifically the natural implementation of it on a 8086-like ALU data path is ADD AX, AX, which does, and right-shifts can’t overflow), but when doing an iterated sequence of such shifts, the final overflow flag will indicate whether the last shift-by-1 overflowed, which is useless when the count is not 1, because for a “shift by 3” operation what you really want to know is whether there was an overflow at any point during the process, not whether the last instruction in a lengthy expansion overflowed (which is neither here nor there). The tightened definition that appears in the original 80286 manuals specifies that only the last 5 bits of the shift count matter (no doubt anticipating the coming expansion to 32-bit with the 80386) and further says that 1-bit-shifts set the overflow flag correctly but multi-bit shifts leave the overflow flags in an undefined state (not mentioned in these early manuals is that an effective shift count of 0 leaves the flags unchanged, but all x86 CPUs I’m aware of nevertheless behave this way). Long story short, the actual data portion of x86 SHL reg, CL is exactly what you would expect, but which flags get updated and how is a tangled mess that, on most Intel CPUs, takes another 1 or 2 uops to resolve. Interestingly, no AMD uArchs I’m aware of, present or historical, have a problem with it.

[8] Another option available on machines with BMI1 or later is to use ANDN to perform our AND operation. The idea is to put the ones’ complement of our desired mask (so ~2047) into a scratch register and then use andn rcx, rInvertedMask, rBits to initialize rcx. In short, we don’t actually care about the “not” part of the ANDN, but it conveniently comes as a non-destructive 3-operand instruction that we can use to avoid the move. This does the job in one instruction, which is great, but requires us to sacrifice another register, which is not so great. In this particular code it’s not really worthwhile since we’re entirely latency-limited on sufficiently wide cores, so having a MOV/AND pair is not a problem as long as we don’t extend the critical path unnecessarily.

[9] I won’t go into what exactly fused-domain uOps are; it’s there for precision, but if you don’t know what that means, just ignore it, it’s way beyond the scope of what we need for this post.

Entropy decoding in Oodle Data: Huffman decoding on the Jaguar

In the last part we went over the general ideas of Huffman coding as implemented in the newer Oodle Data coders, this time we’ll be looking at one particular implementation that is both interesting and “historically relevant”: Oodle was designed with games in mind, an important class of hardware to consider for game middleware is game consoles, and versions of the AMD Jaguar CPU were in both the PS4 and Xbox One (mostly unmodified except for a bump in clock rate in the “mid-lifecycle upgrade” models of both). We wanted Kraken to perform well on those machines, so we spent some time optimizing Oodle for it. Before I go into the details, let’s do a bit of background on the machine itself, but be advised that this will be in-depth and that you may need to re-read the previous part first. Furthermore, this post contains plenty of x86 assembly; if you’re uncomfortable or unfamiliar with that, you probably won’t get much out of it, sorry.

Meet the AMD Jaguar

The Jaguar, or less prosaically “Family 16h”, is a small, low-power, out-of-order 64-bit x86 CPU core designed for small systems and embedded applications such as, well, game consoles. In the game console variants, the Jaguar CPUs appear on the main SoC along with the GPU (and most other components). It’s designed for multi-core operation and cores usually appear in clusters of four that share a common L2 cache, usually 512kb of L2 per core. In the Xbox/PS4 versions, there are two such clusters and thus two L2 cache slices. This post is only concerned with tasks that run on a single thread so I won’t be spending time on this part of the architecture.

Each core has 32KiB of L1 instruction and 32KiB of L1 data cache. The frontend decode/dispatch/retire logic is 2 instructions wide, and the relevant unit for most of it is what is variously (depending on the source) called “macro-ops” or “cops” (complex ops), depending on the source. I’ll stick with macro-ops. Macro-ops are typically a data-processing instruction along with a memory reference, so something like the x86 instruction add rax, [rsi] would be a single macro-op.1 Macro-ops get broken into either one or two micro-ops (I’ll write uops in the following) for execution, but instruction decoding, tracking and retirement all works on macro-ops. The backend has six execution units, each of which can accept one micro-op per cycle: two integer (which I’ll refer to as I0 and I1), one load (L), one store (S), and two SIMD/floating point (which I’ll refer to as F0 and F1). The pipelines are very symmetric: almost all integer instructions can execute in either I0 and I1 (the biggest exceptions being multiplies and divides, which are I1 only), and most FP/SIMD instructions can execute in either F0 or F1 (FP addition and SIMD integer multiplication are only supported in F0, and FP multiplies and store/convert are F1 only). Consequently, most pairings of two independent instructions can execute in the same cycle, if they’re not both contending for the same resource. Of these backend limitations, in my experience the one you’re most likely to hit is the one load per cycle limit.

That said, whenever I’ve looked, the two instructions per cycle decode/dispatch limit is usually the more relevant one. On the Jaguars, using the load-operate and even read-modify-write instructions where possible is a good idea (because it gives you two uops per macro-op), and generally preferable to splitting loads out.

Speaking of loads, the L1 data cache is 8-way associative, write-back, and internally splits 64-byte cache lines into 16-byte sectors. Unaligned loads and stores that stay within a single sector are free, crossing a sector boundary occupies the load/store pipes for an extra cycle (potentially more if it also crosses a page boundary etc.). The load-to-use latency for the L1 data cache is 3 cycles to the integer pipes, 5 cycles to the FP/SIMD pipes, both of which are quite low numbers compared to most of its contemporaries.2 The theme of low latencies continues for other parts of the backend: FP32 multiplies and SIMD integer multiplies complete in 2 clock cycles, FP32 and FP64 adds in 3, and most SIMD ALU operations take a single cycle. L2 misses take relatively long though, at a minimum load-to-use latency of 25 cycles.

Lots of console developers found these cores underwhelming, mostly due to the narrow design and fairly low clock rates (around 1.6 and 1.7GHz in the original PS4 and Xbox One, respectively). On the other hand, these cores are quite small, power-efficient, and the PS4/Xb1 console generation came with 8 of them, at a time when more than 4 cores was a rarity in the consumer space. Personally, I quite like them: they’re not the fastest but what they are is extremely even-tempered and predictable. They have a relatively low ceiling on the instructions per cycle and peak performance they can achieve, but getting there is generally a fairly straightforward process, and there’s not much in the way of gotchas or nasty surprises. They’re a pleasant core to optimize for3, and AMD helped by providing good documentation for it.

The Plan

Because of the aforementioned decode/dispatch/retire limits and low instruction latencies, optimizing code with reasonably nice memory access patterns for the Jaguars is, more often than not, an exercise in minimizing number of instructions executed for a given task. (As I said, they’re fairly straightforward to optimize for!) Therefore, if we want a fast Huffman decoder on these machines, it’s a good idea to see if we can do it with as few instructions as possible.

While reviewing the above-quoted docs, one thing I noticed was that BEXTR, an instruction from BMI1, turns into one uop, is supported on both integer pipes, and has 1-cycle latency. BEXTR is an odd duck: it extracts a given number of bits from a given starting point in the first source operand, and as such is essentially a counterpart to PowerPCs rlwinm or ARMs UBFM, but while these latter two instructions have the bitfield position and width given as an immediate operand, BEXTR takes a register operand for the bitfield specification.4 Code that wants to do repeated bitfield extraction with the same operands can burn a register on a constant (itself a fairly steep cost on the relatively register-starved x86) and then use BEXTR, which replaces a move, shift, and an bitwise AND instruction.5 The second source register operand to BEXTR contains, itself, bit-packed values: the lower 8 bits give the index of the LSB of the bitfield to extract, the next 8 bits give the width in bits.

This is usable for the bitstream decoding part of our Huffman decoder. Using a “bit extraction” style decoder (variant 3 in this post) means we repeatedly do operations of the form (bit_buf >> bit_pos) & ((1 << 11) - 1)) to peek at our next 11 bits, and that is just BEXTR(bit_buf, bit_pos + (11 << 8)). It doesn’t cause any problems to have a constant bias that shows up only in the high bytes added to our bit position, so we can just declare our bit positions to have that offset added at all times while in registers, and that lets us do our bit buffer peek in a single 1-uop instruction on Jaguar cores. Because of another x86 quirk, namely that byte-sized instructions exist and preserve the remaining bits of the register, we can do updates of bit_pos using byte-sized additions or subtractions that leave the high bits alone, if we want to.6

Finally, we don’t want to do a store for every byte we decode, because that’s an extra instruction and we’re easily limited by instructions (or rather, macro-ops) executed. Fortunately we can use the SSE4.1 instruction PINSRB (packed insert byte), which inserts a byte value from an integer register or memory into a given lane of a vector register. Vector registers hold 128 bits (16 bytes), which means we can amortize the number of stores and do one every 16 or so bytes instead of for every codeword. Finally, because Jaguar cores treat memory references inside an instruction as separate uops but not separate macro-ops, and macro-ops are one of our main limiters, we want to use memory references liberally if doing so lets us reduce the number of macro-ops we need.

Putting this all together, note that the pseudocode for the per-symbol processing in a LSB-first Huffman decoder, as outlined in the previous part, looks something like this:

  // peek
  uint32_t bits = (bit_buf >> bit_pos) & 2047;
  // consume bits
  bit_pos += table[bits].len;  
  // decode symbol
  emit(table[bits].sym);

and using the various techniques outlined above, we can turn this into a mere 3 x86 instructions:

  ; peek. rBitPos[15:8] = 11
  bextr   rBits, rBitBuf, rBitPos
  ; advance bit offset (update low byte only)
  add     rBitPosb, [rTableBase + rBits*2]
  ; put table[bits].sym at position N into xmm0
  vpinsrb xmm0, xmm0, [rTableBase + rBits*2 + 1], N

On the Jaguar, this decomposes into 3 macro-ops and 5 uops: 2 loads, 2 integer ops, 1 SIMD. The bit extract to grab rBits from rBitBuf takes a single cycle; the bit position update takes 3 cycles to load the value from the table and an extra cycle to complete the addition. We don’t actually care about the top bytes being preserved here, since we don’t expect overflows, but we do care about our load-operand being byte-sized. Either way, that’s 5 cycles critical path latency from one decoded symbol to the next. Finally, the vector byte inserts to collect the output bytes are not on the critical path. They need to be fast enough to keep up with our decoding bytes once the table loads finish (and they are, single they can complete at a rate of 1 per cycle) but that’s about it. With the Jaguar frontend supporting at most 2 macro-ops per cycle, this code takes at least 1.5 cycles per symbol decoded in the frontend, and 2 cycles per symbol decoded in the load pipeline. Meaning that as given, this code is limited more by the load pipeline than the frontend. However, this is not the only work that needs to happen in this loop, and the Jaguar is out-of-order, so we can build up a backlog of load pipeline work; if we later need to do more integer work in the loop that does not take many loads (spoiler: we will), the load pipeline will get to catch up.

Finally, as mentioned above, our critical path between back-to-back loads from the same stream is 5 cycles on the Jaguar. If we use 3 streams and interleave their processing during decode, then the frontend will get around to the first instruction for the second byte of stream 0 about 4.5 cycles in (although the load pipeline will take about 6 cycles to work through its backlog before then). In other words, the timing here can roughly work out, but it’s not perfectly matched; we will build up a bit of backlog in the load pipeline and the reorder buffer before this is done, but as long as we choose our instructions carefully and don’t go too lopsided, we can make this work while keeping the core nice and busy the whole time through.

I was pretty excited when I first realized this 3-instruction sequence was a viable candidate for the core of our Huffman decoder on Jaguar, but to get a real decoder we also need to deal with bit buffer refills, pointer advancing, and end-of-buffer checks.

The actual decoder

As mentioned above (and in the previous part), we use 3 separate bitstreams for parallelism. Of these, two bitstreams are regular “forward” bitstreams in increasing address order, and one is written backwards. The numbering of these is a bit odd: in the physical Oodle format, the layout is strm0-> | strm2-> | <-strm1, i.e. stream 0 is forward and comes first (as you would expect), stream 1 is backward and comes last, and stream 2 is also forward and somewhat awkwardly sandwiched in the middle, for “historical reasons”. Namely, Kraken uses forward-backward stream pairs in many places.7 The Huffman decoder used to be the same way; when we noticed (while working on this Jaguar decoder, in fact) that three streams would be advantageous, we had to put the third stream somewhere. Putting stream 2 in the middle turned out to be slightly easier.8 The advantage of the odd-looking backward stream is that it saves us a bit of signaling in the container format (not a trivial concern for a compression format) and also gives us a nice way to do end-of-buffer checks. Namely, the three are contiguous, and all three read pointers (called in0, in1 and in2 in the following) are in that single contiguous region. Furthermore, in0 keeps increasing, in1 keeps decreasing, and at any point in a well-formed stream, we have buffer_beginin0in2in1buffer_end. During decoding, we do the two interior checks of the read pointers against each other; the end-of-buffer checks on either end are implied by transitivity, and we don’t need to actually do them, or keep those extra pointers around. The only pointers we need to check are the ones we already keep around anyway. Neat!

Now, loads aren’t zero-sized; we use the (common in C) convention that “end” pointers point one past the last element of arrays. So we don’t want to start loading from in1, and with 64-bit (8-byte) loads, the largest address we can ever safely load from is at buffer_end - 8, assuming the buffer is at least 8 bytes to begin with (which we check beforehand). Decrementing in1 by 8 before the loop takes care of both issues: now in1 points to the last address we can do a valid 64-bit load from, and as a side effect in0in2in1 ends up guaranteeing that in0 and in2 are also good to safely load 8 bytes from without overrunning the buffer. Finally, the optimized decoder loop described here decodes 5 bytes each from 3 streams and writes the results using a 16-byte SIMD store, so it can only safely run until 16 bytes before the intended end of the buffer. All the remaining special cases (less than 8 bytes left in some of the streams, very short input streams, or close to the end of the output buffer) are left to a dedicated safe loop that generally handles the last few bytes, needs to do more careful checking, and is certainly not using hand-tweaked assembly. There would be no point for speed since it only ever handles very few bytes, and besides that’s the exact loop where you very much want a higher-level language for better debugging facilities and good integrations with sanitizers, fuzzers etc.

With a plan for all those details, all we need to take care of now is the refill logic and sort out the remaining plumbing. Looking at the decoder sketch above, we see that we need at least 2 registers worth of state per bitstream: one register to contain bit_buf (rBitBuf in the pseudo-ASM), and one for bit_pos. Once we consider refilling, we also need the corresponding read pointer (the inN I was just taking about). For 3 streams, 3 registers of state per stream works out to 9 registers, a bit more than half of our general-purpose register name pool, which is workable.

As for refill, that is luckily straightforward in a “bit extract” style scheme. At the top of every iteration, we want to load the next 8 bytes from the current input pointer:

  mov  rBitBuf0, [in0]

For the reverse byte order in1 stream, we use a big-endian load (MOVBE) instead, which is the same cost as the regular load on the Jaguars.9

Then we decode 5 values from each of the 3 streams. With our 11 bits code length limit, that means we end up consuming at most 55 bits from each stream. Most relevant bit reading techniques support at most either 56 or 57 bits in a row without a refill when using 64-bit registers, so this fits well.10 Decoding 3×5 = 15 symbols also works out very nicely with our 128-bit vector registers, so we do a single unaligned vector store every 15 bytes.11

Finally, after each stream has decoded 5 symbols, we need to check how many bytes to advance the read pointer by, and what the new start position within the byte is. The number of bytes we need to advance the read pointer by is (bit_pos >> 3) & 7, which on the Jaguar, we can compute using a single BEXTR if we can afford a register just to store the constant 0x303, which we can.12 We then either add or subtract this from the corresponding in pointer. Finally, we need to clear the bits corresponding to the byte position (that we just took care of) in bit_pos, which is an AND by ~0x38. This keeps the high bits, containing the 11 bitfield length that we need, intact. The actual code below does this masking at the start of the next iteration instead of at the end of the current iteration, but conceptually this belongs with the pointer advance.

And that’s pretty much it. Here’s the full decoder loop, written in NASM. We originally tried to write this in C++ with intrinsics, but that got nasty, so we eventually switched to a real assembler. The original version has the comments laid out differently but I need to fit this into an annoyingly narrow blog CMS theme, so this will look a bit clunky:

        ; main decode loop
        ; rax = scratch
        ; rbx = bitextr0
        ; rcx = bitextr1
        ; rdx = bitextr2
        ; rbp = bextr const
        ; rsi = table ptr
        ; rdi = -bytes_left_to_decode
        ; r8  = in0
        ; r9  = in1
        ; r10 = in2
        ; r11 = bits0 (only live in inner loop)
        ; r12 = bits1 (only live in inner loop)
        ; r13 = bits2 (only live in inner loop)
        ; r14 = decodeend
        ; r15 = (unused)

        sub             r9, 8 ; in1 -= 8
        mov             ebx, 0xb00 ; 11 field width
        mov             ecx, 0xb00
        mov             edx, 0xb00
        mov             ebp, 0x303 ; for byte step

        align           16
.bulk_inner:
        ; non-crossing invariant: in0 <= in2 && in2 <= in1
        cmp             r8, r10
        ja              .bulk_done
        cmp             r10, r9
        ja              .bulk_done

        ; refill stream 0
        ; read next bits0, keep bit offset within byte
        mov             r11, [r8]
        and             ebx, ~0x38

        ; refill stream 1
        movbe           r12, [r9]
        and             rcx, ~0x38

        ; refill stream 2
        mov             r13, [r10]
        and             rdx, ~0x38

        %assign i 0
%rep N_DECS_PER_REFILL
        ; stream 0
        ; peek
        bextr           rax, r11, rbx
        ; consume
        add             bl, [rsi+rax*2]
        ; grab sym
        vpinsrb         xmm0, xmm0, [rsi+rax*2+1], i+0

        ; stream 1
        bextr           rax, r12, rcx
        add             cl, [rsi+rax*2]
        vpinsrb         xmm0, xmm0, [rsi+rax*2+1], i+1

        ; stream 2
        bextr           rax, r13, rdx
        add             dl, [rsi+rax*2]
        vpinsrb         xmm0, xmm0, [rsi+rax*2+1], i+2

        %assign i i+3
%endrep
        %undef i

        ; final advances
        ; num_bytes_step0
        bextr           rax, rbx, rbp
        ; in0 += num_bytes_step0
        add             r8, rax
        bextr           rax, rcx, rbp
        sub             r9, rax
        bextr           rax, rdx, rbp
        add             r10, rax

        vmovdqu         [rdi+r14], xmm0
        add             rdi, 15
        ; loop while bytes_to_decode > 0
        js              .bulk_inner 

That’s the core 3-stream Huffman decoder loop. Time to quit it with the hand-waving and do an actual analysis (if only back of the envelope) to make sure we’re on the right track here.

Analysis

We already looked at the core decode step earlier and noted that it has 3 macro-ops (I’ll write 3M in the following), and for the backend: 2 integer 0/1 ops (just 2I for short), 2 load unit cycles for aligned loads (2L for short), and 1 FP/SIMD op (1F for short). We do this 5 times per stream. Also per stream is the refill/advance logic, which we now know the instructions for: 1 load for the refill, and 3 integer ALU ops for the byte advance and bitpos update. The load in the refill is almost always unaligned, though. It’s a 64-bit load, and as noted in the introduction, unaligned loads are free if they stay within an aligned 16-byte sector, and cost at least 1 cycle extra when they don’t. Out of the possible load offsets mod 16, 9 (0 through 8 inclusive) stay within a 16-byte sector, the other 7 do not. That’s 7/16=0.4375 odds of at least one cycle extra, and some of those cases (such as crossing cache lines and pages) get more expensive than just adding a cycle. For sanity in the following, let’s just say that we bake this all down to somewhat simpler numbers and expect around 1.44 cycles average case (but probably closer to 1.5 in realistic conditions) for those unaligned refill loads, 2 cycles for a much more pessimistic estimate. In other words, we want to bill the unaligned refill loads at costing more than single aligned load, since the expected number the load pipelines are occupied with them is larger.

Taking that into account, the four instructions involved in refill and advance for a single stream boil down to 4M, 1.44-2L, and 3I.

Then, we have some cross-stream shared instructions: the two compare/jump pairs for our pointer-crossing check at the beginning account for 4M 4I, the final store accounts for 1M 1.44-2S (since it’s also unaligned), and the final ADD/JS pair contribute another 2M 2I to the tally. That’s all instructions in the loop accounted for.

For an overall throughput estimate, we get:

  • 15 × 3M (decodes) + 3 × 4M (stream refill/advance) + 7M (shared rest) = 64M total, so 64 macro-ops, enough to occupy the front-end for at least 32 cycles.
  • 15 × 2L (decodes) + 3 × 1.44-2L (stream refills) = 34.3-36L total, so the load unit is busy for 34.3-36 cycles.
  • 15 × 2I (decodes) + 3 × 3I (stream refills) + 6I (shared) = 45I total, evenly distributes over both integer ALU pipes for 22.5 cycles worth of pressure.
  • 15 × 1F (decodes) = 15F total, distributed over both FP/SIMD pipes for 7.5 cycles worth of pressure, so they’re loafing.
  • 1.44-2S for 1.44-2 cycles worth of pressure on the store pipe which I assume is sitting on the sidelines munching popcorn.

Purely in terms of pressure on the execution resources, we’re mainly limited by the load pipes which are busy for around 34.5-36 cycles every iteration, closely followed by the frontend which is occupied for at least 32 cycles if everything goes perfectly. 34.3-36 cycles to decode 15 bytes works out to 2.287-2.4 cycles per byte decoded. This is assuming we can ever get throughput-bound to begin with, and is budgeting absolutely no time for L1 cache misses and such.

How does the critical path look? By my reckoning, the most likely candidate takes a freshly updated in pointer from the end of a previous iteration, does an unaligned load to refill which takes 4 cycles for the data to show up, then does 5 back-to-back decodes from that stream which we know have a critical path latency of 5 cycles each, and then finally needs to do a BEXTR on the resulting bitpos followed by an integer add/subtract to produce the next load address. That’s 4 + 5 × 5 + 2 = 31 cycles of critical path latency through the stream decodes, worse if anything bad happens, like extra delays due to page crossings on a load or similar. 31 cycles is close enough to our other 2 limiters for it to be considered a 3-way near-tie. A hitch in the front-end or load pipes or any extra delay along the critical path is likely to end up delaying any given iteration if it occurs. Note I’m purely looking an ideal throughput estimates here, there is no modeling or simulation of machine details going on, all we’re doing is tallying up some figures based on known machine characteristics.

In short, from this rough estimate, we would expect somewhere around 2.3-2.4 cycles per byte for this decoder under very idealized circumstances where there’s not a single cache miss or hitch anywhere along the way, and “a bit worse” (to be decided what that means) when less idealized.

The rubber hits the road

So what happens when we actually run it?

One of the nice things about coding for game consoles is that the hardware is known and tends to have very predictable, repeatable performance. With that said, here’s stats for the exact loop quoted above running on a PS4 on a synthetic test set (decoding a random stream with a very boring “every symbol is 8 bits” Huffman table, which of course you’d never do, but makes for a test run that’s very easy to validate the results of) 1000 times, and reporting 1st, 50th (median) and 95th percentile cycles per byte:

    huff3_jaguar_asm: med 2.28/b, 1st% 2.25/b, 95th% 2.38/b

I swear I did not fudge this in any way, that’s the actual figures I got on a real test run just now. So that much, ahem, very encouraging, to say the least. But what happens when you time it in the middle of an actual Kraken decode of ~250MB of real non-synthetic test data?

SimpleProf              :seconds  calls     count :     clk/call    clk/count
get_array_huff          : 0.3879  17689 178617138 :      34946.1         3.46
huff_x64jaguar_loop     : 0.3035  31350 178617138 :      15427.6         2.71

Here get_array_huff includes everything the Huffman decoder has to do, including reading the headers, the Huffman table descriptions, validating the code lengths, setting up the tables, the fast decode loops, and the slower near-end-of-data tail decoders, and huff_x64jaguar_loop is just the core optimized decode loop that handles most of the bulk data. “Calls” is the number of calls to either function and “count” is the number of bytes decoded. In this case, 250MB of data “only” decode about 178MB through the Huffman decoders; less than 250MB because we also do LZ-style dictionary compression (not covered in this series). So in this particular real-world use case (which very much does not have all the buffers already nicely in the L1/L2 caches as it’s running), we’re about 17% slower than our ideal average-case throughput estimate for this loop, which is still respectable. Also visible from these two lines is that the core decode loop where around 75-80% of the overall Huffman decoding time is spent with the rest being in setup or tail handling. That is fairly typical for our decoder implementations on various platforms. And you can infer from the figures given that our average array of bytes that use a single Huffman table is about 10k long. For this part, looking at averages is misleading: the distribution is quite wide. Many arrays are 60k+, but many others are well below 3k. The former spend more time in the core decoder (always nice since that part is flat), the latter spend a lot more time proportionately in header parsing and table initialization, which is why we can’t neglect it.

This last part is also why the Jaguar decoder (or, for that matter, all other decoders in Oodle) doesn’t bother with trying to set up tables to decode multiple symbols at once. This sounds enticing but it makes table setup more complicated and slower, and also adds many complications to the decoders because everything emits a variable number of bytes now. For example, using PINSRB to group output bytes would not work if each decode step produce either 1 or 2 bytes; we would need to do individual stores for every symbol, and also increment the destination pointer after every decode. This would add at least 2 instructions per byte decoded (a store and a destination pointer add), probably more. When your single-byte-at-a-time decode kernel runs 3 instructions per byte to begin with and instruction count is a major limiting factor, adding 2 extra more instructions to maybe decode 2 bytes at a time isn’t all that tempting. We can decode another byte in an extra three instructions with the single-byte-at-a-time decoder guaranteed, and we don’t need to do any expensive extra work during table setup to do so. We also don’t build up a debt of 2 instructions every time we don’t manage to decode 2 symbols at once that we later have to make up just to break even. Multi-symbol decoding is an old standby when decoding from a single stream, because you can hide a lot of extra work in the shadow of that nasty long critical path, but decoding from multiple streams simultaneously gives you more productive ways to spend those CPU cycles and maybe even get to the holy grail of being primarily throughput bound.

And that’s all I got for this post! I’m not sure which of the other decoder variants I’ll tackle next. Apologies for the long delay, but writing these up takes more effort than my usual blog post, and I need to be in the right headspace to even try doing it.

Footnotes

[1] If you’re familiar with Intel microarchitectures but not AMD, macro-ops are roughly comparable to what Intel calls “fused-domain micro-ops”, except they’re “even CISCier”, in the sense that even read-modify-write instructions like add [rdx], rax count as a single macro-op, where Intel would split them into a add-from-memory and a store internally. I’ll also add that historically speaking describing AMD macro-ops as similar to Intel fused-domain uops is backwards; AMD has been using “fat” macro-ops as part of their x86 instruction decomposition for a long time, since at least the K7 (original Athlon) architectures. Intel added fused micro-ops (which are more restricted) years later to their microarchitectures when they realized that having the frontend deal with these “chunkier” units was beneficial.

[2] Typical L1D load-to-use times for contemporary designs were 4 or 5 cycles. For that matter, they still are at time of writing, 9 years after Jaguar-based HW hit the shelves. That said, the Jaguars target much lower clock rates than those other designs—the fastest Jaguar descendants ran a bit above 2GHz, other OoO x86 cores from the same timeframe have similar L1D sizes and were typically designed to hit 4GHz or above, so presumably the Jaguar cores can fit a lot more logic into a pipeline stage.

[3] To editorialize even more, the Jaguar’s nearly complete lack of sharp edges and huge performance cliffs was a welcome change after the PS3/Xbox 360 generation, where the main CPU cores seemed at times like they had nothing but.

[4] Presumably due to a problem with the way immediate operand encoding in x86 works. Specifying a bitfield position and width needs at least 6 + 6 = 12 bits to be generally useful on a 64-bit machine. But x86 immediate operands for instructions with 32- or 64-bit operands only come in two sizes: 8 bits and 32 bits. The former is not enough, the latter is very wasteful. 16-bit immediates only exist for 16-bit register instructions, and this part of the encoding would be quite expensive to add exceptions to. Interestingly AMD added a short-lived immediate-operand version of BEXTR that indeed spends a full 32 bits, but this version of the instruction decodes to 2 macro-ops on Jaguar and was removed from Zen 3. It never shipped on any Intel CPU.

[5] All “big core” Intel CPUs that support BEXTR (that I’m aware of, anyway) decode it into 2 uops. The same CPUs can usually eliminate register-register moves during register renaming and would also take 2 uops for a shift-and-mask combination, so BEXTR has never been particularly interesting on them, since it offers at best some minor advantages in the frontend. It’s more interesting on the Intel Atom-derived cores such as the Alder Lake E-cores (which like Jaguar have a 1 uop, 1 cycle version) and AMD Zen cores, though.

[6] Yet another AMD/Intel difference: Intel has been renaming the 8-bit parts of registers such as AL and AH of RAX or R9B of R9 separately for a long time, meaning AL and AH can reside in separate locations in the physical register file. When referencing the merged halves as a single register (such as AX, EAX or RAX) later, Intel CPUs used to either stall (the “partial register stall” of long ago) or, later, started inserting merge operations that combined the results into the instruction stream. AMD has never done this, and instead seems to do partial updates on every operation. That means that on Intel CPUs, code that alternately updates AL and AH can execute as two independent dependency streams, whereas on AMD CPUs all these updates run in series. However, AMD never needs to insert any merge ops either, and has no special penalty for referring to the full register after a partial update, which is handy in our use case: on the Jaguar we’re always concerned with macro-ops through the front-end, so injecting merge operations would suck for us. Good that we don’t get any!

[7] Two streams so that we can alternate decoding from them, because (as seen in this post already and also mentioned in previous parts) sequential decoding from a single bitstream tends to result in very long dependency chains and is a bottleneck. Having pairs of forward and backwards streams with the two meeting in the middle allows us to store a single size in the bitstream to signal the start position of the reverse stream and the combined size as a single value. They also can, in some contexts, act as padding for each other. During decoding, ensuring the pointers don’t cross corresponds to a end-of-buffer check; we don’t know the exact size of either stream up front, but we know that the read pointers may never cross, and once decoding is done they should point to the same location.

[8] Arguably, we should have at least renumbered the streams and swapped labels of streams 1 and 2; but as is often the case, this was quickly prototyped, found to be working, then for a while we were concerned with other things such as buffer overflow safety and such, and by the time we realized it was pretty odd for stream 2 to appear in the bitstream before stream 1, it had long shipped to customers and was very much not worth a format-breaking change to rectify.

[9] Very convenient how the bitstream layout chosen works out so nicely for the most constrained of the important target platforms for Oodle.

[10] Yet another very-much-not-a-coincidence.

[11] This one actually is a coincidence, but I’ll take it.

[12] Yes, the decoder loop keeps 0x303 pinned in a register the entire time it’s running. Long-time friends will realize why this delights me; sometimes the universe just smiles at you like that. This one’s for you, Felix.