This is a result I must have re-derived at least 4 times by now in various ways, but this time I’m writing it down so I just have a link next time. All right. If you’re encoding a BCn or ASTC block and are trying to find optimal endpoints (in a least-squares sense) for a given set of weights, you are led to the linear least-squares problem
where A is a n-by-2 matrix of weights, X is a 2-by-c matrix of endpoints we’re solving for where c is the number of color channels (most commonly 3 or 4), B is a n-by-c matrix of desired pixel values, and A looks like
Solving this leads us to the Normal Equations
which is ultimately a regular linear system with multiple right-hand sides. is just a 2×2 matrix, so solving this is very easy (Cramer’s rule is the most popular approach). The only problem being, of course, that this matrix can end up being singular. There’s one obvious case where
ends up singular: when all weights are the same (making the two columns of A linearly dependent). The question is, are there any other cases where we end up with a singular system?
In short: no. Write (because it’s symmetric), then its determinant is
which per Lagrange’s Identity equals
That’s a sum of squares, therefore non-negative terms, and the determinant can only be 0 if all the summands are 0. However, we can simplify
and therefore the determinant is 0 (and our system singular) if and only if all the weights are the same, since we’re summing over all possible pairs of weights and a single pair with a non-zero difference is sufficient to make our sum positive.
Hi. I’m Fabian “ryg” Giesen, one of the co-authors of Oodle, originally made and sold by RAD Game Tools, now (after an acquisition in late 2020) officially Epic Games Tools as per the business license. Everyone (even at Epic) still mostly refers to us as RAD though; it’s been a long-standing name and nobody sees much point in a rebranding campaign. There’s several misconceptions about Oodle etc. floating around. There’s a lot more technical details on the RAD home page and in the docs (if you’re a licensee) but this is a shorter version that tries to give the executive summary of what is what in a few paragraphs each.
What is Oodle?
Oodle is a suite of data compression-related libraries, with three main ones: Oodle Data, which does lossless data compression – think ZIP/Deflate, 7zip/LZMA, Zstd, LZ4, that kind of thing. It was originally designed for storing game assets on disk, but it’s lossless and can be used for whatever data you want. The algorithms in Oodle Data generally emphasize fast decompression. For example, Kraken decodes around 4x faster than Deflate/ZIP for typical data.
Then there’s Oodle Network, which focuses especially on UDP network packets. These are often much smaller (frequently around 100 bytes or less). Oodle Network is also lossless, but is optimized for small, independent amounts of data. It’s slower to decode but has essentially no per-packet overhead.
Finally, we have Oodle Texture, which takes images and encodes them to the “BCn” family of GPU compressed texture formats. Oodle Texture provides high-quality compressors for these formats and supports RDO (Rate-Distortion Optimization), which encodes textures in a different way that takes the same amount of space in GPU memory but will be much smaller on disk after applying a standard lossless compressor like Deflate, Zstd or, of course, Oodle Data.
Kraken
Oodle Kraken, or just Kraken when the context is clear, is one of the algorithms supported by Oodle Data. It’s a good jack-of-all-trades that usually offers much better compression than say Deflate/ZIP and slightly better compression than Zstd, while also being very fast to decode.
Kraken (in our usual software implementation as part of Oodle Data) turned out to be quite popular among PlayStation 4 games and as a result was licensed by Sony for use in the PlayStation 5, which implements a decoder in hardware, making Kraken decompression on PS5 essentially “free”. This decoder was developed by Sony and AMD.
Sometimes, PS5 game package sizes are compared with the sizes of the corresponding game packages on PS4. While Kraken is likely part of any observed size differences, the typical difference between Kraken and a much older format like Deflate/ZIP is around 10-15% or so. In the lossless compression world, even a one-percentage-point difference is considered a big deal, but we’re not talking huge differences here. Some PS4 games are much smaller on PS5, and although the details depend on the individual game, in general, when that happens, it’s almost certainly not primarily due to Kraken. PS4 games are designed to load from a fairly slow hard drive and often duplicate data on disk multiple times to avoid long seek times. PS5 game packages, on the other hand, are always stored on a SSD where seeks are not a major concern. The PS5 packaging tools automatically find and eliminate large duplicated chunks of data, so when PS5 versions of PS4 games are much smaller, this is likely to be single biggest contributor. All credit here is due to the people at Sony who design and maintain these packaging tools.
Mermaid, Selkie, Leviathan
Oodle Data provides algorithms other than Kraken. Mermaid is faster to decode than Kraken but provides lower compression. Selkie is even faster and even lower compression. In the other direction, Leviathan takes longer to decode than Kraken but provides higher compression.
All of these algorithms provide different trade-offs. Game developers can and do choose between them based on their CPU and disk space budget. The latter is often still subject to certain “magic numbers” for games that get physical releases on Blu-Ray discs or similar, where say a 99GB game fits on a 100GB Blu-Ray disc while a 105GB game does not. So even though we’re usually talking about differences of a few percentage points here, sometimes those few percent really do make a big difference.
Oodle Texture
Oodle Texture is, first and foremost, an encoder for BC1-BC7 format textures (collectively referred to as BCn). These texture formats are always lossy. BCn textures are designed for random access on the GPU and chop up images into small, fixed-size blocks. BCn textures are already compressed, but usually get stored on disk with an extra layer of lossless compression applied. Intuitively, think of it like this: because BCn blocks are fixed-size, “easy” regions of an image are sent with more bits than they need, which is the price we pay for fast random access. We’re OK with this once textures are in memory, but on disk it’s just a pointless waste of space, and the extra layer of compression fixes it.
Oodle Texture aims to provide very high-quality encoding while still being relatively fast (to keep artist iteration times manageable). Its main feature is RDO. Oodle Texture RDO makes the BCn encoder aware of that second layer of compression happening, and explicitly tries to make the “easy” blocks more compressible. This does introduce some extra error but can reduce on-disk and download sizes of textures considerably, often by more than 2x in shipping games. It does not make a difference either way in VRAM usage, since the data in VRAM is in the fixed-bit-rate BCn formats.
Oodle Texture RDO results are still just regular BCn textures – they just happen to compress much better when passed into most lossless compressors. There are no extra decode steps required at runtime.
We try quite hard to provide really good BCn encoders. The goal is that at typical settings, Oodle Texture RDO results have similar error to most other (non-RDO) encoders. BCn are lossy formats, so some amount of error is in general unavoidable. But our goal is to provide the same level of fidelity that you’d get without Oodle Texture, just with smaller disk and download footprint.
It’s been a while! Last time, I went over how the 3-stream Huffman decoders in Oodle Data work. The 3-stream layout is what we originally went with. It gives near-ideal performance on the last game console generation’s AMD Jaguar cores and is also a good fit for 32-bit x86 and ARM targets, both of which were still a pretty important target for us at the time the scheme was designed (today, pretty much anything new uses 64-bit code, so that’s what we’re designing for; sub-optimal performance on 32b targets is not a reason for us to keep features out of the bitstream anymore).
As per the analysis in the last part, the 3-stream decoders are getting decent issue slot utilization out of a Skylake-era x86 core, but are still fundamentally limited by the critical path latency for individual symbol decodes and the bit buffer refill. We’re also using a lot of extra instructions compared to the minimum necessary to shave a few cycles off that critical path latency. Instead, what we can do is optimize for instruction/issue slot count and throughput at the expense of a slight increase in latency, and then use a lot more streams. Ideally, if we have enough independent work to keep the machine filled, the critical path latency stops being the primary limiter altogether.
That said, we still have those 32-bit and Jaguar targets who prefer the 3-stream layout and could maybe use 4 streams without introducing a lot of extra spills, but certainly not a lot more. The compromise we use in Oodle Data is that we introduced a 6-stream layout that uses 3 streams for the front half of the data being coded, and an independent 3-stream group for the back half. In effect, we’re doing two 3-stream decodes (with two separate output buffers) at the same time. The platforms that would rather have 3 streams decode the two in sequence instead.1
Many-stream logistics
The number 6 was chosen because it cleanly divides into two sets of 3 (for the platforms that prefer 3), but it also happened to work out in other ways: it turns out both 4 and 8 streams are a bit awkward in terms of register use on x86-64 and 32-bit ARM, and would introduce several extra spills. Probably wouldn’t be a huge deal, but the odd counts are there for a reason.
Anyway, register use is the watchword in general here: the scheme described in the 3-stream article uses 3 registers worth of state per stream: the actual bit buffer, the “number of bits present” count, and the stream read pointer. On x86-64, we have 16 integer GPRs, sort of. The stack pointer is really spoken for, though, which leaves us with 15. 6 streams at 3 registers of state per stream would need 18 registers’ worth just for the bit stream state; the stream read pointer is only used during refill so spilling those during the core decoding loop isn’t the end of the world, but even so, that’s 12 registers allocated, and we still need an output pointer (actually two in our case), a pointer to the decoding table, and at least one or two registers worth of temporary space… and that’s more registers than we have available already. Spilling and reloading around the bit buffer refill isn’t ideal, but workable. When just cycling between our 6 streams means some amount of register thrashing, though, we’re paying an extra tax not just per every 5 symbols decoded, but for every single symbol. The idea with using more streams was to be able to use fewer instructions per decode to use the machine more efficiently. If we’re saving one instructions’ worth of work per symbol and then adding two instructions worth of spill and reload, we’re not doing great. This can still be worth it if we get better instruction-level parallelism (ILP) potential out of it, but we’d rather not.
Therefore, our goal is to be efficient both in terms of instructions executed and in terms of register state we need to keep around. This old post of mine conveniently presents a solution, in the form of bit reader “variant 5”, the throughput-optimized one. The actual scheme is explained in that post, so I won’t reiterate it here. Just know that the basic idea is to get rid of the “bits consumed” counter per stream by setting the MSB of the 64-bit bit buffer upon refill (then shifting the whole word down if parts of the first byte were already consumed earlier), and determining the consumed bit count using a leading zero count. Conveniently, shifting out just-consumed low bits using an unsigned bit-shift right also updates the consumed bit count for us. Therefore, we save both the extra register for the counter and the extra instructions for maintaining it. The trade-off is that this version of bit-buffer IO doesn’t have the “lookahead” property where the refill load can start before the final consumed bit count of the previous iteration is done. Therefore, we pick up one more load latency along the critical path.
Enough build-up, how does a single-symbol decode look like in this form? Assuming BMI2 is available, it’s this:2
andn rcx, rMask, rBits ; peek (rMask = ~2047)
movzx ecx, word [rTablePtr + rcx*2] ; table fetch
shrx rBits, rBits, rcx ; bit buffer update
mov [rDest + <index>], ch ; emit byte
Four instructions and 5 uops (4 fused-domain; the store is 2 ops, one store address and one store data which counts as a single uop in the fused domain, the others are 1 uop each) per symbol decoded, which is pretty sweet. Note we splurge on an extra register to keep the constant ~2047 just so we can do the masking with ANDN, which has a non-destructive 3-register form, instead of MOV/AND. The extra MOV doesn’t matter that much when MOV elimination is active, but due to CPU bugs this was actually later disabled in several CPU models, so this is more consistent. Also note both our destination pointers need to be kept within the low 8 registers so the move-from-CH instruction encoding is available, or else it takes an extra shift instruction and uop. (That’s weird quirks of the x64 encoding for you.) The scheme from the previous article takes 5 instructions and 6 uops per symbol decoded (if we use the same ANDN trick, one more of each otherwise), so this change saves us one instruction and one register’s worth of state per stream, with the latter saving further instructions that would otherwise go into spills and reloads if we used 6 streams naively.
The critical path for this is simple: the first three instructions are on it (for any given stream), the last one is not. The AND and shift have 1 cycle of latency each, the indexed load has 5 cycles (on Skylake), for 7 cycles total. Continuing with last article’s assumption of around 4 fused-domain uop issue slots per cycle (the reality is a bit more complicated, but this is good enough), over these 7 cycles, we get 4×7 = 28 issue slots, of which we’re filling 24 when we’re doing 6 streams. When we’re doing this 5 times in a row, that means 5×4 = 20 unused issue slots by the core decode; we do have other things in the loop such as constant loads, pointer increments, the loop condition, the stream overlap checks, and maybe some spills and reloads around the refill stage, so these remaining slots don’t necessarily have to go to waste either. We don’t get to fill every issue slot every cycle anyway because the scheduling and load balancing in the CPU is not perfect, but this is still a promising setup.
Refill
How do we do the bit buffer refill with this? Well, this again follows the basic logic from “Reading bits in far too many ways (part 3)”. If we implement that naively, we get something like this (per stream):
lzcnt rBits, rBits ; # bits consumed
mov rcx, rBits
shr rcx, 3 ; # of full bytes consumed
add rBitsPtr, rcx ; advance ptr
and rBits, 7 ; leftover bit count
mov rcx, [rBitsPtr] ; load next
or rcx, rMarker ; set marker bit (MSB)
shrx rBits, rcx, rBits ; consume leftover bits
This is already not bad, but note that most of this is on the critical path for the refill as written. The most annoying part of that is the lzcnt which has a 3-cycle latency on most Intel parts. However, note that the rBits value we do the lzcnt on is always the shrx result of the final decode from that stream, and nothing else looks at the contents of the bit buffer after said shift. Therefore, we can juggle things around to do the leading zero count early. The final decode for each of the 6 streams does:
andn rcx, rMask, rBits ; peek (rMask = ~2047)
lzcnt rBits, rBits ; early lzcnt
movzx ecx, word [rTablePtr + rcx*2] ; table fetch
add rBits, rcx ; lzcnt update
mov [rDest + <index>], ch ; emit byte
Instead of a final shrx, we use an add instruction instead (making the bits past the initial 8 bits garbage, but we can deal with that), and we move the lzcnt up so it happens concurrent with the table load (which has 5 cycles latency and is on the critical path). That takes the leading zero count off the critical path entirely and doesn’t need us to execute any extra instructions. We do need to clear the garbage bits that end up at bit 8 and above in rBits afterwards. However, if we look at the refill code given above, that’s a trivial fix (funny how that works out):
; # bits consumed already in low 8 bits of rBits
movzx rcx, low_byte(rBits)
shr rcx, 3 ; # of full bytes consumed
add rBitsPtr, rcx ; advance ptr
and rBits, 7 ; leftover bit count
mov rcx, [rBitsPtr] ; load next
or rcx, rMarker ; set marker bit (MSB)
shrx rBits, rcx, rBits ; consume leftover bits
The code still has some other details to take care of. Because we treat 6-stream blocks as a pair of 3-stream blocks, we still have one backwards, reverse-endian stream in there, we need to maintain 2 separate output pointers, we still need to do the stream-crossing checks and have our loop condition, and so forth. All of that is routine and I won’t go into it here, other than noting that the reverse-endian streams require a 64-bit bswap for their refill, which adds 2 uops and 2 cycle latency to their respective critical paths (on Intel), so those streams have the longest critical-path latency.
Analysis and results
I already went into the core symbol decode step before, noting that it takes 4 instructions, 4 fused-domain uops, and 5 unfused-domain uops per symbol. For the rest of our critical path candidate, that leaves the refill.
With the tweaked final decode described above, the leading zero count is off the critical path, but we still count it as 1 uop that effectively belongs with the refill (it also changes one shrx to an add, but that doesn’t make a difference at the level of analysis I’m doing in this series, although I will note that this is also slightly beneficial since there are more execution ports that can take integer adds than there are ports that execute shifts).
For the refill, that leaves us with 8 instructions executed (including the lzcnt), all single-uop. The critical path for the refill (after the final rBits update for a stream) goes movzx, shr rcx, add rBitsPtr, mov [rBitsPtr], or rBits, shrx rBits. All of these except for the load are 1 cycle latency, the load is 4 cycles on Skylake and derivatives, 5 on newer uArchs. That’s 9 or 10 cycles critical path latency through the refill, respectively, 11/12 cycles on the reversed streams that need the bswap.
We’re doing this for 6 streams; 8 uops × 6 streams = 48 uops and issue slots – plus another 4 uops for the two bswaps implied by the two backwards streams, implying a grand total of 52 uops. By our previous “about 4 unfused domain uop issue slots per cycle” rule of thumb, that’s around 52/4 = 13 cycles worth of work – we’re actually throughput limited on this portion, for once. And since we’re running this in a loop, we can expect the throughput-limited refill and “other loop overhead” portions to collapse with the latency-limited symbol decode step to some extent.
In the “production” version of the loop, it turns out there’s 15-20 fused-domain uops worth of other work (that’s not symbol decoding or refill) in this loop. I’m giving a range because 5 of these are compare-branch instruction pairs that can be macro-fused but whether that actually happens in any given instance is again a bit complicated and depends on other factors such as code alignment that are outside the scope of the level of analysis I’m doing here.
If we take the higher count of 20, this happens to line up precisely with the 20 “wasted” issue slots we noticed during the symbol decode portion. I did not notice this before writing this post and highly doubt that things shake up anywhere close to this in practice, because machine behavior gets fairly “twitchy” in the high-ILP regime and any stray disturbance such as non-ideal scheduler instruction picks or individual L1 cache misses cause long ripple effects. Nevertheless, it does suggest that on Skylake and its many process tweaks, we expect to be, just about, throughput-bound.
This time we’re decoding from 6 streams, 5 symbols each, for 30 symbols decoded per loop iteration. We expect to take 4 issue slots per symbol decode (120 slots total), 8 issue slots per refill (48 slots total), and 20 issue slots worth of bookkeeping etc. per loop iteration, for a total of 188 issue slots. With our 4 issue slots per cycle rule of thumb, if everything works out perfectly, that comes out to 47 cycles per iteration exactly.
For critical path latency, we have 6 independent streams, and the slowest ones are the two byte-reversed ones. Their critical path goes through the symbol decode and refill. We have 5 decodes of 7 cycles each plus a refill of 11 cycles, giving us a latency estimate of about 46 cycles if everything works out right.
In short, under very idealized circumstances, we expect to be throughput-bound (or close to), and hitting around 47/30 = 1.57 cycles/symbol decoded.
How fast does the “production” version of this loop actually run on a Skylake machine, in a real workload, on real data? This is where I would ordinarily post results of a test run, but as I’m writing this, the Skylake test machine I’ve been using for this series isn’t letting me remote in for some reason, so I don’t have same-machine results to post just yet. The values I remember are around 1.83 cycles/symbol for 55 cycles per loop iteration, off from our ideal throughput estimate by 8 cycles or about 17%. (I’ll try to remember to update that with recent measured figures once I get that machine back up and running). That’s a higher discrepancy than I’m comfortable with, but we’re running into the limits of what we can do with our very basic back-of-the-envelope estimates here. A more detailed analysis using uICA gives a lower-bound estimate of 49.75 cycles for Skylake, with the listed bottleneck being “Decoder”. That is about 10% smaller than the actual observed figures on real workloads that I remember, which as noted in the previous installment of this series is a pretty typical ratio for real-world results (involving cache misses, branch mispredictions, other process/OS interference etc.) vs. throughput estimates. Either way, at around 1.83 cycles/symbol observed for 6-stream compared to 2.79 cycles/symbol observed for 3-stream, we get a 1.53x speed-up from doubling the number of streams.
I do have actual measured results on my home desktop machine, which as of this writing is an AMD Ryzen 9 7950X3D (Zen 4) at 4.2GHz nominal. On that machine, when doing single-threaded Huffman decoding using the 3-stream BMI2 decoder, I get around 2.08 RDTSC ticks (which are not actually clock cycles due to dynamic frequency scaling) per symbol, while the 6-stream BMI2 variant gets around 1.37 RDTSC ticks per symbol, for a speed-up factor of around 1.52x from using 6 streams – almost exactly the same ratio, on a different microarchitecture. (There are also dedicated Zen-optimized versions of the inner loop, which exploit some specifics of AMDs Zen microarchitectures, and average around 1.79 RDTSC ticks/symbol for 3-stream and 1.29 ticks/symbol for 6-stream on the same machine; in this case, the speed-up works out to “only” 1.39x)
For what it’s worth, around 1.5x is also about the level of speed-up I see on wider 64-bit ARM machines, although the really wide ones (e.g. Apple M1/M2) can see even larger speed-ups from 6-stream.
On said Zen 4 machine, the pure 6-stream Huffman decoding BMI2 inner loop hits a rate of around 3.07 GB/s decoded per second (in our case, 1 symbol=1 byte), on a single core, with the system otherwise mostly idle, i.e., running the OS and desktop but nothing particularly taxing. When using the Zen-optimized variant, it hits 3.25GB/s.
The entire Huffman decode on the same data, same machine averages 2.52GB/s on a single core (using the straight BMI2 variant, 2.65GB/s for the Zen-optimized variant); this includes other tasks like reading the Huffman code lengths from the bit stream, setting up decode tables, as well as the careful end-of-stream processing.
Discussion and conclusion
The original 6-stream design started as a thought experiment about how many streams a purely throughput-bound Huffman decoder would have to be using on a theoretical x86 with as many registers as we wanted, but the same basic instruction latencies as the Haswell/Broadwell/Skylake machines that were current at the time (early 2016). Once it became clear that the 4-instruction/symbol decode step was not hard to do and the resulting stream count worked out to “around 6”, exactly twice what we were shipping previously and without bumping into register count limits, the (admittedly slightly off-kilter) pair-of-3-streams design coalesced as something that was easy to fit into the existing bitstream structure and could just be ignored on targets that didn’t benefit from it, while giving us a considerable speed boost on the rest.
In aggregate, this series of posts is meant both as an explanation of the methods used and more generally, as an illustrative worked example about what designing a data format for higher ILP potential looks like. A “standard” single-stream Huffman decoder is fundamentally bottlenecked by its single serial dependency chain, but the “right” number of dependency chains for a given task is not necessarily straightforward, nor is it always easy to figure out by experimentation. For example, the throughput-optimized decode step used in this version of the Huffman decoder is strictly worse than the latency-optimized decoder for typical 3- or 4-stream setups, so unless you’re aware of the underlying dynamics, you wouldn’t even try it for anything larger. Likewise, the latency-optimized approaches that work best for smaller stream counts run out of steam as more streams are added (mostly from extra spills), suggesting that there is no further win to be had from increased stream counts when in fact there is.
The particular numbers here are always subject to change; the original 3-stream count in the Oodle Kraken format was motivated primarily by the AMD Jaguar cores that were our most important target at the time, and the 6-stream version, although it happens to work out fine, was just a convenient multiple. As CPU designs get wider, the optimal number of streams keeps increasing, provided other likely bottlenecks keep getting fixed alongside.3 That said, a not-quite-ideal 6-stream arrangement on a machine that would do even better with 9 is still much preferable to your usual single serial bitstream decode.
If nothing else, takes this series as a practical demonstration of just how much wasted potential there is on current hardware (“current” here really meaning “anything built in the 21st century”) in serialization formats with a single stream of variable-sized data where each token needs to be decoded to be able to start decoding the next. Of course decoding speed is far from the only concern in practice, but it should at least be considered, when too often it is ignored entirely. There are definitely problems with multi-stream formats (especially when consuming data incrementally or in a streaming fashion), so they are not a panacea, but in many use cases it is completely fine to decode data in small batches (on the order of tens of kilobytes) with some buffering, and in those cases a multi-stream design can offer substantial benefits, especially in write-once read-many scenarios.
Finally, one lengthy digression on a complementary strategy: decoding multiple Huffman symbols at once. The basic idea here is straightforward: if we’re doing a table lookup per symbol anyway, and if the whole point of variable-length codes is to assign shorter code to more common symbols, can’t we decode multiple symbols from a single table lookup, provided those symbols are all fully contained in the portion of the bitstream we peeked at? And yes, of course we can, but there are considerable logistical difficulties if we look again at the final per-symbol decode step we ended up with:
andn rcx, rMask, rBits ; peek (rMask = ~2047)
movzx ecx, word [rTablePtr + rcx*2] ; table fetch
shrx rBits, rBits, rcx ; bit buffer update
mov [rDest + <index>], ch ; emit byte
This design uses 2 bytes per table entry so it doesn’t have space to decode multiple symbols at once. The easiest generalization is to allow decoding either 1 or 2 symbols per step. That means per table entry, we now need the combined length of all symbols decoded, the values for the first and (potentially) second symbol decoded, and a symbol count (or other flag) for whether the table entry describes 1 or 2 symbols. Take a basic layout of a 32-bit table entry containing 4 bytes (len, nsyms, sym1, sym2) for a concrete example (that’s not the only possible design point, but it is a sensible one). Note that our table entries are now twice the size (and much trickier to initialize), and furthermore, for each table lookup, we don’t know anymore whether we’re going to decode 1 or 2 symbols, so we can’t use immediate offsets for our stores anymore; in every decode step, we need to figure out what the symbol count is and advance the output pointer. We also either need multiple loads or do a single combined 32-bit load and shift to get access to the various fields; pick your poison. My best bet at a workable 1-or-2 symbol decode step looks something like this:
andn rcx, rMask, rBits ; peek (rMask = ~2047)
mov ecx, dword [rTablePtr + rcx*4] ; table fetch
shrx rBits, rBits, rcx ; bit buffer update
movzx rNumSyms, ch ; extract nsyms
shr ecx, 16 ; align output byte(s)
mov [rDest], cx ; emit output byte(s)
add rDest, rNumSyms ; advance write pointer
Note that despite still using oddball tricks like the movzx-from-ch to get (effectively) a bitfield extract, which comes with strings attached (rNumSyms can’t be just any register for this to work), we’re now using 7 instructions and 7 fused-domain uops (8 unfused-domain) to decode either 1 or 2 symbols. Furthermore, we’re adding a secondary dependency chain on the outputs between adjacent logical streams: the add rDest, rNumSyms needs to complete before the next stream can finish its output. That’s not the end of the world since only the stores depend on it, but it’s an extra potential for slowdown.
This is just not all that appealing. As noted above, we already get to be mostly throughput-bound (actually frontend, but at least we’re not constantly waiting on dependency chains anymore) on the original Skylake target, using 4 instructions per symbol decoded. This modified version will in the best case only ever write 2-symbol pairs and use 7 instructions to do so, for 3.5 instructions per symbol decoded. But that’s with a 100% success rate. Even if we manage to decode a pair of symbols a staggering 75% of the time, we’re still taking 7 instructions to decode a single symbol in the remaining cases, which works out to an average of 0.25 × 7 + 0.75 × 3.5 = 4.375 instructions per symbol – still worse than a single symbol decode, although to be fair, the double-symbol decodes end up doing a refill less often when they hit. Considering that, it’s likely that around 75% pair decode rate, we’d probably break even or see a slight speed-up using multi-symbol decoding.
That’s still not particularly promising though; from multi-symbol decoding, we’d like to see a lot more than “break even around the point where we decode 2 symbols at once 3/4 of the time”. Moreover, from the entropy of the symbol stream, we can work out how big the decode table needs to be to be able to successfully decode symbol pairs on most decode steps. Oodle Data uses a (very low) code-length limit of 11 bits to facilitate infrequent refills. In our existing table size of 2048 entries, we can expect to see frequent pairs of symbols if the entropy is around 5.5 bits per symbol or so, which is very low. Our typical data for things like literals after LZ compression is more frequently around 7.2-7.8 bits per symbol (each of which is from an 8-bit alphabet). We would need something on the order of 215 = 32768 table entries for pair decodes to be common; at 4 bytes per table entry, that’s 128KiB, which is definitely outside the L1D cache, and in fact takes up a sizeable fraction of the L2 cache as well. So now we’re expecting to miss the L1D cache on those table accesses as well, meaning our expected latency per symbol decode for the memory access goes from around 5 cycles to around 16-20, which completely blows up our assumptions. Now we have nowhere near enough streams to hide that latency again, and we also need to spend the time to set up these giant tables; as you can see from the numbers quoted in the preceding section, the time spent setting up the decoding tables even for our small (2048-entry, 2 bytes/entry) configuration is not negligible, and multi-symbol decode tables need more space per element, more elements, and are more expensive to set up.
Add all these together, and the case for table-driven multi-symbol decoding in the general case just doesn’t look great: it increases the amount of work per decode step, which is disproportionately felt when the decode steps are otherwise lightweight or there are multiple streams available to compensate for critical-path latency. It increases table build time, which can be a significant fraction of overall decode time if tables change frequently, table building isn’t optimized (it often isn’t) or tables are large, and it often ends up requiring impractically large tables to be able to decode pairs to begin with.
That doesn’t mean that it doesn’t have its uses, especially in cases where you know the expected entropy is low. For example, the JPEG decoder in stb_image normally decodes AC coefficients in two steps, first reading a (run,level) pair (this is encoded into a single 8-bit value as per the JPEG spec) and then reading an extra number of bits determined by the “level” value to figure out the coefficient value. Most AC coefficients are very small (mostly ±1 or ±2), so stb_image’s JPEG decoder additionally has the “fast AC” table that combines decoding (run,level,extra_bits) all into one table lookup (into a 1KiB table) in the typical case when the AC coeffs are very small.
However, for general-purpose lossless (de)coding in software, multi-symbol decoding just isn’t very compelling in my experience.
And that’s it for this post and the Huffman portion of this series, unless there’s interest in some of the other targets. (There’s also a suite of optimized ARM decoders, both 32- and 64-bit, and while they have some cool tricks I’m happy with, they don’t add anything substantially new to the material already covered so far.) I’m leaving the door open to do another follow-up on the non-Huffman portions of Oodle Data’s entropy coding—especially TANS—but I don’t have any concrete plans yet. We’ll see!
Footnotes
[1] This does give us a full 6 independent streams when batch-decoding the entire contents of a Huffman block up front, but is still effectively 3 streams when decoding incrementally. Not a problem in the regular SW Oodle decoders, but there are other “streaming” implementations that don’t get a benefit out of the 6-stream layout. The platforms we support that greatly prefer 3-stream batch decoding were much more important to our customers than the streaming use case at the time this was designed, so this was a sensible trade-off.
[2] You might notice that this code has a tendency to keep all shift counts in explicitly rcx, that exact register, not just a named temporary like I do for the other register references. That’s because even though I’m showing the BMI2 versions of everything, we also support non-BMI2 paths for machines that don’t have it, and in those we want to use the shr reg, cl forms. We also use the store-from-ch instruction, so the register receiving the table load needs to be one of eax, ebx, ecx or edx anyhow; it’s easiest to just make rcx the dedicated temporary register in this code and make sure everything that needs to be a shift count ends up in there in, that way the BMI2 and non-BMI2 paths can be very similar.
[3] Dougall Johnson has done some experimentation on Apple M1/M2 hardware and according to him, going up to even 12 streams is still beneficial, but I haven’t tried to replicate this myself.
Most standard texture compression formats use a type of algorithmic vector quantization (meaning that instead of storing an explicit codebook of possible blocks, the codebook entries are determined by an algorithm that uses the encoded block as an input). This is the case for all the BCn formats, ETC/EAC, and ASTC, but not PVRTC, where the decoded pixels near the edge of a block also depend on adjacent blocks, which makes the encoding process more of a global problem (so the notes in this post do not apply).
Furthermore, the contents of the encoded blocks can usually be partitioned into three categories:
- Header/mode bits, which determine how the remaining contents of the block are to be interpreted.
- Endpoints, which specify some of the colors present in the block in some form. The endpoints are then expanded into a larger palette of colors, and pixels select which color they want using
- Indices into the palette. Pixels generally pick the index that gives the closest match in the palette according to some error metric.
The expansion process from specified endpoints into the full palette frequently involves some form of linear interpolation, and the error metric is usually some variant of squared error (possibly weighted), for numerous reasons.
I have now several times run into people who thought that encoding textures at reasonable speed was impressive because this problem looks like certain well-known hard problems such as Integer Least Squares. This is not the case, for at least two orthogonal reasons:
- Being able to reduce a problem to some variant of Integer Least Squares, or Integer Linear Programming, does not automatically make it hard in a computational complexity sense. The reduction in a hardness proof needs to go the other way: you need to be able to turn an arbitrary instance of a known hard problem into an instance of your problem to prove that your problem is hard. It is entirely possible, and frequently quite easy, to go the other way and turn trivial problems into instances of NP-hard or worse problems (usually by accident).
- Hyper-polynomial growth of the best-known solution to a problem as a function of problem size N matters when the problem size depends on the input, but this is not the case for block-based texture compression. When the N for some hyper-polynomial algorithm is bounded by a constant independent of the size of the input (in this case that would be the dimensions of the texture), the asymptotic complexity of whatever algorithm you use to solve a NP-hard subproblem does not really matter.
In particular, if you chop an image/texture into fixed-size blocks (i.e. block size doesn’t depend on the input) that are then encoded fully independently, and encoding any given block takes a bounded amount of time, that makes encoding of the full image linear time in the number of input pixels, even if encoding any individual block is a combinatorial nightmare. You might not like the constant factors in front, but it’s still going to be linear time.
For the purposes of computational complexity, behold the following universal algorithm to encode compressed textures:
best, best_dist = None, None
for E in enumerate_coded_blocks():
D = decode_block(E)
dist = distance_metric(D, original_block)
if best_dist is None or dist < best_dist:
best_dist = dist
best = E
return best
This is a perfectly fine linear-time algorithm with the only problem being that even with something as modest as 64-bit blocks, it’s not expected to finish before the heat death of the universe. Don’t worry though, this post is not just an exercise in pedantry.
Namely, the other thing we can immediately do from the decomposition of a block into its constituent parts above is to shrink the search space drastically, and of course practical encoding algorithms do just that.
- Not all header/mode bits need to be explored. If desired, methods that still want to guarantee optimality can use branch-and-bound techniques to prune the search. In practice, nobody actually wants to spend the extra time (or cares about the generally tiny improvements from) guaranteeing optimality, so more commonly heuristics are used here as well.
- Formats that split their blocks into subsets let you do parts of the solve separately, which shrinks the search space considerably where it applies.
- Given a set of endpoints, the source pixels, and a distance metric, it is pointless to try all possible choices of indices. Once we fix the endpoints for a subset (and possibly some index constraints resulting from that choice), the optimal indices can generally be determined directly. This direction is usually cheap to solve exactly, but the endpoints are a somewhat awkward search space.
- In the other direction, if we fix a set of indices, the source pixels, and the distance metric, we can try to solve for the optimal endpoints given that set of indices. This direction is usually tricky to solve exactly, but the search space is easier to prune to a small subset. It’s the basis of cluster fit methods like the original algorithm in the S3TC patent and methods derived from it.
The actual reason I’m writing this post is that for several of the simpler formats, the resulting search space is small enough that exhaustive search is completely viable. Not so much for real-world texture encoding, it’s too slow and the benefits are way too small for that. For BC1-5, it’s also pointless because the decoder isn’t exactly specified. But it’s definitely doable to get reference results if you want to see what optimality actually looks like.
In particular, for BC3 alpha or BC4 blocks, the set of possible endpoint/mode combinations is just 16 bits (the ordering of the endpoints selects one of two modes), so 65536 possible choices of endpoint pairs. Determining the optimal endpoints in 1D isn’t even hard (and can be done exactly using fairly straightforward math), and it’s all very easy to accelerate using SIMD, using a GPU, or both. This is not only eminently viable, it’s viable enough that not only can this be used to test individual blocks (or small image snippets), you can encode full textures this way just fine. Sure, encoding a 1024×1024 texture that way takes seconds instead of milliseconds but that’s not long to wait for a reference result at all.
BC5 is just two independent BC4 blocks, one per channel, and they’re completely independent, so this also applies to BC5. BC1 color blocks are still 64-bit blocks and have a 32-bit space of possible endpoint pairs. However, 232 = about 4 billion “interesting” encodings (since we don’t care about sub-optimal choices of indices) is still small enough that exhaustive testing is totally viable on either CPU or GPU if you’re using an optimized search, a beefy CPU/GPU, parallelization and are willing to wait a bit to encode a single 4×4 pixel block. (In this case, you probably would not want to encode a full 1024×1024 texture this way.) And if we can do BC1 and BC4, we can also do BC2 (the alpha part is trivial) and BC3 (effectively separate BC1+BC4, they decompose). In short, BC1-5 are actually totally practical to do exhaustive encoding these days, and the same applies to ETC and EAC.
Would you use this for a practical encoder? Definitely not. Likewise, it’s not viable for the more complex formats like BC6-7 or ASTC. Nevertheless, it felt worth pointing out that optimal BC4-5/EAC encoding (in the least-squares sense anyway) for a known decoder is eminently possible even at home for real-world-sized textures if you’re willing to wait for a few seconds (and that’s with no pruning whatsoever). BC1-3/ETC are more involved, but still at a timescale where determining optimal solutions (given a known decoder) for a small image patch is something you can easily do on a regular home computer even without using any pruning strategies if you’re willing to wait a few hours.
Now the actual application given here is obviously very specific to what I happen to be working on, but the reason I’m writing it up is that I’m trying to make a few more general points: first, “this looks like what I know is a NP-hard problem” is no reason to assume something is actually hard (never mind infeasible) in practice, even if you actually want optimality – hardness reductions work the other way round, and this is by no means the only instance of NP-hard subproblems of bounded size that I’ve run into, which are a very different beast. Second, exploiting even some very basic symmetries can sometimes take combinatorial optimization problems from “never going to happen” to “that’ll be 5 minutes”, so it’s worth looking. Third, computers are fast. Trying on the order of 100,000 things if said trials don’t take long is just not that big a deal and hasn’t been for a long time. For that matter, trying 4 billion things has also been entirely practical for a while, as long as individual trials are in the “100s to 1000s of cycles” range (i.e. don’t try this with slow interpreted languages) and you have a few cores. For the past 15 years or so, pretty much every unary function on 32-bit inputs (especially floats) and every binary function on pairs of 16-bit inputs, I’ve just tested exhaustively on all possible inputs unless there was a very compelling reason why this wasn’t possible.
UPDATE May 7, 2023: I wrote this post yesterday somewhat in a huff (for reasons not worth going into) and the original post contained several inaccuracies. These have been corrected in this version and I’ve marked the places where I was wrong [like this]. Let’s just say I wish I hadn’t posted this in its original form, but I did, and I’d rather make the corrections publicly now than pretend it didn’t happen.
BitKnit is a LZ77+RANS lossless general-purpose compressor that was designed 8 years ago (1.0 release on May 20, 2015) to replace the aging and very slow to encode LZ77+arithmetic coding codec built into Granny 3D, and also worked – somewhat unintentionally – as a LZ77/entropy coding test vehicle. The codec did end up in Granny and later a small variation (with slightly different bitstream to work within the different container) in Oodle 2.1.2. In Oodle it was quickly superseded by the “oceanic cryptozoology” (first Kraken, then later Mermaid/Selkie and Leviathan) codecs, so BitKnit the actual codec is just a historical curiosity these days (and deprecated in current Oodle versions), but it was our first released code to use several ideas that later ended up in other Oodle codecs [the original version implied that these ideas started in BitKnit; that’s incorrect], and might be of general interest. So I’ll keep it brief and focus mainly on the parts of it that actually got successfully used elsewhere.
Basic overview
BitKnit (I’ll just write BK from here on out) is basically your usual LZ77 with entropy coding backend. Like LZX and then LZMA, it keeps a history of recent match offsets that can be cheaply reused; BK was designed mainly for Granny files, which are chock full of fixed-size records with highly structured offsets, so unlike LZXs LRU history of 3 and LZMAs of 4 recent match offsets, it keeps 7. This is somewhat expensive to maintain in the encoder/decoder and mostly a waste for ASCII text and text-based formats, but turns out to be very useful for the structured binary data that was its intended use case.
For entropy coding, it uses a 2-way interleaved rANS coder with 32-bit state that emits 16-bit words at a time, and 15-bit “semi-adaptive” (aka deferred summation) multi-symbol (i.e. not binary) models. Semi-adaptive meaning that they change over time, but not after every symbol; model updates are batched and the model remains static in between, which lets it amortize update cost and build some auxiliary data structures to accelerate decoding. The particular model it ended up with was a compromise to hit its speed targets, and its update mechanism wasn’t great. It was significantly faster than LZMA/CABAC/etc.-style “update after every symbol” fully adaptive models, but with appreciably worse compression, and slower to decode than a fully static model would’ve been (which would’ve also enabled tANS usage). Our later codecs jettisoned this part of the design; Oodle LZNA (as the name suggests, a LZMA-derived design) used multi-symbol rANS but with full adaptation after every symbol. Kraken, Mermaid and Leviathan (but not Selkie, which is a byte-aligned LZ with no entropy coding) support multiple choices of entropy coders but they’re all based on static models.
BitKnits use of a semi-adaptive model makes it easy to do the modelling and most of the encoding in a single front-to-back pass. Static models are trickier to encode for because there are interdependencies between e.g. the LZ parse and the statistics of the symbols emitted, which makes encoding harder and somewhat slower, but such is life. The actual rANS bitstream needs to be encoded in reverse though (we prefer our decoders to consume data in increasing address order). To limit encoder space memory usage, BitKnit processes data in chunks of 64k (uncompressed) at a time; the rANS bitstream is flushed after every chunk, which adds a small bit of overhead but lets us use a guaranteed bounded amount of memory in the encoder for arbitrary input sizes, improves its memory access patterns, and is also handy from a file format perspective. (Old post about rANS in practice, derived from how we used it in BK then LZNA, here). All the Oodle formats are also internally chunked, and have been forever (it’s how the framing format works) and I can recommend this strategy in general.
Entropy coding experiments
While working on BitKnit, we were experimenting with a lot with different entropy coding options, and much of it ended up in our other formats in some way or other [my colleague Charles Bloom also did a lot of experimentation at the same time, there was a lot of bouncing ideas back and forth in a bunch of mail threads, and the previous version of this post implied I did all that work, which is not true.] My blog post “Mixing discrete probability distributions” was written while working on BK and underlies both its semi-adaptive models and the follow-up “Models for adaptive arithmetic coding”. The latter are what’s used in Oodle LZNA, mostly with 8-, 9-, 16- or 17-symbol alphabets, using a SIMD decoder. This style of model with a larger radix than binary needs to sustain a much smaller number of symbol decodes per second than the equivalent binary arithmetic coder does, and especially combined with interleaved rANS coding gives quite a nice speed boost in the decoder. The caveat is that they’re usually not quite as good in terms of compression ratio as the “binary tree”-style adaptive binary models, but I consider it a great trade-off that even now, 8 years after writing it up, feels under-explored to me. (LZNA did use it; our follow-up formats didn’t simply because we’ve generally moved away from design points where adaptive coding makes sense entirely, since in our application space, decoder speed is very important.)
The actual models that ended up in LZNA were designed by my colleague Charles Bloom; ultimately, the adaptive coding direction was rejected as being too slow for the rates BK was targeting (not least because some of the platforms BK was meant to work on didn’t have suitable integer SIMD either), but it was a great fit for LZNAs slower target. Alas, it too didn’t last too long; Leviathan, which came out a few years later, usually gets quite close – sometimes better – at something like 10x the decode speed, using purely static models.
LZ literals
The main problem with batched model adaptation is that it takes quite a bit longer for the encoder (or decoder) to respond to changes in symbol statistics. LZMAs backend can pivot rapidly on such changes: when an unexpected (i.e. low-probability) symbol is seen, it’s probability is immediately boosted drastically. In my experiments, even batched updates that run every 2 symbols (i.e. update after every second symbol) did much worse than immediate updates/full adaptation on LZ literals, i.e. all the bytes not covered by string matches. Other types of symbols fared much better in my testing, but that’s of little use, because literals usually make up most of a compressed LZ data stream both by symbol count and by encoded size. If the literals have to be fully adaptive, that most of the runtime cost of full adaptation already, which as already mentioned was slower than what BK was aiming for.
Because BK was meant for Granny (a 3D mesh and animation exporter/runtime toolkit), most of the data I was looking at for mesh exports in particular was mesh vertex data, topology (in the form of index buffers), and textures. Vertex data is laid out in the form of fixed-size structs containing fields like vertex positions, normals, UV coordinates, and so forth. Meshes often have a fair bit of redundancy between vertices, where the bytes encoding positions of vertices frequently match positions of other nearby vertices, it’s common for authored vertex data to have meshes that are axis-aligned planar components that have a lot more structure still, and both the normals and UV coordinates in such cases tend to be highly repetitive as well, although usually with some variation. Indices are list of small integers that are usually “more or less” increasing (meaning the average value of the indices you see in a sliding window of recent indices will steadily go over up time, although there’s of course outliers). Some textures are quite noisy, and LZs don’t do great on those (you can improve things slightly with methods like PNGs filters). Some are “vector-y” with very geometric features and large flat areas, and LZs do much better on those. And some are “vector-y” with gradients, which again LZs don’t do awesomely on (it somewhat depends on gradient orientation), but they obviously have very nice structure in the byte values.
What most of these have in common is that delta coding (i.e. sending values as differences to a prior value, rather than directly) tends to work well on them. And from a different direction, LZMA (the main codec I was comparing against for BK) has a special case in its literal coding for the literal immediately following a match: the idea is that the decoder can be fairly sure that right after a match ends, the next literal in the data stream is not what the next byte after the final matched location would have been – otherwise we would just have sent a longer match to begin with. LZMA does not strictly enforce this, and the optimal parser can do such choices sometimes because it turns out to be advantageous, but the probability of the literal byte being the same as the match byte is still quite low. So what LZMA uses after a match is use a probability model for the XOR of the literal byte and the match byte.
This does two things: first, we expect that XOR to not be 0 with high likelihood, so even if everything else is completely random, we now end up a with an alphabet of 255 uniformly random symbols (with a very low likelihood of a 0), instead of 256 random symbols, which is slightly better even assuming nothing about the distribution of values at all. But second, the match byte and the literal byte are not uncorrelated in practice. The first literal after a match, especially in scenarios like ours where we have arrays of fixed-size records with very regular structure, is often a “near miss”, and the match byte still ends up being a good predictor. LZMA uses a bit-wise binary tree model, so XOR is a natural predictor for that topology.
BitKnit, however, does not; it uses byte-based coding, and as explained above we have a lot of data where we expect delta coding to do well. Hence, BitKnit sends the byte difference (mod 256) between the match byte and the literal byte. [This is a part where I really got the history wrong, there’s an entire correction below.]. This worked great, and some experimentation later, it just ended up sending all literals this way, not just the first literal after a match – the first literal after a match sends the difference to the byte right after the match ended, the second literal after a match a difference to the byte after that, and so on.
This solved several problems at once: first, it acts as an automatic delta-coder for this kind of data, and is neatly integrated with the LZ so there is no fiddly setting up good byte distances to use for delta-ing or similar. Because these distances are determined by the match distances, it’s all automatic within the format. It’s not generally as good as a custom format-aware delta coder, but it tends to get a lot of the benefit and Just Works. Second, it turns out that (for the data I was looking at anyway) the delta-literal distribution was usually a lot less variable than the raw literal distribution, which was the key enabler for BitKnits use of semi-adaptive models throughout.
BitKnit itself always codes literals this way; this is not ideal for other types of data, but – somewhat surprisingly, to me anyway – worked decently even more text-based data like JSON or XML. Right around the time of BKs original release, Charles wrote this up in his post on what he ended up calling LZ-Sub. He was using this at the same time in LZQ1, an experimental codec of his that never shipped but was used for lots of experiments that led to Kraken. Kraken, Mermaid and Leviathan all used this method and later expanded on it. Notably, neither LZQ1 nor the newer codecs go strictly one way or the other. LZQ1 could switch whenever it re-transmitted Huffman tables; Kraken and Mermaid work in chunks (128k of uncompressed data), and both of them have a per-chunk flag whether literals should be taken as raw values or as deltas from the match bytes. The delta decoding is slightly more expensive than just copying raw literals but does tend to give decent compression ratio improvements on many sources. As with essentially all other decisions in these codecs, the decision is made from an explicit space-speed trade-off (i.e. evaluate both compression ratio gains, if any, of using delta-literals, estimate decompression cost, and make the decision by weighing both factors according to a user-configurable trade-off). Leviathan goes further down this path and adds several other literal modes, so there’s not just two. Leviathan doesn’t have any notion of per-byte difference other than subtraction – we tried XOR but never found any real compelling use case – and instead can optionally include other types of context during its literal modeling.
The trade-off here is that the more complicated modes are both more expensive to decode and need to be evaluated by the encoder, which adds a bit to the encode-time cost.
For an example of how this works out in practice, you can look at for example Aras Pranckevičius’ recent (at time of writing) series of blog posts. In post 2 of the series he compares several of the Oodle codecs to a bunch of other competitors and finds out that we compare very well in terms of compression ratio, speed and decompression speed. The differences in his first test are as drastic as they are largely because Aras’ test set (structured binary data, namely several large-ish arrays of data items that are 4 floats each) plays exactly to our formats’ strengths. As you might notice, this is very similar to what you would see for the “arrays of vertex data” situation that was the design goal for BitKnit and prompted many of its design decisions. Kraken and Mermaid improved on BKs design considerably, but the delta-literals in particular come straight from BK and work very well for this kind of data. Later on in the series in part 6, Aras experiments with his own data-specific delta filtering, at which point the big differences between Zstd and Kraken essentially disappear (worth noting his results are exactly on the fast to medium part of the compression speed scale; Zstd gets better bang for the buck in that range, we tend to gain some more compression ratio at the slower settings, but overall I’d describe the two as comparable). In terms of absolute numbers, the delta literals in Kraken mean that Kraken at “Optimal1” (level 5) in the first test hits about 3.71:1 compression ratio while zstd level 18 (comparable encode speed) manages around 3.04:1. The latter test has Kraken Optimal1 around 4.34:1 and zstd level 15 (closest quoted) around 4.17:1, which I’m going to call roughly even. (We could make both codecs trade places here by playing around more with Zstd’s compression levels and Kraken’s configurable space/speed trade-off setting). In short, the delta literals manage to extract about half the gain a specialized difference filter does, with no manual data-specific work required.
Lots of repeat offsets
The main other fairly unique design choice in BitKnit is to have many repeat offset slots; where LZX, LZMA and Zstd (and also Kraken, for that matter) use 3, BitKnit uses an extravagant 7. [Turns out, Charles had also experimented with this some years prior, but that one I legitimately didn’t remember, and either way it’s a matter of changing a single constant, I’m sure it’s been independently discovered many times.] As with everything else in BitKnit, this comes down to its genesis as a codec for a data format leaning towards fixed-size records. In short, the extra repeat offset slots are useful for the data it’s designed for, whereas for many other types of data they don’t really help. The extra repeat offsets rarely outright hurt compression, but maintaining them does take time in the decoder.
One weird wrinkle in BitKnit that we didn’t take forward into other formats is the replacement policy. The standard method is to use “move to front” coding, i.e. repeat offset slots are numbered from most to least recently used. With BitKnits many offset slots, I found that I got some slight wins by not inserting new offsets into the precious “0” (frontmost) slot; instead, new offsets got inserted into slot 2, and the only way for an offset to make it into slots 0 or 1 is to get reused at least once. This was a slight win in BitKnit but didn’t make it into any of our newer codecs; notably this was designed into BitKnit before I had an optimal (or optimizing anyway) parser working, and was a small but solid win with the heuristics-driven parse I was using in the encoder at the time. When we re-evaluated this idea for other formats later, we found that optimizing parsers are good at managing the recent offset slots anyway so there wasn’t much benefit to this. Leviathan did end up using 7 recent match slots (largely because it provided benefits on certain types of highly structured data, same as I saw in BK) but with standard MTF/LRU update rules. (I wrote up how BitKnit implements its offset history years ago.)
And that’s it! I’m not sure if these notes are valuable to anyone else, but BK was a fun little codec to work on (and it was little – decoder, heuristic encoder with multiple speed levels and optimizing encoder all fit into a 2700-line file C++ file, with extensive comments) and it led to many ideas that we elaborated on further in our follow-up codecs.
More details on the literal models
I got the history wrong in the first version of this post, and some digging into old emails and commits later the details are actually more interesting than I remembered, so for what it’s worth, here’s some more history on how the various RAD codecs ended up having delta-coded literals as a mode.
Early BitKnit started with the idea of building something with adaptive multi-symbol arithmetic coding built on rANS. Multi-symbol arithmetic coders have existed for a long but, but they were mostly avoided for a long time due to requiring a divide, which is fairly expensive. At the time (2014), new-ish PCs had 32-bit integer divide latencies that were not unreasonable (clock cycle counts in the high 20s, on a dedicated divider that didn’t block anything else), but some of the targets we were still shipping code for had division that took 100+ cycles and blocked all integer execution for the entire time, which was problematic to say the least. Encoding for rANS still needed divisions but decoding did not, so this was very attractive as simultaneously being faster than a regular multi-symbol decoder on newer PCs, and not being unacceptably bad on more constrained targets.
The constraint was that to be able to use a divide-free approach, we needed to track estimates for a probability distribution in a way that kept the sum of all scaled probabilities at a fixed power of 2 at all times. That’s where the earlier “mixing discrete probability distributions” comes in. Early BK first used this to build a variant of LZMAs binary-tree probability models that worked on groups of 4 bits (nibbles) at a time, and also a variant of LZMAs literal-after-match (“LAM”) model, “Nib-LAM”.
The regular nibble models maintain 17 probability distributions over 16 symbols each: one distribution for the high nibble of a byte, and 16 more for the low nibble depending on what the value of the high nibble was. “Nib-LAM” keeps 48 distributions over 16 symbols: 16 distributions for the high nibble of the literal given the high nibble of the match byte, 16 distributions for the low nibble of the literal if those high 4 bits all agreed (indexed by low nibble of the match byte), and 16 distributions for the nibble of the literal if those high 4 bits didn’t agree (indexed by the high nibble of the literal that we just sent).
Early versions of BK used this Nib-LAM model, but unlike LZMA (which only uses this type of model for the first byte after a match), even the very first versions of BK I could find in source control always used this for later literals too (just extending the most recent match forward). I honestly don’t recall whether this was intentional or just a mistake, but literal coding in BK always worked this way, from the very beginning, and this gelled nicely with the delta literals.
Around the same time, my colleague Charles Bloom was experimenting with what turned into LZNA, and also played around with the Nib-LAM models. These were sometimes better than the bit-wise binary tree LAM models used by LZMA, sometimes worse; them being a bit worse makes sense if you think of them as a “chunkier” approximation of a bit-by-bit model, but more interesting was why they were sometimes doing a lot better.
He noticed a property that is easiest to see in the case when the high nibbles between the literal and match byte agree: we have 16 possible contexts (one for each of the possible high nibbles of the match byte, which agrees with the high nibble of the literal) each storing a distribution over 16 possible values. That is, when those high nibbles match, we track probabilities for all 16*16 possible combinations of (literal low nibble, match low nibble), and in particular can pick up any correlation between those, even when we have a mismatching bit somewhere in the middle (LZMAs bitwise-LAM models “detour” to a different part of the model immediately when that happens). And in particular, he noticed that on files where the Nib-LAM models did surprisingly well, he could do even better by sending straight-up delta literals.
As far as I can tell, that’s how the delta literals in their described form ended up in BK: generalizing the LZMA LAM model idea to a higher radix to get Nib-LAM, using Nib-LAM on all literals (not just the first after a match), Charles noticing that on the cases where that really improves on LZMA you can do even better by sending straight deltas, and then the deltas ended up in BK as the primary literal mechanism and replacing Nib-LAM for good as part of a push for speed, since Nib-LAM was just too slow for BKs target speeds and the delta literals worked fine with semi-adaptive models.
Most of the significant work on all three of LZNA, LZQ1 and BitKnit happened concurrently (and with many mails being sent back and forth…) from late February to late May of 2015. The original version of this post was poorly written; for the record, I’m not trying to claim sole credit for anything described in here, nor do I want to erase the contributions of others on those same emails threads at the time (which included not just Charles and I, but at the very least also Jeff Roberts, Sean Barrett, and Won Chun).
In the previous post I’ve talked about things you might want to know as someone who uses FFTs, this part covers all kinds of FFT implementation details, including the underlying reasons for a lot of the API complexities that showed up last time. I’ll also give some recommendations on what I think are good ways write a FFT right now, and presumably also going forward.
Algorithm zoo
There’s a ton of FFT algorithms to choose from, but practically speaking, the broad Cooley-Tukey family of algorithms (I include variants like split-radix as part of the general family) is by far the most important.
If you don’t already know how CT FFTs work, I would recommend to avoid any sources that derive the algorithm by manipulating sums of trigonometric functions (doubly so if they do it with real sin/cos instead of complex exponentials). Way too much noise and index manipulation, nearly impossible to see what’s going on. I would also not recommend the other extreme, sources that explain FFTs purely in terms of signal flow charts/butterfly diagrams, because while that form can be useful to look at particular transforms (until they get too large to comfortably fit in a sensibly-sized diagram anyway), the first few (or last few, depending) levels of the decomposition are in some sense special cases (because their twiddles are trivial), and with a flow-chart form that is most of what you see. Since you’re forced to look at a concrete realization that is generally not too big and thus mostly special cases, it can be quite hard to see what the actual general construction is supposed to look like.
The least “noisy” introductory explanation I’ve ever seen for a CT radix-2 FFT (provided you know some basic linear algebra) wrote down a 8×8 DFT matrix as powers ω0, …, ω7 of the 8th root of unity and proceeded to find and use the symmetries on it, then explained the general rule once you’d seen the particular example (it’s not too hard to figure out yourself once you start that way). I like this form: having the powers of the root of unity exposes the critical structure that is being used, but obscures the fact that of course ω2 is i and ω4 is -1, which is precisely the property that makes small signal-flow-graph based descriptions confusing to me because it leads to simplifications that break up the regularity somewhat.
If you have the required background (meaning you know what polynomial rings are and how they work), I’ve found ring-theoretic approaches the most insightful in terms of showing the actual structure of the problem without getting bogged down in the details (and in index manipulation) as tends to happen to varying degrees when working with sums of trigonometric terms, signal flow graphs or matrices. Honestly I think the most I’ve ever gotten out of a single source on the subject would be D. J. Bernstein’s “Multidigit multiplication for mathematicians” (section 7), which covers the underlying algebra for many of the major FFT variants in about 2 pages.
DIT vs. DIF
All Cooley-Tukey family algorithms come in two dual variants, Decimation-in-Time vs. Decimation-in-Frequency. With the recursive decomposition as introduced by Gentleman and Sande, a DIT FFT for even N is decomposed into first computing two size-N/2 DFTs on the even and odd elements respectively. These two DFTs on the even and odd halves of the data can then be combined into a DFT for all N samples by first multiplying the odd-half coefficients by so-called “twiddle factors” and then adding and subtracting the even and twiddled odd halves from each other (“butterflies”).
Splitting into even and odd halves is done at every level of this recursion, and always before any of the data is modified. We would like the actual computation parts of the kernel (at the various sizes) to work on contiguous elements; what this ultimately results in is a big data permutation at the start that does all of the even/odd splits at once (which turns out to be a bit-reversal permutation, where if indices i are in { 0, …, N-1 }, the item at position i is sent to the position bit_reverse(i) that literally reverses the order of bits in the binary expansion of the number.
Therefore this (now standard) version of Cooley-Tukey consists of first a permutation on the input, and then all the follow-up work can be conveniently done in-place. (We’ll go into memory access patterns later.)
The dual version of this approach is Decimation-in-Frequency (DIF). In DIF, you start with butterflies, this time between the first and second half of the input. Then you twiddle the second half and finally recurse into size-N/2 DFTs on the first and second halves (which are already contiguous), and once those partial DFTs are done, you interleave them. Again this interleaving happens as the last step at every level of the recursion, so you can skip it in the middle of the recursion and end up with FFT coefficients out of order – bit-reversed in fact, so you have a bit-reversal permutation at the end. In short, it’s the exact opposite of the DIT decomposition (if you invert a DIT FFT by reversing the flow graph, you end up with the corresponding DIF FFT, and vice versa).
The bit-reversal permutation is possible to do in-place, but it’s annoying and relatively slow. DIT FFTs, which need to do the reordering first, often prefer to be out-of-place: in their first pass, they apply the input permutation while copying to the destination, and then can work in-place for the rest of the computation. This can be viewed as a feature by saying that they leave the original data intact. DIF FFTs also want to do their computation in-place and have the reordering last; so they either overwrite the input completely leaving a completed but permuted FFT in the original input, and then do a final reordering step that copies to the destination, or they can be – very conveniently – fully in-place and leave the coefficients in a permuted order. This latter variant is the type of FFT that can give you permuted-order coefficients more quickly, which you might not mind as long as you’re only doing convolutions with it. The inverse would then use the corresponding DIT factorization, skipping the initial permutation step, and in fact ending up with zero shuffling whatsoever.
This is nice in principle, however the bit-reversed order is annoying for library users that do want FFT coefficients in the conventional order, and it gets worse when other types of decomposition (e.g. radix-4 or split-radix variants) are used which have a different “natural” order for their coefficients to end up in. In short, using DIF with whatever radix you want and then returning the unpermuted coefficients does save some work, but it does so at the expense of exposing a lot of implementation details, so it’s a bit dicey to expose as part of a library API.
Finally, the elephant in the room: multiply-adds. A radix-2 DIT step ultimately boils down to a twiddle of the odd half then a butterfly, which works out to (the additions, subtractions and multiplications here are using complex numbers, and w is a twiddle factor)
out0 = in0 + w*in1 out1 = in0 - w*in1
This looks as though it might nicely map to multiply-add operations where available, and in fact it does, although not exactly in the obvious way (I’ll get to that later). For a DIF version, remember that the order of operations was reversed, and we first add/subtract and then twiddle, so we get
out0 = in0 + in1 out1 = (in0 - in1) * w
which is not in multiply-add form. We might think this is a temporary setback and we just have to factor this w into the next operation using out1 to get our multiply-adds, but that doesn’t work either: out1 ends up going into another DFT layer that looks just like ours, and both of its inputs need to be twiddled, so if we delay multiplying by w, we get something like
out0 = deferred_w0*in0 + deferred_w1*in1 out1 = deferred_w0*in0 - deferred_w1*in1
Note we have two twiddles now because in general the twiddle factors feeding into the two inputs at the next level will not be the same. We can do one multiply and then two multiply-adds, but this is not great at all, and we still are deferring one twiddle multiply for the next iteration. (Maybe there is some good trick here that I’m missing?)
In short, with DIT we get a form that is amenable to using FMAs (or other types of multiply-adds) throughout; with DIF, not so much. In my opinion that severely limits the usefulness DIF + out-of-order coefficients as an attractive strategy to avoid permutations for convolutions, because now your DIF FFT potentially performs quite differently from its DIT inverse that shows up in the IFFT.
Therefore, these days, I’d stay away from permuted coefficient orders at API boundaries and just eat the permutation (and/or copy) to produce conventional-order results.
Different radix, arithmetic complexity
The “canonical” way to write a Cooley-Tukey FFT for power-of-2 sizes uses a radix-2 decomposition, meaning at every level we subdivide into 2 sub-FFTs of size N/2. Another popular variant uses radix-4 steps, which do a slightly different calculation to immediately subdivide into 4 sub-FFTs of size N/4. This has slightly more complicated logistics, a different output permutation (corresponding to reversing the digits in a base-4 expansion of the index this time), and can reduce the number of multiplies relative to radix-2, which is what originally sparked the interest in it – the number of multiplications used to be the primary cost in the FFT algorithm, especially in the absence of hardware multipliers, much less pipelined ones.
There’s further refinements such as split-radix, which is a hybrid between radix 2 and 4, using the fact that part of the computation in a recursive decomposition is more efficiently done as radix-2 and the remainder is more efficient as radix-4 (as you might expect, this introduces further complications to the coefficient permutation, which for split-radix is quite complex). There’s a whole sub-genre of hunting down individual stray multiplications: complex multiplications with twiddle factors that are 1, -1, i, or -i are obviously unnecessary and can just be turned into real-valued additions and subtractions (optional with swapped or sign-flipped real/imaginary parts) with no actual multiplications. Slightly less obvious, multiplications with twiddles that correspond to +-45 and +-135 degrees (+-1 +- i) / sqrt(2) multiply both the real and imaginary part by the same value and can use a slightly cheaper rotation. And then there’s some more tricks with leaving values scaled to reduce the number of multiplies further.
All these tricks can be used systematically in the first few (or last few, depending) steps where these are all the twiddle factors that appear. Scalar implementations can also make use of them at other levels of the decomposition to special-case individual values, but for SIMD/SIMT architectures trying to special-case these rotations in the middle of larger blocks hurts way more than it helps.
More to the point, all of these techniques are about minimizing arithmetic complexity, and more specifically, reducing the number of multiplications, since the number of additions is the same for all of them – and also larger than the number of multiplies. And that’s a crucial caveat since, generally speaking, we’re on a very different cost model from counting number of individual floating-point operations these days. Let’s give a few examples to see how this matters:
- Intel Ivy Bridge (2012). Older uArchs like Sandy Bridge, Nehalem etc. are similar, all the way back down to the original Pentium Pro (1995). FP/SIMD multiplier on port 0, FP/SIMD adder on part 1, both fully pipelined, can independently receive an instruction every cycle. Thus, in the limit, FP ALU throughput is limited by whichever type of operation we have more of, and that’s additions in this case.
- Intel Haswell (2013), Broadwell (2014). FP/SIMD fused multiply-add unit on ports 0 and 1; FP adder on port 1. That’s right, HSW and BDW can do two FMAs per cycle (256b vector ones at that, too!) but only one FP add. Once again, multiplies are less of a problem than adds, funnily enough.
- Intel Skylake (2015) through Rocket Lake (2021). FP FMA on ports 0 and 1, adds also accepted on ports 0 and 1. No more add imbalance, but if we’re chasing op counts for throughput, it’s far more helpful to use FMAs efficiently than it is to minimize multiplies. (As of 2022 Alder Lake, we do get a dedicated FP adder on port 5.)
- ARM Cortex-A55 (2017). Extremely common mobile device “little” core. Two SIMD/FP pipes, the FP portion is effectively 64b wide, can do two multiplies/adds/FMAs per cycle, so again the biggest bang for the buck is by using FMAs. The newer “medium” Cortex-A710 (2021) is still two FMA-based FP/SIMD pipes although they are fully 128-bit now.
- Pretty much any GPU or DSP built within the last 20 years (and that’s conservative): built around pipelined multiply-accumulate operations. Not necessarily fused floating-point multiply-adds (which is a specific thing meaning there is only one rounding stage), but some form of multiply-accumulator, maybe fused, maybe with intermediate rounding, maybe with weird truncations in the middle, maybe fixed point, but there is some kind of multiply-add in there. If we do a multiply, we get a corresponding add for ~free, as long as we have a suitable candidate.
You get the idea. For a multiply-add based architecture, the operations we’re counting – if we’re counting anything – should be multiply-adds and not individual multiplies or adds. For a machine without multiply-adds but with separate superscalar mul and add pipes, it matters what the ratio of mul to add resources is; for integer (fixed-point), you usually get more adders than multipliers so the number of multiplications is the thing to count, but for floating-point, it was a common design point for a long time to have one pipelined FP multiplier and one pipelined FP adder, and in that case the bottleneck is whichever pipeline gets more work, which for FFTs is generally the adder. Now if your target is single-issue, has a different pipeline, or doesn’t have pipelined multipliers (or adders for that matter), the classic cost model might be relevant to you, but for newer targets with beefy vector pipelines, nickel-and-diming individual multiplies is worse than useless (since it actively introduces irregularities that are more expensive than just doing the math) and utilizing FMAs (where available) well is more important.
For what it’s worth, the classical operation costs of a power-of-2 sized FFT for the basic algorithm variants – just the leading terms:
- Radix-2: ~5N log2(N) operations
- Radix-4: ~4.25N log2(N) operations
- Split-radix: ~4N log2(N) operations
In my experience anyway, radix-2 and radix-4 can make use of all-FMAs quite easily, split-radix gets a bit iffy, so these days, if you’re gonna pick one, I think radix-4 is a solid choice.
How do I use FMAs in this anyway?
Right. So, let’s write this out. For a DIT radix-2 step (radix-4 is analogous, just more of everything, you can figure it out), the complex-valued version as given above is
out0 = in0 + w*in1 out1 = in0 - w*in1
but of course we actually need to implement this in terms of individual real operations and expand out that complex arithmetic, which turns this into
out0.re = in0.re + w.re*in1.re - w.im*in1.im out0.im = in0.im + w.re*in1.im + w.im*in1.re out1.re = in0.re - w.re*in1.re + w.im*in1.im out1.im = in0.im - w.re*in1.im - w.im*in1.re
which makes it look like it’s 8 multiply-adds (I’ll use multiply-adds and FMAs interchangeably in the following, even though it’s not technically correct), but that’s clearly not a good way to do it. We could compute the shared “w*in1” part once, which takes 2 multiplies and 2 FMAs, and then we still have 2 adds and 2 subtractions left, so that’s not better in the “multiply-adds cost the same as isolated multiplies or adds” model. Now, what goes wrong here? Clearly, through the first half (the computation of out0), we can use 4 FMAs, and there’s no redundant computations so far, so that all seems fine. But in that second half we’re mostly recomputing stuff we already computed in the first half. Luckily, there is a trick, which is actually easier to see in the complex-valued version above: we can compute out1 via
out1 = 2*in0 - out0
instead. This is a multiplication by a pure real number so we can do this using 2 more FMAs, giving us this computation once we expand out into reals:
out0.re = (in0.re + w.re*in1.re) - w.im*in1.im out0.im = (in0.im + w.re*in1.im) + w.im*in1.re out1.re = 2*in0.re - out0.re out1.im = 2*in1.re - out1.im
6 FMAs total. For comparison, the “classic” version using just regular multiplies and adds ends up with 4 multiplies and 6 adds/subtracts total, so our op count goes down from 10 to 6, quite the a big change (NOTE: none of this is new and certainly not my invention, I’m just trying to spare you the time of digging it all up).
Oh, and regarding that complex multiplication in the middle: there are old tricks to replace the 4-mul-2-add complex multiplication with 3 muls and 5 adds, 2 of which can be precomputed if one of the factors is known, as is the case for twiddle factors, so you end up with 3 muls and 3 adds. However, as you might expect, this algorithm is more irregular, needs 3 floats for the twiddle factors instead of 2 (and also prevents you from using tricks that might shrink this further), and even though it’s 3 multiplies and 3 adds it’s not in the form of 3 multiply-adds, although you can reduce it to an add, a multiply, and then 2 FMAs, which sounds good in theory, except…
Is minimizing arithmetic operations actually the key?
In a word? No, not usually.
Here’s the problem. Take the 6-FMA snippet above, and assume we’re working scalar for now. That’s all the math to process a pair of complex values in a radix-2 DIT FFT step. You have to load the real and imaginary parts of both inputs (4 loads), load the real and imaginary parts of the twiddle factor (2 loads), and finally store the real and imaginary parts of both results, normally to the same place you just loaded them from (4 stores). In total, 6 loads, 4 stores and 6 arithmetic operations per 4 floats processed in each iteration.
This is not really a compute kernel so much as a memory streaming exercise with a part-time gig in arithmetic.
I was looking at a scalar version for simplicity, but the picture is exactly the same when looking at a vectorized version that processes multiple samples at once (because FFTs are so regular, vectorizing them is, for the most part, straightforward). It also doesn’t actually matter here whether a vectorized version uses a split or interleaved layout internally; interleaved has the real and imaginary values next to each other, but having complex numbers packed that way in SIMD registers means that instead of N “real-valued” SIMD lanes, we effectively end up with N/2 “complex-valued” SIMD lanes. The amount useful work per individual float remains unchanged (at best; in practice, using an interleaved format inside the kernel often picks up overheads from elsewhere. I’ll get to it later).
What we’re looking at here is the real problem with radix-2 factorizations, which admittedly is especially amplified when using the FMA versions of the core kernel: the ratio of computation to memory operations is unfavorable. Now exactly how bad this is or isn’t depends on what machine you’re targeting, and what the proportion of compute to memory resources is, but more loads/stores than arithmetic ops (or even a 1:1 ratio as you would get without FMAs) is never good news. This is also the real reason why the 3-mul 3-add complex multiplication which needs 3 floats per twiddle is not that attractive: it decreases arithmetic and increases memory load even further.
Radix-4 happens to have fewer multiplies, but much more importantly, it does the work in roughly half the number of passes, and thus halves the number of coefficient loads/stores. You can reduce this even further by using yet larger radices (if you have enough registers available to keep all those values around), but the biggest single jump is from radix 2 to radix 4.
Another note on radices: there’s a distinction between radix-4 and what I’ve seen called radix-22, a name that feels clunky but that I’ll keep using for lack of a better suggestion. The former is an actual radix-4 step, which is to say, for DIT, we have 3 twiddles “in front” into a 4-point DFT kernel (all multiplications by 1, i, -1 and -i, which are all not actual complex multiplications, just swapping values around and some negating). The latter is essentially two unrolled radix-2 passes chained back to back. In terms of arithmetic complexity, the difference is that a “proper” radix-4 kernel has 3 twiddle multiplies, while the radix-22 variant contains 4 instances of a radix-2 sub-kernel that each does one twiddle multiply. However, classic radix-4 needs to load those 3 twiddles from somewhere; radix-22 only needs to load 2 different twiddle factors (which both get used twice). Considering our woes with the amount of computation vs. amount of memory operations required, and that FMAs change the trade-offs for the effective cost of multiplications in a radix-2 kernel, radix-22 is a very interesting candidate!1
Point being: we can save a fair amount of unnecessary memory operations by using larger radices. This can use actual larger-radix algorithms, or it can use a radix-2k-style approach where we may just load a bunch of values into registers and complete 2, 3 or more levels of the factorization at once while they’re there. The difference in twiddle factor loads really starts to ramp up as we do more of these: a “real” radix-8 FFT needs to load 7 complex twiddles, a radix-23 step only needs 3 (and classic radix-4 into radix-2 would need 4). Combining steps like this, provided we have sufficient registers, lets us increase the amount of computation in between loads and stores to a level where we’re actually keeping the FP/SIMD units respectably busy and not just spamming memory operations.
Memory operations, the second
So far, I’ve been looking at memory operations purely in terms of the number of instructions that need to execute. But what about actual cache effects?
Hoo boy. A lot has been written about this subject, too much of it overgeneralizing, using really broad strokes or just working from unstated, questionable assumptions. Lest I go full rant mode, I’ll try to avoid editorializing for the most part. A lot of what you will read on this topic, be it in textbooks, on the web, in forums (remember forums?) or papers, is either questionable or just plain bad.
The most important news first: if you’re using either the standard DIT or DIF factorizations which are in-place except for that initial (or final, depending) permutation step, then as long as you’re working on a contiguous array and the transform size is small enough so the coefficients fit in about half your L1 data cache, you’re pretty much going to be fine no matter what. Your data comfortably fits in the L1D cache, will be repeatedly accessed and kept hot, and nothing bad is going to happen.
L1D caches on CPUs these days are usually 32KB or larger, and that’s been true for a good while (it was already true on desktops 15 years ago, for example; these days mobile devices have also caught up in that regard). That’s plenty for a size-2048 complex FFT with 32-bit floats. That covers a lot of the sizes you’re going to see in practice! Beyond that, here’s some of the statements I’ve seen, and what I think are the important caveats to all of them:
“DIF has better cache behavior than DIT”: this one’s a bit tricky. With power-of-2 sizes, all FFT algorithms are going to do a lot of memory accesses with exact power-of-2 strides between them, and that is a recipe for running into “fun” memory subsystem problems. As discussed before, both DIF and DIT run the same steps, just in reverse order (relative to each other). What is true is that a DIT FFT has a mandatory permutation step in front, whereas DIF can avoid it entirely if reordered coefficients are acceptable. Also, in-place bit-reverse permutations are a tricky mess so DIT FFTs are more likely to work out-of-place and do the permutation as part of a copy – and it is easy to do that permutation in a way that has exceedingly poor memory access patterns. So, yes, a reordered-coefficient DIF FFT is easy to implement fully in-place and efficient, and it can avoid the reordering step with its memory access troubles entirely. A DIT FFT is more likely to be out-of-place and do extra copies, and if the reordering isn’t done carefully, it can have bad cache performance – this is a solvable problem, though.2 And DIF has its own problems which I already went into earlier.
“Iterative FFTs are bad for the cache, use a recursive decomposition”: iterative FFTs will generally have an outer loop over the pass index (which determines current transform size), and a nested loop inside it that does all sub-FFTs of that size (either as function calls or just have that one loop handle all of them). Recursive approaches use, well, recursion; a size-N transform will internally call into smaller transforms for part of the data, and then either pre- or post-process them. That means that for transforms too large to fit into the L1D (or L2, or L3, …) cache, a straight iterative approach will stream the entire dataset into and out of the cache once per pass, while a recursion subdivision will eventually narrow down to a transform size that fits in the cache, and then all the sub-transforms on that subset will have the data warm in the data cache. Don’t get me wrong, this is good, but the straight linear loop structure of iterative FFTs is also good, and one recursive call per length-8 or length-16 or whatever your “leaf” transform size FFT is isn’t awesome. Mostly, I disagree with the unspoken assumption that they’re mutually exclusive. Iterative FFTs are great at handling the tons of small sub-FFTs near the leaf levels of the decomposition. Recursive FFTs are cache-oblivious and automatically “right-size” themselves. So just use both! Recurse down to a medium transform size that you’re sure is going to fit in the cache (I’d say somewhere from 256 to 1024 complex elements, depending on just how low-spec your smallest target is), and you can let the iterative approach make short work of all the tiny sub-FFTs while still getting all the nice automatic adaptation that recursive decomposition provides for larger sizes.
Over-generalization from old HW: this is in some textbooks, papers and forum posts. Especially late-80s and early-90s RISC workstations tended to have small, direct-mapped data caches that had completely pathological behavior on memory accesses with the right (or rather wrong) power-of-2 strides, and FFTs tend to hit all of these potholes one by one. People learned some very painful lessons in those days when they got burned quite badly by this, but set-associative caches with multiple ways are not a luxury feature these days, and they have substantially defused this particular problem. It’s still possible to hit pathological behavior, but you have to try a bit harder, and there’s decent ways to work around it.
“You should use exclusively cache-oblivious approaches/the four-step algorithm/insert other plug here” – use whatever algorithms you want, but I would recommend doing your own research and not just blindly following recommendations that some rando typed into a forum in 2002. There’s a lot of algorithms out there but many of them are quite old and written in a very different situation without floating-point hardware, with slow multipliers, little or no access to instruction-level parallelism, and very different relative costs of compute to memory. The good news is that large FFTs used to be a big-iron HPC problem and these days, even legitimately large FFTs over many million data points can be computed in a fraction of a second on a battery-powered cellphone. It’s pretty hard these days to get into a problem size for FFTs that actually runs into some serious hardware limits. The bad news is that a lot of what’s been written about FFTs dates back to times when this was a much bigger concern and hasn’t been updated since. This post is, in part, me writing up what I still remember from about 4 weeks worth of reading the literature back in 2015, trying to give you the bits that I think are still useful and applicable and intentionally avoiding all the material that is, I think, past its sell-by date.
Finally, a disclaimer: I have personally written and used FFT kernels for small to medium problem sizes (everything still fits comfortably in a 256k L2 cache). I have never actually had reason to use FFTs with transform sizes above 32768 in an application, nor have I optimized or debugged an implementation that provides them, so anything that concerns large transforms specifically (which shifts us even more into a memory-bound scenario) is not something I can comment on.
Stockham’s auto-sorter
There’s a Cooley-Tukey variant usually credited to Stockham that avoids the initial (or final, for DIF) bit-reversal, digit-reversal or whatever permutation by instead performing the reordering incrementally, interleaved with the processing passes. This was popular on vector machines and is sometimes suggested for SIMD.
It works just fine, but I’m not a big fan, for two reasons. The first is that Stockham gives up on any in-place processing entirely and instead “ping-pongs” between two buffers on every pass, which means it has about twice the active working set in the cache as an in-place algorithm (and more traffic in upstream caches or memory when it doesn’t fit anymore). The second is that this picks up extra data-shuffling work in every single pass; a batched reorder can be combined with other input processing (for example to turn data into an internal format), maybe an initial radix-2 or radix-4 pass, and also be done in a way that avoids thrashing the cache unnecessarily, because it’s a data-movement pass at heart and can be designed with that in mind. Then the actual workhorse FFT passes can focus on the math and memory access and don’t have data shuffling in the mix as well.
SIMD
So far, I’ve discussed everything pretending we’re working with scalars. But FFTs (power-of-2 sized ones, anyway) are generally pretty easy to vectorize.
If you have the data given as separate arrays for real and imaginary parts, it can be almost as simple as replacing every scalar load/store with a vector load/store and every scalar arithmetic operation with the corresponding vector arithmetic operation. You might need to handle the first few (or last few, depending) passes specially, where you’re working first with pairs of adjacent elements and then groups of 2, 4 etc., but once the distance between elements you’re processing becomes larger than your vector width (which is what most of your passes will be working with), it’s straightforward.
It does mean that your initial (or final) few passes end up special. However, these are the passes right next to the input/output permutation, and combining the early special-case passes with the data permutation tends to solve problems in both: the data permutation usually involves some kind of SIMD transpose, and moving some of the math across that transpose boundary separates what would otherwise be intra-vector “horizontal” dependencies out into multiple vectors. As for the data movement, that is ordinarily just loads, stores and shuffling; having some arithmetic in there helps the instruction mix and makes it less likely to be bottlenecked purely on memory operations or shuffling.
The pure deinterleaved format is probably the easiest to understand conceptually, but it’s not very commonly used. Fully interleaved (all real, imaginary value pairs) is the other extreme. That’s a very common format for complex-valued input data, and it usually works reasonably well to do the FFT in that form directly – for complex multiplications specifically, you do tend to need some amount of shuffling and special instructions for complex multiplication (“addsub” and “fused multiply with alternating subtract/add”, most notoriously), but these instructions are often available. That said certain things which can just be dealt with implicitly in a fully deinterleaved form (such as swapping real and imaginary parts, or computing a complex conjugate) turn into actual instructions when working interleaved, so there’s a lot of small random fix-ups involved; again, I’m not a big fan.
Yet another option is to keep the data interleaved, but at the granularity of full vectors (or some larger granularity that is a multiple of the vector size). For example, for 8-wide SIMD, you might store real parts of elements 0-7, then imaginary parts of elements 0-7, then real parts of elements 8-15, and so forth. This is still addressed by a single base pointer and has nice data access locality characteristics, but is just as (if not more) convenient as fully deinterleaved for SIMD code. The trade-off here is that your vector width (or other interleaving granule size) now becomes part of your interface; I like this format inside FFT kernels, less so as part of their public interface. (I’ve been experimenting a bit with this though. Maybe I’ll write a follow-up on this in particular, some day.)
Whenever you don’t work internally with the format you have in your API, there’s going to be some translation at the ends. One end already has the input permutation and can generally rewrite the complex number format into whatever you’d prefer with no trouble. Then for the output you need to translate it back; ideally this is fused into the last pass, but one problem that does crop up with these SIMD variations is that you already have 2 or 3 versions of everything (data permutation combined with initial passes, then a standalone radix-4 or whatever pass, and adding another “final pass” with a different permutation on output can get to be a bit of a chore. It’s up to you how many variations you want to generate (or hand-write if you’re into that sort of thing); personally I tend to avoid having too many pre-built combinations and would rather live with a bit of inefficiency at the interface boundaries than dealing with the headache that is a combinatorial explosion of algorithm variants.
Non-pow4 sizes, inverses, real-input or real-output FFTs, trig transforms, etc.
Non-pow4 first: anything based on radix-4 (or radix-22) factorizations needs to deal with the fact that half of all powers of 2 aren’t powers of 4. For those, you’re going to need one radix-2 (or radix-8, or radix-23) pass in there. There’s not much more to it than that, but it seemed worth writing up just to be explicit.
IFFTs are usually the same transform, just with the twiddles conjugated (scaling with 1/N or similar is often left to the caller). There’s other ways to reduce IFFTs to regular FFTs by swapping around coeffs etc. but I’d advise against it; depending on how your twiddles are stored, you might be able to produce the conjugate twiddles easily, but it’s also not too bad to just have a second version of the FFT kernel around for inverse transforms.
Real-input/real-output, well, there’s multiple ways to handle that, but the easiest is to “bitcast” the real input array into a size-N/2 interleaved complex number array, compute a complex size-N/2 FFT (which due to linearity gives FFT[even_reals] + i*FFT[odd_reals]), and then do a final radix-2 pass to combine the even/odd halves and compute the actual outputs. The inverse of that does, well, the exact same steps in reverse. I’ll not go into the details here but this is a classic algorithm. It’s not the most efficient but it’s reasonable and easy.
Other trigonometric transforms can also be reduced to inner FFTs with pre- and post-passes like that. If you have a decent FFT around, I’ve found this generally preferable to using specialized O(N log2(N)) algorithms specific to these transforms; trigonometric transforms generally have such direct algorithms, but their structure is almost always less regular than a FFT, so reduction to whatever-size FFT doesn’t have a lot of wasted work and then pre-/post-passes is usually my go-to.
Fixed point
I haven’t mentioned it at all. This omission is by design, not out of neglect. Floating-point and SIMD floating-point is widely available these days, and fixed-point signal processing is becoming increasingly niche. Some of the trade-offs are different with integers (adds are cheaper; DSPs usually have integer multiply-accumulate but many integer or SIMD integer instruction sets do not) and all I’ll say on the topic is that the last time I’ve had to deal with a fixed-point FFT is about 15 years ago, and I haven’t missed it. Good luck to everyone who still has to deal with it on a regular basis, but I’m just happy that’s not me anymore.
Twiddle storage
Twiddle factors are, essentially, a sine/cosine table. There is definitely a temptation to try and get cute here – maybe try to save some memory by only storing part of the values and inferring the rest by symmetry, or by realizing that the twiddles for FFTs of size N/2 are just all the even-numbered elements of the twiddles for size N here, and so forth.
In short, resist the temptation to do anything fancy with them. You want them in a form that is easy to load and doesn’t require a lot of bookkeeping, and trying to exploit symmetries too much to save on your sine table storage will probably not save you all that much memory but very well might make your hair go prematurely gray.
There’s an exemption for really straightforward stuff. For example, if you’re working in deinterleaved form, you need imaginary parts of twiddles (=sines) and real parts of twiddles (=cosines) both. Since cos(x) = sin(pi/2 + x), you can just store a sine table and use it for both, starting the cosine lookups a quarter of the way in. That is simple enough to not cause trouble (in my experience anyway), and it will also helps your data cache footprint, so it’s actually worthwhile.
Twiddles are one aspect where my lack of experience with large transform sizes shows: I expect that for large enough sizes, once your working data set and your twiddles are both streaming through not staying in the relevant cache levels, the extra traffic is an issue and you’d much rather spend some more arithmetic in the FFT kernel to compute twiddles (incrementally or otherwise) than eat avoidable extra loads from memory. However, I’ve not been in a situation where I’ve had to care, so I don’t really know one way or the other.
Takeaways
This is a long post and I’ve been at it all day, but I’ll try to give you the summary of my personal recommendations:
- Cooley-Tukey DIT is, on balance, probably your best bet.
- Nonstandard-coefficient-order FFTs are cute but probably more trouble than they’re worth these days.
- Do use FMAs where available (which is quite common these days), and if you design any FFT code now, plan with them in mind. Also, FMAs are the great equalizer as far as the cost of different factorizations is concerned.
- Don’t sweat old-school operation counts, they’re optimizing for the wrong things these days.
- Prefer radix-4 (or radix-22), less so for lower number of multiplies, and more because having fewer passes and fewer loads/stores is good and memory operation count is not negligible for this.
- You don’t have to pick strictly between iterative and recursive approaches, and you can get the best of both worlds by recursing for larger to medium sizes, and handling the small sub-FFTs iteratively.
- Do worry about number of memory accesses and your access patterns, especially for things like digit-reversal permutations.
- SIMD: yes please; Stockham: probably not.
- Data layout-wise, it’s really up to you.
- Twiddles: Keep it simple, stupid!
And I think that’s it for today!
Footnotes
[1] You can also get a “proper” radix-4 kernel with 2 complex twiddle loads instead of 4. The high concept is that instead of twiddling with ωNk, ωN2k and ωN3k, you can substitute the 3k twiddle with -k – which is just the complex conjugate of the first twiddle, so that doesn’t need a separate load. This substitution works fine but does affect the coefficient permutation (are you starting to see a pattern emerge?); see e.g. Bouguezel, Ahmad, Swamy, “Improved radix-4 and radix-8 FFT algorithms”, 2004 IEEE International Symposium on Circuits and Systems. The same trick can be used on split-radix factorizations to yield what is now usually called the “conjugate-pair split-radix factorization”. That said, as I keep repeating, with a FMA-counting cost model, the arithmetic operation difference between a radix-4 and radix-2 decomposition doesn’t really materialize, and radix-22 comes out looking pretty good.
[2] Not to leave you hanging with a mysterious statement, one possible approach is in Blake, Witten, Cree, “The Fastest Fourier Transform in the South”, IEEE Transactions on Signal Processing, Vol. 61 No. 19, Oct 2013 – in short, be careful about how you implement the permutation, be sure to grab some adjacent data while you’re there, and also, once you have that data in registers, might as well do an initial radix-2 or radix-4 pass along with the reordering.
I was just looking over SIMD FFT code I wrote in 2015 for Bink Audio and Miles Sound System to replace the old, all-scalar implementation we had been using since (presumably) the late 90s. That in turn reminded me of writing that code originally, reading a lot of papers on the subject, and how I wished at the time that there were a write-up going into what mattered (and why) and what didn’t. It’s 8 years too late to save myself some time, but it might still be handy for you!
I’ll try to cover everything interesting to users of FFTs first. I’ll do a follow-up post that is only really interesting for FFT implementers. All of this assumes some familiarity with DSP concepts.
Big picture: what do you need that FFT for, and what kind?
FFTs get used for many things, but there’s 3 main classes of use cases that all have somewhat different requirements:
- Actual spectrum analysis. You have some signal that you want to see the discrete Fourier spectrum of. Generally uses windowed FFTs (aka Short-Time Fourier Transforms, STFTs). Typically wants FFT coefficients in the “proper” (natural) order. Large transform sizes (let’s draw an arbitrary line in the sand and call any transform size above 4096 “large”) are common. All-real inputs are more common by far, and usually just one direction of transform (let’s call it “forward”) is used.
- Convolution or polynomial multiplication. Here, you’re just using the FFT as an intermediate step in a convolution algorithm. For convolution, real inputs are more common by far. For polynomials, you see complex a bit more often, but still not that common. If you don’t want cyclical convolution (and you usually don’t), not only are your inputs and outputs often real, the second half of FFT inputs is usually all zeroes and the second half of IFFT outputs is usually discarded; this can be used to speed up the transform if you care to exploit it. When used for convolutions, you don’t typically care whether coefficients are in the right order or not (for convolution, all you’re doing is element-wise multiplications in frequency space; reordered coefficients cause no real problems.) Can use quite large transforms but often doesn’t, even for uses cases like fairly long convolution reverbs (see below). Uses both forward and inverse transforms, in about equal proportion if one side of the convolution (e.g. a filter kernel) is fixed and pre-processed in advance; twice as many forward as inverse transforms if not.
- As building block in other DSP algorithms. Many other trigonometric transforms used in practice, e.g. DCTs and MDCTs, reduce to FFTs with some pre- or post-processing, sometimes both. The inner FFT here is usually a complex-valued one. Usually wants coefficients in order. It’s rare to see large transform sizes for this use case.
FFTs “want” to work in the complex domain and this is the most natural form for them. Real-valued signals can be turned into complex-valued signals by just adding all-0 imaginary parts, and then you can use a complex FFT, however this wastes memory and work; a “proper” real-input FFT doesn’t require blowing up the input into complex numbers, can compute just half the spectrum (the other half is implied by symmetry), and as a result uses about half the memory and is nearly twice as fast. One caveat: for a FFT that works on complex numbers, the difference between a “forward” and “inverse” FFT is very small (literally a sign flip), and there is not any universal agreement on which sign the “forward” transform should be (sigh), but it also doesn’t actually matter for many applications. Once you do a real FFT and corresponding IFFT, well the real FFT takes N real-valued inputs and outputs N/2 complex numbers (really N/2-1 complex numbers and 2 real numbers, but the two reals are frequently packed together into one complex value) and the real IFFT takes those N/2 complex numbers and outputs N reals, so they are not interchangeable and you need to use the right one.
Big picture: transform sizes
In principle, any size FFT can be computed efficiently, in the sense that you get an O(N log N) asymptotic complexity bound no matter what. In practice, the practical FFT algorithms that you actually want to use in most cases need the transform size to be of a certain form to be efficient, and the arbitrary-size FFT algorithms work by reducing a size-N FFT to a different FFT of a nearby “convenient” size, again with pre-/post-processing.
The most important class of algorithms in practice (Cooley-Tukey and descendants) wants highly composite numbers with few different prime factors; powers of two (just factors of 2) are best. Having a single factor of 3 or 5 in there is also often supported, which gives you more convenient spacing of transform sizes: for example, if you just need slightly more than 1024 elements, you don’t have to jump all the way to 2048 but might instead use 5*256 = 1280.
If possible, and there’s no strong reasons to do anything else, stick with just powers of 2. If that’s not an option, try to make do with exactly one factor of 3 or 5 in there. Any other “odd” sizes will likely end up getting reduced to one of those sizes in the end anyhow.
Big picture: libraries, interfaces, what to call?
If you’re a user of FFTs, and you’re interested mostly in real-valued data (you often are), check if your FFT implementation of choice supports real-valued input or has inverse transforms that produce real outputs from half a spectrum, instead of manually expanding real values to complex values with zero imaginary part. That’s a fairly easy near-2x speed-up, usually.
FFTs usually need some description of the transform on the side, depending on the transform type and size. This is a data structure that usually contains twiddle factors (which effectively means a sine/cosine table, sometimes in funny order) and some auxiliary data (tables or otherwise) used during coefficient reordering.
Some FFT libraries only support a small, restricted set of transform sizes (usually just certain powers of 2), and these can usually avoid a “setup” structure of “FFT plan” or similar entirely. It’s a simpler API but more restricted.
There’s in-place vs. out-of-place. This is really more of an implementation detail, but sometimes it matters. Some approaches lend themselves to a fully in-place implementation (where the input and output are the same array), others don’t. “Real” in-place needs less memory but has more complicated logistics and sometimes inefficiencies. Asa a result, often in-place functions exist but are just an API entry point you can call that internally allocates workspace memory, works out-of-place, then copies the results back. Sometimes you also have API functions that take an optional pointer to workspace memory; if those exist, that usually means they need it and if you don’t give them workspace memory, they’ll allocate it themselves and free it once they’re done, so allocating the workspace once and passing it in can save a lot of allocation churn if you’re doing a lot of transforms. In short, there’s a lot of variations here, and this part is often poorly documented, but this is something you’ll want to look into because eating a lot of unnecessary allocations and copies in this type of DSP functionality can contribute a significant cost.
Then there’s complex number format. There’s two main contenders: interleaved and split. Interleaved has complex numbers stored as pairs of real part and imaginary part right next to each other. This is the format most commonly used for complex number data types (like C99 built-in complex type or C++ std::complex). FFT algorithms can work in this form directly, and some do, but it can be advantageous to split the data into two arrays, one containing just the real parts and one just the imaginary parts of each value. In general, interleaved form is pretty much always supported, but if the library supports the split format, that is usually the faster one, and it might be more convenient for your code as well, so it could be worth looking into.
Finally, permuted coefficients vs. not. If you only want to use a FFT for convolutions, you might not care about the order coefficients are in. Some algorithms can produce a FFT with coefficients in a non-standard order at a lower cost. You don’t see them that much anymore for reasons I’ll get into later, but if you use a FFT library that supports this option, and you only care about convolutions, it might be interesting for you.
With that all said, I know the question of library recommendations is going to come up, so for what it’s worth: if you’re on an Apple platform, vDSP ships as part of the Accelerate framework, is always present, and looks fine to me (although I haven’t tried the DFT/FFT functionality in particular), so there’s at least a few platforms where there’s an obvious choice.
There’s always FFTW, which I have last used in the early 2010s and don’t have fond memories of. It’s a big library with external dependencies, not easy to build or use, a ton of code (around half a megabyte last I checked), and for a while anyway it seemed to lag behind quite a bit in SIMD instruction set support (that might have been fixed by now, or maybe that was just a configuration problem on my part). FFTW does have pretty much everything once you get over that integration hump, though.
Intel has FFT functionality as part of their MKL (Math Kernel Library). I haven’t used it for FFTs but MKL in general tends to offer excellent performance on Intel CPUs. They detect CPUs by model number and mysteriously fall back to their slowest unoptimized kernels on any x86 device that doesn’t report as Intel (unless you override it with an environment variable), which is a reason for me not to use it on general principle but YMMV. MKL is also fairly big (and of course x86-specific) so if distribution size is a problem for you and you just want a FFT this is probably not what you want.
The one that’s easiest for me to recommend is Mark Borgerding’s KISS FFT. It’s not the fastest FFT around but it’s small, simple, and easy to build and integrate. If you find yourself bottlenecked by FFT speed you’ll want to look into something more advanced but if you just need a FFT-type algorithm to replace a calculation that would otherwise be O(N2) with something that’s O(N log N) and doesn’t have a huge amount of wasted work, this fits the bill. You’re losing a constant factor relative to what you could be getting with proper vectorization so be advised that there’s faster options when that becomes a problem.
Big picture: convolutions
Convolutions are a fundamental signal processing operation for filtering (and also for effects such as convolution reverb in audio processing), and by what’s usually called the Convolution Theorem, time-domain convolution (the operation we’re interested in) corresponds to frequency-domain point-wise multiplication. (Vice versa, frequency-domain convolution also corresponds to time-domain point-wise multiplication, which is the math behind how windowing affects the FFT).
The theory is this: convolving a length-N signal X by a length-M signal Y directly takes N*M multiply-adds (the first one is just a multiply, but close enough). If we take the first signal to be our data and the second to be our filter, that means convolving by Y takes M operations per sample. (It stands to reason that we have to some work for every output sample on average.) This gets bad if Y is long. In the aforementioned convolution reverb, Y could be a 5-second sample at 48kHz sampling rate. That’s 240,000 multiplies per sample, which is what we in the business refer to as “not ideal”. The convolution theorem tells us we can do a Fourier transform, then one multiply per sample in the frequency domain, which is 4 real multiplies, and finally we do a inverse Fourier transform back. Fourier transforms are (mumble mumble) something something O(N log N), or I guess in our case O(M log M), so maybe this is O(log M) arithmetic operations per sample, and log2(M) is 18. Even assuming some fairly high constant factors here we’d expect this to work out to ballpark a few hundreds of operations per sample, roughly a 3 orders of magnitude speed-up.
Only problem being: I glossed over all the details of how exactly that actually works. This is not the subject of this post, so I’ll keep it brief, but there’s a bunch of important techniques here to actually get this to work, and this is another one of those things that are often glossed over, so I want to at least put all the search keywords in here so you know what to look for if you actually want (or need) to implement this.
First, we want “discrete convolution”, which is the convolution operation that belongs with what’s usually called the DTFT (Discrete-Time Fourier Transform). The FFT is a realization of the DFT (Discrete Fourier Transform) not DTFT, so it’s a different type of transform. DFT is like DTFT but for periodic signals, and it also has a convolution operation it’s associated with for its version of the Convolution Theorem, but the convolution we get with the DFT/FFT is what’s called cyclical convolution – it wraps around the signal (which is assumed to be periodic). If we want the no-wraparound (discrete) version, we can get it from the cyclical version by padding the signal with extra zeroes and using a long enough transform length that no actual wrap-around takes place. For our length-N signal, length-M filter setup, we can pad both out to be N+M-1 samples long, and then use cyclical convolution (which is just point-wise multiplication of DFT/FFT coefficients) to get our discrete convolution result.
That works, but if we did that directly, we’d need the whole input signal in advance, and we also get O((N+M) log2(N+M)) operations for the whole job, so the amount of operations per output sample has a dependency on the input length. That’s not only worse than advertised, it also fundamentally doesn’t work in a setting where you get the signal in real-time, because N (the eventual length of your input signal) is not something you know while it’s happening, and you don’t know all the future samples yet. So the next step is to realize that no, of course you don’t need full knowledge of all samples past and future to do a convolution with a finite-length filter. What you can do is cut your source signal into small pieces, say also of size M, filter those individually, and then combine the results. This works just fine and the search term for the main approaches are “overlap-save” and “overlap-add” method. This gets the dependency on N out of there entirely, and we actually do have O(log(M)) asymptotic operation count per output sample now. Nice.
This is now good enough for offline editing or batch processing, but still doesn’t work for something like real-time audio, because we work in units of big blocks, and for our M around 240,000, we need blocks of roughly that many input samples to do the job. That’s 5 seconds of audio. If we need to wait until we have more than 5 seconds of audio to do another batch, that means we can’t do it without stuttering unless we introduce 5 seconds of latency. Once again we’re off by several orders of magnitude – we might be willing to live with 20-30ms of latency, although for Pro Audio a real 5ms or thereabouts would actually be nice. Direct convolution doesn’t have this delay. As soon as we have a new input sample, we can compute the corresponding output sample – although it might take us a lot of operations to do so. FFT-based convolution amortizes the effort greatly, but introduces unacceptably large latency when used directly with large kernels. So what do we do?
Well, several tricks. First, we can use the same “cut into pieces” trick we used on the input signal on the filter as well; for example, we might cut it into pieces 1024 samples long (duration 21.3ms at 48kHz sampling rate), which is getting towards the upper end of what we’re willing to tolerate. Then we can use overlap-add/overlap-save style techniques on these pieces to build our convolution with a 240,000-sample signal out of 235 convolutions by a length-1024 signal (this is effectively just another direct convolution, just between corresponding coefficients of 1024-sample blocks). We don’t need to redo all these every time, every 1024 samples we transform one new block of 1024 and keep the older ones around – the search term here is “Frequency-Domain Delay Line”. That adds a few hundred multiply-adds per sample to our tally, but a few hundred extra we can live with.
Next, that 1024-sample latency is also not actually necessary. It’s true it takes 21.3ms to complete a 1024-sample block, but that doesn’t mean you have to twiddle your thumbs while that’s going. The problem is really only with the “beginning” parts of Y – which get multiplied by the most recent samples from X. Once you’re 1024 samples into Y, those values get multiplied by “old” samples in X (at least 1024 samples old), so there’s not nearly as much of a rush. So we can use direct convolution with just the initial part of Y to keep our latency down (it can now again go about as low as direct convolution can), and then use optimized FFT-based convolution for the contribution from more distant parts of the signal. That contribution from the direct convolution is now contributing something like a 1024 multiply-adds per sample, which is starting to hurt, but at least we can hit low latency now.
Finally, the actual solution in practice: this realization that the big blocks are only a problem in terms of latency for recent operation is true in general, and lets us use this whole process recursively. For example, in our initial 1024 samples, the first 64 samples (1.3ms worth at 48kHz) are latency-critical and probably need direct convolution. Past that point, we might switch to length-64 blocks with FFT-based convolution. At some point past 1024 samples into the “past”, we might switch over to length-1024 blocks. And for further back-in-time samples, we might switch to length-16384 blocks or so (our 5s impulse response fits into 15 16,384-sample blocks, so the 15 resulting overlap-adds/overlap-saves we don’t worry about much, but if it were longer we might do this a few more time with increasing sizes). There’s some subtlety about what block sizes to use and when to switch (you might think it would be good to use pow2 sizes and greedily double the block size as soon as we can every time, but turns out that’s not work-optimal), and there’s also other concerns such as how much processing time the various transform sizes take and when to schedule them etc., but this is the actual “good” convolution algorithm for real-time applications, and it’s called “Non-Uniform Partitioned Block Convolution”.
All right. This was just a very brief, very rough overview, and I would absolutely not recommend trying to fill in the details from my few paragraphs, but especially with convolutions, I’ve several times seen code that really does a 2D FFT on a whole image or similar to convolve with a 50×50 pixel kernel, and anecdotally it seems that outside the domain of real-time audio, the better techniques are not nearly as well-known as they ought to be, hence this extended detour into convolution algorithms.
This concludes my notes on the subjects that might be interesting for FFT users. Next part will be for FFT implementers only, pretty much.
Back in 2007 I wrote my DXT1/5 (aka BC1/3) encoder rygdxt, originally for “fr-041: debris” (so it was size-constrained). A bit later I put up the source and Sean Barrett adapted it into “stb_dxt”, which is probably the form that most know it in today.
It’s a simple BC1 encoder that gives decent quality, the underlying algorithm is reasonably fast (fast enough to say bake textures you produce once per session in a game from a character creator, say, which is one of the cases I’ve ended up using it in in a professional context), and is easy to integrate.
The basic algorithm uses the same primitives most BC1 encoders use (I’ll assume in the following you know how BC1 works): compute the average and covariance matrix of the block of pixels, compute the principal component of the covariance to get an initial guess for what direction the vector between the two endpoints should point in. Then we project all pixel values onto that vector to find the min/max support points in that direction as the initial seed endpoints (which determines the initial palette), assign each pixel the palette entry closest to it, and do some iterative refinement of the whole thing.
rygdxt/stb_dxt only uses the 4-color mode. It did not bother with the 3-color + 1-bit transparency mode. It’s much more niche, increases the search space appreciably and complicates the solver, and is rarely useful. The original code was written for 64k intros and the like and this was an easy small potential win to give up on to save on code size.
Nothing in there was invented by me, but at the time I wrote it anyway, DXT1/BC1 encoders (at least the ones I was looking at) were doing some of these steps in a way that was more complicated than necessary. For example, one thing I vividly recall is that one encoder I was looking at at the time (I believe it was Squish?) was determining the principal component of the covariance matrix by forming the characteristic polynomial (which for a 3×3 matrix is cubic), using Cardano’s formula to find the roots of the polynomial to determine eigenvalues, and finally using Gaussian Elimination (I think it was) to find vectors spanning the nullspace of the covariance matrix minus the eigenvalue times the identity. That’s a very “undergrad Linear Algebra homework” way of attacking the problem; it works, but it’s complicated with a fair amount of tricky code and numerical issues to wrestle with.
rygdxt, with its focus on size, goes for a much simpler approach: use power iteration. Which is to say, pick a start vector (the only tricky bit), and then iterate multiplying covariance matrix by that vector and normalizing the result a few times. This gives you a PCA vector directly. In practice, 3-6 iterations usually sufficient to get as good a direction vector as makes sense for BC1 encoding (stb_dxt uses 4), and the whole thing fits in a handful lines of code with absolutely nothing subtle or numerically tricky going on. I don’t claim to have invented this, but I will say that before rygdxt I saw a bunch of encoders futzing around with more complicated and messier approaches like Squish, or even bothering with computing a general 3×3 (or even NxN) eigendecomposition, and these days power iteration seems to be the go-to, so if nothing else I think rygdxt helped popularize a much simpler technique in that space.
Another small linear algebra trick in there is with the color matching. We have a 4-entry palette with colors that lie (approximately, because everything is quantized to an 8-bit integer grid) on a line segment through RGB space. The brute-force way to find the best match is to compute the 4 squared distances between the target pixel value and each of the 4 palette entries. This is not necessarily a bad way to do it (especially if you use narrow data types and SIMD instructions, because the dataflow is very simple and regular), but it is essentially computing four 4-element dot products per pixel. What rygdxt/stb_dxt uses instead is the fact that if it were a perfect line segment in a continuous space, we could just use an orthogonal projection to find the nearest point on the line, which with appropriate normalization boils down to a single dot product per pixel. In that continuous simplification, the two interpolated colors sit exactly 1/3rd and 2/3rds along the way between the two endpoints. However working on the aforementioned 8-bit integer grid means that the interpolated colors can sometimes be noticeably off from their ideal placement, especially when the two endpoints are close together. What rygdxt therefore does is compute where the actual interpolated 1/3rd-of-the-way and 2/3rds-of-the-way colors land on the line (two more dot products), and then we can do our single dot product with the line direction and use the values we computed earlier to figure out which of these 4 colors is closest in the 1D space along the line, which is just a few comparisons and can be done branchlessly if desired.
The result doesn’t always match with the distances code using the brute-force solution would get, but it’s very close in practice, and reducing the computations by a factor of nearly four sped up the BC1 encoding process nicely (old 2008 evaluation by my now-colleague Charles here).
That leaves us with the actual subject of this blog post, the iterative refinement logic! I just answered an email by someone asking for an explanation of what that code does and why, so here goes.
The refinement function
The code in question is here.
Ultimately, what rygdxt/stb_dxt does to refine the results is linear least-squares minimization. (Another idea that’s not mine, this one I definitely got from Squish). We’re holding the DXT indices (interpolation weights) constant and solving for optimal endpoints in a least-squares sense. In each of the RGB color channels, the i’th target pixel is approximated by a linear interpolation
where x0, x1 are the endpoints we’re solving for and wi is one of {0, 1/3, 2/3, 3/3} depending on which of the 4 indices is used for that pixel. Writing that out for the whole block in say the red channel turns into an overdetermined system of 16 linear equations
to be solved for x0r, x1r (the first/second endpoint’s R value).
Let’s call that tall and skinny matrix on the left A; is the 2D column vector we’re solving for, and the RHS vector of the pixel r values we can just call “r’.
That means our problem is to minimize (2-norm), or equivalently
.
is quadratic; the way you find its minimum is by computing the derivative and equating it to 0, which leads us to what’s called the Normal Equations, which for this problem are
A is a 16×2 matrix, so ATA is a tiny symmetric matrix (2×2) and contains the dot products of the columns of A with each other.
We have 3 color channels, not just r but g and b as well. That means we have 3 copies of the same linear system with 3 different right-hand sides, or equivalently we can view the whole thing as a matrix equation with a 3-wide right-hand side. Either way, all 3 systems have the same A matrix, the only thing that differs is the right-hand sides.
We accumulate (and g and b as well) in the first pass, and also compute the entries of
. To solve the system above, because we just have a 2×2 matrix, we can use Cramer’s rule to solve it directly, no need to mess around with factorizations or Gaussian Elimination or similar.
That’s the basic idea. The RefineBlock function uses two more tricks:
- instead of the weights being {0, 1/3, 2/3, 3/3}, we multiply them by 3 and also scale the RHS by 3 (the latter, we never actually do explicitly). Getting the extra scaling is essentially free during the linear system solve, especially since we already need to do some scaling per-channel anyway, because the R/B endpoint values we solve for are in [0,31] (instead of [0,255] for the input pixel values), and the G values are in [0,63]. Scaling everything by 3 means there are no fractions involved in the computation, it’s all small integers, which will be useful for the second trick. It also means that when we compute the determinant of the 2×2 system for Cramer’s rule, it’s an integer computation, so we don’t have to worry about near-zero values and the like. (Although in this case, it’s not too hard to prove that A is singular, i.e. has determinant 0, if and only if all the wi are the same, which is easy enough to detect up front.)
- now our weights wi (matrix entries in A) are all in {0,1,2,3}. The three entries in
sum, respectively (note we scaled by 3, so
turns into
):
,
, and
. Note that all three values we’re summing only depend on wi, and wi is one of 4 possible values (depending on the index), so we can just precompute all of them. Also note they’re all quite small:
and
are at most 9, and
is either 0 or 2, so they all comfortably fit in 4 bits. We’re summing 16 of these values (one per pixel in the block), and they’re all positive. That means the final sums fit into 8 bits no problem. Therefore the actual table we have packs the 3 values into one integer:
.
We sum that one integer per pixel. All of the individual sums are guaranteed to be <256 so we can extract the corresponding bits after the accumulation loop. That means the cost of computing the entries of A^T A becomes quite cheap (a single 4-entry table lookup and 32-bit integer add per pixel), and the bulk of the work in that first loop is just computing the right-hand sides.
And that’s about it. I know that approach was later also used by NVidia Texture Tools, beyond that I have no idea of its reach, but if it’s handy for someone, cool!
Two acquaintances independently asked about this today, so it seems worth a write-up: recently (as of this writing), DeepMind published a new paper about a new practical fast matrix multiplication algorithm, along with a press release that is a bit misleading and seems to have led to some confusion already.
In particular, while the paper does introduce many new “practical” (not scare quotes, I’m using the word in a specific sense here that I’ll make more precise in a minute) fast small-matrix multiplication algorithms, that does not mean that you should replace your small-matrix library routines for 3×3 or 4×4 matrix multiplication with them. That’s not actually what they’re meant for, and they wouldn’t be good at it.
If you just want the executive summary, here it is: these are definitely interesting algorithms from an arithmetic complexity theory standpoint – especially for the case of 4×4 matrices over finite fields, where (to the best of my knowledge) Strassen’s algorithm from 1969 was still the reigning champion. These algorithms are also practically relevant, meaning that not only do they have better asymptotic lower bounds than Strassen’s algorithm, they are still algorithms you might actually use in practice, unlike essentially everything else that has been written on the topic in the last 50 years: these algorithms are correct, and will in principle win over Strassen’s algorithm with large enough matrices, but that cut-off is well beyond the sizes that anyone is actually doing computations with.
And if you know what Strassen’s algorithm is, you might be in the market for the results from this paper, In fact, I’ll go one further and say that if you are currently using Strassen’s algorithm somewhere, you should definitely check the paper out. For the rest of you, I’ll try to give a very short summary of the topic and explain why actual small matrices are not really the intended use case.
Strassen’s algorithm
Regular matrix multiplication of 2×2 matrices uses the standard “dot products of rows with columns” algorithm:
As written, this does 8 multiplications and 4 additions. In 1969, Volker Strassen discovered an algorithm for this that only uses 7 multiplications but 18 additions (follow the link for more details, I won’t go into it here); Winograd later presented a variant that only uses 15 additions. This is interesting in principle if multiplications are much more expensive than additions, something which is true in some settings but does not commonly apply to hardware floating-point math these days. Hardware floating-point now commonly implements fused multiply-add (FMA) instructions, and the 2×2 matrix multiplication above can be implemented using 4 regular multiplications, 4 fused multiply-adds, and no separate additions at all. Moreover, Strassen’s algorithm has some numerical stability issues when used with floating-point (these concerns do not exist when it’s used for finite field arithmetic, though!) that means it also must be used carefully. What this means is that you would never actually consider using Strassen’s algorithm on 2×2 matrices, and that is in fact not how it’s normally presented.
The form of Strassen’s algorithm that is of practical interest works not with 2×2 matrices, but with 2×2 block matrices, that is, 2×2 matrices whose elements are themselves matrices. Matrix multiplication has a very regular structure that looks exactly the same when multiplying block matrices as it does when multiplying matrices of scalars, you just need to be careful about operand order for multiplications because matrix multiplications, unlike multiplications in the scalar ring or field, are not commutative:
This looks identical to what we had before (except it’s now all upper case), but the operations mean something different: before we were multiplying scalars with each other, so we had something like individual real number multiplications (or floating-point multiplications in actual numerical code), now the products are matrix-matrix products (which are O(N³) operations when multiplying two square NxN matrices using the standard algorithm, a mix of multiplications and additions or possibly fused-multiply-adds) and the sums are matrix-matrix sums (which are O(N²) additions). And because what we’re describing here is a matrix multiplication algorithm to begin with, the smaller sub-matrix multiplications can also use Strassen’s algorithm if they want to.
In short, not only does each multiplication and addition in this block matrix represent many underlying scalar operations, but the relative cost of multiplications compared to additions keeps growing as N (the size of the blocks in our block matrix) increases. And that’s where Strassen’s algorithm is actually used in practice: eventually, once N becomes large enough, the many extra additions and irregular structure become worthwhile. It’s important to note that arithmetic operation count is not the only factor here: the extremely regular structure of conventional matrix multiplication, and the ease with which it can use FMAs, means that the arithmetic operations in a tuned matrix multiplication kernel are a lot cheaper than you might expect, because these kernels tend to do an extremely good job of keeping the machine busy. With small matrices (and actual 4×4 matrices definitely fit that description), other overheads dominate. Somewhat larger blocks are mostly limited by memory subsystem effects and carefully manage their footprint in the various cache levels and TLBs, which tends to include techniques such as extra memory copying and packing stages that might seem like a waste because they’re not spamming floating-point math, but are worth the cost because they make everything else go faster. With much larger blocks, eventually Strassen can become attractive, although the actual cut-off varies wildly between architectures. I’ve seen reports of Strassen multiplication becoming useful for blocks as small as 128×128, but more usually, it’s used for blocks that are at least 512×512 elements, if not much more. All this assuming that its less-than-ideal numerical properties are not a show-stopper for the given application.
AlphaTensor
That brings us back to AlphaTensor. What DeepMind has implemented is, in essence, a variant of the neural-net-guided Monte Carlo Tree Search they’ve been using with great success to build Chess and Go AIs. This time, they use it to search for small-matrix multiplication algorithms. I’m not the right person to go into the details here, and it’s not important for the purposes of this post. We can just treat this procedure as a black-box optimizer that can do a guided search for matrix-multiplication algorithms meeting a user-specified set of criteria.
One obvious criterion would be optimizing for minimum multiplication count, and in fact one of the discoveries reported is a “Strassen-like” factorization that uses 47 multiplications to multiply two 4×4 matrices (compared to 7*7=49 multiplications for two nested applications of Strassen’s 2×2 algorithm, or 64 multiplications for the direct algorithm). And separate from the more theoretical concern of minimum operation count for a “practical” algorithm, the same optimizer can also incorporate measured runtimes on actual devices into its operation, and thus be used to perform a guided search for algorithms that are fast on certain devices.
That’s the process used to yield figure 5 in the paper, which reports speed-ups of optimized 4×4 matrix multiplication factorizations against the baseline (which is the regular algorithm). Note that what this does is one high-level 4×4 block matrix multiply using the algorithm in question at the top level; all the smaller lower-level matrix multiplies use regular matrix multiplication. Also note that the reported matrix sizes start at 8192×8192, i.e. in this case, the 2048×2048-element blocks are multiplied regularly.
Furthermore, note that the reported speed-ups that the press release refers to as “10-20%” are compared to the baseline of using regular matrix multiplication, not against Strassen (in this case, “Strassen-square”, the aforementioned 4×4 factorization obtained by applying Strassen’s algorithm twice to a 4×4 matrix). Strassen results are reported in the figure as well, they’re the red bars.
In short, the new algorithms are a sizeable improvement, especially on the TPU, but the perspective here is important: this type of algorithm is interesting when individual multiplications are quite expensive, in this case because they are actually multiplications of fairly large matrices themselves (2048×2048 blocks are nothing to sneeze at).
If you’re actually multiplying 4×4 matrices of scalars, the standard algorithm remains the way to go, and is likely to stay that way for the foreseeable future.
I wrote about Morton codes long ago, and I see that post and the accompanying code referenced frequently, but there’s a few points I want to make about it.
First, if you’re on a x86 CPU with BMI2 support, you have access to PDEP and PEXT, which make Morton encoding/decoding trivial. 2D Morton encoding one 16-bit coordinate to a 32-bit word is a single 32-bit PDEP with 0x55555555, decoding it is a PEXT with the same mask, and you can also use the same thing to encode 64-bit values with appropriate longer masks extended in the obvious way. The only fly in the ointment is that earlier AMD Zen CPUs technically support PDEP and PEXT but with an incredibly slow implementation, so this is not a good implementation if your target HW includes those CPUs. That said on Zen 3 and later the two instructions are fine, and ARM has added an equivalent to their vector pipes with SVE2, so while a bit iffy right now, I expect this might be the method of choice eventually, for CPUs anyway. I’ve been seeing this a fair amount in GPU shaders too, though, which brings me to the next subject for today.
The methods presented in the original post assume that the Morton-coded index value is 32 bits. I did not point this out explicitly in the text back in 2009, but you can either add another pass to produce a 64-bit result, or remove passes if you don’t need full pairs of 16-bit coordinates (or triples of 10-bit coordinates) interleaved into one 32-bit value.
For example, I gave this code for Morton-encoding a pair of unsigned 16-bit values to form a 32-bit index: (Dropping the comments that detail the bit layout after each step because they cause formatting issues)
// "Insert" a 0 bit after each of the 16 low bits of x
uint32_t Part1By1(uint32_t x)
{
x &= 0x0000ffff;
x = (x ^ (x << 8)) & 0x00ff00ff;
x = (x ^ (x << 4)) & 0x0f0f0f0f;
x = (x ^ (x << 2)) & 0x33333333;
x = (x ^ (x << 1)) & 0x55555555;
return x;
}
uint32 EncodeMorton2(uint32_t x, uint32_t y)
{
return (Part1By1(y) << 1) + Part1By1(x);
}
If we only care about 8 bits worth of x and y, that first shift-and-xor pass (which tries to move the high 8 bits of x into place) is useless and we can just skip it:
// "Insert" a 0 bit after each of the 16 low bits of x
uint32_t Part1By1_8b(uint32_t x)
{
x &= 0xff; // 8 bits only this time!
x = (x ^ (x << 4)) & 0x0f0f0f0f; // 0x0f0f would suffice
x = (x ^ (x << 2)) & 0x33333333; // 0x3333 would suffice
x = (x ^ (x << 1)) & 0x55555555; // 0x5555 would suffice
return x;
}
// Only uses low 8 bits of x, y
uint32 EncodeMorton2_8b(uint32_t x, uint32_t y)
{
return (Part1By1_8b(y) << 1) + Part1By1_8b(x);
}
Lastly, if you’re encoding or decoding multiple values, which is extremely common because after all the whole point is to turn a pair (or triple) of values into a single integer, and the values are sufficiently narrow, we can handle x and y at once, by packing them into a suitable value before we do the rest of the computation.
For example, the case above where we only have 8-bit x and y coordinates (not that uncommon, say within a tile) can do a variant of the trick above to handle the full pair at only slightly more work than just handling a single coordinate in the original code would be:
uint32_t EncodeMorton2_8b_better(uint32_t x, uint32_t y)
{
// Pack x and y into t, starting at bits 0 and 16 respectively
uint32_t t = (x & 0xff) | ((y & 0xff) << 16);
// Do the full interleave as above (this time the full
// 32-bit masks are necessary)
t = (t ^ (t << 4)) & 0x0f0f0f0f;
t = (t ^ (t << 2)) & 0x33333333;
t = (t ^ (t << 1)) & 0x55555555;
// Merge the halves.
// All the odd bits are clear, so instead of shifting t right by
// 16 to get y and then shifting left by 1, we can just shift right
// by 15.
return (t >> 15) | (t & 0xffff);
}
All in all is this is roughly the same cost of a single invocation of the original Part1By1. If you need a full 16-bit x and y but have 64-bit registers available, you can get a 32-bit result using a 64-bit extension of the original code (you just need to expand the masks appropriately).
This technique is independent of how exactly you do the bit-(de-)interleaving, so you can in principle combine this with PDEP and PEXT. There’s not much point though, because if you have a good implementation then doing two PDEP with different masks and a final OR will be cheaper, and if you might encounter one of the bad implementations you just want to avoid these instructions entirely and use the above SIMD-within-a-register (SWAR) algorithms.
I’ve only showed 2D encoding here, the same trick works for decoding:
pair<uint32_t,uint32_t> DecodeMorton2_8b(uint32_t code)
{
// Separate even and odd bits to top and bottom half, respectively
uint32_t t = (code & 0x5555) | ((code & 0xaaaa) << 15);
// Decode passes
t = (t ^ (t >> 1)) & 0x33333333;
t = (t ^ (t >> 2)) & 0x0f0f0f0f;
t ^= t >> 4; // No final mask, we mask next anyway:
// Return x and y
return make_pair(t & 0xff, (t >> 16) & 0xff);
}
And of course the same idea works for the 3D variants as well.
In summary, To Whom It May Concern: 1. PDEP/PEXT and friends (when available and reliably good) make this trivial, 2. if you only need a few bits you can skip passes and make it cheaper, 3. the bit movement is completely regular so you can also pack multiple values ahead of time and get two (de)interleaves for the price of one.