In the last part we went over the general ideas of Huffman coding as implemented in the newer Oodle Data coders, this time we’ll be looking at one particular implementation that is both interesting and “historically relevant”: Oodle was designed with games in mind, an important class of hardware to consider for game middleware is game consoles, and versions of the AMD Jaguar CPU were in both the PS4 and Xbox One (mostly unmodified except for a bump in clock rate in the “mid-lifecycle upgrade” models of both). We wanted Kraken to perform well on those machines, so we spent some time optimizing Oodle for it. Before I go into the details, let’s do a bit of background on the machine itself, but be advised that this will be in-depth and that you may need to re-read the previous part first. Furthermore, this post contains plenty of x86 assembly; if you’re uncomfortable or unfamiliar with that, you probably won’t get much out of it, sorry.

### Meet the AMD Jaguar

The Jaguar, or less prosaically “Family 16h”, is a small, low-power, out-of-order 64-bit x86 CPU core designed for small systems and embedded applications such as, well, game consoles. In the game console variants, the Jaguar CPUs appear on the main SoC along with the GPU (and most other components). It’s designed for multi-core operation and cores usually appear in clusters of four that share a common L2 cache, usually 512kb of L2 per core. In the Xbox/PS4 versions, there are two such clusters and thus two L2 cache slices. This post is only concerned with tasks that run on a single thread so I won’t be spending time on this part of the architecture.

Each core has 32KiB of L1 instruction and 32KiB of L1 data cache. The frontend decode/dispatch/retire logic is 2 instructions wide, and the relevant unit for most of it is what is variously (depending on the source) called “macro-ops” or “cops” (complex ops), depending on the source. I’ll stick with macro-ops. Macro-ops are typically a data-processing instruction along with a memory reference, so something like the x86 instruction `add rax, [rsi]`

would be a single macro-op.^{1} Macro-ops get broken into either one or two micro-ops (I’ll write uops in the following) for execution, but instruction decoding, tracking and retirement all works on macro-ops. The backend has six execution units, each of which can accept one micro-op per cycle: two integer (which I’ll refer to as I0 and I1), one load (L), one store (S), and two SIMD/floating point (which I’ll refer to as F0 and F1). The pipelines are very symmetric: almost all integer instructions can execute in either I0 and I1 (the biggest exceptions being multiplies and divides, which are I1 only), and most FP/SIMD instructions can execute in either F0 or F1 (FP addition and SIMD integer multiplication are only supported in F0, and FP multiplies and store/convert are F1 only). Consequently, most pairings of two independent instructions can execute in the same cycle, if they’re not both contending for the same resource. Of these backend limitations, in my experience the one you’re most likely to hit is the one load per cycle limit.

That said, whenever I’ve looked, the two instructions per cycle decode/dispatch limit is usually the more relevant one. On the Jaguars, using the load-operate and even read-modify-write instructions where possible is a good idea (because it gives you two uops per macro-op), and generally preferable to splitting loads out.

Speaking of loads, the L1 data cache is 8-way associative, write-back, and internally splits 64-byte cache lines into 16-byte sectors. Unaligned loads and stores that stay within a single sector are free, crossing a sector boundary occupies the load/store pipes for an extra cycle (potentially more if it also crosses a page boundary etc.). The load-to-use latency for the L1 data cache is 3 cycles to the integer pipes, 5 cycles to the FP/SIMD pipes, both of which are quite low numbers compared to most of its contemporaries.^{2} The theme of low latencies continues for other parts of the backend: FP32 multiplies and SIMD integer multiplies complete in 2 clock cycles, FP32 and FP64 adds in 3, and most SIMD ALU operations take a single cycle. L2 misses take relatively long though, at a minimum load-to-use latency of 25 cycles.

Lots of console developers found these cores underwhelming, mostly due to the narrow design and fairly low clock rates (around 1.6 and 1.7GHz in the original PS4 and Xbox One, respectively). On the other hand, these cores are quite small, power-efficient, and the PS4/Xb1 console generation came with 8 of them, at a time when more than 4 cores was a rarity in the consumer space. Personally, I quite like them: they’re not the fastest but what they are is extremely even-tempered and predictable. They have a relatively low ceiling on the instructions per cycle and peak performance they can achieve, but getting there is generally a fairly straightforward process, and there’s not much in the way of gotchas or nasty surprises. They’re a pleasant core to optimize for^{3}, and AMD helped by providing good documentation for it.

### The Plan

Because of the aforementioned decode/dispatch/retire limits and low instruction latencies, optimizing code with reasonably nice memory access patterns for the Jaguars is, more often than not, an exercise in minimizing number of instructions executed for a given task. (As I said, they’re fairly straightforward to optimize for!) Therefore, if we want a fast Huffman decoder on these machines, it’s a good idea to see if we can do it with as few instructions as possible.

While reviewing the above-quoted docs, one thing I noticed was that `BEXTR`

, an instruction from BMI1, turns into one uop, is supported on both integer pipes, and has 1-cycle latency. `BEXTR`

is an odd duck: it extracts a given number of bits from a given starting point in the first source operand, and as such is essentially a counterpart to PowerPCs `rlwinm`

or ARMs `UBFM`

, but while these latter two instructions have the bitfield position and width given as an immediate operand, `BEXTR`

takes a register operand for the bitfield specification.^{4} Code that wants to do repeated bitfield extraction with the same operands can burn a register on a constant (itself a fairly steep cost on the relatively register-starved x86) and then use `BEXTR`

, which replaces a move, shift, and an bitwise AND instruction.^{5} The second source register operand to `BEXTR`

contains, itself, bit-packed values: the lower 8 bits give the index of the LSB of the bitfield to extract, the next 8 bits give the width in bits.

This is usable for the bitstream decoding part of our Huffman decoder. Using a “bit extraction” style decoder (variant 3 in this post) means we repeatedly do operations of the form `(bit_buf >> bit_pos) & ((1 << 11) - 1))`

to peek at our next 11 bits, and that is just `BEXTR(bit_buf, bit_pos + (11 << 8))`

. It doesn’t cause any problems to have a constant bias that shows up only in the high bytes added to our bit position, so we can just declare our bit positions to have that offset added at all times while in registers, and that lets us do our bit buffer peek in a single 1-uop instruction on Jaguar cores. Because of another x86 quirk, namely that byte-sized instructions exist and preserve the remaining bits of the register, we can do updates of bit_pos using byte-sized additions or subtractions that leave the high bits alone, if we want to.^{6}

Finally, we don’t want to do a store for every byte we decode, because that’s an extra instruction and we’re easily limited by instructions (or rather, macro-ops) executed. Fortunately we can use the SSE4.1 instruction `PINSRB`

(packed insert byte), which inserts a byte value from an integer register or memory into a given lane of a vector register. Vector registers hold 128 bits (16 bytes), which means we can amortize the number of stores and do one every 16 or so bytes instead of for every codeword. Finally, because Jaguar cores treat memory references inside an instruction as separate uops but not separate macro-ops, and macro-ops are one of our main limiters, we want to use memory references liberally if doing so lets us reduce the number of macro-ops we need.

Putting this all together, note that the pseudocode for the per-symbol processing in a LSB-first Huffman decoder, as outlined in the previous part, looks something like this:

// peek uint32_t bits = (bit_buf >> bit_pos) & 2047; // consume bits bit_pos += table[bits].len; // decode symbol emit(table[bits].sym);

and using the various techniques outlined above, we can turn this into a mere 3 x86 instructions:

; peek. rBitPos[15:8] = 11 bextr rBits, rBitBuf, rBitPos ; advance bit offset (update low byte only) add rBitPosb, [rTableBase + rBits*2] ; put table[bits].sym at position N into xmm0 vpinsrb xmm0, xmm0, [rTableBase + rBits*2 + 1], N

On the Jaguar, this decomposes into 3 macro-ops and 5 uops: 2 loads, 2 integer ops, 1 SIMD. The bit extract to grab rBits from rBitBuf takes a single cycle; the bit position update takes 3 cycles to load the value from the table and an extra cycle to complete the addition. We don’t actually care about the top bytes being preserved here, since we don’t expect overflows, but we do care about our load-operand being byte-sized. Either way, that’s 5 cycles critical path latency from one decoded symbol to the next. Finally, the vector byte inserts to collect the output bytes are not on the critical path. They need to be fast enough to keep up with our decoding bytes once the table loads finish (and they are, single they can complete at a rate of 1 per cycle) but that’s about it. With the Jaguar frontend supporting at most 2 macro-ops per cycle, this code takes at least 1.5 cycles per symbol decoded in the frontend, and 2 cycles per symbol decoded in the load pipeline. Meaning that as given, this code is limited more by the load pipeline than the frontend. However, this is not the only work that needs to happen in this loop, and the Jaguar is out-of-order, so we can build up a backlog of load pipeline work; if we later need to do more integer work in the loop that does not take many loads (spoiler: we will), the load pipeline will get to catch up.

Finally, as mentioned above, our critical path between back-to-back loads from the same stream is 5 cycles on the Jaguar. If we use 3 streams and interleave their processing during decode, then the frontend will get around to the first instruction for the second byte of stream 0 about 4.5 cycles in (although the load pipeline will take about 6 cycles to work through its backlog before then). In other words, the timing here can roughly work out, but it’s not perfectly matched; we will build up a bit of backlog in the load pipeline and the reorder buffer before this is done, but as long as we choose our instructions carefully and don’t go too lopsided, we can make this work while keeping the core nice and busy the whole time through.

I was pretty excited when I first realized this 3-instruction sequence was a viable candidate for the core of our Huffman decoder on Jaguar, but to get a real decoder we also need to deal with bit buffer refills, pointer advancing, and end-of-buffer checks.

### The actual decoder

As mentioned above (and in the previous part), we use 3 separate bitstreams for parallelism. Of these, two bitstreams are regular “forward” bitstreams in increasing address order, and one is written backwards. The numbering of these is a bit odd: in the physical Oodle format, the layout is `strm0-> | strm2-> | <-strm1`

, i.e. stream 0 is forward and comes first (as you would expect), stream 1 is backward and comes last, and stream 2 is also forward and somewhat awkwardly sandwiched in the middle, for “historical reasons”. Namely, Kraken uses forward-backward stream pairs in many places.^{7} The Huffman decoder used to be the same way; when we noticed (while working on this Jaguar decoder, in fact) that three streams would be advantageous, we had to put the third stream *somewhere*. Putting stream 2 in the middle turned out to be slightly easier.^{8} The advantage of the odd-looking backward stream is that it saves us a bit of signaling in the container format (not a trivial concern for a compression format) and also gives us a nice way to do end-of-buffer checks. Namely, the three are contiguous, and all three read pointers (called `in0`

, `in1`

and `in2`

in the following) are in that single contiguous region. Furthermore, in0 keeps increasing, in1 keeps decreasing, and at any point in a well-formed stream, we have `buffer_begin`

≤ `in0`

≤ `in2`

≤ `in1`

≤ `buffer_end`

. During decoding, we do the two interior checks of the read pointers against each other; the end-of-buffer checks on either end are implied by transitivity, and we don’t need to actually do them, or keep those extra pointers around. The only pointers we need to check are the ones we already keep around anyway. Neat!

Now, loads aren’t zero-sized; we use the (common in C) convention that “end” pointers point one past the last element of arrays. So we don’t want to start loading from `in1`

, and with 64-bit (8-byte) loads, the largest address we can ever safely load from is at `buffer_end - 8`

, assuming the buffer is at least 8 bytes to begin with (which we check beforehand). Decrementing `in1`

by 8 before the loop takes care of both issues: now in1 points to the last address we can do a valid 64-bit load from, and as a side effect `in0`

≤ `in2`

≤ `in1`

ends up guaranteeing that in0 and in2 are also good to safely load 8 bytes from without overrunning the buffer. Finally, the optimized decoder loop described here decodes 5 bytes each from 3 streams and writes the results using a 16-byte SIMD store, so it can only safely run until 16 bytes before the intended end of the buffer. All the remaining special cases (less than 8 bytes left in some of the streams, very short input streams, or close to the end of the output buffer) are left to a dedicated safe loop that generally handles the last few bytes, needs to do more careful checking, and is certainly not using hand-tweaked assembly. There would be no point for speed since it only ever handles very few bytes, and besides that’s the exact loop where you very much want a higher-level language for better debugging facilities and good integrations with sanitizers, fuzzers etc.

With a plan for all those details, all we need to take care of now is the refill logic and sort out the remaining plumbing. Looking at the decoder sketch above, we see that we need at least 2 registers worth of state per bitstream: one register to contain `bit_buf`

(rBitBuf in the pseudo-ASM), and one for `bit_pos`

. Once we consider refilling, we also need the corresponding read pointer (the `inN`

I was just taking about). For 3 streams, 3 registers of state per stream works out to 9 registers, a bit more than half of our general-purpose register name pool, which is workable.

As for refill, that is luckily straightforward in a “bit extract” style scheme. At the top of every iteration, we want to load the next 8 bytes from the current input pointer:

mov rBitBuf0, [in0]

For the reverse byte order `in1`

stream, we use a big-endian load (`MOVBE`

) instead, which is the same cost as the regular load on the Jaguars.^{9}

Then we decode 5 values from each of the 3 streams. With our 11 bits code length limit, that means we end up consuming at most 55 bits from each stream. Most relevant bit reading techniques support at most either 56 or 57 bits in a row without a refill when using 64-bit registers, so this fits well.^{10} Decoding 3×5 = 15 symbols also works out very nicely with our 128-bit vector registers, so we do a single unaligned vector store every 15 bytes.^{11}

Finally, after each stream has decoded 5 symbols, we need to check how many bytes to advance the read pointer by, and what the new start position within the byte is. The number of bytes we need to advance the read pointer by is `(bit_pos >> 3) & 7`

, which on the Jaguar, we can compute using a single BEXTR if we can afford a register just to store the constant 0x303, which we can.^{12} We then either add or subtract this from the corresponding `in`

pointer. Finally, we need to clear the bits corresponding to the byte position (that we just took care of) in `bit_pos`

, which is an AND by ~0x38. This keeps the high bits, containing the 11 bitfield length that we need, intact. The actual code below does this masking at the start of the next iteration instead of at the end of the current iteration, but conceptually this belongs with the pointer advance.

And that’s pretty much it. Here’s the full decoder loop, written in NASM. We originally tried to write this in C++ with intrinsics, but that got nasty, so we eventually switched to a real assembler. The original version has the comments laid out differently but I need to fit this into an annoyingly narrow blog CMS theme, so this will look a bit clunky:

; main decode loop ; rax = scratch ; rbx = bitextr0 ; rcx = bitextr1 ; rdx = bitextr2 ; rbp = bextr const ; rsi = table ptr ; rdi = -bytes_left_to_decode ; r8 = in0 ; r9 = in1 ; r10 = in2 ; r11 = bits0 (only live in inner loop) ; r12 = bits1 (only live in inner loop) ; r13 = bits2 (only live in inner loop) ; r14 = decodeend ; r15 = (unused) sub r9, 8 ; in1 -= 8 mov ebx, 0xb00 ; 11 field width mov ecx, 0xb00 mov edx, 0xb00 mov ebp, 0x303 ; for byte step align 16 .bulk_inner: ; non-crossing invariant: in0 <= in2 && in2 <= in1 cmp r8, r10 ja .bulk_done cmp r10, r9 ja .bulk_done ; refill stream 0 ; read next bits0, keep bit offset within byte mov r11, [r8] and ebx, ~0x38 ; refill stream 1 movbe r12, [r9] and rcx, ~0x38 ; refill stream 2 mov r13, [r10] and rdx, ~0x38 %assign i 0 %rep N_DECS_PER_REFILL ; stream 0 ; peek bextr rax, r11, rbx ; consume add bl, [rsi+rax*2] ; grab sym vpinsrb xmm0, xmm0, [rsi+rax*2+1], i+0 ; stream 1 bextr rax, r12, rcx add cl, [rsi+rax*2] vpinsrb xmm0, xmm0, [rsi+rax*2+1], i+1 ; stream 2 bextr rax, r13, rdx add dl, [rsi+rax*2] vpinsrb xmm0, xmm0, [rsi+rax*2+1], i+2 %assign i i+3 %endrep %undef i ; final advances ; num_bytes_step0 bextr rax, rbx, rbp ; in0 += num_bytes_step0 add r8, rax bextr rax, rcx, rbp sub r9, rax bextr rax, rdx, rbp add r10, rax vmovdqu [rdi+r14], xmm0 add rdi, 15 ; loop while bytes_to_decode > 0 js .bulk_inner

That’s the core 3-stream Huffman decoder loop. Time to quit it with the hand-waving and do an actual analysis (if only back of the envelope) to make sure we’re on the right track here.

### Analysis

We already looked at the core decode step earlier and noted that it has 3 macro-ops (I’ll write 3M in the following), and for the backend: 2 integer 0/1 ops (just 2I for short), 2 load unit cycles for aligned loads (2L for short), and 1 FP/SIMD op (1F for short). We do this 5 times per stream. Also per stream is the refill/advance logic, which we now know the instructions for: 1 load for the refill, and 3 integer ALU ops for the byte advance and bitpos update. The load in the refill is almost always unaligned, though. It’s a 64-bit load, and as noted in the introduction, unaligned loads are free if they stay within an aligned 16-byte sector, and cost at least 1 cycle extra when they don’t. Out of the possible load offsets mod 16, 9 (0 through 8 inclusive) stay within a 16-byte sector, the other 7 do not. That’s 7/16=0.4375 odds of at least one cycle extra, and some of those cases (such as crossing cache lines and pages) get more expensive than just adding a cycle. For sanity in the following, let’s just say that we bake this all down to somewhat simpler numbers and expect around 1.44 cycles average case (but probably closer to 1.5 in realistic conditions) for those unaligned refill loads, 2 cycles for a much more pessimistic estimate. In other words, we want to bill the unaligned refill loads at costing more than single aligned load, since the expected number the load pipelines are occupied with them is larger.

Taking that into account, the four instructions involved in refill and advance for a single stream boil down to 4M, 1.44-2L, and 3I.

Then, we have some cross-stream shared instructions: the two compare/jump pairs for our pointer-crossing check at the beginning account for 4M 4I, the final store accounts for 1M 1.44-2S (since it’s also unaligned), and the final `ADD`

/`JS`

pair contribute another 2M 2I to the tally. That’s all instructions in the loop accounted for.

For an overall throughput estimate, we get:

- 15 × 3M (decodes) + 3 × 4M (stream refill/advance) + 7M (shared rest) = 64M total, so 64 macro-ops, enough to occupy the front-end for at least 32 cycles.
- 15 × 2L (decodes) + 3 × 1.44-2L (stream refills) = 34.3-36L total, so the load unit is busy for 34.3-36 cycles.
- 15 × 2I (decodes) + 3 × 3I (stream refills) + 6I (shared) = 45I total, evenly distributes over both integer ALU pipes for 22.5 cycles worth of pressure.
- 15 × 1F (decodes) = 15F total, distributed over both FP/SIMD pipes for 7.5 cycles worth of pressure, so they’re loafing.
- 1.44-2S for 1.44-2 cycles worth of pressure on the store pipe which I assume is sitting on the sidelines munching popcorn.

Purely in terms of pressure on the execution resources, we’re mainly limited by the load pipes which are busy for around 34.5-36 cycles every iteration, closely followed by the frontend which is occupied for at least 32 cycles if everything goes perfectly. 34.3-36 cycles to decode 15 bytes works out to 2.287-2.4 cycles per byte decoded. This is assuming we can ever get throughput-bound to begin with, and is budgeting absolutely no time for L1 cache misses and such.

How does the critical path look? By my reckoning, the most likely candidate takes a freshly updated in pointer from the end of a previous iteration, does an unaligned load to refill which takes 4 cycles for the data to show up, then does 5 back-to-back decodes from that stream which we know have a critical path latency of 5 cycles each, and then finally needs to do a `BEXTR`

on the resulting bitpos followed by an integer add/subtract to produce the next load address. That’s 4 + 5 × 5 + 2 = 31 cycles of critical path latency through the stream decodes, worse if anything bad happens, like extra delays due to page crossings on a load or similar. 31 cycles is close enough to our other 2 limiters for it to be considered a 3-way near-tie. A hitch in the front-end or load pipes or any extra delay along the critical path is likely to end up delaying any given iteration if it occurs. Note I’m purely looking an ideal throughput estimates here, there is no modeling or simulation of machine details going on, all we’re doing is tallying up some figures based on known machine characteristics.

In short, from this rough estimate, we would expect somewhere around 2.3-2.4 cycles per byte for this decoder under very idealized circumstances where there’s not a single cache miss or hitch anywhere along the way, and “a bit worse” (to be decided what that means) when less idealized.

### The rubber hits the road

So what happens when we actually run it?

One of the nice things about coding for game consoles is that the hardware is known and tends to have very predictable, repeatable performance. With that said, here’s stats for the exact loop quoted above running on a PS4 on a synthetic test set (decoding a random stream with a very boring “every symbol is 8 bits” Huffman table, which of course you’d never do, but makes for a test run that’s very easy to validate the results of) 1000 times, and reporting 1st, 50th (median) and 95th percentile cycles per byte:

huff3_jaguar_asm: med 2.28/b, 1st% 2.25/b, 95th% 2.38/b

I swear I did not fudge this in any way, that’s the actual figures I got on a real test run just now. So that much, ahem, very encouraging, to say the least. But what happens when you time it in the middle of an actual Kraken decode of ~250MB of real non-synthetic test data?

SimpleProf :seconds calls count : clk/call clk/count get_array_huff : 0.3879 17689 178617138 : 34946.1 3.46 huff_x64jaguar_loop : 0.3035 31350 178617138 : 15427.6 2.71

Here `get_array_huff`

includes everything the Huffman decoder has to do, including reading the headers, the Huffman table descriptions, validating the code lengths, setting up the tables, the fast decode loops, and the slower near-end-of-data tail decoders, and `huff_x64jaguar_loop`

is just the core optimized decode loop that handles most of the bulk data. “Calls” is the number of calls to either function and “count” is the number of bytes decoded. In this case, 250MB of data “only” decode about 178MB through the Huffman decoders; less than 250MB because we also do LZ-style dictionary compression (not covered in this series). So in this particular real-world use case (which very much does not have all the buffers already nicely in the L1/L2 caches as it’s running), we’re about 17% slower than our ideal average-case throughput estimate for this loop, which is still respectable. Also visible from these two lines is that the core decode loop where around 75-80% of the overall Huffman decoding time is spent with the rest being in setup or tail handling. That is fairly typical for our decoder implementations on various platforms. And you can infer from the figures given that our average array of bytes that use a single Huffman table is about 10k long. For this part, looking at averages is misleading: the distribution is quite wide. Many arrays are 60k+, but many others are well below 3k. The former spend more time in the core decoder (always nice since that part is flat), the latter spend a lot more time proportionately in header parsing and table initialization, which is why we can’t neglect it.

This last part is also why the Jaguar decoder (or, for that matter, all other decoders in Oodle) doesn’t bother with trying to set up tables to decode multiple symbols at once. This sounds enticing but it makes table setup more complicated and slower, and also adds many complications to the decoders because everything emits a variable number of bytes now. For example, using `PINSRB`

to group output bytes would not work if each decode step produce either 1 or 2 bytes; we would need to do individual stores for every symbol, and also increment the destination pointer after every decode. This would add at least 2 instructions per byte decoded (a store and a destination pointer add), probably more. When your single-byte-at-a-time decode kernel runs 3 instructions per byte to begin with and instruction count is a major limiting factor, adding 2 extra more instructions to maybe decode 2 bytes at a time isn’t all that tempting. We can decode another byte in an extra three instructions with the single-byte-at-a-time decoder *guaranteed*, and we don’t need to do any expensive extra work during table setup to do so. We also don’t build up a debt of 2 instructions every time we don’t manage to decode 2 symbols at once that we later have to make up just to break even. Multi-symbol decoding is an old standby when decoding from a single stream, because you can hide a lot of extra work in the shadow of that nasty long critical path, but decoding from multiple streams simultaneously gives you more productive ways to spend those CPU cycles and maybe even get to the holy grail of being primarily throughput bound.

And that’s all I got for this post! I’m not sure which of the other decoder variants I’ll tackle next. Apologies for the long delay, but writing these up takes more effort than my usual blog post, and I need to be in the right headspace to even try doing it.

### Footnotes

[1] If you’re familiar with Intel microarchitectures but not AMD, macro-ops are roughly comparable to what Intel calls “fused-domain micro-ops”, except they’re “even CISCier”, in the sense that even read-modify-write instructions like `add [rdx], rax`

count as a single macro-op, where Intel would split them into a add-from-memory and a store internally. I’ll also add that historically speaking describing AMD macro-ops as similar to Intel fused-domain uops is backwards; AMD has been using “fat” macro-ops as part of their x86 instruction decomposition for a long time, since at least the K7 (original Athlon) architectures. Intel added fused micro-ops (which are more restricted) years later to their microarchitectures when they realized that having the frontend deal with these “chunkier” units was beneficial.

[2] Typical L1D load-to-use times for contemporary designs were 4 or 5 cycles. For that matter, they still are at time of writing, 9 years after Jaguar-based HW hit the shelves. That said, the Jaguars target much lower clock rates than those other designs—the fastest Jaguar descendants ran a bit above 2GHz, other OoO x86 cores from the same timeframe have similar L1D sizes and were typically designed to hit 4GHz or above, so presumably the Jaguar cores can fit a lot more logic into a pipeline stage.

[3] To editorialize even more, the Jaguar’s nearly complete lack of sharp edges and huge performance cliffs was a welcome change after the PS3/Xbox 360 generation, where the main CPU cores seemed at times like they had nothing but.

[4] Presumably due to a problem with the way immediate operand encoding in x86 works. Specifying a bitfield position and width needs at least 6 + 6 = 12 bits to be generally useful on a 64-bit machine. But x86 immediate operands for instructions with 32- or 64-bit operands only come in two sizes: 8 bits and 32 bits. The former is not enough, the latter is very wasteful. 16-bit immediates only exist for 16-bit register instructions, and this part of the encoding would be quite expensive to add exceptions to. Interestingly AMD added a short-lived immediate-operand version of `BEXTR`

that indeed spends a full 32 bits, but this version of the instruction decodes to 2 macro-ops on Jaguar and was removed from Zen 3. It never shipped on any Intel CPU.

[5] All “big core” Intel CPUs that support `BEXTR`

(that I’m aware of, anyway) decode it into 2 uops. The same CPUs can usually eliminate register-register moves during register renaming and would also take 2 uops for a shift-and-mask combination, so `BEXTR`

has never been particularly interesting on them, since it offers at best some minor advantages in the frontend. It’s more interesting on the Intel Atom-derived cores such as the Alder Lake E-cores (which like Jaguar have a 1 uop, 1 cycle version) and AMD Zen cores, though.

[6] Yet another AMD/Intel difference: Intel has been renaming the 8-bit parts of registers such as AL and AH of RAX or R9B of R9 separately for a long time, meaning AL and AH can reside in separate locations in the physical register file. When referencing the merged halves as a single register (such as AX, EAX or RAX) later, Intel CPUs used to either stall (the “partial register stall” of long ago) or, later, started inserting merge operations that combined the results into the instruction stream. AMD has never done this, and instead seems to do partial updates on every operation. That means that on Intel CPUs, code that alternately updates AL and AH can execute as two independent dependency streams, whereas on AMD CPUs all these updates run in series. However, AMD never needs to insert any merge ops either, and has no special penalty for referring to the full register after a partial update, which is handy in our use case: on the Jaguar we’re always concerned with macro-ops through the front-end, so injecting merge operations would suck for us. Good that we don’t get any!

[7] Two streams so that we can alternate decoding from them, because (as seen in this post already and also mentioned in previous parts) sequential decoding from a single bitstream tends to result in very long dependency chains and is a bottleneck. Having pairs of forward and backwards streams with the two meeting in the middle allows us to store a single size in the bitstream to signal the start position of the reverse stream and the combined size as a single value. They also can, in some contexts, act as padding for each other. During decoding, ensuring the pointers don’t cross corresponds to a end-of-buffer check; we don’t know the exact size of either stream up front, but we know that the read pointers may never cross, and once decoding is done they should point to the same location.

[8] Arguably, we should have at least renumbered the streams and swapped labels of streams 1 and 2; but as is often the case, this was quickly prototyped, found to be working, then for a while we were concerned with other things such as buffer overflow safety and such, and by the time we realized it was pretty odd for stream 2 to appear in the bitstream before stream 1, it had long shipped to customers and was very much not worth a format-breaking change to rectify.

[9] Very convenient how the bitstream layout chosen works out so nicely for the most constrained of the important target platforms for Oodle.

[10] Yet another very-much-not-a-coincidence.

[11] This one actually is a coincidence, but I’ll take it.

[12] Yes, the decoder loop keeps 0x303 pinned in a register the entire time it’s running. Long-time friends will realize why this delights me; sometimes the universe just smiles at you like that. This one’s for you, Felix.

BC1-7 (variously also known by the older names S3TC, DXTC, RGTC and BPTC) are the standard compressed texture formats on PC. The newer BC6H and BC7 formats have precise specs that require bit-exact decoding, which means that encoders know exactly what results they’re going to get. The older BC1-5 are a lot more loosely specified, which has been a source of annoyance and problems for BCn encoder authors for some time.

While working on Oodle Texture, I did some experimentation trying to figure out what various GPU hardware decoders actually did. I did get good results for BC1-3 but BC4 and BC5 turned out to be a bit more troublesome. More recently, my colleague Sean Barrett spend some more time on this and we managed to get bit-exact equivalent expressions for Intel and NVidia hardware, but not AMD, where we were off by the last bit on some 32-bit float results in a small number of cases. This is completely irrelevant in practice but it bothered me, so last weekend I spent some more time trying to figure out those results too. I now have satisfying models for all three vendor decoders which match bit-exactly in my tests. That level of precision is unlikely to ever matter to anyone, but along the way there are several common mistakes in BC1-5 implementations and misunderstandings about the format to clear up, so it seems worth it to do a write-up.

### Overview

BC1-7 encode fixed-size blocks of 4×4 pixels to a fixed-size output: 64 bits (8 bytes) per block for BC1 and BC4, 128 bits (16 bytes) for all the other formats. The basic idea is to approximate the colors of pixels in a block by a small number of colors spaced evenly along one or more line segments in RGBA space. Per block, we send the endpoints of the line segment (quantized to a relatively low number of bits), plus an index per pixel that selects which of the colors along the line segment to use. These indices are usually between 2 and 4 bits each. Additionally, almost all of these formats have multiple modes that distribute the values slightly differently. Let’s briefly go over each format in turn:

BC1 sends color endpoints in RGB565 encoding (5 bits for red and blue, 6 bits for green). Somewhat unfortunately, this choice of encoding contains no exact grayscale values other than black and white, which means that slow grayscale gradients in this format tend to oscillate between areas tinged green and areas tinged purple. There are two modes, both of which use a 4-color palette and 2 index bits per pixel to select the palette entry (2×16 bits for the endpoints plus 16×2 bits for 16 2-bit indices, one per pixel, makes 64 bits total). The more commonly used mode uses the two specified endpoint colors, plus another two interpolated colors located 1/3rd and 2/3rds of the way along the line segment between them respectively, for a palette of 4 colors (all 4 colors have 1.0 in the alpha channel). The second mode has only one interpolated color halfway between the two endpoints; the third color has R=G=B=A=0, “transparent black” (in some cases, this is considered a separate “BC1A” format, and stock “BC1” decodes as R=G=B=0 and A=1; this distinction is not available everywhere though). This was intended for use with 1-bit alpha textures. If whoever reads the texture doesn’t care about the alpha channel, this second mode can also be used to get a “free black” even when the line between endpoints goes nowhere near actual black.

BC2 is for textures with an alpha channel that needs more than 1 bit. It consists of what is basically a BC1 color block (except here, all blocks are always in the first mode with 4 interpolated colors; the secondary mode is not available) and 4 bits of alpha per pixel, stored in memory before the BC1-esque color bits. BC2 has existed since the original S3TC, but I’ve never actually seen it used in practice. My guess is that the main reason it’s still around is that it costs very little hardware if you already support BC1 and 3, so despite little practical use, it’s cheap enough to support that nobody has lobbied to get it deprecated.

BC3 is the other format for RGBA textures. Once again, it contains a BC1 color block that is locked into the 4-color interpolation mode, and a different 64-bit encoding for the alpha bits in front. This one specifies two endpoints in the A channel as 8-bit values, followed by 48 bits (3 bits per pixel) to select the palette index. Once again, there are two modes: the more common one has the 8 colors spaced evenly along the line segment, the secondary mode only spaces 6 points along the line and adds two special indices that always decode to 0.0 and 1.0 no matter what the endpoint values are. BC3 is one of the two primary formats used for RGBA texture data.

BC4, unlike the previous formats, is intended for single-channel textures. It is often described as “just a BC3 alpha block”, which is close but not quite true as we’ll see later. There are unsigned and signed variants; the unsigned variants use the regular 8-bit UNorm encoding for values between 0 and 1, the signed variants use 8-bit SNorm for values between -1 and 1. In the signed variant, the special indices in the secondary mode correspond to -1 and +1, respectively.

BC5 also has unsigned and signed variants. A BC5 block is really just two independent BC4 blocks, one for the red and one for the green channel.

BC6H is meant for HDR color data. It only encodes RGB channels, the alpha channel always decodes to 1. There are unsigned and signed variants. The format decodes to 16-bit half-float format and color interpolation works by essentially treating the 16-bit half-floats as a 16-bit sign-magnitude integer value and interpolating those linearly, which works decently for unsigned data, not so well for signed data for blocks containing sign changes. BC6H has 14 modes, sort of, but most of these work essentially the same except for allocating endpoint bits slightly differently, depending on how close the values are to each other. There are two “major” modes in the format, one which specifies 2 sets of line segments (plus a code denoting which pixels use which line segment, from a small set of supported patterns) and has 3 bits of index data per pixel, and one of which specifies just one line segment with 4 bits of index data per pixel. BC6H decoding is complicated but specified bit-exactly so I won’t be talking about it in the rest of this post.

BC7 is a more complicated version of BC3 that adds several more options. It has 8 modes, 4 of which are RGB-only (with A decoding to 1.0), the idea being that even textures with alpha often have large opaque regions, and it helps everything to have more options for coding these color-only regions well. The different modes in BC7 contain various combination of either 1, 2, or 3 different line segments (and corresponding sets of endpoints), once again with the assignment of pixels to line segments chosen from a small, fixed set; either one index per pixel (when RGB and A behave similarly) or two (when they do not); and various index pixel depth, plus several other features that I won’t get into here. BC7 decoding is also complicated but specified bit-exactly and won’t appear in the rest of this post.

### Common (software) mistakes

This is the part that’s probably the most applicable to the majority of readers: before we even get to specification details, many encoders and decoders make two small mistakes in how they implement the basics.

The first and most common one is about the different modes in the BC1 format. The way a BC1 encoder selects which mode it wants a block to use is by sending the coded endpoints in a different order. Note that both color endpoints get packed to RGR565 format. When the first endpoint, interpreted as an unsigned 16-bit integer, is larger than the second one, this selects the more common four-color mode. If the first endpoint is less than or equal to the second, that selects three-color + transparent black mode. Note that when both colors are equal, you have no choice; you always end up in three-color mode (where the 3rd palette entry is special). This is not much of a problem because both endpoints being equal is a degenerate case anyway; all three other palette entries are equal in that situation, so just avoid the third. In all other cases, when the colors you picked result in the wrong order, the encoder can just swap them around and adjust the indices to match. Linear interpolation is symmetric so this does not change the set of available colors. This much, essentially all implementations get right. The tricky part involves BC2 and BC3: BC2 and BC3 blocks *almost* contain regular BC1 blocks, except they don’t have any mode-switching. The color values in BC2/3 always use the four-color mode. (This is pointed out in all specifications of the format, but easy to miss.) Several encoders only ever use the four-color mode to begin with, and swapping endpoints into the order you would need for BC1 never hurts; just be aware that three-color mode is not an option for BC2 or 3, and encoding assuming you have it available will give the wrong results.

The second mistake involves BC4 and 5. BC4 encoders/decoders often take “it’s just like a BC3 alpha block” too far and actually encode or decode unsigned BC4 (or BC5) as BC3 8-bit data. This is not far off, but it’s an approximation. BC4 and BC5 actually decode somewhat differently than BC3 alpha does, with measurably different results and higher internal precision. For most textures this doesn’t matter too much, but the extra precision is definitely useful for content like normal maps. BC5 two-channel normal maps can get up to nearly 11 bits of precision in regions with small value range, as opposed to the at most 8 bits you would get out of regular R8G8B8A8 or BC3 alpha encoding. It’s highly contextual whether this matters or not, and especially for rough surfaces there’s usually not much difference, but it can be useful especially when rendering things with reflective, smooth surfaces like machine parts, cars and so forth.

### What do the specs say?

The premise of this post is that GPU decoders don’t exactly agree with each other, but of course they at least approximately agree or the formats would be useless in practice. Can we say something more about what “approximately agree” means? The two primary sources for this sort of thing in the PC space are the Khronos Data Format Specification and the D3D11 Functional Spec. The Khronos spec is in force for Vulkan and supplants the older OpenGL extensions saying the same thing; as of this writing, the current version is 1.3 and the description of the BC1 formats starts here. It short and to the point, but unfortunately fairly light on details and completely missing any compliance requirements; it specifies what results an ideal decoder should produce, but not how close practical implementations need to be to be considered good enough.

Our secondary primary source is the D3D11 Functional Spec (a descendant of the D3D10 spec and the D3D12 spec is given as differences from D3D11, so this is still in force). The relevant sections start here. In theory, GL/Vulkan devices could do something completely different, but in practice the same hardware is designed to work for both (for obvious reasons), so whichever spec has the more detailed requirements ends up determining the requirements that the hardware is actually trying to satisfy. And the D3D11 version of the spec is a lot more detailed; in particular it has the short section 19.5.2 “Error Tolerance” which I’ll quote verbatim:

Valid implementations of BC formats other than BC6H and BC7 may optionally promote or do round-to-nearest division, so long as they meet the following equation for all channels of all texels:

| generated - reference | < absolute_error + 0.03 *MAX( | endpoint_0 - endpoint_1 |, | endpoint_0_promoted - endpoint_1_promoted | )absolute_error is defined in the description of each format.

endpoint_0, endpoint_1, and their promoted counterparts have been converted to float from either UNORM or SNORM as specified in the Integer Conversion rules. Values that the reference decodes to 0.0, 1.0 or -1.0 must always be exact.

For BC6H and BC7, decompression hardware is required to be bit accurate; the hardware must give results that are identical to the decoder described in this specification.

The part about promotion is clarified in the next 2 subsections and boils down to implementations being allowed to do fixed-point computations at an intermediate wider bit width, as long as values are extended to that intermediate bit width as specified. For the rest, I’ll note that all formats under consideration permit endpoint_0=endpoint_1, in which case the error term reduces to `| generated - reference | < absolute_error`

, so the absolute error term (I’ll get to that in a minute) primarily affects how precise decoding of the endpoint values themselves is. This value is 1/255 for BC1-3, 1/32767 for the signed BC4/BC5 formats, and 1/65535 for unsigned BC4/5 (indicating that BC4/5 have higher internal resolution). The second, relative error term requires that implementations have to be within ±3% of the reference, which is really not a tight bound: even for usual 8-bit texels in a range of [0,255], ±3% works out to ±7 in the integer space (actually 7.65, but when only considering integers, we can round down here) in the worst case when we choose endpoint_0 so that it decodes 0 to and endpoint_1 to decode to 1, or vice versa. This relative error term matters most when the endpoints are far from each other in a given channel, and the rationale for it is that the BC1 quantization is pretty harsh to begin with (you only get 4 values to cover that range). An extra 3% is less than one fifth the expected quantization error from getting to use at most 4 colors, so in that sense, it’s not a big deal. The problem for encoders is that there are often many candidate encodings for a block that have similar error; more possible encodings gives us more chances to get overall errors low if we can manage to find just the right combination of endpoints and indices, including spacing endpoints much further apart than necessary given the range of values in the block because it makes one of the interpolated values land exactly where we need it. Having this relative error term in the mix means we have to play it safe here.

More to the point, having this level of allowed differences between hardware encoders also means that the same texture wil decode differently on different GPUs, even if we’re not trying to do anything tricky. Sometimes blocks legitimately have a large value range because that’s what those blocks’ contents are, and having large differences possible between decoder implementations is a bummer, because we don’t even get an accurate picture of what the error from encoding any particular texture in BC1-5 is; it depends on what GPU it’s decoded on!

The reason I originally looked into this for Oodle Texture was because I wanted tighter error bounds than just the spec requirements. While in theory there is a large space of possible decoder implementations permitted by the spec, in practice there is a very small number of GPU vendors active in the PC space, and once they’ve settled on a particular decoder implementation, they really have no reason to ever change it unless the spec changes again. So I did some testing, and as far as I can tell, indeed neither NVidia, AMD nor Intel seem to have touched their BCn decoder blocks in the last 10 years or so (and why would they?). That means that both in terms of the current marketplace and installed base, there’s mostly just 3 decoders we have to worry about; that excludes really old hardware as well as vendors with a tiny market share, but if nothing else it will give us an idea of what we should really expect.

### Probing what decoders do

Luckily, the BCn formats have many properties that make it fairly tractable to completely capture decoder behavior. For the BC1/BC2-3 color portion, the R, G, B channels are decoded independently from each other, and the pixels don’t influence each other. R and B endpoints are specified with 5 bits of precision and G with 6. That means we can look at all possible combinations of R/B endpoint values with a 128×128 pixel texture (32×32 blocks). In each block, we make sure to use each of the 4 indices at least once, and then write a compute shader that does a `Load`

/ `texelFetch`

operation for every pixel (so that filtering doesn’t confuse the results) and writes the result to a 32-bit float/channel texture. For G we need a texture with 256×256 pixels for 64×64 blocks. That’s a set of 3 small images that completely captures BC1 decoder behavior. For BC3 alpha and BC4 unsigned/signed we have 8-bit endpoints in a single channel, so we need an entire 1024×1024 pixels (256×256 blocks). For BC2 and BC5 we can do some quick tests just to confirm that they behave as expected (BC2: alpha is a 4-bit encoding that decodes as expected, BC5: same as two BC4 blocks, one for R and one for G).

Some more quick tests indicated that:

- As far as I can tell, on all three vendors tested (AMD, NV, Intel), BC1 internally decodes to 8-bit fixed point, which then, depending on whether the format is UNorm or UNorm_sRGB, gets converted to float either using the same conversion the GPU uses for regular 8-bit UNorm textures, or a 256-entry LUT (or maybe LUT + math combination?) used for sRGB->float conversions (hard to be sure, but tested by noting that all values obtained for 8-bit BC1 RGB channels appear in the list of 256 values output by regular UNorm or UNorm_sRGB conversions; if this is not what’s happening the observed behavior appears to be equivalent).
- On all three vendors, BC3 alpha appears to decode to 8-bit fixed point, once again matching the 8-bit UNorm->float conversions, which in turn for all 3 vendors exactly match the results of rounding “x/255” to the nearest floating-point value. Setting e.g. endpoint_0=0 and endpoint_1=1 (8-bit integer values) did
*not*result in any interpolated values inside the open interval (0,1/255). - On all three vendors, BC4 and BC5 decode to more than 8 bits of precision. For Intel and NVidia, all interpolated values for unsigned/signed BC4/5 formats occur on the list of decoded 16-bit UNorm and 16-bit SNorm values, respectively, and they are consistent with first decoding to 16-bit int then converting to float via the UNorm/SNorm conversions. On all three vendors, 16-bit UNorm decode as correctly rounded
`x/65535`

and 16-bit SNorm as correctly rounded`max(x,-32767)/32767`

, respectively. On AMD, the resulting floats do*not*all appear on the list of UNorm16/SNorm16 values; instead the decoder appears to use an internal 14-bit format, details below. Setting e.g. endpoint_0=0, endpoint_1=1 (integer byte values) for BC4 unsigned*does*result in interpolated values inside (0,1/255).

Furthermore, Ignacio Castaño wrote a 2009 blog post on how NVidia GPUs decompress BC1 (with mention of the relevant patent), and it was straightforward to verify that more recent NV GPUs (tested on GeForce 1080 and 2070 series devices) still appear to use the same logic. For Intel, testing was done on a late-2013 laptop with Haswell integrated GPU, and we contacted Tom Forsyth at Intel to confirm that the BCn decoder logic was still essentially unchanged and answer some questions regarding the exact BC4/5 decoder behavior. For AMD, testing was done with an old Radeon HD 7000 series GPU we had kicking around the office as well as game console devkits for the previous and current generation (GCN2 and RDNA2 variants, respectively), the latter on account of them already being on the network and set up for remote testing. (So no pre-GCN AMD GPU in the mix, not sure if that makes a difference).

### What all decoders seem to agree on

For all vendors, the endpoint_0 and endpoint_1 values in all formats (which always appear as palette entries 0 and 1, respectively) appear to be reproduced exactly the same. BC1 as written in the D3D11 functional spec first expands the endpoint values from 5 or 6 bits to 8 bits by replicating the top bits; all three vendors appear to do this or something equivalent, and then convert the result from 8-bit UNorm to float exactly. For BC4/5, likewise palette entries 0 and 1 always seem to exactly make the result of the respective correctly rounded UNorm8 or SNorm8 to float conversion, respectively. Furthermore the D3D11 functional spec requires (quoted above) that the values specified to decode to exactly 0, 1 or -1 indeed decode to these values in all implementations, and this appears to hold.

Taking these all together, this means that for BC1/BC2-3 color, the parts that decoders actually differ in are how exactly the colors 1/3rd and 2/3rds of the way (in four-color mode) or 1/2 of the way (in 3-color mode) are computed, and for BC3 alpha and BC4/5, how the 6 (respectively 4, depending on the mode) actually interpolated values are determined. I’ll cover these vendor by vendor, in order of increasing complexity (of description though not necessarily hardware).

### Intel GPUs

Intel’s decoders are the most straightforward to describe. BC1-3 color is computed in fixed point, from the endpoint values extended to 8 bits using bit replication, using the formula `((256-w)*a + w*b + 128) >> 8`

or something equivalent, i.e. 8-bit fixed-point linear interpolation with round-to-nearest. For the 1/2-of-the-way color in 3-color mode, the weight is w=128 (of course); in 4-color mode, the interpolation weights are w=85 (for the 1/3rd point) and w=171 (for 2/3rds), respectively. These weights are just `round(k*256 / 3)`

for k=1, 2.

BC3 alpha appears to use the weights `round(k*256/7)`

for k=1,…,6 in 6-interpolated color mode, and `round(k*256/5)`

for k=1,…,4 in 4-interpolated-color mode.

BC4, signed and unsigned both, takes the integer UNorm/SNorm endpoint values (for SNorm, -128 is an alternative encoding for -127, they mean the same thing), and use the 16-bit interpolation formula `t = ((65536-w)*a + w*b + 128) >> 8`

to get the interpolation result, where `w=round(k*65536/7)`

, k=1,…,6 in 6-interpolated-color mode, `w=round(k*65536/5)`

, k=1,…,4 in 4-interpolated-color mode.

For BC4/5 unsigned, the GPU then computes `t + (t >> 8)`

; the resulting integer is treated as a 16-bit UNorm value and finally converted to float. For signed, the GPU computes `abs(t) + (abs(t) >> 7) + (abs(t) >> 14)`

with the sign of the original t (i.e. if t was negative, this result is then negated). This number is treated as 16-bit SNorm value and converted to float.

### AMD GPUs

AMDs GPUs still behave in a way that is straightforward to describe, with a wrinkle involving BC4/5. But let’s start with BC1-3: BC1-3 is computed in fixed point, from the endpoint values extended to 8 bits using bit replication, using the formula `((64-w)*a + w*b + 32) >> 6`

or something equivalent, i.e. 6-bit fixed-point linear interpolation with round-to-nearest. This is the exact same computation that is prescribed for interpolation in BC7 decoders, so it likely uses the same hardware. In BC1 3-color mode halfway points, we have w=32 (again, no surprise), and for 4-color mode the weights are 21 and 43, again matching BC7 and equivalent to `w=round(k*64/3)`

for k=1, 2.

BC3 alpha uses the weights `round(k*64/7)`

for k=1,…,6 in 6-color mode, `round(k*64/5)`

for k=1,…4 in 4-color mode.

BC4/5, signed and unsigned both, tales the integer UNorm/SNorm endpoint values (for SNorm, after the -128 -> -127 correction) and uses the same interpolation formula, just without the shift or rounding term: `t = (64-w)*a + w*b`

, again suggesting that it’s probably all using the same hardware. The resulting product fits in a 14-bit integer. Note that by construction, t is either in [-127*64,127*64] for signed values or [0,255*64] for unsigned ones.

For BC4/5 SNorm, this result is then converted to float by effectively doing a correctly rounded division by 127*64 = 8128. This turns out to be somewhat more complicated than the straight division by 127 used for 8-bit SNorm values, mainly due to the extra 6 bits in the input. For BC4/5 UNorm, the result is converted to float by doing an *almost* correctly rounded division by 255*64 = 16320. Once again, this would ordinarily be a bit more complicated than the division by 255 used for UNorm8->float, again due to the extra 6 bits of input; however, here, for whatever reason, be it just a bug or a small hardware optimization that is not clear to me, a very small subset of values (13 total of the 16077 unique possible results for all combinations of inputs) is rounded incorrectly.

This result differs in final mantissa bit of the 32-bit float values only and it truly does not matter for any use case I can imagine since actual BC4/5 encoding-induced errors are many orders of magnitude larger than this, but in case anyone cares about reproducing the results exactly, my hacky conversion function that manually detects the incorrectly-rounded cases is here. (The version for signed decoding is unnecessary because that case is correctly rounded. I thought implementing it might give me an idea of what’s going on with the unsigned version, but no dice.)

OK, so both Intel and AMD are reasonably easy to explain, and probably just reuse hardware that is already there; it’s worth pointing out that BC6H also uses the same interpolation formula as BC7 and works on signed 17-bit integers internally, so there’s probably at least 3 color channels worth of 6×17-bit multipliers lying around somewhere in the decoder path. I was initially very surprised about the 16-bit weights used by the Intel version, but if they just swap the operands between BC4/5 and BC6 and build something like a 8×17-bit multiplier instead, that might be totally fine. (Or maybe they even have their own wider multiplier for BC4/5. Don’t know, I’m just guessing at likely implementation options from the observed behavior here.)

### NVidia GPUs

The formulas used by NVidia GPUs are nothing like the other two vendors, and do not look like they can share any hardware with BC6H/BC7 decoding at all. They do, however, very much look like they’re designed to share as much as possible between BC1-5, and by someone who is dead set to spend as little logic on it as they can reasonably get away with. My best guess is that this design probably predates BC6H/BC7; BC6 forces you into having at least a few rectangular multipliers in the BCn decoder block and once you have them, there’s not much reason to avoid using them.

Anyway, the decoders we’ve seen before use simple linear interpolation formulas with round-to-nearest, or at least give results that are equivalent to them. The NVidia decoders use different formulas for the 5-bit-endpoint red/blue channels in BC1 than they use for the 6-bit-endpoint green channel, and then green/alpha channel logic seems to be the same, with BC4 and BC5 using just the green/alpha channel decoders (and ignoring red/blue). This in turn means that red/blue only ever work from 5-bit inputs, whereas green needs to support both 6-bit and 8-bit endpoints (the latter for use in BC5), and alpha always has 8-bit endpoints (for BC3).

Let’s do the simpler red/blue channels first. Ignacio’s 2009 article gives the formulas for palette entries 0/1 as well, but those are just equivalent ways of writing the 5->8 bit expand using bit replication. The interesting palette entries are 2 and 3, and those are (again, not following Ignacio’s presentation here, but this gives bit-equivalent results):

// red/blue, four-color mode; a, b are 5-bit col2 = ((2*a + b) * 22) >> 3; col3 = ((2*b + a) * 22) >> 3;

The rationale here is that expanding from 5 bits to 8 is effectively multiplying by 255/31, and if we compute (2*a + b) we need to divide the result by 3; the final fraction of 255/(31*3) = 2.741935 is quite close to 11/4, so that’s what it’s approximated by. So this calculation folds bit-expansion and linear interpolation into one and can work with slightly smaller intermediate values as a result (the multiplications by known constants are just shifts and adds).

For red and blue, the halfway point in three-color mode is computed similarly:

// red/blue, three-color mode; a, b are 5-bit col2 = ((a + b) * 33) >> 3;

This is a decent approximation, but somewhat annoyingly, it just behaves differently from the more straightforward AMD/Intel ones; folding the scaling and lerp into one makes the weight constants slightly smaller but also makes the overall rounding behavior different, which is awkward. Oh well.

Anyway, green and alpha is when the fun really starts. Both green and alpha do use a more explicit linear interpolation formulation with round-to-nearest, but the values being multiplied by and the overall way things are put together are still decidedly odd. But let’s start at the beginning. First, BC1 green, four-color mode (again, presentation different from Ignacio’s, but equivalent)

// green, four-color mode, a, b are 6-bit ae = (a << 2) | (a >> 4); // bit expand be = (b << 2) | (b >> 4); diff = be - ae; scaled_diff = 80*diff + (diff >> 2); col2 = ae + ((128 + scaled_diff) >> 8); col3 = be + ((128 - scaled_diff) >> 8);

and in three-color mode:

// green, three-color mode, ae, be as above diff = be - ae; scaled_diff = 128*diff + (diff >> 2); col2 = ae + ((128 * scaled_diff) >> 8);

Here, the effective factor is something like 80.25/256, which is pretty far off (the correct fraction for a real 1/3rd here would be around 85.33/256).

On to BC4/5. Here, our interpolation factor are multiples of 1/7th and 1/5th, respectively. NVidia conceptually splits this into two parts: we first multiply by the small integer numerator, and then by a fixed-point approximation of the reciprocal of 1/5 and 1/7, and all of that is sandwiched into our interpolation formula above.

The set of numbers we might want to multiply by here are thus 0,1,2,3,4,5,6, and 7. For one final trick, note that instead of saying that we want “1/5th of the way from a to b”, we might equivalently say (by symmetry) “4/5ths of the way from b to a”, and NV decoder uses this to avoid doing any multiplies in the “multiply by numerator” portion: of the values 0 to 5 we need for the multiples of one fifth, 0 is 0 (which is very cheap to multiply by indeed) and 1, 2, and 4 are powers of 2 (shifts). Of the remaining numbers, 5=5-0 (so instead of going 5/5ths from a to b, we can go 0/5ths from b to a) and 3=5-2 (turn 3/5ths from a to b into 2/5ths from b to a). For the multiples of 1/7th, it works similarly: 0,1,2,4 are already easy, and for the remaining ones, 7-3=4 (which is easy), 7-5=2 (easy), 7-6=1 (easy), and 7-7=0 (easy). So the HW here can switch “a” and “b” around so that the factor is either 0 or a power of 2, choosing which end of the line segment to interpolate from.

Either way, the interpolation procedure for BC4 in 6-interpolated-color mode works out to:

// BC4; a, b are 8-bit, 6-interp mode ae = expand_to_16(a); be = expand_to_16(b); diff = b - a; // b, a not be, ae! col0 = ae; col1 = be; col2 = ae + 36*(diff); // 1/7 col3 = ae + 36*(diff<<1); // 2/7 col4 = be - 36*(diff<<2); // 3/7 col5 = ae + 36*(diff<<2); // 4/7 col6 = be - 36*(diff<<1); // 5/7 col7 = be - 36*(diff); // 6/7

where `expand_to_16(x) = (x<<8) | x`

for unsigned formats and the way less nice `expand_to_16(x) = copysign((abs(x) * ((1<<14) + (1<<7) + 1) >> 6, x)`

(aside from the sign manipulation this is just shifts and ORs, I just wrote it more compactly).

Here 36=32+4 is a convenient number with just 2 set bits which makes it easy to multiply by with just shifts and adds. The actual factor we would want to approximate for the reciprocal here is 65535/(255 * 7) is about 36.714, so this is not terrible. In 4-interpolated-color mode for BC4, we get:

// BC4; a, b are 8-bit, 4-interp mode ae = expand_to_16(a); be = expand_to_16(b); diff = b - a; // b, a not be, ae! col0 = ae; col1 = be; col2 = ae + 48*(diff); // 1/5 col3 = ae + 48*(diff<<1); // 2/5 col4 = be - 48*(diff<<1); // 3/5 col5 = be - 48*(diff); // 4/5

48 = 32 + 16 is again just 2 set bits, the factor to approximate is 65535/(255*5) = 51.4, so that one’s somewhat dubious, but oh well, it is what it is.

BC5 is just two of those. Finally, for BC3 alpha, we get variation of the BC4 unsigned formulas, but this time we want a 8-bit result not a 16-bit result, so the mystery shift along with the rounding bias is back. Let `scale_and_round(k,d) = ((k*d + (d >> 3) + 128) >> 8)`

, then

// BC3 alpha; a, b are 8-bit, 6-interp mode diff = b - a; col0 = a; col1 = b; col2 = a + scale_and_round(36, diff); col3 = a + scale_and_round(36, diff*2); col4 = b + scale_and_round(36, -diff*4); col5 = a + scale_and_round(36, diff*4); col6 = b + scale_and_round(36, -diff*2); col7 = b + scale_and_round(36, -diff);

and finally:

// BC3 alpha; a, b are 8-bit, 4-interp mode diff = b - a; col0 = a; col1 = b; col2 = a + scale_and_round(48, diff); col3 = a + scale_and_round(48, diff*2); col4 = b + scale_and_round(48, -diff*2); col5 = b + scale_and_round(48, -diff);

And that, at long last, is it.

### Conclusions

If you got here, you have no-one to blame but yourself. If you’re one of the handful of people actively working on BCn encoders (hi!), this is probably actionable information to you. If you read this far out of curiosity or because you want to know exactly what you’re getting on the various GPUs, hopefully now you know. And in the not unlikely scenario that you are someone from the future with a question about weird BCn decoding behavior that I referred to this post, well, hopefully this makes it a bit clearer what’s going on. It’s also hopefully usable if you want a software model of the given three GPU vendors’ decoders without having to juggle multiple machines or video cards around all the time.

If you’re an encoder author and not sure what to do about this, one of the main takeaways from this for me was that even though the D3D spec formulation of BC1 decoding explicitly works on 8-bit integer values and divides by 2 or 3 with truncation, and the Khronos spec treats everything as real numbers with unspecified precision, both Intel and AMD actually use the (somewhat nicer) division with round to nearest, and NVidia uses it as well in the green channel, while in the red/blue channels “it’s complicated” (because of the scaling factor folded into the linear interpolation). My older rygdxt (then turned into stb_dxt) used round-to-nearest, Oodle Texture used to default to the D3D spec behavior of truncating, but between AMD and Intel explicitly rounding and NVidia being somewhere in the middle, the next Oodle Texture release (2.9.5) will probably switch to round to nearest.

Also, instead of using the same weights for red, green, and blue, we are currently planning to use 85/256ths as our new approximation to 1/3 (we used to divide by 3 exactly) in the red and blue channels, and 83/256ths in the green channel, in the spirit of making everyone equally unhappy, to land somewhere between Intel’s choice of 85/256ths, AMDs of 21/64ths = 84/256ths, and NVidias sort-of 80.25/256th’s. It was chosen as a convenient dyadic fraction with not too many significant digits that had OK error estimates over the relevant range for all three vendors; somewhat higher for NV than for AMD or Intel, but they chose this crappy approximation themselves, so that’s on them. (For the halfway point, we use 1/2 with round to nearest, same as AMD and Intel.)

The main reason for this change is that Oodle Texture used to solely encode for the a D3D reference decoder that does not actually ship anywhere; we worked to stay within the relevant tolerances (the 3% relative error etc.) to avoid truly nasty surprises, but at the end of the day nobody ships against the D3D reference rasterizer, they ship on actual GPUs. The new approximate decoder model is not particularly complicated (the main weirdness is the different weights for G and R/B) and targeting it made the errors as computed with the accurate hardware models go down across the board, even though the error as measured against the D3D reference increased (it would have to, seeing as that was the sole metric we used to target before). As of this writing we haven’t done anything about BC3 alpha or BC4/5 yet, but we might change our reference/target decoder for that as well in the foreseeable future.

All three decoders are well within the error bounds required by the D3D spec (and Khronos doesn’t have any particular requirements to begin with). In an ideal world, we’d just pick one of these options (doesn’t matter which) and standardize on it; we have standardized the behavior of pretty much all other numbers formats that GPUs work with and that’s been good for everyone. It wouldn’t do much good for any new software for a while, as long as the multiple HW variants are around, but I don’t see texture memory bandwidth concerns or the BCn formats going away any time soon, the more complex BC6-7 and ASTC already nail down their respective decoder behavior, and it sure would be nice to put the current mess behind us *eventually*, even if it’s another 10 years from now.

Finally, the formulas and explanations given here are collated from data collected over several runs and experiments conducted over a period of one and a half years or so; I tried to unify the presentation somewhat for this blog post but I probably introduced mistakes along the way. If you find any problems, it’s probably a bug. Ping me and I’ll try to fix it.

Many thanks to Sean Barrett who did a lot of the legwork to figure out the last few details of Intels and AMDs BC4/5 decoders, Tom Forsyth who crucially told us that yes indeed the Intel decoders for BC4/5 used full 16-bit weights (something I dismissed as implausible for a good long while), and Ignacio Castaño for his 2009 write-up on the NVidia BCn decoders.

Last time I covered the big picture, so we know the ground rules for the modular entropy coding layer in Oodle Data: bytestream consisting of several independent streams, pluggable algorithms, bytes in and bytes out, and entropy decoding is done as a separate pass, not inlined into the code that consumes the decoded data.

There are good and bad news here: the good news is that the encoders and decoders need to do just one thing, they have homogeneous data to work with, and they ultimately boil down to very simple operations in extremely predictable loops so they can pull out all the stops when it comes to optimization. The bad news is that since they are running on their own, if they’re inefficient, there’s no other useful work being accomplished when the decoders are stalled. Careful design is a necessity if we want to ensure that things go smoothly come decode time.

In this post, I’ll start by looking at Huffman coding, or rather, Huffman decoding; encoding, we won’t have to worry about much, since it’s an inherently easier problem to begin with and the design decisions we’ll make to enable fast decoding will also make encoding faster. Required reading for everything that follows will be my earlier posts “A whirlwind introduction to dataflow graphs” as well as my series “Reading bits in far too many ways” (part 2, part 3). I will assume the contents of these posts as known, so if something is unclear and has to do with bit IO, it’s in there.

### Huffman (de)coding basics

Huffman coding is a form of variable-length coding that in its usual presentation is given a finite symbol alphabet *Σ* along with a positive count *f _{s}* (the

*frequency*) of how often each symbol

*s*occurs in the source. Huffman’s classic algorithm then assigns variable-length binary codes (meaning strings of 0s and 1s) of length

*c*to each symbol that guarantee unique decodability and minimize the number of bits in the coded version of the source; we want to send the symbols from the source in fewer bits, and the way to do so is to assign shorter codes to more frequent symbols. To be precise, it solves the optimization problem

_{s}(i.e. minimize the number of bits sent for the payload, because each *c _{s}*-bit code word occurs

*f*times in the source) for integer

_{s}*c*subject to the constraints

_{s}The first of these constraints just requires that code lengths be positive integers and is fairly self-explanatory, the second requires equality in the Kraft-McMillan inequality which guarantees that the resulting code is both complete and uniquely decodable; in fact, let’s just call the quantity the “Kraft defect”. When *K*<0, the code has redundancy (i.e. there free code space left for assignment), for *K*=0 it is complete, and *K*>0 is over-complete (not uniquely decodable or using more code space than is available).

Huffman’s algorithm to perform this construction is a computer science classic, intuitive, a literal textbook example for greedy algorithms and matroids, and it even gives us not just the sequence of code lengths, but an actual code assignment that achieves those code lengths. Nevertheless, actual Huffman codes are of limited use in applications, primarily due to one big problem: the above problem places no upper bound on the code lengths, and indeed a *n*-symbol alphabet can reach a maximum code length of up to *n*-1 given the right (adversarial) frequency distribution that produces a fully left- or right-leaning tree.^{1} This is bad for both encoders and decoders, which generally *have* to adhere at the very least to limits derived from the choice of target platform (mostly register width) and bit I/O refill strategy (see my long series on the topic mentioned earlier), although it is often desirable to enforce much tighter limits (more on that below). One way to accomplish this is by taking the results of the standard Huffman construction and using heuristics and local rewrite rules on either the tree or the code lengths to make the longest codes shorter at the expense of making some other (more frequent) codes longer than they would otherwise have to be. Another easy heuristic option is to just use the regular Huffman construction and if codes end up over-long, divide all symbol frequencies by 2 while rounding up (so that non-zero symbol frequencies never drop to zero) and keep trying again until success. This works because it keeps the frequencies of more popular symbols approximately correct and in the right relation to each other, while it flattens out the distribution of the less popular symbols, which all end up eventually having symbol frequencies of 1, resulting in a uniform symbol distribution at the bottom end. A non-heuristic, actually optimal option is to directly solve a constrained version of the above optimized problem that adds a maximum codeword length constraint. The standard solution to the resulting combinatorial optimization problem is called the package-merge algorithm.

A related observation is that there is really no particular reason to prefer the code assignment produced by Huffman’s algorithm. Even for cases where no length-limiting is required to fall within usual limits, there are usually many possible code assignments that all result in the exact same bit count for the payload portion; and given an assignment of code length to symbols, it is easy to come up with a codeword assignment that matches those code lengths. In short, the per-symbol code lengths make a difference to how efficient a given code is, but we can freely choose which codes we assign subject to the length constraint, and can do so in a way that is convenient for us.

This has the effect of deciding on a single *canonical* Huffman tree for each valid sequence of code lengths (valid here meaning *K*=0), even in cases where Huffman’s algorithm generates distinct trees and code assignments that happen to have the same code lengths. The actual coded bits are just part of the puzzle; an actual codec needs to transmit not just the coded bits, but also enough information for the decoder to work out what codes were assigned to which symbols, and specifying a canonical Huffman code requires less information than its non-canonical counterpart would.^{2} There are several options as to how exactly we assign codes from code lengths, the most common of which is to assign code words sequentially to symbols in lexicographic order for the (*c _{s}*,

*s*) pairs—that is, code lengths paired up with symbol values, ordered by ascending code length and then symbol value.

^{3}

Using canonical, length-limited codes is a common choice. For example, Deflate (the algorithm used in zlib/ZIP) uses canonical codes limited to a maximum length of 15 bits, and JPEG limits the maximum code length to 16 bits.

In principle, such a data stream can be decoded by reading a bit at a time and walking the Huffman tree accordingly, starting from the root. Whenever a leaf is reached, the corresponding symbol is emitted and the tree walk resets back to the root. However, processing data one bit at a time in this fashion is very slow, and implementations generally avoid it. Instead, decoders try to decode many bits at once. At the extreme, when all code lengths are below some upper bound *N*, a decoder can just “peek ahead” *N* bits into the bitstream, and prepare a table that contains the results of the tree walks for all 2^{N} possible bit patterns; table entries store which symbol was eventually reached and how many bits were actually consumed along the way. The decoder emits the symbol, advances its read cursor by the given number of bits, and reads the next symbol.

This method can be much faster and is easy to implement; however, when *N* is large, the resulting tables will be too, and there are steep penalties whenever table sizes start exceeding the sizes of successive cache levels, since the table lookups themselves are essentially random.^{4} Furthermore, most sources being compressed aren’t static, so Huffman tables aren’t used forever. A typical Huffman table will

last somewhere between 5 and 100 kilobytes of data. For larger *N*, it’s easy for table initialization to take as long, or longer, than actual bitstream decoding with that table, certainly for shorter-lived Huffman tables. The compromise most commonly used in practical implementations is to use a multi-level table scheme, where individual tables cover something like 9 to 12 bits worth of input. All the shorter symbols are decoded in a single table access, but the entries for longer symbols don’t contain a symbol value; instead, they reference a secondary table that resolves the remaining bits (there can be many such secondary tables). The point of Huffman coding is to assign shorter codes to more common symbols, so having to take a secondary table walk is relative uncommon. And of course there are many possible variants of this technique: table walks can do more than 2 levels, and in the other direction, if we have many shorter codes (say at most 4 bits long), then a single access to a 512-entry, 9-bit table can resolve two symbols (or even more) at once. This can accelerate decoding of very frequent symbols, at the expense of a more complicated decoder inner loop and slower table build step.^{5}

### Huffman coding in Oodle

Our older codecs, namely Oodle LZH and LZHLW, used a 15-bit code length limit and multi-level tables, both common choices. The problem with this design is that every symbol decode needs to check whether a second-level table lookup is necessary; a related problem is that efficient bit IO implementations maintain a small buffer that is periodically refilled to contain somewhere between 23 and 57 bits of data (details depend on the choice of bit IO implementation and whether we’re on a 32-bit or 64-bit platform, see the bit-IO series I referenced earlier). When we know that codes are guaranteed to be short enough, we can decode multiple symbols per refill. This is significant because refilling the bit buffer is about as expensive as decoding a symbol itself; being able to always decode two symbols per refill, as opposed to one, has the potential to reduce refill overhead from around 50% of decode time to about 33% on 32-bit targets!

When Charles first prototyped what later turned into Kraken, he was using our standard bit IO implementation, which uses byte-wise refill so post-refill, the 32-bit version guarantees at least 25 usable bits, while for in 64-bit the number is 57 bits, all typical limits.

Therefore, if codes are limited to at most 12 bits, the 32-bit decoder can sustain a guaranteed two symbols per refill; for 64-bit, we can manage 4 decodes per refill (using 48 out of 57 bits). As a side effect, the tables with a 12-bit limit are small enough that even without a multi-level table scheme, everything fits in the L1D cache. And if codes are limited to 11 bits, the 32-bit decoders still manage 2 symbols per refill, but the 64-bit decoders now go up to 5 decodes per refill, which is a nice win. Therefore, the newer Oodle codecs use Huffman codes length-limited to only 11 bits, which allows branch-free decoding of 2 symbols per refill on 32-bit platforms, and 5 symbols per refill on 64-bit targets.^{6}

The other question is, of course, how much limiting the code lengths aggressively in this way costs in terms of compression ratio, to which the answer is: not much, although I don’t have good numbers for the 11-bit limit we eventually ended up with. For a 12-bit limit, the numbers in our tests were below a 0.1% hit vs. the original 15-bit limit (albeit with a note that this was when using proper package-merge, and that the hit was appreciably worse with some of the hackier heuristics). We ended up getting a fairly significant speed-up from this choice, so that slight hit is well worth it.

With these simplifications applied, our core Huffman decoder uses pre-baked table entries like this:

struct HuffTableEntry { uint8_t len; // length of code in bits uint8_t sym; // symbol value };

and the decoder loop looks something like this (32-bit version):

while (!done) { bitbuf.refill(); // can decode up to 25 bits without refills! // Decode first symbol { intptr_t index = bitbuf.peek(11); HuffTableEntry e = table[index]; bitbuf.consume(e.len); output.append(e.sym); } // Decode second symbol { intptr_t index = bitbuf.peek(11); HuffTableEntry e = table[index]; bitbuf.consume(e.len); output.append(e.sym); } }

Note that with a typical bit buffer implementation, the `peek`

operation uses a single bit-wise AND or shift instruction; the table access is a 16-bit load (and load address calculation, which depending on the target CPU might or might not be an addressing mode). The bit buffer `consume`

operation usually takes an integer shift and an add, and finally the “append” step boils down to an integer store, possibly an integer shift to get the symbol value in the right place prior to the store (the details of this very much depend on the target ISA, I’ll get into such details in later posts in this series), and some pointer updating that are easily amortized if the loop is unrolled more than just twice.

That means that in terms of the work we’re doing, we’re already starting out with something around 5-7 machine instructions per symbol, and if we want good performance, it matters a lot what exactly those instructions are and how they relate. But before we get there, there’s a few other details about the resulting dataflow to understand first.

The first and most important thing to note here is what the critical path for the entire decoder operation is: everything is in serialized through the bit buffer access. The `peek`

operation depends on the bit buffer state after the previous `consume`

, the table access depends on the result of the `peek`

operation, and the `consume`

depends on the result of the table access (namely the `len`

field) which completes our cycle. This is the critical dependency chain for the entire decoder, and the actual symbol decoding or output isn’t even part of it. Nothing in this loop depends on or cares about `e.sym`

at all; its load and eventual store to the output buffer should be computed eventually, but there’s no rush. The code length determination, on the other hand, is our primary bottleneck. If the critical computation for `peek`

and `consume`

works out to a single 1-cycle latency integer instruction each (which it will turn out it does on most of the targets we’ll see later in this series), then everything hinges on how fast we can do that table access.

On current out-of-order CPUs (no matter the ISA), the answer usually turns out to be 4 or 5 clock cycles, making the critical path from one symbol decoded to the next either 6 or 7 clock cycles. This is very bad news, because as just pointed out, we’re executing around 5-7 instructions per symbol decoded, total. This means that, purely from the dependency structure, we’re limited to around 1 instruction per cycle average on CPUs that can easily sustain more than 3 instructions per cycle given the right code. Viewed another way, in those 7 cycles per symbol decoded, we could easily run 21 independent instructions (or even more) if we had them, but all we do is something like 6 or 7.^{7}

The obvious solution here is to not have a single bit buffer (or indeed bitstream), but several at once. Once again we were not the first or only ones to realize this, and e.g. Yann Collet’s Huff0 uses the same idea. If the symbols we want to decode are evenly distributed between multiple bitstreams, we can have multiple one-symbol-at-a-time dependency chains in flight at once, and hopefully get to a state where we’re mostly (or even completely) limited by instruction throughput (i.e. total instruction count) and not latency.

Our high-level plan of attack will be exactly that. Huff0 uses 4 parallel bitstreams. For reasons that will become clearer in subsequent parts of this series, Huffman coding in the newer Oodle codecs has two different modes, a 3-bitstream mode and a 6-bitstream mode. Using more bitstreams enables more parallelism but the flipside is that it also requires more registers to store the per-bitstream state and is more sensitive to other micro-architectural issues; the actual counts were chosen to be a good compromise between the desires of several different targets.^{8}

For the rest of this post, I’ll go over other considerations; in the next part of this series, I plan to start looking at some actual decoder inner loops.

### Aside: tANS vs. Huffman

The ANS family of algorithms is the new kid on the block among entropy coders. Originally we meant to use tANS in the original Kraken release back in 2016, but the shipping version of Kraken doesn’t contain it; it was later added as a new entropy coding variant in 2018 along with Leviathan, but is rarely used.

However, one myth that was repeatedly claimed around the appearance of tANS was that it is “faster than Huffman”. While it is true that some tANS decoders are faster than some Huffman decoders, this is mostly an artifact of other implementation choices and the resulting dependency structure of the decoder, so I want to spend a few paragraphs explaining when and why this effect appears, and why despite it, it’s generally not a good idea to replace Huffman decoders with tANS decoders.

I’m not going to explain any of the theory or construction of tANS here; all I’ll note is that the individual per-symbol decode step uses tables similar to a Huffman decode table, and paraphrasing old code from Charles here, the key decode step ends up being:

HuffTableEntry e = table[state]; state = next_state[state] | bitbuffer.get(e.len); output.append(e.sym);

In essence, this is a “lookahead Huffman” decoder, where each stream keeps *N* bits of lookahead at all times. Instead of looking at the next few bits in the bit buffer to figure out the next code length, we already have enough bits to figure out what the next code is stored in our `state`

, and after decoding a symbol we shift out a number of bits corresponding to the length of the code and replenish the bottom of our state from the bitstream.^{9}

The bits come from a different place, and we now have the extra `state`

de-facto shift register in the mix, so why would this run faster, at least sometimes? Again we need to look at the data dependencies.

In the regular Huffman decode, we had a single critical dependency chain with the peek/table lookup/consume loop, as well as the bit-buffer refill every few symbols. In this version of the loop, we get not one but two separate dependency chains:

- The
`state`

update has a table lookup (two in fact, but they’re independent and can overlap), the bit buffer`get`

operation to get the next set of low bits for the next state (usually something like an integer add and a shift), and an OR to combine them. This is not any better than the regular Huffman decoder (in fact it’s somewhat worse in multiple ways), but note that the high-latency memory loads happen*before*any interactions with the bit buffer. - The bit buffer update has the periodic refill, and then the
`get`

operations after the`state`

table lookups (which are a combined`peek`

and`consume`

).

The crucial difference here is that in the regular Huffman decoder, most of the “refill” operation (which needs to know how many bits were consumed, load extra bytes and shift them into place, then do some accounting) is on the same dependency chain as the symbol decoding, and decoding the next symbol after a refill can make no meaningful progress until that refill is complete. In the TANS-style decoder, the refill can overlap with the state-dependent table lookups; and then once those table lookups complete, the updates to the bit buffer state involve only ALU operations. In throughput terms the amount of work per symbol in the TANS-style decoder is not better (in fact it’s appreciably worse), but the critical-path latency is improved, and when not using enough streams to keep the machine busy, that latency is all that matters.

A related technique that muddles this further is that ANS-based coders can use interleaving, where (in essence) we just keep multiple `state`

values around, all of which use the same underlying bitstream. This gives us twice the parallelism: two TANS decoders with their own `state`

are independent, and even making them share the same underlying bitstream keeps much of the advantage, because the high-latency portion (the table lookup) is fully overlapped, and individual reads from the bit buffer (once all the lengths are known) can usually be staggered just 1 cycle apart from each other.

A 2× interleaved tANS decoder is substantially faster than a single-stream Huffman decoder, and has lower incremental cost (a single extra integer state is easier to add and track than the registers required for a second bitstream state). However it also does more work per symbol than a 2-stream Huffman decoder does, so asymptotically it is slower—if you ever get close to the asymptote, that is.

A major disadvantage of tANS-style decoding is that interleaving multiple states into the same bitstream means that there is essentially only a single viable decoder implementation; bits need to be read in the same way and at the same time by all decoders to produce the same results, fast implementations need larger and more expensive to set up tables than a straight Huffman decoder, plus other overheads in the bitstream. In contrast, as we’ll see later, the 6-stream Huffman

coding variant is decoded on some platforms as two 3-stream blocks, which would not easily map to a tANS-style implementation. In the end, the lower asymptotic performance and decreased flexibility is what made us stick with a regular Huffman decoder in Oodle Data, because as it turns out we can get decent wins from decoding the bitstreams somewhat differently on different machines.

### Huffman table initialization

As mentioned several times throughout this post, it’s easy for Huffman table initialization to become a major time sink, especially when tables are large, used for just a short time, or the table builder is unoptimized. Luckily, optimizing the table builder isn’t too hard, and it also gives a nice and intuitive (to me anyway) view of how canonical Huffman codes work in practice.

For now, suppose that decoding uses a MSB-first bit packing convention, i.e. a bit-at-a-time Huffman decoder would read the MSB of a codeword first and keep reading additional less significant bits until the symbol is determined. Although our 11-bit code length limit is quite short, that’s still a large number of bits to work through in an example, so let’s instead consider at a toy scenario with a code length limit of 4 bits and four symbols a, b, c, d with code lengths 1, 3, 2, and 3 bits, respectively.

The code length limit of 4 bits means we have a total space of 16 possible 4-bit patterns to cover; I’ll refer to each such bit pattern as a “slot”. Our 1-bit symbol “a” is the first and shortest, which means it gets assigned first, and as such gets a 1-bit code of “0”, with the next 3 bits irrelevant. In short, all slots with a leading 0 bit correspond to the symbol ‘a’; there’s 2^{4-1} = 8 of them. The next-shortest symbol is ‘c’, which get the next available 2-bit code. All codes starting with 0 are taken; the next available slot starts with a 2-bit pattern of “10”, so that’s what we use. There are 2^{4-2} = 4 slots corresponding to a bit pattern starting “10”, so that’s what we assign to c. The two remaining symbols ‘b’ and ‘d’ get 3-bit codes each. Anything starting with “0” or “10” is taken, so they need to start with “11”, which means our choice is forced: ‘b’ being earlier in the alphabet (since we break ties in code length by which symbol is lower-valued) gets the codes starting “110”, ‘d’ gets “111”, and both get 2^{4-3}=2 table slots each. If we write up the results, we get:

Code | Sym | Len |
---|---|---|

0000 |
a | 1 |

0001 |
a | 1 |

0010 |
a | 1 |

0011 |
a | 1 |

0100 |
a | 1 |

0101 |
a | 1 |

0110 |
a | 1 |

0111 |
a | 1 |

1000 |
c | 2 |

1001 |
c | 2 |

1010 |
c | 2 |

1011 |
c | 2 |

1100 |
b | 3 |

1101 |
b | 3 |

1110 |
d | 3 |

1111 |
d | 3 |

Note that we simply assigned all of the code space in linear order to symbols sorted by their code length and breaking ties by symbol value, where each symbol gets 2^{N–k} slots, where *N* is the overall code length limit and *k* is the code length for the symbol in question. At any step, the next code we assign simply maps to the first few bits of the next available table entry. It’s trivial to discover malformed codes in this form: we need to assign exactly 2^{N} slots total. If any symbol has a code length larger than *N* or we end up assigning more or less than that number of slots total, the code is, respectively, over- or under-complete; the current running total of slots assigned is a scaled version of the number in the Kraft inequality.

Moreover, it should also be obvious how the canonical table initialization for a MSB-first code works out in practice: the table is simply written top to bottom in linear order, with each entry repeated 2^{N–k} times; in essence, a slightly more complicated `memset`

. Eventually, each slot is just duplicated a small number of times, like 4, 2 or just 1, so as code lengths get closer to the limit it’s better to use specialized loops and not treat it as a big memset anymore, but the basic idea remains the same.

This approach requires a list of symbols sorted by code length; for Oodle, the code that decodes Huffman table descriptions already produces the symbol list in this form. Other than that, it is straightforward.

The only wrinkle is that, for reasons to do with the details of the decoder implementation (which will be covered in future posts), the fast Huffman decoders in Oodle do not actually work MSB-first; they work LSB-first. In effect, we produce the table as above, but then reverse all the bit strings. While it is possible to directly build a canonical Huffman decode table in a LSB-first fashion, all the nice (and SIMD-friendly) sequential memory writes we had before now turn into scatters, which is a lot messier. It turns out to be easier and faster to first build a MSB-first table as described above and then apply a separate bit-reverse permutation to the elements to obtain the corresponding LSB-first decoding table. This is extra work, but bit-reverse permutations have a very regular structure and also admit a fast SIMD implementation that ends up being essentially a variant of a matrix transpose operation. In our experiments, the SIMD-friendly (and memset-able) MSB-first table initialization followed by a MSB→LSB bit-reverse was much faster than a direct LSB-first canonical table initialization, and LSB-first decoding saved enough work in the core decode loops to be worth the extra step during table building for the vast majority of our Huffman tables, so that’s what we settled on.

And that’s it for the high-level overview of Huffman decoding in Oodle Data. For the next few upcoming posts, I plan to get into detail about a bunch of the decoder loops we use and the CPUs they are meant for, which should keep us busy for a while!

### Footnotes

[1] An easy way to construct a pathological frequency distribution is to pick the symbol frequencies as Fibonacci numbers. They grow exponentially with *n*, but not so quickly as to make the symbol frequencies required to exceed common practical code length limits impossible to occur for typical block sizes. The existence of such cases for at least certain input blocks means that implementations need to handle them.

[2] There are strictly more Huffman trees than there are canonical Huffman trees. For example, if symbol ‘a’ has a 1-bit code and symbols ‘b’ through ‘e’ all get 3-bit codes, then our corresponding canonical tree for that length distribution might be `(a ((b c) (d e)))`

, but it is equally possible (with the right symbol frequencies) to get say the tree `(a ((b d) (c e)))`

or one of many other possible trees that all result in the same code lengths. The encoded size depends only on the chosen code lengths, and spending extra bits on distinguishing between these equivalent trees is wasteful unless the tree shape carries other useful information, which it can in some applications, but not in usual Huffman coding. Along the same lines, for any given Huffman tree there are in general many possible sets of symbol frequencies that produce it, so sending an encoding of the original symbol frequencies rather than the tree shape they result in is even more wasteful.

[3] When assigning codewords this way and using a MSB-first bit packing convention, the blocks of codes assigned to a given code length form intervals. When codes are limited to *N* bits, that means the length of a variable-length code word in the bitstream can be determined by figuring out which interval a code falls into, which takes at most *N*-1 comparisons against constants of width ≤*N*-1 bits. This allows for different, and often more efficient, implementation options for the latency-critical length determination step.

[4] This is inherent: the entire point of entropy coding is to make bits in the output bitstream approach an actual information content around 1 bit. In the process of Huffman coding, we are purposefully making the output bitstream “more random”, so we can’t rely on locality of reference to give us better cache hit rates here.

[5] Decoding multiple symbols at once may sound like an easy win, but it does add extra work and complications to decoding symbols that are long enough not to participate in pair-at-a-time decoding, and once again the table build time needs to be carefully watched; generally speaking multi-symbol decoding tends to help on extremely redundant input but slows down decoding of data that achieves maybe 7 bits per byte from Huffman coding, and the latter tends to be more common than the former. The actual evaluation ends up far from straightforward and ends up highly dependent on other implementation details. Multi-symbol decoding tends to look good when plugged into otherwise inefficient decoders, but as these inefficiencies are eliminated the trade-offs get more complicated.

[6] No doubt we were neither the first nor last to arrive at that particular design point; with an 8-bit alphabet, the set of viable options for at least two decodes per refill if 32b targets matter essentially boils down to 9b to 12b, with 11b being the first where 5 decodes/refill on 64b targets are possible.

[7] This gap is one of the things that makes multi-symbol decoding look great when plugged into a simple toy decoder. Decoding multiple symbols at once requires extra bookkeeping and instructions, and also introduces extra dependencies between the stores since the number of symbols decoded in each step is no longer known at compile time and folded into an addressing mode, but when we’re only using about a third of the capacity of the machine, inserting extra instructions into the symbol decode is very cheap.

[8] Oodle has the same bitstream format for everyone; while we can (and do) make choices to avoid pathologies on all of the targets we care about, any choice we make that ends up in the bitstream needs to work reasonably well for all targets.

[9] `next_state`

here essentially just tabulates a shift and mask operation; in a “real” tANS decoder this can be more complicated, but when we’re looking at a case corresponding to a Huffman table and symbol distribution, that’s all it does.

April 26, 2016 was the release date of Oodle 2.1.5 which introduced Kraken, so it celebrated its 5-year anniversary recently. A few months later we launched Selkie and Mermaid, which wetre already deep in development at the time of the Kraken release but not quite finished yet, and in early 2018 we added Leviathan, adding a higher-compression option to our current suite of codecs, the “sea beastiary”.

In those 5 years, these codecs have seen wide adoption in their intended market (which is video games).

There’s a few interesting things to talk about, and the 5-year anniversary seems as good a reason as any to get started; this is the first in what will be a series of yet to be determined length, in which I’ll do a deep-dive on one interesting aspect of these codecs, namely the way they handle entropy coding.

Before I can get into any details, first some general notes on how things fit together. Kraken, Mermaid, Selkie, and Leviathan are lossless data compression algorithms, all variations on the basic LZ77 + entropy coding formula that has been the de facto standard in general-purpose compressors since the late 80s, because such codecs can achieve a good balance of compression ratio and compression/decompression speed for practical applications. Other codecs belonging to this family include Deflate (ZIP/gzip), LZX (Amiga LZX/CAB), LZMA (7zip/xz), Zstd, Brotli, LZHAM, and many others, including most of the older Oodle Data codecs (Oodle LZH/LZHLW, LZA, LZNA, and BitKnit).

The “LZ77” portion here refers to the LZ77 algorithm, which, broadly speaking, compresses data by replacing repeated byte sequences in a stream with references to prior occurrences; as long as the back-reference is smaller than the bytes themselves, this will result in compression. Nobody actually uses the original LZ77 algorithm per se (which has a very inefficient encoding by today’s standards), but the general idea remains the same. Some codecs, especially ones designed for faster decoding or encoding, use just this byte/string matching approach without any entropy decoding; this includes many well-known faster codecs such as LZ4, Snappy, LZO, and again many others, including the remaining older Oodle Data

codecs (Oodle LZB/LZBLW and LZNib).

I won’t talk about the LZ77 portion here (that’s its whole own thing), and instead focus on the “entropy coding” portion of things. In essence, what all these codecs do is identify repeated strings, resulting in some mixture of back-references (“matches”) and bytes that couldn’t be profitably matched to anything earlier in the data stream (“literals”). All the codecs mentioned differ in how exactly these things are encoded, but all of them eventually boil it down to one

or more data streams that each have symbols from a finite-size and not too large alphabet (typical sizes include a few dozen to a few hundred distinct symbols in the alphabet), plus maybe a few extra streams that contain raw bits for data that is essentially random.

Entropy coding, then, processes these streams of symbols from a finite alphabet and tries to turn them into an equivalent form that is smaller. There are numerous ways to do this; one well-known method, and indeed a mainstay in data compression to this day (with some caveats), is Huffman coding, which counts how often individual symbols occur and assigns short bit codes to more likely symbols and longer symbols to less likely symbols, in a way that remains unambiguous and uniquely decodable.^{1}

The second major class of techniques is arithmetic coding, which compresses slightly better than Huffman for typical data and can give significant improvements on either small alphabets or highly skewed distributions, and is also more suitable for applications when there is not an a priori known fixed symbol distribution, but rather an on-line model that is being updated as new data is seen (“adaptive coding”).

Finally, the new kid on the block is the ANS (“Asymmetric Numeral System”) family of methods, which is conceptually related to arithmetic coding but works differently, and has different trade-offs. Kraken and Mermaid use ANS sometimes and Leviathan frequently; I’m sure I’ll cover its usage in those codecs at some point, and for now there are plenty of my older writings on the topic from around 2014-2016.

Anyway, it is common for LZ77-derived algorithms with a Huffman or arithmetic coding backend to use odd alphabet sizes; when you’re assigning variable-length codes to everything anyway, it doesn’t really matter whether your alphabet has just 256 distinct byte values in it, or if the limit is 290 or 400 or something else non-power-of-2. It’s likewise common to interleave all kinds of data into a single contiguous bitstream. The sea beastiary family does things a bit differently than most codecs.

Namely, the input (uncompressed) data is processed in 128KiB chunks, which all have self-contained bytestreams with an explicitly signaled size.^{2} These chunk bytestreams are themselves made up out of multiple independent bitstreams for different types of data, and most of these streams (except for a few “misc”/leftover data streams) use a single shared entropy coding layer to handle what we now call “primary streams”.^{3}

This layer works on streams that always use a byte alphabet, i.e. unsigned values between 0 and 255. That means that everything that goes through this layer needs to be presented in a byte-packed format. This kind of restriction is common in fast byte-aligned LZ77 coders without a final entropy coding stage (like LZ4 or Snappy), but atypical outside of that space.

Working in bytes needs some care to be taken in other parts of the codec design, but comes with a massive advantage, because it allows us to have an uncompressed mode that just stores streams as uncompressed bytestreams and doesn’t need any decoding at all. That means that streams that are nearly random, or just short, don’t have to waste CPU time on setting up a more complicated decoder and then decoding bytes individually through a more complicated algorithm when just grabbing uncompressed bytes straight from the bytestream is good enough.

In short, for most codecs, the choice of entropy coding scheme is a design-time decision. Algorithms like Deflate which uses LZ+Huffman *always* perform Huffman coding, even when doing so isn’t actually worthwhile, because they use extended alphabets with more than 256 symbols in them that mean pass-through coding would still need to use larger-than-8-bit units, both introducing extra waste and making it more expensive in terms of CPU time. Sticking with a byte alphabet means that the Oodle sea beastiary codecs can either use a conventional entropy coding stage or choose to send data uncompressed and act like a faster byte-aligned LZ codec. In fact, that is exactly how Selkie works: under the hood, Selkie is the same codec as Mermaid, but with all streams in uncompressed mode at all times (and some other features disabled).

The final big departure from most codecs is that the sea beastiary codecs run all entropy decoding as separate passes that consume an input bytestream with a common header and output an array of bytes. Uncompressed streams can skip the decoding step and consume data straight from the input bytestream. These passes just handle whatever entropy coding scheme is used and nothing else. This means that instead of the choice of entropy coders being fundamentally built into the algorithm, it’s a collection of small loops that can easily be interchanged or added to, and indeed that’s what we’ve been doing; the original version of Kraken shipped in Oodle 2.1.5 with a choice between just uncompressed streams and one particular version of Huffman coding. We’ve added several more options since then, including an alternative Huffman coding variant, an optional RLE pre-pass, TANS, and the ability to mix-and-match all of these within a single stream, all without it being a big redesign, and we’re quite happy with the results.

There’s many interesting things in here that I plan to cover in small installments, starting with the way we do Huffman (de)coding, which is worth several posts on its own. I’ll try to keep updating this at a reasonable cadence, but no promises since my blogging time is pretty limited these days!

### Footnotes

[1] Huffman coding technically refers to such variable-length

coding (meaning not all symbols need to be coded with the same number of bits) only when the code in question is actually generated via Huffman’s algorithm (which minimizes the length of the coded stream given a certain number of assumptions), but in practice most implementations don’t actually use Huffman’s original algorithm which can assign awkwardly long code lengths to rare symbols. Instead the maximum code length is usually capped, which is a different problem and yields codes that are not actually Huffman codes (and are very slightly less efficient), but just general variable-length codes. Having said that, approximately everyone calls it Huffman coding even when the codes are length-limited anyway, and I won’t sweat it either in the following.

[2] The chunks being independent is how we implement what is called “thread-phased” decoding. As is explained further down in the article, Oodle decodes the bitstreams to temporary memory buffers in a first phase, and the chunks can be processed completely independently for this part of the process, making it a prime candidate to move off to a separate thread (or threads). After phase 1 completes, the actual LZ decoding takes place in “phase 2”; this portion can make references to any bytes earlier in the stream, so it must be run in order and can’t as easily be threaded. Customers that want wider parallelism (or easy seeking) usually chop data into fixed-size blocks, typically between 64KiB and 512KiB, that are encoded independently with no back-references permitted across block boundaries. This decreases compression ratio but makes it trivial to have many workers decoding different blocks at the same time.

[3] In the original code, these streams of bytes are, very non-distinctly, just called “arrays”. When specifying the bitstream more formally, a lot of things suddenly needed proper names to avoid confusion, and “arrays” was just too indistinct and overloaded, so “primary streams” was what we settled on. Sometimes additional data streams exist within a chunk for information not contained in the primary streams, and these are therefore called “secondary streams”.

In the previous post, I looked at repeated application of FIR filters on the same signal and laid out why their frequency response is the thing to look at if we’re wondering about their long-term stability under repeated application, because the Discrete-Time Fourier Transform (DTFT) of the filter coefficients corresponds to the eigenvalues of the linear map that convolves a signal with the filter once.

Now, let’s go back to the original problem that Casey asked about: half-pixel (or half-pel) interpolation filters for motion compensation. We know from Casey’s post which of the filters he tried “blow up” under repeated application, and which ones don’t, and we now know some of the theory. The question is, does it track? Let’s find out!

### A small filter zoo: 6 taps

We’ll be looking at the frequency responses of various filters here. As a reminder, what we’re looking at are the values of the function

where the are our filter coefficients, which are real numbers. It’s easy to prove from this form that the function is continuous, 2π-periodic, and has (where the bar denotes complex conjugation, as usual). Therefore the “interesting” part of the frequency response (for our purposes anyway) is contained in the values of for ω in [0,π]. Furthermore, we only care about the “magnitude response”, the absolute values of the (complex) frequency response, which is conventionally plotted in decibels (i.e. effectively on a logarithmic scale); the angle denotes phase response, which is not relevant to our question (and also happens to be relatively boring in this case, because all filters under consideration are symmetric and thus linear-phase FIR filters). Finally I’ll crop the plots I’m showing to only show frequencies from 0 to 0.8π (up to 80% Nyquist frequency) because all filters under consideration sharply decay past that point (if not much earlier), so the portion from 0.8π to π contributes very little information and makes the y axis scaling awkward.

With all that out of the way, let’s start with our first filter: the 2-tap linear interpolation filter [0.5, 0.5] (it’s “bilinear” if you do it in both the X and Y directions, but all filters under consideration are separable and we’re only looking at 1D versions of them, so it’s just straight-up linear interpolation).

The dotted line at 0dB is the line we’re not allowed to cross if we want repeated application to be stable. It’s also the line we want to be *exactly on* for an ideal interpolation filter. As we can see from the diagram, linear interpolation is good at the first part, not so good at the second part: it’s unit gain at a frequency of 0 but immediately rolls off, which is what causes it to over-blur. Not great.

The next filter that Casey looks at jumps from 2 taps all the way to 6: it’s the half-pixel interpolation filter from H.264/AVC. Here’s its magnitude response:

This one is quite different. We can immediately see that it has above-unit gain for a significant fraction of its spectrum, peaking around 0.5π. That tells us it should blow up in Casey’s repeated-filtering test, and indeed it does. On the plus side, it has a much wider passband, and is in general much closer to an ideal interpolation filter than basic linear interpolation is.

The image for the 6-tap Lanczos filter is not that different:

In fact, this one’s qualitatively so similar that it seems fair to assume that the H.264 filter was designed as an approximation to Lanczos6 with reduced-precision coefficients. (It would certainly be plausible, but I haven’t checked.)

Next up, let’s look at Casey’s filter with quantized coefficients (numerator of 32):

Casey’s filter stays at or below unit gain; it has noticeably below-unit gain for much of its passband which is not ideal, but at least its passband is significantly wider than basic linear interpolation, and looks decent out to about 0.5π before it really starts to cut off.

And while we’re at it, here’s Casey’s other 6-tap filter from the second part of his series:

This one *very* slightly pokes above unit gain around 0.5π, but evidently little enough not to blow up when the output signal is quantized to 8 bits after every step, as Casey’s test does. Other than that, we have some passband ripple, but are really quite close to unit gain until at least half-Nyquist, after which our performance starts to deteriorate, as it does for all 6-tap filters.

Our final 6-tap filter in this round is a 6-tap Lagrange interpolator. This one’s not in Casey’s series; its coefficients are [ 0.011719, -0.097656, 0.585938, 0.585938, -0.097656, 0.011719 ]. Lagrange interpolators are a classic family of interpolating filters (closely related to Lagrange polynomials) and have extremely flat passbands:

On the good side, yes, the passband really is flat. The trade-off is that the filter response visibly starts to dip around 0.4π: the price we pay for that flat passband is losing more of the high frequencies than we would with other filters. On the other hand, this type of filter is not going to explode with repeated application. (Really, this is the second Lagrange-type filter I’ve shown, since the initial linear interpolation filter coincides with a 2-tap Lagrange interpolator.)

### 8-tap filters

I’m putting these in their own category: the two extra taps make a significant difference in the attainable filter quality, so they’re not on even footing with the 6-tap contenders. Let’s start with the filter from H.265/HEVC:

Some ripple in the passband and an above-unit peak around 0.6π. Looking at the frequency response, this filter should blow up in Casey’s tests, and indeed it does. However it also manages to pass through frequencies all the way out to nearly 0.7π, not something we’ve seen so far.

Here’s 8-tap Lanczos:

Again qualitatively similar to what we see from the HEVC filter, although I overall like the “real” HEVC filter a bit better than this one. Another filter that we would expect to blow up (and Casey’s testing confirms it does).

At the other extreme, 8-tap Lagrange:

Passband straight as a ruler, but not much use past 0.5π. The good news is that, once again, it’s perfectly stable.

Here’s Casey’s contender:

Again some passband ripple and it’s poking slightly above unity around 0.55π, but it manages good results out to about 0.6π before it starts to cut.

### Conclusions

That’s a couple of different filter types, and at this point we’ve seen enough to start drawing some conclusions, namely:

First, adding more taps gives us frequency responses closer to our ideal, which is not exactly surprising. The trade-offs are that filters with more taps are more expensive to evaluate, and that codecs also care about things other than frequency response. Among other things, we generally want reduced-precision fixed-point approximations to the filter coefficients both for efficiency (particularly in hardware decoders) and to ensure different codec implementations agree (which, for various reasons, gets significantly nastier once floating-point is in the mix).

Second, among the filters I showed, there’s a fairly clear spectrum: at one end, we have the Lagrange interpolators with strictly unit-or-below gain. They are completely stable under repeated application but also tend to lose high frequencies fairly quickly. In the other direction, we have filters like the Lanczos-derived ones or those used in H.264 or HEVC that have wider pass bands but also clearly above-unit gain for at least some frequencies, meaning that frequency content in those regions will grow and ultimately explode under repeated application. Finally, Casey’s filters are somewhere in between; they have wider pass bands than pure Lagrange interpolators but keep their maximum gain close enough to 1 to avoid blow-ups when applied to data that is quantized to 8 bit fixed point.

This last part is something I originally intended to do a more in-depth analysis of, which is the reason this post is so delayed. But I just never felt inspired to actually write this part and didn’t make any headway the 3 times I tried to just sit down and write it *without* inspiration either, so, well, sorry. No proper analysis here. I figured it’s better to at least publish the rest of the article without it.

The gist of it is this: if your samples are quantized to say 8-bit fixed point after every step, it seems plausible that you should be able to get away with slightly above unit gain at some frequencies. Essentially, all the inputs (being quantized) are integers, which means they need to change by at least 0.5 steps in either direction to actually round to a different value. If the gain at a given frequency isn’t high enough to accomplish this change, nothing will happen, and even a filter with above-unit gain for some frequencies that should in theory blow up eventually, won’t. Experimentally this seems to be true and I meant to do a proper proof but as said, you’ll have to live without it for the time being. (I might do another post in the future if I do come up with a nice proof.)

Finally: how much does it matter? Casey’s original posts framed not diverging over repeated application as an essential requirement for a motion interpolation filter in a video codec, but it’s really not that simple.

There’s no question that stability under iteration is desirable, the same way that a perfectly flat response over all frequencies is desirable: ideally we’d like interpolation filters to behave like a perfect all-pass filter. But we care about computational cost and memory bandwidth too, so we don’t get to use infinitely wide filters, which means we can’t get a perfect frequency response. This is somewhat hidden by the cropping I did, but all filters shown decay very sharply above around 80% Nyquist. The stability issue is similar: if we care about stability above all else, there is a known answer, which is to use Lagrange interpolators, which give us perfect stability but have subpar response at the high frequencies.

There is another thing to consider here: in a video codec, motion compensation is part of the coding loop. The encoder knows exactly what the decoder will do with a given frame and set of motion vectors, and will use that result to code further updates against. In short, the interpolation filter is not left to its own devices in a feedback loop. The encoder is continuously monitoring what the current state of the frame in a compliant decoder will be. If there is a build-up of high-frequency energy over time, all that really happens is that at some point, the error will become high enough for the encoder to decide to do something about it, which means sending not just a pure motion vector, but instead a motion vector with a residual (a correction applied to the image data after motion compensation). Said residual is also coded in frequency space (essentially), most commonly using a DCT variant. In short, the end result of using a filter that has resonances in the high frequencies is that a few times per second, the encoder will have to send residuals to cancel out these resonances. This costs extra bits, and correcting errors in the high frequencies tends to be more expensive than for the lower frequencies (image and video codecs can generally code lower frequencies more cheaply than high frequencies).

But suppose we didn’t do that, and instead used say a perfectly stable Lagrange interpolator. It doesn’t have any resonances that would cause the image to blow up, but it does act like a bit of a low-pass filter. In short, instead of steadily gaining energy in the high frequencies, we end up steadily losing energy in the high frequencies. And this too ends up with the encoder having to send high-frequency heavy residuals periodically, this time to add in the missing high frequencies (instead of removing excess ones).

Neither of these is obviously preferable to, or cheaper than, the other. Sending high-frequency heavy residuals is relatively expensive, but we end up having to do it periodically regardless of which type we choose.

That’s not to say that it doesn’t matter at all; it’s just to point out that the actual decision of which type of filter to use in any real codec is not made in a vacuum and depends on other factors. For example, H.264/AVC and HEVC at low bit rates rely aggressively on their deblocking filters, which are essentially low-pass filters applied over block edges. In that context, a somewhat sharpening motion interpolation filter can help mitigate some of the damage, whereas a Lagrange interpolator would make things even more one-sided.

In short, there is no such thing as a single “best” interpolation filter. The decision is made in context and depends on what the rest of the codec does.

Casey Muratori posted on his blog about half-pixel interpolation filters this week, where he ends up focusing on a particular criterion: whether the filter in question is stable under repeated application or not.

There are many things about filters that are more an art than a science, especially where perceptual factors are concerned, but this particular question is both free of tricky perceptual evaluations and firmly in the realm of things we have excellent theory for, albeit one that will require me to start with a linear algebra infodump. So let’s get into it!

### Analyzing iterated linear maps

Take any vector space V over some field and any linear map from that space to itself. An *eigenvector* v of T is a nonzero element of V such that for some – that is, the result of applying the map T to v is a scaled version of v itself. The scale factor λ is the corresponding *eigenvalue*.

Now when we’re iterating the map T multiple times, eigenvectors of T behave in a very simple way under the iterated map: we know that applying T to v gives back a scaled version of v, and then linearity of T allows us to conclude that:

and more generally for any .

The best possible case is that we find lots of eigenvectors – enough to fully characterize the map purely by what it does on its eigenvectors. For example, if V is a finite-dimensional vector space with , then if we can find n linearly independent eigenvectors, we’re golden: we can select a basis entirely made of eigenvectors, and then written in that basis, T will have a very simple form: we will have where for some Q (whose columns contain n linearly independent eigenvectors of T).

That is, in the right basis (made of eigenvectors), T is just a diagonal matrix, which is to say, a (non-uniform) scale. This makes analysis of repeated applications of T easy, since:

and in general

and : viewed in the basis made of eigenvectors, repeated application of T is just repeated scaling, and behaviour over lots of iterations ultimately just hinges on what the eigenvalues are.

Not every matrix can be written that way; ones that can are called *diagonalizable*. But there is a very important class of transforms (and now we allow infinite-dimensional spaces again) that is guaranteed to be diagonalizable: so called self-adjoint transforms. In the finite-dimensional real case, these correspond to symmetric matrices (matrices A such that ). Such transforms are guaranteed to be diagonalizable, and even better, their eigenvectors are guaranteed to be pairwise orthogonal to each other, meaning the transform Q is an orthogonal matrix (a rotation or reflection), which among other things makes the whole process numerically quite well-behaved.

As a small aside, if you’ve ever wondered why iterative solvers for linear systems usually require symmetric (or, in the complex case, Hermitian) matrices: this is why. If a matrix is symmetric, it it diagonalizable, which allows us to build an iterative process to solve linear equations that we can analyze easily and *know* will converge (if we do it right). It’s not that we can’t possibly do anything iterative on non-symmetric linear systems; it just becomes a lot trickier to make any guarantees, especially if we allow arbitrary matrices. (Which could be quite pathological.)

Anyway, that’s a bit of background on eigendecompositions of linear maps. But what does any of this have to do with filtering?

### Enter convolution

Convolution itself is a *bilinear map*, meaning it’s linear in both arguments. That means that if we fix either of the arguments, we get a linear map. Suppose we have a FIR filter f given by its coefficients . Then we can define an associated linear map on a suitable space, say something like (writing for the set of bounded sequences of complex numbers) by setting

.

If this is all a bit dense on notation for you, all I’m doing here is holding one of the two arguments to the convolution operator constant, and trying to at least specify what set our map is working on (in this case, bounded sequences of real numbers).

And now we’re just about ready for the punchline: we now have a linear map from a set to itself, although in this case we’re dealing with infinite sequences, not finite ones. Luckily the notions of eigenvectors (eigensequences in this case) and eigenvalues generalize just fine. What’s even better is that for *all* discrete convolutions, we get a full complement of eigensequences, and we know exactly what they are. Namely, define the family of sequences by:

That’s a cosine wave with frequency ω in the real part and the corresponding sine wave in the imaginary part, if you are so inclined, although I much prefer to stick with the complex exponentials, especially when doing algebra (it makes things easier). Anyway, if we apply our FIR filter f to that signal, we get (this is just expanding out the definition of discrete convolution for our filter and input signal, using the convention that unqualified summation is over all values of k where the sum is well-defined)

There’s very little that happens here. The first line is just expanding the definition; then in the second line we use the properties of the exponential function (and the linearity of sums) to pull out the constant factor of . And it turns out the entire rest of the formula doesn’t depend on n at all, so it turns into a constant factor for the whole sequence. It does depend on f and ω, so we label it . The final line states exactly what we wanted, namely that the result of applying to is just a scaled copy of itself—we have an eigensequence (with eigenvalue ).

Also note that the formula for the eigenvalue isn’t particularly scary either in our case, since we’re dealing with a FIR filter f, meaning it’s a regular finite sum:

Oh, and there’s one more minor detail I’ve neglected to mention so far: that’s just the discrete-time Fourier transform (DTFT, not to be confused with the DFT, although they’re related) of f. Yup, we started out with a digital FIR filter, asked what happens when we iterate it a bunch, did a brief detour into linear algebra, and ended up in Fourier theory.

Long story short, if you want to know whether a linear digital filter is stable under repeated application, you want to look at its eigenvalues, which in turn are just given by its frequency response. In particular, for any given frequency ω, we have exactly three options:

- . In this case, the amplitude at that frequency is preserved exactly under repeated application.
- . If the filter dampens a given frequency, no matter how little, then the amplitude of the signal at that frequency will eventually be driven to zero. This is stable but causes the signal to degrade. Typical interpolation filters tend to do this for the higher frequencies, which is why signals tend to lose such frequencies (in visual terms, get blurrier) over time.
- . If a filter amplifies any frequency by more than 1, even by just a tiny bit, then any signal containing a nonzero amount of that frequency will eventually blow up.

The proof for all three cases is simply observing that k-fold application of the filter f to the signal will result in the signal . To generalize this to a wider class of signals (not just complex exponentials) we would need to represent said signals as sum of complex exponentials, which is exactly what Fourier series are all about; so it can be done, but I won’t bother with the details here, since they’re out of the scope of this post.

Therefore, all you need to know about the stability of a given filter under repeated application is contained in its Fourier transform. I’ll try to do another post soon that shows the Fourier transforms of the filters Casey mentioned (or their magnitude response anyway, which is what we care about) and touches on other aspects such as the effect of rounding and quantization, but we’re at a good stopping point right now, so I’ll end this post here.

Hash tables are the most popular probabilistic data structure, by quite a margin. You compute some hash code then compute an array index from that hash code. If you’re using open addressing-class schemes, you then have a rule to find other array indices that a value with the given hash code might be found at; with separate chaining, your array entries point to the head of a linked list, or the root of a binary tree, or whatever other data structure you prefer for the given use case.

No matter what exactly you do, this entire class of schemes always gives you a Las Vegas algorithm, meaning this type of hash table will always retain values for all the keys you inserted, but you’re gambling on how long a lookup takes.

An alternative is to enforce a strict limit P≥1 on the number of probes performed (for open addressing) or the size of any of the secondary data structures (for separate chaining). The result is a “forgetful dictionary”: keys are allowed to disappear, or at least become unretrievable. That’s quite a limitation. In return, the worst-case lookup cost becomes bounded. In short, we now have a Monte Carlo algorithm: we’re now allowed to fail lookups (keys can disappear over time), but runtime variability is significantly reduced.

Let’s call this data structure a “cache table”, both because it’s useful for caching the results of queries and because it matches the most common organization of caches in hardware. We’ve been using that name at RAD for a while now, and I’ve also seen it used elsewhere, so it’s likely someone independently came up with the same name for the same thing, which I take to be a good sign.

In conventional hash tables, we have different probing schemes, or separate chaining with different data structures. For a cache table, we have our strict bound P on the number of entries we allow in the same bucket, and practical choices of P are relatively small, in the single or low double digits. That makes the most natural representation a simple 2D array: N rows of hash buckets with P columns, each with space for one entry, forming a N×P grid. Having storage for all entries in the same row right next to each other leads to favorable memory access patterns, better than most probe sequences used in open addressing or the pointer-heavy data structures typically used in separate chaining.

To look up a key, we calculate the row index from the hash code, using something like `row = hash % N`

. Then we check whether there is a matching entry in that row, by looping over all P columns. That’s also why you don’t want P to get too large. What I’ve just described matches the operation of a P-way set associative cache in hardware exactly, and we will sometimes call P the number of “ways” in the cache table.

P=1 is the simplest case and corresponds to a direct-mapped cache in hardware. Each entry has exactly one location in the cache array where it can go; it’s either there or it’s not present at all. Inserting an entry means replacing the previous entry at that location.

For P≠1, there are multiple choices of which item to replace on insertion; which one is picked is determined by the *replacement policy*. There are many candidates to choose from, with different implementation trade-offs; a decent option that doesn’t require any extra metadata stored alongside each row is to use random replacement, i.e. just picking a (pseudo-)random column within the given row to evict on every insertion. “Hang on”, you might say, “aren’t hash codes pseudo-random already?”. Yes, they are, but you want to use a random number generator independent of your hash function here, or else you’re really just building a direct-mapped cache with N×P rows. One of the benefits of keeping P entries per row is that it allows us to have two different keys with identical hash values (i.e. a hash collision) in the same table; as per the birthday problem, even with a well-distributed hash, you are likely to see collisions.

So what is this type of data structure useful for? I’ve come across two classes of use cases so far:

- Caching queries, as noted above. It’s used extensively to build (memory) caches in hardware, but it’s useful in software too. If you’re trying to cache results of operations in software, there is often a desire to keep the size of the cache bounded and have lookups take a predictable time; cache tables deliver both and are generally simpler than trying to manually limit the number of live entries in a regular hash table.
- Approximate searching tasks. For example, they’re quite useful in LZ77 match finding – and have been used that way since at least the late 80s (a 1986 patent covering the case P=1, now expired, was the subject of a rather famous lawsuit) and early 90s (P≠1), respectively.

This is a rehash of something I wrote in a forum post something like 10 years ago. It turns out that forum prohibited archiving in its `robots.txt`

so it’s not on the Internet Archive. The original operator of said forum hosted a copy (with broken formatting) of the original posts for a while, but at a different URL, breaking all links. And now that copy’s gone too, again with archiving disabled apparently. Sigh.

I got asked about this yesterday; I don’t have a copy of my original derivation anymore either, but here’s my best attempt at reconstructing what I probably wrote back then, and hopefully it won’t get lost again this time.

Suppose you’re given a unit quaternion and a vector . Quaternions are a common rotation representation in several fields (including computer graphics and numerical rigid-body dynamics) for reasons beyond the scope of this post. To apply a rotation to a vector, one computes the quaternion product , where is implicitly identified with the quaternion with real (scalar) part 0 and as its imaginary part, and denotes the conjugate of . Such quaternions with a real part of 0 are also referred to as “pure imaginary” quaternions. Anyway, the result of the above product is another pure imaginary quaternion, corresponding to the rotated vector.

This is all explained and motivated elsewhere, and I won’t bother doing so here. Now generally, you often want to apply the same transformation to many vectors. In that case, you’re better off turning the quaternion into the equivalent 3×3 rotation matrix first. You can look up the formula elsewhere or work it out from the expression above with some algebra. That’s not the topic of this post either.

But sometimes you really only want to transform a single vector with a quaternion, or have other reasons for not wanting to expand to an explicit 3×3 (or larger) matrix first, like for example trying to minimize live register count in a shader program. So let’s look at ways of doing that.

### The direct way

First, I’m going to split quaternions in their real (scalar) and imaginary (vector) parts, writing where is the real part and imaginary. The conjugate of is given by .

The product of two quaternions and is given by

where denotes the usual dot product and the cross product. If you’re not used to seeing this in vector notation, I recommend using this (or something more abstract like geometric algebra) for derivations; writing everything in terms of scalars and the basis elements makes you miss the forest for the trees.

Anyway, armed with this formula, we can compute our final product without too much trouble. Let’s start with the part:

Not so bad. Now we have to multiply it from the right by , which I’ll do in multiple steps. First, let’s take care of the real part, by multiplying our just-computed values for with using the general multiplication formula above:

because the first two dot products are identical and the cross product of and is perpendicular to . This proves that is indeed a pure imaginary quaternion again, just like the we started out with.

Nice to know, but of course we’re actually here for the vector part:

which so far has used nothing fancier than the antisymmetry of the cross product .

If we pull out and name the shared cross product, we get

which is the direct expression for the transformed vector from the formula. This is what you get if you just plug everything into the formulas and apply basic algebraic simplifications until you run out of obvious things to try (which is exactly what we did).

In terms of scalar operation count, this boils down to two cross products at 6 multiplies and 3 additions (well, subtractions) each; one 3D dot product at 3 multiplies and 2 additions; 3 vector-by-scalar multiplications at 3 multiplies each; two scalar multiplies (to form and , although the latter can also be computed via addition if you prefer); and finally 3 vector additions at 3 adds each. The total operation count is thus 26 scalar multiplies and 17 additions, unless I miscounted. For GPUs, a multiply-add “a*b+c” generally counts as a single operation, and in that case the scalar operation count is 9 scalar multiplies and 17 scalar multiply-adds.

### Stand back, I’m going to try algebra!

We can do better than that. So far, we haven’t used that is a unit quaternion, meaning that . We can thus plug in for , which yields:

This might look worse, but it’s progress, because we can now apply the vector triple product identity

with , , , leaving us with:

The two remaining terms involving both multiply by two, so we use a slightly different shared temporary for the final version of our formula:

Final scalar operation count: without multiply-adds, the two cross products sum to 12 multiplies and 6 adds, scaling t by two takes 3 multiplies (or adds, your choice), and the final vector-by-scalar multiply and summation take 3 multiplies and 6 adds, respectively. That’s 18 multiplies and 12 adds total, or 15 multiplies and 15 adds if you did the doubling using addition.

Either way, that’s a significant reduction on both counts.

With multiply-adds, I count a total of 3 multiplies and 9 multiply-adds for the cross products (the first cross product has nothing to sum to, but the second does); either 3 multiplies or 3 adds (your choice again) for the doubling; and another 3 multiply-adds for the portion. That’s either 6 multiplies and 12 multiply-adds or 3 multiplies, 3 adds and 12 multiply-adds. Furthermore some GPUs can fold a doubling straight into the operand without counting as another operation at all; in that case we get 3 multiplies and 12 multiply-adds.

For comparison, multiplying a vector by a 3×3 rotation matrix takes 9 multiplies and 6 additions (or 3 multiplies plus 6 multiply-adds). So even though this is much better than we started out with, it’s generally still worthwhile to form that rotation matrix explicitly if you plan on transforming lots of vectors by the same quaternion, and aren’t worried about GPU vector register counts or similar.

“Rate-distortion optimization” is a term in lossy compression that sounds way more complicated and advanced than it actually is. There’s an excellent chance that by the end of this post you’ll go “wait, that’s it?”. If so, great! My goal here is just to explain the basic idea, show how it plays out in practice, and maybe get you thinking about other applications. So let’s get started!

### What does “rate-distortion optimization” actually mean?

The mission statement for lossless data compression is pretty clear. A lossless compressor transforms one blob of data into another one, ideally (but not always) smaller, in a way that is reversible: there’s another transform (the decoder) that, when applied to the second blob, will return an *exact* copy of the original data.

Lossy compression is another matter. The output of the decoder is usually at least slightly different from the original data, and sometimes very different; and generally, it’s almost always possible to take an existing lossily-compressed piece of data, say a short video or MP3 file, make a slightly smaller copy by literally deleting a hundred bytes from the middle of the file with a hex editor, and get another version of the original video (or audio) file that is still clearly recognizable yet degraded. Seriously, if you’ve never done this before, try it, especially with MP3s: it’s quite hard to mangle a MP3 file in a way that will make it not play back anymore, because MP3s have no required file-level headers, tolerate arbitrary junk in the middle of the bitstream, and are self-synchronizing.

With lossless compression, it makes sense to ask “what is the smallest I can make this file?”. With lossy compression, less so; you can generally get files far smaller than you would ever want to use, because the data is degraded so much it’s barely recognizable (if that). Minimizing file size alone isn’t interesting; we want to minimize size while simultaneously maximizing the quality of the result. The way we do this is by coming up with some error metric that measures the distance between the original image and the result the decoder will actually produce given a candidate bitstream. Now our bitstream has two associated values: its length in bits, the (bit) *rate*, usually called R in formulas, and a measure of how much error was introduced as a result of the lossy coding process, the *distortion*, or D in formulas. R is almost always measured in bits or bytes; D can be in one of many units, depending on what type of error metric is used.

Rate-distortion optimization (RDO for short) then means that an encoder considers several possible candidate bit streams, evaluates their rate and distortion, and tries to make an optimal choice; if possible, we’d like it to be *globally* optimal (i.e. returning a true best-possible solution), but at the very least optimal among the candidates that were considered. Sounds great, but what does “better” mean here? Given a pair and another pair , how do we tell which one is better?

### The pareto frontier

Suppose what we have 8 possible candidate solutions we are considering, each with their own rate and distortion scores. If we prepare a scatter plot of rate on the x-axis versus distortion on the y-axis, we might get something like this:

The thin, light-gray candidates aren’t very interesting, because what they have in common is that there is at least one other candidate solution that is strictly better than them in every way. That is, some other candidate point has the same (or lower) rate, and also the same (or lower) distortion as the grayed-out points. There’s just no reason to ever pick any of those points, based on the criteria we’re considering, anyway. In the scatterplot, this means that there is at least one other point that is both to the left (lower rate) and below (lower distortion).

For any of the points in the diagram, imagine a horizontal and a vertical line going through it, partitioning the plane into four quadrants. Any point that ends up in the lower-left quadrant (lower rate and lower distortion) is clearly superior. Likewise, any point in the upper-right quadrant (higher rate and higher distortion) is clearly inferior. The situation with the other two quadrants isn’t as clear.

Which brings us to the other four points: the three fat black points, and the red point. These are all points that have no other points to the left and below them, meaning they are pareto efficient. This means that, unlike the situation with the light gray points, we can’t pick another candidate that improves one of the metrics without making the other worse. The set of points that are pareto efficient constitutes the *pareto frontier*.

These points are not all the same, though. The three fat black points are not just pareto efficient, but are also on the convex hull of the point set (the lower left contour of the convex hull here is drawn using blue lines), whereas the red point is not. The points that are both pareto and on the convex hull of the pareto frontier are particularly important, and can be characterized in a different way.

Namely, imagine taking a straightedge, angling it so that it’s either horizontal, “descending” (with reference to our graph) from left to right, or vertical, and then sliding it slowly “upwards” from the origin without changing its orientation until it hits one of the points in our set. It will look something like this:

The shading here is meant to suggest the motion of the green line; we keep sliding it up until it eventually catches on the middle of our three fat black points. If we change the angle of our line to something else, we can manage to hit the other two black points, but not the red one. This has nothing to do with this particular problem and is a general property of convex sets: any vertices of the set must be extremal in some direction.

Getting a bit more precise here, the various copies of the green line I’ve drawn correspond to lines

and the constraints I gave on the orientation of the line boil down to (with at least one being nonzero). Sliding our straightedge until we hit a point corresponds to the minimization problem

for a given choice of w_{1} and w_{2}, and the three black points are the three possible solutions we might get for our set of candidate points. So we’ve switched from purely minimizing rate or purely minimizing distortion (both of which, in general, tend to give somewhat pathological results) towards minimizing some linear combination of the two with non-negative weights; and doing so with various choices of the weights w_{1} and w_{2} will allow us to sweep out the lower left convex hull of the pareto frontier, which is often the “interesting” part (although, as the red point in our example illustrates, this process excludes some of the points on the pareto frontier).

This does not seem particularly impressive so far: we don’t want to purely minimize one quantity or the other, so instead we’re minimizing a linear combination of the two. That seems it would’ve been the obvious next thing to try. But looking at the characterization above does at least give us some idea on what looking at these linear combinations ends up doing, and exactly what we end up giving up (namely, the pareto points not on the convex hull). And there’s another massively important aspect to consider here.

### The Lagrangian connection

If we take our linear combination above and divide through by w_{1} or w_{2} (assuming they are non-zero; dividing our objective by a scalar constant does not change the results of the optimization problem), respectively, we get:

which are essentially the Lagrangians we would get for continuous optimization problems of the form “minimize R subject to D=const.” (L_{1}) and “minimize D subject to R=const.” (L_{2}); that is, if we were in a continuously differentiable setting (for data compression we usually aren’t), trying to solve the problems of either minimizing bit rate while hitting a set distortion target or minimizing distortion within a given bit rate limit woud lead us to study the same type of expression. Generalizing one step further, allowing not just equality but also inequality constraints (i.e. rate or distortion *at most* a certain value, rather then requiring exact match) still leads to essentially the same functions, this time by way of the KKT conditions.

In this post, I intentionally went “backwards” by starting with the Lagrangian-esque expressions and then mentioning the connection to continuous optimization problems because I want to emphasize that this type of linear combination of the different metrics arises naturally, even in a fully discrete setting; starting out with Lagrange or KKT multipliers would get us into technical difficulties with discrete decisions that do not necessary admit continuously differentiable objectives or constraint functions. Since the whole process makes sense even without explicitly mentioning Lagrange multipliers at all, that seemed like the most natural way to handle it.

### What this means in practice

I hopefully now have you convinced that looking at the minima of the linear combination

is a sensible thing to do, and both our direct derivation and the two Lagrange multiplier formulations for continuous problems I mentioned lead us towards it. Neither our direct derivation nor the Lagrange multiplier version tells us what to set our mystery weight parameters to, however. In fact, the Lagrange multiplier method flat-out tells you that for every instance of your optimization problem, there *exist* the right values for the Lagrange multipliers that correspond to an optimum of the problem you’re interested in, but it doesn’t tell you how to get them.

However, what’s nice is that the same thing also works in reverse, as we saw earlier with our line-sweeping procedure: picking the angle of the line we’re sliding around corresponds to picking a Lagrange multiplier. No matter which one we pick, we are going to end up finding an optimal point for *some* trade-off, namely one that is pareto; it just might not be the exact trade-off we wanted.

For example, suppose we decide that a single-variable parameterization like in the Lagrange multiplier scenario is convenient, and we pick something like L_{1}, namely w_{1} fixed at 1, w_{2} = λ. We were assuming from the beginning that we have some method of generating candidate solutions; all that’s left to do is to rank them and pick a winner. Let’s start with λ=1, which leads to a solution with some (R,D) pair that minimizes . Note it’s often useful to think of these quantities with units attached; we typically measure R in bits and [D] is whatever it is, so the unit of λ must be [λ] = bits/[D], i.e. λ is effectively an exchange rate that tells us how many units of distortion are worth as much as a single bit. We can then look at the R and D values of the solution we got back. If say we’re trying to hit a certain bit rate, then if R is close to that target, we might be happy and just stop. If R is way above the target bit rate, we overshot, and clearly need to penalize high distortions less; we might try another round with λ=0.1 next. Conversely, if say R is a few percent below the target rate, we might try another round with slightly higher lambda, say λ=1.02, trying to penalize distortion a bit more so we spend more bits to reduce it, and see if that gets us even closer.

With this kind of process, even knowing nothing else about the problem, you can systematically explore the options along the pareto frontier and try to find a better fit. What’s nice is that finding the minimum for a given choice of parameters (λ in our case) doesn’t require us to store all candidate options considered and rank them later; storing all candidates is not a big deal in our original example, where we were only trying to decide between a handful of options, but in practice you often end up with huge search spaces (exponential in the problem size is not unheard of), and being able to bake it down to a single linear function is convenient in other ways; for example, it tends to work well with efficient dynamic programming solutions to problems that would be infeasible to handle with brute force.

### Wait, that’s it?

Yes, pretty much. Instead of trying to purely minimize bit rate or distortion, you use a combination and vary λ to taste to hit your targets. Often, you know enough about your problem space to have a pretty good idea of what values λ should have; for example, in video compression, it’s pretty common to tie the λ used when coding residuals to quantizer step size, based on the (known) behavior of the quantizer and the (expected) characteristics of real-world signals. But even when you don’t know anything about your data, you can always use a search process for λ as outlined above (which is, of course, slower).

Now the one thing to note in practice is that you rarely use just a single distortion metric; for example, in video coding, it’s pretty common to use one distortion metric when quantizing residuals, a different one for motion search, and a third for overall block mode decision. In general, the more frequently something is done (or the bigger the search space is), the more willing codecs are to make compromises with their distortion measure to enable more efficient searches. In general, good results require both decent metrics and doing a good job exploring the search space, and accepting some defects in the metrics in exchange for a significant increase in search space covered in the same amount of time is often a good deal.

But the basic process is just this: measure bit rate and distortion (in your unit of choice) for every option you’re considering, and then rank your options based on their combined (or , which is a different but equivalent parameterization) scores. This gives you points on the lower left convex hull of the pareto frontier, which is what you want.

This applies in other settings as well. For example, the various lossless compressors in Oodle are, well, lossless, but they still have a bunch of decisions to make, some of which take more time in the decoder than others. For a lossless codec, measuring “distortion” doesn’t make any sense (it’s always 0), but measuring decode time does; so the Oodle encoders optimize for a trade-off between compressed size and decode time.

Of course, you can have more parameters too; for example, you might want to combine these two ideas and do joint optimization over bit rate, distortion, and decode time, leading to an expression of the type with two Lagrange multipliers, with λ as before, and a second multiplier μ that encodes the exchange rate from time units into bits.

Either way, the details of this quickly get complicated, but the basic idea is really quite simple. I hope this post de-mystifies it a bit.

(Continued from part 2. “A whirlwind introduction to dataflow graphs” is required reading.)

Last time, we saw a whole bunch of different bit reader implementations. This time, I’ll continue with a few more variants, implementation considerations on pipelined superscalar CPUs, and some ways to use the various degrees of freedom to our advantage.

### Dependency structure of bit readers

To get a better feel for where the bottlenecks in bit decoding are, let me restate some of the bit reading approaches we’ve covered in the previous parts again in our pseudo-assembly language, and then we can have a look at the corresponding dependency graphs.

Let’s start with variant 3 from last time, but I’ll do a LSB-first version this time:

refill3_lsb: rBytesConsumed = lsr(rBitPos, 3); rBitPtr = rBitPtr + rBytesConsumed; rBitBuf = load64LE(rBitPtr); rBitPos = rBitPos & 7; peekbits3_lsb(count): rBits = lsr(rBitBuf, rBitPos); rBitMask = lsl(1, count); rBitMask = rBitMask - 1; rBits = rBits & rBitMask; // result consume3_lsb(count): rBitPos = rBitPos + count;

Note that if `count`

is a compile-time constant, the computation for `rBitMask`

can be entirely constant-folded. Peeking ahead by a constant, fixed number of bits then working out from the result how many bits to actually consume is quite common in practice, so that’s what we’ll do. If we do a refill followed by two peek/consume cycles with the consume count being determined from the read bits “somehow”, followed by another refill (for the next loop iteration), the resulting pseudo-asm is like this:

// Initial refill rBytesConsumed = lsr(rBitPos, 3); // Consumed 0 rBitPtr = rBitPtr + rBytesConsumed; // Advance 0 rBitBuf = load64LE(rBitPtr); // Load 0 rBitPos = rBitPos & 7; // LeftoverBits 0 // First decode (peek count==19) rBits = lsr(rBitBuf, rBitPos); // BitsRemaining 0 rBits = rBits & 0x7ffff; // BitsMasked 0 rCount = determineCount(rBits); // DetermineCount 0 rBitPos = rBitPos + rCount; // PosInc 0 // Second decode rBits = lsr(rBitBuf, rBitPos); // BitsRemaining 1 rBits = rBits & 0x7ffff; // BitsMasked 1 rCount = determineCount(rBits); // DetermineCount 1 rBitPos = rBitPos + rCount; // PosInc 1 // Second refill rBytesConsumed = lsr(rBitPos, 3); // Consumed 1 rBitPtr = rBitPtr + rBytesConsumed; // Advance 1 rBitBuf = load64LE(rBitPtr); // Load 1 rBitPos = rBitPos & 7; // LeftoverBits 1

And the dependency graph looks dishearteningly long and skinny:

Ouch. That’s averaging less than one instruction per cycle, and it’s all in one big, serial dependency chain. Not depicted in this graph but also worth noting is that the 4-cycle latency edge from “Load” to “BitsRemaining” is a recurring delay that will occur on every refill, because the computation of the updated rBitPtr depends on the decode prior to the refill having been completed. Now this is not a full decoder, since I’m showing only the parts to do with the bitstream IO (presumably a real decoder also contains code to, you know, actually decode the bits and store the result somewhere), but it’s still somewhat disappointing. Note that the DetermineCount step is a placeholder: if the count is known in advance, for example because we’re reading a fixed-length field, you can ignore it completely. The single cycle depicted in the graph is OK for very simple cases; more complicated cases will often need multiple cycles here, for example because they perform a table lookup to figure out the count. Either way, even with our optimistic single-cycle single-operation DetermineCount step, the critical path through this graph is pretty long, and there’s very little latent parallelism in it.

Does variant 4 fare any better? The primitives look like this in pseudo-ASM:

refill4_lsb: rNext = load64LE(rBitPtr); rNextSh = lsl(rNext, rBitCount); rBitBuf = rBitBuf | rNextSh; // Most instruction sets don't have a subtract-from-immediate // but do have xor-by-immediate, so this is an advantageous // way to write 63 - rBitCount. (This works since we know that // rBitCount is in [0,63]). rBitsAdvance = rBitCount ^ 63; rBytesAdvance = lsr(rBitsAdvance, 3); rBitPtr = rBitPtr + rBytesAdvance; rBitCount = rBitCount | 56; peekbits4_lsb(count): rBitMask = lsl(1, count); rBitMask = rBitMask - 1; rBits = rBitBuf & rBitMask; // result consume4_lsb(count): rBitBuf = lsr(rBitBuf, count); rBitCount = rBitCount - count;

the pseudo-code for our “refill, do two peek/consume cycles, then refill again” scenario looks like this:

// Initial refill rNext = load64LE(rBitPtr); // LoadNext 0 rNextSh = lsl(rNext, rBitCount); // NextShift 0 rBitBuf = rBitBuf | rNextSh; // BitInsert 0 rBitsAdv = rBitCount ^ 63; // AdvanceBits 0 rBytesAdv = lsr(rBitsAdv, 3); // AdvanceBytes 0 rBitPtr = rBitPtr + rBytesAdv; // AdvancePtr 0 rBitCount = rBitCount | 56; // RefillCount 0 // First decode (peek count==19) rBits = rBitBuf & 0x7ffff; // BitsMasked 0 rCount = determineCount(rBits); // DetermineCount 0 rBitBuf = lsr(rBitBuf, rCount); // ConsumeShift 0 rBitCount = rBitCount - rCount; // ConsumeSub 0 // Second decode rBits = rBitBuf & 0x7ffff; // BitsMasked 1 rCount = determineCount(rBits); // DetermineCount 1 rBitBuf = lsr(rBitBuf, rCount); // ConsumeShift 1 rBitCount = rBitCount - rCount; // ConsumeSub 1 // Second refill rNext = load64LE(rBitPtr); // LoadNext 1 rNextSh = lsl(rNext, rBitCount); // NextShift 1 rBitBuf = rBitBuf | rNextSh; // BitInsert 1 rBitsAdv = rBitCount ^ 63; // AdvanceBits 1 rBytesAdv = lsr(rBitsAdv, 3); // AdvanceBytes 1 rBitPtr = rBitPtr + rBytesAdv; // AdvancePtr 1 rBitCount = rBitCount | 56; // RefillCount 1

with this dependency graph:

That’s a bunch of differences, and you might want to look at variant 3 and 4 in different windows side-by-side. The variant 4 refill does take 3 extra instructions, but we can immediately see that we get more latent instruction-level parallelism (ILP) in return:

- The variant 4 refill splits into three dependency chains, not two.
- The LoadNext for the second refill can start immediately after the AdvancePtr for the first refill, moving the load off the critical path for the second and subsequent iterations. Variant 3 has a 6-cycle latency from the determination of the final
`rBitPos`

in the first iteration to a refilled`rBitBuf`

; in variant 4, that latency shrinks to 2 cycles (one shift and an OR). In other words, while the refill takes more instructions, most of them are off the critical path. - The consume step in variant 4 has two parallel computations; in variant 3, the
`rBitPos`

update is critical and feeds into the shift in the next “peek” operation. Variant 4 has a single shift (to consume bits) on the critical path to the next peek; as a result, the latency between two subsequent decodes is one cycle less in variant 4: 3 cycles instead of 4.

In short, this version trades a slight increase in refill complexity for a noticeable latency reduction of several key steps, provided it’s running on a superscalar CPU. That’s definitely nice. On the other hand, the key decode steps are still very linear. We’re limited by the latency of a long chain of serial computations, which is a bad place to be: *if possible, it’s generally preferable to be limited by throughput (how many instructions we can execute), not latency (how fast we can complete them)*. Especially so if most of the latency in question comes from integer instructions that already have a single cycle of latency. Over the past 30 years, the number of executions units and instructions per cycle in mainstream CPU parts have steadily, if slowly, increased. But if we want to see any benefit from this, we need to write code that has a use for these extra execution resources.

### Multiple streams

As is often the case, the best solution to this problem is the straightforward one: if decoding from a single bitstream is too serial, then why not decode from multiple bitstreams at once? And indeed, this is much better; there’s not much point to showing a graph here, since it’s literally just two copies of a single-stream graph next to each other. Even with a very serial decoder like variant 3 above, you can come a lot closer to filling up a wide out-of-order machine as long as you use enough streams. To a first-order approximation, using N streams will also give you N times the latent ILP—and given how serial a lot of the direct decoders are, this will translate into a substantial (usually not quite N-times, but still very noticeable) speed-up in the decoder on wide-enough processors. So what’s the catch? There are several:

- Using multiple streams is a change to the bitstream format, not just an implementation detail. In particular, in any long-term storage format, any change in the number of bitstreams is effectively a change in the protocol or file format.
- You need to define how to turn the multiple streams into a single output bytestream. This can be simple concatenation along with a header, it can be some form of interleaving or a sophisticated framing format, but no matter what it ends up being, it’s an increase in complexity (and usually also in storage overhead) relative to producing a single bitstream that contains everything in the order it’s read.
- For anything with short packets and low latency requirements (e.g. game packets or voice chat), you either have to interleave streams fairly finely-grained (increasing size overhead), or suffer latency increases.
- Decoding from N streams in parallel increases the amount of internal state in the decoder. In the decoder variants shown above, a N-wide variant needs N copies of
`rBitBuf`

,`rBitPos`

/`rBitCount`

and`rBitPtr`

, at the very least, plus several temporary registers. For N=2 this is usually not a big deal, but for large counts you will start to run out of registers at least on some targets. There’s relatively little work being done on any given individual data item; if values get spilled from registers, the resulting loads and stores tend to have a very noticeable cost and will easily negate the benefit from using more streams.

In short, it’s not a panacea, but one of the usual engineering trade-offs. So how many streams should you use? It depends. At this point, for *anything* that is even remotely performance-sensitive, I would recommend trying at least N=2 streams. Even if your decoder has a lot of other stuff going on (computations with the decoded values etc.), bitstream decoding tends to be serial enough that there’s many wasted cycles otherwise, even on something relatively narrow like a dual-issue in-order machine. Having two streams adds a relatively small amount of overhead to the bitstream format (to signal the start of the data for stream 2 in every coding unit, or something equivalent), needs a modest amount of extra state for the second bit decoder, and tends to result in sizeable wins on pretty much any current CPU.

Using more than 2 streams can be a significant win in tight loops that do nothing but bitstream decoding, but is overkill in most other cases. Before you commit to a specific (high) number, you ideally want to try implementations on at least a few different target devices; a good number on one device may be past a big performance cliff on another, and having that kind of thing enshrined in a protocol or file format is unfortunate.

### Aside: SIMD? GPU?

If you use many streams, can you use SIMD instructions, or offload work to a GPU? Yes, you can, but the trade-offs get a bit icky here.

Vectorizing the simple decoders outlined above directly is, generally speaking, not great. There’s not a lot of computation going on per iteration, and operations such as refills end up using gathers, which tend to have a high associated overhead. To hide this overhead, and the associated latencies, you generally still need to be running multiple instances of your SIMD decoder in parallel, so your total number of streams ends up being the number of SIMD lanes times two (or more, if you need more instances). Having a high number of streams may be OK if all your targets have good wide SIMD support, but can be a real pain if you need to decode on at least one that doesn’t.

The same thing goes for GPUs, but even more so. With single warps/wavefronts of usually 16-64 invocations, we’re talking *many* streams just to not be running a kernel at quarter utilization, and we generally need to dispatch multiple warps worth of work to hide memory access latency. Between these two factors, it’s easy to end up needing well over 100 parallel streams just to not be stalled most of the time. At that scale, the extra overhead for signaling individual stream boundaries is definitely not negligible anymore, and the magic numbers are different between different GPU vendors; striking a useful compromise between the needs of different GPUs while also retaining the ability to decode on a CPU if no suitable GPU is available starts to get quite tricky.

There are techniques to at least make the memory access patterns and interleaving overhead somewhat more palatable (I wrote about this elsewhere), but this is an area of ongoing research, and so far there’s no silver bullet I could point at and just say “do this”. This is definitely a challenge going forward.

### Tricks with multiple streams

If you’re using multiple streams, you need to decide how these multiple streams get assembled into the final output bitstream. If you don’t have any particular reason to go with a fine-grained interleaving, the easiest and most straightforward option is to concatenate the sub-streams, with a header telling you how long the individual pieces are, here pictured for 3 streams:

Also pictured are the initial stream bit pointers before reading anything (pointers in a C-like or assembly-like setting; if you’re using something higher-level, probably indices into a byte slice). The beginning of stream 0 is implicit—right after the end of the header—and the end of the final stream is often supplied by an outer framing layer, but the initial positions of `bitptr1`

and `bitptr2`

need to be signaled in the bytestream somehow, usually by encoding the length of streams 0 and 1 in the header.

One thing I haven’t mentioned so far are bounds checks. Compressed data is normally untrusted since all the channels you might get that data from tend to be prone to either accidental (error during storage or in transit) or intentional (malicious attacker trying to craft harmful data) corruption, so careful input validation is not optional. What this usually boils down to in practice is that every load from the bitstream needs to be guarded by a range check that guarantees it’s in bounds. The overhead of this can be reduced in various ways. For example, one popular method is to unroll loops a few times and check at the top that there are enough bytes left for worst-case number of bytes consumed in the number of unrolled iterations, then only dropping to a careful loop that checks every single byte access at the very end of the stream. I’ve written about another useful technique before.

But why am I mentioning this here? Because it turns out that with multiple streams laid out sequentially, the overhead of bounds checking can be reduced. A direct range check for 3 streams that checks whether there are at least K bytes left would look like this:

// This needs to happen before we do any loads: // If any of the streams are close to exhausted // (fewer than K bytes left), drop to careful loop if (bitend0 - bitptr0 < K || bitend1 - bitptr1 < K || bitend2 - bitptr2 < K) break;

But when the three streams are sequential, we can use a simpler expression. First, we don’t actually need to worry about reading past the end of stream 0 or stream 1 as long as we still stay within the overall containing byte slice. And second, we can relax the check in the inner loop to use a much weaker test:

// Only check the last stream against the end; for // other streams, simply test whether an the read // pointer for an earlier stream is overtaking the // read ponter for a later stream (which is never // valid) if (bitptr0 > bitptr1 || bitptr1 > bitptr2 || bitend2 - bitptr2 < K) break;

The idea is that `bitptr1`

starts out pointing at `bitend0`

, and only keeps increasing from there. Therefore, if we ever have `bitptr0`

> `bitptr1`

, we know for sure that something went wrong and we read past the end of stream 0. That will give us garbage data (which we need to handle anyway), but not read out of bounds, since the checks maintain the invariant that `bitptr0`

≤ `bitptr1`

≤ `bitptr2`

≤ `bitend2 - K`

. A later careful loop should use more precise checking, but this variant of the test is simpler and doesn’t require most of the `bitend`

values to be reloaded in every iteration of our decoding loop.

Another interesting option is to reverse the order of some of the streams (which flips endianness as a side effect), and then glue pairs of forward and backward streams together, like shown here for streams 1 and 2:

I admit this sounds odd, but this has a few interesting properties. One of them is that it shrinks the amount of header space somewhat: in the image, the initial stream pointer for stream 2 is the same as the end of the buffer, and if there were 4 streams, the initial read pointers for stream 2 and 3 would start out in the same location (but going opposite directions). In general, we only need to denote the boundaries between stream *pairs* instead of individual streams. Then we let the decoder run as before, checking that the read cursors for the forward/backward pair don’t cross. If everything went right, once we’ve consumed the entire input stream, the final read cursors in a forward/backward pair should end up right next to each other. It’s a bit strange in that we don’t know the size of either stream in advance, just their sum, but it works fine.

Another consequence is that there’s no need to keep track of an explicit end pointer in the inner decoder loop if the final stream is a backwards stream; the pointer-crossing check takes care of it. In our running example, we’re now down to

// Check for pointer crossing; if done right, we get end-of-buffer // checks for free. if (bitptr0 > bitptr1 || bitptr1 > bitptr2) break;

In this version, `bitptr0`

and `bitptr1`

point at the next byte to be read in the forwards stream, whereas `bitptr2`

is offset by -K to ensure we don’t overrun the buffer; this is just a constant offset however, which folds into the memory access on regular load instructions. It’s all a bit tricky, but it saves a couple instructions, makes the bitstream slightly smaller and reduces the number of live variables in a hot loop, with the savings usually being larger the cost of a single extra endian swap. With a two-stream layout, generating the second bitstream in reverse also happens to be convenient on the encoder side, because we can reserve memory for the expected (or budgeted) size of the combined bitstream without having to guess how many bytes end up in either half; it’s just a regular double-ended stack. Once encoding is done, the two parts can be compacted in-place by moving the second half downwards.

None of these properties are a big deal in and of themselves, but they make for a nice package, and a two-stream setup with a forwards/backwards pair is now our default layout for most parts in most parts of the Oodle bitstream (Oodle is a lossless data compression library I work on).

Between the various tricks outlined so far, the size overhead and the extra CPU cost for wrangling multiple streams can be squeezed down quite far. But we still have to deal with the increased number of live variables that multiple streams imply. It turns out that if we’re willing to tolerate a moderate increase in critical path latency, we can reduce the amount of state variables per bit reader, in some cases while simultaneously (slightly) reducing the number of instructions executed. The advantage here is that we can fit more streams into a given number of working registers than we could otherwise; if we can use enough streams that we’re primarily limited by execution *throughput* and not critical path latency, increasing said latency is OK, and reducing the overall number of instructions helps us increase the throughput even more. So how does that work?

### Bit reader variant 5: minimal state, throughput-optimized

The bit reader variants I’ve shown so far generally split the bit buffer state across two variables: one containing the actual bits and another keeping track of how many bits are left in the buffer (or, equivalently, keeping track of the current read position within the buffer). But there’s a simple trick that allows us to reduce this to a single state variable: the bit shifts we use always shift in zeros. If we turn the MSB (for a LSB-first bit buffer) or the LSB (for a MSB-first bit buffer) into a marker bit that’s always set, we can use that marker to track how many bits we’ve consumed in total come the next refill. That allows us to get rid of the bit count and the instructions that manipulate it. That means one less variable in need of a register, and depending on which variant we’re comparing to, also fewer instructions executed per “consume”.

I’ll present this variant in the LSB-first version, and this time there’s an actual reason to (slightly) prefer LSB-first over MSB-first.

const uint8_t *bitptr; // Pointer to current byte uint64_t bitbuf = 1ull << 63; // Init to marker in MSB void refill5_lsb() { assert(bitbuf != 0); // Count how many bits we consumed using a "leading zero // count" instruction. See notes below. int bits_consumed = CountLeadingZeros64(bitbuf); // Advance the pointer bitptr += bits_consumed >> 3; // Refill and put the marker in the MSB bitbuf = read64LE(bitptr) | (1ull << 63); // Consume the bits in this byte that we've already used. bitbuf >>= bits_consumed & 7; } uint64_t peekbits5_lsb(int count) { assert(count >= 1 && count <= 56); // Just need to mask the low bits. return bitbuf & ((1ull << count) - 1); } void consume5_lsb(int count) { bitbuf >>= count; }

This “count leading zeros” operation might seem strange and weird if you haven’t seen it before, but it happens to be something that’s useful in other contexts as well, and most current CPU architectures have fast instructions that do this! Other than the strangeness going on in the refill, where we first have to figure out the number of bits consumed from the old marker bit, then insert a new marker bit and do a final shift to consume the partial bits from the first byte, this is like a hybrid between variants 3 and 4 from last time.

The pseudo-assembly for our running “refill, two decodes, then another refill” scenario goes like this: (not writing out the marker constant explicitly here)

// Initial refill rBitsConsumed = clz64(rBitBuf); // CountLZ 0 rBytesAdv = lsr(rBitsConsumed, 3); // AdvanceBytes 0 rBitPtr = rBitPtr + rBytesAdv; // AdvancePtr 0 rNext = load64LE(rBitPtr); // LoadNext 0 rMarked = rNext | MARKER; // OrMarker 0 rLeftover = rBitsConsumed & 7; // LeftoverBits 0 rBitBuf = lsr(rMarked, rLeftover); // ConsumeLeftover 0 // First decode (peek count==19) rBits = rBitBuf & 0x7ffff; // BitsMasked 0 rCount = determineCount(rBits); // DetermineCount 0 rBitBuf = lsr(rBitBuf, rCount); // Consume 0 // Second decode rBits = rBitBuf & 0x7ffff; // BitsMasked 1 rCount = determineCount(rBits); // DetermineCount 1 rBitBuf = lsr(rBitBuf, rCount); // Consume 1 // Second refill rBitsConsumed = clz64(rBitBuf); // CountLZ 1 rBytesAdv = lsr(rBitsConsumed, 3); // AdvanceBytes 1 rBitPtr = rBitPtr + rBytesAdv; // AdvancePtr 1 rNext = load64LE(rBitPtr); // LoadNext 1 rMarked = rNext | MARKER; // OrMarker 1 rLeftover = rBitsConsumed & 7; // LeftoverBits 1 rBitBuf = lsr(rMarked, rLeftover); // ConsumeLeftover 1

The refill has 7 integer operations, the same as variant 4 (“looakhead”) above, and 3 more than variant 3 (“bit extract”), while the decode step takes 3 operations (including the `determineCount`

step), one fewer than variants 3 (“bit extract”) and 4 (“lookahead”). The latter means that we equalize with the regular bit extract form in terms of instruction count when we perform at least 3 decodes per refill, and start to pull ahead if we manage more than 3. For completeness, here’s the dependency graph:

Easily the longest critical path of the variants we’ve seen so far, and very serial indeed. It doesn’t help that not only do we not know the load address early, we also have several more steps in the refill compared to the basic variant 3. But having the entire “hot” bit buffer state concentrated in a single register (`rBitBuf`

) during the decodes means that we can afford *many* streams at once, and with enough streams that extra latency can be hidden.

This one definitely needs to be deployed carefully, but it’s a powerful tool when used in the right place. Several of the fastest (and hottest) decoder loops in Oodle use it.

Note that with this variation, there’s a reason to stick with the LSB-first version: the equivalent MSB-first version needs a way to count the number of *trailing* zero bits, which is a much less common instruction, although it can be synthesized from a leading zero count and standard arithmetic/logical operations at acceptable extra cost. Which brings me to my final topic for this post.

### MSB-first vs. LSB-first: the final showdown

Throughout this 3-parter series, I’ve been continually emphasizing that there’s no *major* reason to prefer MSB-first or LSB-first for bit IO. Both are broadly equivalent and have efficient algorithms. But having now belabored that point sufficiently, if we can make both of them work, which one should we choose?

There are definitely differences that push you into one direction or another, depending on your intended use case. Here are some you might want to consider, in no particular order:

- As we saw in part 2, the natural MSB-first
`peekbits`

and`getbits`

implementations run into trouble (of the undefined-behavior and hardware-actually-behaving-in-surprising-ways kind) when`count == 0`

, whereas with the natural LSB-first implementation, this case is unproblematic. If you need to support counts of 0 (usefol for e.g. variable-length codes), LSB-first tends to be slightly more convenient. Alternatives for MSB-first are a rotate-based implementation (which has no problems with 0 count) or using an extra shift, turning`x >> (64 - count)`

into`(x >> 1) >> (63 - count)`

. - MSB-first coding tends to have a big edge for universal variable-length codes. Unary codes can be decoded quickly via the aforementioned “count leading zero” instructions; gamma codes and the closely related Exp-Golomb codes also admit direct decoding in a fairly slick way; and the same goes for Golomb-Rice codes and a few others. If you’re considering universal codes, MSB-first is definitely handier.
- At the other extreme, LSB-first coding often ends up slightly cheaper for the table-based decoders commonly used when a code isn’t fixed as part of the format; Huffman decoders for example.
- MSB-first meshes somewhat more naturally with big-endian byte order, and LSB-first with little-endian. If you’re deeply committed to either side in this particular holy war, this might drive you one way or the other.

Charles and me both tend to default to MSB-first but will switch to LSB-first where it’s a win on multiple target architectures (or on a single important target).

### Conclusion

That’s it for both this post and this mini-series; apologies for the long delay, caused by first a surprise deadline that got dropped in my lap right as I was writing the series originally, and then exacerbated by a combination of technical difficulties (alas, still ongoing) and me having gotten “out of the groove” in the intervening time.

This post ended up longer than my usual, and skips around topics a bit more than I’d like, but I really didn’t want to make this series a four-parter; I still have a few notes here and there, but I don’t want to drag this topic out much longer, not on this general level anyway. Instead, my plan is to write about some more down-to-earth case studies soon, so I can be less hand-wavy, and maybe even do some actual assembly-level analysis for an actual real-world CPU instead of an abstract idealized machine. We’ll see.

Until then!