For BC6H encoding in Oodle Texture, we needed a sensible error metric to use in the encoder core. BC6H is HDR which is more challenging than LDR data since we need to handle vast differences in magnitude.
BC6H internally essentially treats the float16 bits as 16-bit integers (which is a semi-logarithmic mapping) and works with those integers throughout. It’s a bit more complicated than just grabbing the float bits, but not in a way that really matters here. The most straightforward approach is to do the same thing in the encoder, pretend we care about those 16-integers, and just work with squared errors etc. on those, which is easy to do and also quite fast.
This works reasonably well with white(-ish) light and not very oversaturated colors, but degrades horribly especially with bright, colored emissive light. My colleague Jon Olick wrote a post showing some examples back when we originally released Oodle Texture. The problem with working on log-space RGB data boils down to this: say we have some bright red glowing thing and the RGB tuples for a pixel (in whatever scaling we choose) are something like (2.34, 0.02, 0.01). Working with the logarithms per-channel in this scenario tells us that getting the value in the green channel off by 0.001 would be about as bad as getting the value in the red channel off by 0.1, so we end up producing large net brightness shifts to avoid imperceptible hue shifts, a bad choice.
We still would prefer something squared error-ish in the encoder core: the derivative of a squared error is linear, so minimizing a squared error by finding its critical points turns into solving a linear system. We find ourselves doing this hundreds of times for millions of pixels each, so a simple linear system is good.
I’ll cut to the chase, the main distance metric we use in Oodle Texture for HDR data is what we internally call MRSSE, “Mean Relative Sum of Squared Errors” (somewhat unfortunately different from the similar but distinct notion by the same name used in statistics). In essence we wanted something like our usual squared error (aka SSE, Sum of Squared Errors, or SSD, Sum of Squared Differences), but normalized to be more like a relative than absolute error, so we used
The numerator is self-explanatory, the denominator was chosen to avoid divisions by zero if either vector is nonzero (in our case, there’s also an extra tiny bias term on the order of the smallest normalized float16 to avoid divisions by zero) and to make the difference symmetric. The latter is not required, but convenient. In our case, x and y are 3-component vectors. The normalization here is somewhat arbitrary, it would be perfectly sensible to throw an extra factor of 2 in the numerator to normalize by the average squared length of x and y, not the sum of their squared lengths, but we only ever care about comparing errors with each other, an overall scaling of 2 changes nothing, so we went with the simpler form.
Since this is a BC6H encoder, most commonly one of the two inputs, say x, is held constant (coming from the source image) while we try many candidate choices for y. We do this a lot, and in this case we expect anyway (after all making the two close is our objective), so we use the asymmetric relative squared error
instead (again, with a tiny bias term in the numerator to avoid division by zero). This has the huge advantage that since x is fixed, we can compute all the values for a block once up front, and then just have a regular weighted squared error (with a per-pixel weight factor) per pixel. That’s what’s used in our inner loops. It makes “minimize this error”-type optimization problems actually linear, which is extremely handy.
We did not attempt to do a full formal evaluation of different error metrics. We did, however, try something like 8 or 9 different metrics in the early stages of writing the BC6H encoder (this would’ve been mostly 2019/2020) and had a surviving 3 metrics (one using ICtCp-ish variant using the SMPTE 2084 PQ transfer function, one using the “natural” albeit problematic semi-log2 encoding implied by BC6H, and this one) carried through all the way into shipping Oodle Texture, until we ultimately decided to ship with just the one always on because we never found an image that really benefited from anything else. Of the metrics we evaluated, this ended up being computationally one of the cheaper ones and simultaneously the best in terms of visual result quality in informal but careful testing; it was not close. The semi-log2 encoding is the cheapest by far, but its problems with highly saturated bright colors ultimately disqualified it.
This one was good, still fairly simple, and gave us a result quality that (in my opinion anyway) is appreciably better than all other BC6H encoders I’ve seen.
GPUs support UNORM formats that represent a number inside [0,1] as an 8-bit unsigned integer. In exact arithmetic, the conversion to a floating-point number is straightforward: take the integer and divide it by 255. 8-bit integers are for sure machine numbers (exactly represented) in float32 and so is 255, so if you’re willing to do a “proper” divide, that’s the end of it; both inputs are exact, so the result of the division is the same as the result of the computation in exact arithmetic rounded to the nearest float32 (as per active rounding mode anyway), which is the best we can hope for.
The D3D11.3 functional spec chickened out here a bit (granted, this verbiage was already in the D3D10 version as I recall) and added the disclaimer that
Requiring exact (1/2 ULP) conversion precision is acknowledged to be too expensive.
For what it’s worth, I had reason to test this a while back (as part of my GPU BCn decoding experiments) and at the time anyway, all GPUs I got my hands on for testing seemed to do exact conversions anyway. It turns out that doing the conversion exactly is not particularly expensive in HW (that might be a post for another day) and certainly doesn’t require anything like a “real” divider, which usually does not exist in GPUs; correctly rounded float32 divisions, when requested, are usually done as a lengthy instruction sequence (plus possibly an even lengthier fallback handler in rare cases).
A conversion that is close enough for essentially all practical purposes is to simply multiply by the float32 reciprocal, i.e. 1.f / 255.f, instead. This is not the same as the exact calculation, but considering you’re quantizing data into an uint8 format to begin with, is almost certainly way more precision than you need.
What if, as a matter of principle, we want an exact solution, though?
Option 1: work in double (float64)
Not much to say here. Converting the 8-bit value x to double, multiplying by 1. / 255., then converting the result to float32 happens to give the same results as the exact calc (i.e. dividing x / 255.f). This is trivial to verify by exhaustive testing. It’s just 256 possible inputs.
Option 2: but I don’t want to use doubles either!
Picky, are we? OK, fine, let’s get to the meat of this post. What if we’d like an exact conversion, would prefer to not use a divide, and would also prefer to avoid using doubles?
This is the fun part. First, let’s turn 1/255 into a geometric series:
and therefore
Going from a single divide to an infinite series might seem like it’s making our problems worse, but it’s progress. The divides by 256k are easy because those are powers of 2, hence exact (and, incidentally, non-overlapping bit shifts). Of course we can’t sum infinitely many terms, but we also don’t need to. All we need is an approximation that gets close enough to round to the right value. How many series terms is that?
The hardest to round cases turn out to be powers of 2, x=2k, k = 0, …, 7. They’re all basically the same, so let’s stick with the easiest example x=1 (k=0). Float32 has 24 effective significand bits (only 23 stored, since a leading 1 is implied). To round correctly, we need 24 correct bits after the leading 1 bit (23 actual bits plus the first truncated bit for rounding)… and then actually another term after that to make sure we don’t accidentally hit an exact halfway case. Our infinite series never terminates, so we need that extra term for non-0 values to make sure the sticky bit gets set pre-normalization to indicate unambiguously that the value ought to round up (and not break the tie towards evens).
Each term contributes 8 bits, and in our x=1 case, normalizing the floating-point number shifts out the first term entirely. In short, we need 1 “sacrificial” term that might get mostly normalized away (in the power-of-2 cases), 3 more terms for the bulk of the mantissa, and then 1 more term after that to avoid the halfway-case problem and ensure our sticky bit gets set correctly come rounding time.
So am I proposing summing 5 terms of a series? That doesn’t sound fast! And, fortunately, no. We don’t need to sum terms one by one; if we’re going to be multiplying anyway, we might as well combine multiple terms into one. Here’s the actual solution I’m thinking of (note this assumes compilation with proper IEEE semantics, not fast math shenanigans):
// Constants here written for posterity but assumed
// to be evaluated at compile time.
// All values exactly representable as float32.
const float k0 = (1.f + 256.f + 65536.f) / 16777216.f;
const float k1 = k0 / 16777216.f;
// suppose x is integer in [0,255], then:
float ref = x / 255.f;
// is exactly the same (in IEEE-754 float32 with RN) as
float tmp = (float)x;
float no_div = (tmp * k0) + (tmp * k1);
That first calculation of tmp * k0 creates 3 back-to-back copies of x. That’s 24 bits of significand which is (just) exactly representable in float32 thanks to the implicit 1 bit. tmp * k1 creates another 3 copies (and thus terms of the series) scaled down by 2-24, which is likewise exact (it’s the same as the first term except for the exponent). That means we end up with two machine numbers, and adding them together has one final rounding step – the only rounding we’ll do. 6 series terms is more than we needed (and in fact 5 works fine), but the constants are easier to write down with 6 terms instead of 5 and it doesn’t really matter.
Final tally: after the int->float conversion, two multiplies and an add. When FMAs are available, the second multiply and the final add can also be replaced with a FMA, meaning this comes in at two FP ops with FMAs available, and 3 without. If there’s a way to do the exact conversion with less, I don’t know it.
UPDATE: Alexandre Mutel sent in a very slick suggestion on Mastodon based on a completely different approach that boils down to
// Constants evaluated at compile time.
const float k0 = 3.f;
const float k1 = 1.f / (255.f * 3.f);
return ((float)x * k0) * k1;
The realization here simply is that while the float32 reciprocal of 255 isn’t accurate enough to produce correct rounding, the reciprocal of 255*3 is, and multiplying integers in [0,255] by 3 is (easily) exact in float32. In this case I wrote the multiplication by 3 after floating point, but it can of course also be done on the integer side (which was Alexandre’s original suggestion); whichever is more convenient. This is two multiplications after the int->float conversion, even when no FMAs are available, so I think that’s our new winner.
That’s right, it’s another texture compression blog post! I’ll keep it short. By “solid-color block”, I mean a 4×4 block of pixels that all have the same color. ASTC has a dedicated encoding for these (“void-extent blocks”), BC7 does not. Therefore we have an 8-bit RGBA input color and want to figure out how to best encode that color with the encoding options we have. (The reason I’m writing this up now is because it came up in a private conversation.)
In BC1/3, hitting a single color exactly is not always possible, and we have the added complication of the decoder being under-specified. BC7 has neither problem. If we look at our options in table 109 of the Khronos data format spec, we see that mode 5 stores color endpoints with 7 bits of precision per channel and alpha endpoints with a full 8 bits, so it’s a very solid candidate for getting our 8-bit colors through unharmed.
The alpha portion is trivial: we can send our alpha value as the first endpoint and just use index 0 for the alpha portion (mode 5 is one of the BC7 modes that code RGB and A separately, similar to BC3), which leaves the colors. Can we use the color interpolation to turn our 7 endpoint bits per channel into an effective 8?
We have 2 color index bits per pixel. Indices 0 and 3 are not useful for us, they return the dequantized endpoint value, and those are 7 bits, so that only gives us 128 out of 256 possible options for each color channel. Index 1 is interpolated between the two at a 21/64 fraction; index 2 is symmetric (exactly symmetric in BC7, unlike the situation in BC1), i.e. it’s the same as using index 1 with the two endpoints swapped, and therefore doesn’t give us any new options over just using index 1. That means we only need to consider the case where all index values are 1: if the value we need in a color channel happens to be one of the 128 values we can represent directly, we just set both endpoints to that value, otherwise we want select a pair of endpoints so that the value we actually want is between them, using that 21/64 interpolation factor (subject to the BC7 rounding rules).
For BC1, at this point, we would usually build a table where we search for ideal endpoint pairs for every possible target color. For BC7, we can do the same thing, but it turns out we don’t even need a table. Specifically, if we build that table (breaking ties to favor pairs that lie closer together) and look at it for a bit, it becomes quickly clear that we can not only hit each value in [0,255] exactly, but there’s a very simple endpoint pair that works:
// Determine e0, e1 such that (BC7 interpolation formula)
// target == (43*expand7(e0) + 21*expand7(e1) + 32) >> 6
// where expand7(x) = (x << 1) | (x >> 6)
e0 = target >> 1
e1 = ((target < 128) ? (target + 1) : (target - 1)) >> 1
And that’s it. Do this for each of the R, G, B channels of the input color, and set all the color indices to 1. As noted earlier, the A channel is trivial since we just get to send a 8-bit value there to begin with, so we just send it as one of the endpoints and leave all alpha index bits at 0.
This is exact and the encoding we’ve used in all our shipping BC7 encoders, regular and with rate-distortion optimization both. Often there are many other possible encodings using the other modes, especially for particularly easy cases like all-white or all-black. In our experiments it didn’t particularly matter which encoding is used, so we always use the above. The one thing that does matter is that whatever choice you make should be consistent, since solid-color blocks frequently occur in runs.
The x86 instruction set has a somewhat peculiar set of SIMD integer multiply operations, and Intel’s particular implementation of several of these operations in their headline core designs has certain idiosyncrasies that have been there for literally over 25 years at this point. I don’t actually have any inside information, but it’s fun to speculate, and it gives me an excuse to waffle about multiplier designs, so here we go!
MMX
x86 doesn’t have explicit SIMD integer operations before MMX, which first showed up in – no big surprise – the Pentium MMX. Said Pentium MMX offers exactly three SIMD integer multiply instructions, and all three of them originally took 3 cycles (fully pipelined).
The first and most basic one is PMULLW, “packed multiply low word”, which interprets its two 64-bit MMX register operands as containing four words (which in x86, if you’re not familiar, means 16-bit integers) each. The corresponding lanes in both operands are multiplied and the low 16 bits of the result written to the corresponding lane of the destination. We don’t need to say whether these integers are interpreted as signed or unsigned because for the low 16 bits, it doesn’t matter. In short, it’s a basic element-wise multiply working on 16-bit ints.
The second available integer multiply is PMULHW, “packed multiply high word”. Again, we multiply 16-bit lanes together, which (in general) yields a 32-bit product, and this time, we get the top 16 bits of the result. This time, we need to make up our mind about whether the integers in question are considered signed or unsigned; in this case, it’s signed. A fun fact about “high multiply” type operations (which exist in a lot of instruction sets) is that there’s no practical way (at least, not that I’m aware of) to compute just those high bits. Getting those high bits right generally means computing the full product (in this case, 32 bits per lane) and then throwing away the bottom half. Therefore, a datapath that can support both types of multiplies will usually end up having a full 16×16->32-bit multiplier, compute all product bits, and then throw half of them away in either case.
That brings us to the third and last of the original Pentium MMX’s multiply-type instructions, and the most fun one, which is PMADDWD. I think this originally stands for “packed multiply and add words to doublewords”. That makes it sound like it’s a multiply-add type operation, but really it’s more like a two-element dot product: in pseudocode, PMADDWD computes result.i32[i] = a.i16[i*2+0] * b.i16[i*2+0] + a.i16[i*2+1] * b.i16[i*2+1]. That is, it still does those same four signed 16×16-bit multiplies we’ve been doing for the other two instructions, but this time with a “use the whole pig” attitude where the full 32-bit results are most definitely not tossed out. If we can’t return the whole result in a 16-bit operation, well, just pair even and odd pairs of adjacent lanes together and sum across them. Because when we’re summing across pairs of adjacent lanes, we get 32 bits to return the result in, which is perfect (we don’t need to worry about overflow here because the two constituent products were signed; they can’t get too large).
Now, this description sounds like we’re still finishing computation of 32-bit results for each of the 16 bit lanes, and then doing a separate 32-bit addition after to combine the two. That’s a possible implementation, but not necessary; this is not a post about how multiplications are implemented (some other time, maybe!), but the gist of it is that multiplier hardware already breaks down N-bit by N-bit multiplies into many smaller multiplications (the “partial products”) of a N-bit number by a much smaller digit set. The obvious one would be N-bit-1 bit products, which leaves just “x*0” and “x*1” products, but in practice other options are much cheaper. The partial products are then summed together in a kind of reduction tree, and again, there’s slicker ways to do it than just throwing down a whole bunch of fast adders, but the details don’t matter here. What does matter is that you can have either of the even/odd 16-bit multipliers do their normal thing until very close to the end, and then do the “cross-over” and final 32-bit summation very late (again with plenty of hardware reuse compared with the 16-bit result paths).
In short, not only does PMADDWD let us use both 32-bit results that we already computed anyway fully, it also doesn’t touch the first 90% of the datapath at all and can be made to share plenty of logic with the regular path for the final 10% too if desired. Which is why it’s fun.
SSE
The headline item for SSE was SIMD floating point operations (not my subject today), but it also patched a hole in the original MMX design by adding PMULHUW (packed multiply high unsigned word). This one does the multiply unsigned and gives you the high word result. Once again, this is a minor change to the hardware.
SSE2
This one added 128-bit integer SIMD whereas MMX was just 64-bit. It did so, in its initial implementations, by adding 128-bit registers, but still used a 64-bit datapath and issuing instructions over two cycles. Not surprisingly, then, all the SSE2 integer multiply instructions (and in fact the vast majority of SSE/SSE2 instructions in general) can be understood as working on independent 64-bit blocks at a time. (AVX/AVX2 would later do the same thing with 128-bit blocks.)
It does however add the rather awkward-seeming PMULUDQ (packed multiply unsigned doubleword to quadword), which multiplies two pairs of unsigned 32-bit integers (in bits [31:0] and [95:64] of either source) to give two 64-bit results. And it does so with the same latency as our 16-bit multiplies! Is that a much wider multiplier at work?
Turns out, not necessarily! Let’s look at a single 32-bit product a * b, and split a = (a1 * 65536) + a0 and b = (b1 * 65536) + b0. 65536 is of course 216 and we’re really just chopping a and b in half. Multiplying that out, we get:
a * b
=((a1 * b1) << 32) + ((a1 * b0 + a0 * b1) << 16) + (a0 * b0)
Wait a second. Those are two 16-bit by 16-bit multiplies (unsigned this time, but we added that capability back with the first SSE) and a PMADDWD-style operation (albeit also on unsigned values) in the middle. We do need four 16×16-bit multiplies total… but we’re producing a 64-bit result, so our result area covers four 16-bit lanes’ worth. So this time we do need a bit more logistics up front to route the individual pieces of a and b to four separate multipliers over our 64-bit result area to line up our individual products, and we also have a somewhat more complicated final output stage (what with the different alignments of the results) and actually need a mode in the multiplier where we run a full 64-bit add, not just 32-bit adds, to produce or results. In short, it does get more complicated, but we’re still getting to build it all around the 16×16-bit multipliers we’ve had since MMX.
SSSE3
SSSE3 adds two more multiply operations, both of which are variations on themes we’ve seen so far, but let’s start with the simple one first.
That would be PMULHRSW, Packed Multiply High Rounding Shifting Words (again, not the official name). It’s another signed 16×16 bit multiply. This one computes signed (not the official way it’s specified, but it’s equivalent) (a * b + 0x4000) >> 15. This requires a slight tweak to the reduction tree to in the multiplier to sum in one extra term somewhere that we can use for the rounding bias. Grabbing different output bits from the result is not a big deal.
The more complicated one is PMADDUBSW which is like PMADDWD but on pairs bytes not words, and to keep things interesting, it’s an unsigned*signed multiply. I think this might have been inspired by AltiVec (or maybe there’s a common ancestor I’m not aware of?) which had this type of operation in its “msum” family of instructions (alongside a PMADDWD equivalent and some other operating modes), but the AltiVec version is nicer because it’s a 4-element dot product on bytes producing a 32-bit result. PMADDUBSW produces a word result which, in turns out, does not quite work. The problem is that multiplying unsigned by signed bytes means the individual product terms are in range [-128*255, 128*255] = [-32640,32640]. Our result is supposed to be a signed word, which means its value range is [-32768,32767]. If the two individual products are either near the negative or positive end of the possible output range, the sum overflows. PMADDUBSW decides to saturate the result, i.e. clamp it to lie within [-32768,32767] instead. This is well-defined but frequently quite annoying to work with.
In terms of implementation, I’ve not actually worked out the details here. I will point out that one way of designing multipliers is to use a few levels of a recursive decomposition into smaller multipliers much as I just did with PMULUDQ; chopping the constituent 16-bit multipliers up into byte pieces would presumably work, although at this point, we don’t just have some extra muxing near the beginning or end of the datapath, we’re also starting to add a lot of constraints on the rest of the internal implementation (if we’re trying to share as much hardware as possible, that is). We’ve just about pushed this as far as we can go.
SSE4.1
SSE4.1 adds PMULDQ which is PMULUDQ, but signed. The same basic approach as PMULUDQ should work, so I’ll not get into it.
It also added the at that point long-awaited PMULLD, doubleword-by-doubleword low multiplies. To date, we have not gotten any high multiplies for them, not in SSE4.1 nor in any subsequent extension, and it seems unlikely at this point.
Curiously, with PMULLD, something’s different: these have half the throughput and twice the latency as all the other multiply-type operations on Intel’s big core designs, and take two uops whereas all the other multiply-type operations mentioned so far take one.
Once again, I think the divide-and-conquer approach described for PMULUDQ above explains both. Looking at the product terms:
a * b
=((a1 * b1) << 32) + ((a1 * b0 + a0 * b1) << 16) + (a0 * b0)
We don’t care about the high product term for a1 * b1 since we’re only returning the low 32 bits anyway. But we do need the other three product terms, and per each 32-bit result lane, we only have two 16×16 multipliers to work with. My best guess is that the first uop is a PMADDWD-like affair that computes the (a1 * b0 + a0 * b1) portion and stashes the low 16 bits of the result, and the second uop gets issued immediately after and computes a regular a0 * b0 32-bit product in the low 16-bit lane then adds it together with the stashed product (shifted by 16, but that’s just wiring) – this is again a fairly minor variation of the logic that is used for PMADDWD. In short, I think a possible implementation of PMULLD on top of the 16-bit-multiplier-centric datapath that Intel seems to have been bolting more and more features onto for the past 25 years is using 2 uops that are slight variations of the PMADDWD flow, and it would be consistent with the somewhat odd characteristics of the instruction on Intel’s big cores.
Is any of this definitive?
Oh hell no it’s not. But it would make sense given how the whole thing seems to have developed, and it fits available data.
Would I design a new unit this way now if I had to support all these instructions? Probably not. The original MMX/SSE-era design doesn’t need to do a lot of operations and is pretty sweet, but the doubleword->quadword multiplies in SSE2 started to muddle the waters, SSSE3 with PMADDUBSW (which although useful is very odd) made the 16-bit slices have to support some pretty odd things that really break out of a conventional multiplier dataflow (like the internal saturation), and as of SSE4.1 with PMULLD, honestly, instead of teaching this old dog the umpteenth new trick, maybe just throw some actual 32×32->32 multipliers in there as well instead of adding complications. That seems to be what AMD has done, as well as Intel’s E-cores. But it’s still fun to speculate!
What about AVX-512 VPMULLQ, IFMA or VNNI?
The original VNNI is a pretty straightforward generalization of the PMADDWD/PMADDUBSW designs to include a final accumulation too, much like the aforementioned vmsum family of instructions in PowerPC. I have a suspicion this might be due to a combination of a) x86 SIMD datapaths post-AVX2 (which added FMAs) having native support for 3-operand instructions and b) quite possibly some AltiVec-era instruction patents having expired in the meantime. For a), the extra addition is not the hard part (as mentioned above, the extra term is very natural to sneak into any multiplier datapath), the actual cost is all in the extra wiring and bypassing to have a 3-operand datapath. But once it’s a sunk cost because you want float FMAs anyway, might as well use it! For b), I have no idea whether this is the case or not, it’s just funny to me that AltiVec had these integer dot product instructions from the standard while x86 took forever to add them (after people used PMADDUBSW with a follow-up PMADDWD by an all-1’s vector literally just to sum the pairs of words in a 32-bit lane together for something like a decade).
IFMA is a completely different story because even though it’s an integer multiply, it’s very clearly designed to use the double-precision multiplier in the floating-point datapath. Completely different multiplier with a different set of fun design constraints!
VPMULLQ, I have nothing to say about. Literally haven’t ever looked into, or tried to figure out, how it’s implemented. Might do so at some point, but not today!
And I think that’s all of them.
This one originally came up for me in Oodle Texture’s BC7 decoder. In the BC7 format, each pixel within a 4×4 block can choose from a limited set of between 4 to 16 colors (ignoring some caveats like the dual-index modes that don’t matter here) and consequently between 2 and 4 bits per pixel are used to store a color index. Some pixels are “anchors” which (for reasons outside the scope of this post) are forced to have the MSB of their color index be 0. Because these index bits are always 0, they aren’t actually transmitted in the compressed encoding. The end result makes for some rather hairy-looking index arithmetic in the spec a few paragraphs above table 114.
Needless to say, it’s inconvenient working in this form. Removing those always-0 index bits is among the last things a BC7 encoder usually does, and conversely, re-inserting them is among the first things a decoder wants to do, because there’s only ever between 1 and 3 anchor bits and having each index be the same bit width is definitely easier in the rest of the code.
Inserting a 0 bit in the middle of a value is not hard to do: we can split the original value into the bits below and at or above the target bit position, shift the top bits left by 1 more unit to make space, then reassemble everything together:
uint64 insert_zero_bit(uint64 value, int pos) {
uint64 bottom_mask = (1u64 << pos) - 1;
uint64 top_mask = ~bottom_mask;
uint64 bottom_bits = value & bottom_mask;
uint64 top_bits = value & top_mask;
return bottom_bits | (top_bits << 1);
}
This works fine, there’s nothing wrong with it, it’s not a bottleneck or anything, but it bothered me just how much of a production it was for what seemed like a simple operation. Some tinkering later, I found a slicker solution:
uint64 insert_zero_bit(uint64 value, int pos) {
uint64 top_mask = ~0u64 << pos;
return value + (value & top_mask);
}
The first part creates top_mask directly. This version doesn’t need or use bottom_mask, so creating top_mask from it is not a good way to do things here. In fact, even though creating a mask for the bottom N bits the way I did in the first code fragment is the more idiomatic way, creating the opposite mask that selects just the high bits is actually often cheaper, as this example shows: all you do is start with an all-1 mask (which is just a -1 constant in two’s complement) and shift it left. That’s not the point of this post, but I guess it counts as a bonus trick.
The actual point of this post is the second line. Adding a value to itself just gives two times that value, which is the same as left-shifting by 1; but in this case, we’re adding a copy of value that has its low bits masked off. The end result is that we add 0 to those low bits, i.e. they stay the same. At or above bit number pos, we do add the remaining bits of the value – which has the end consequence of shifting just those bits left and leaving the rest as it was. It only works for inserting exactly 1 bit, but it’s cute. (In the BC7 case with sometimes 2 or 3 anchors, we can just do it multiple times.)
We can also reverse this and remove a single bit in the middle when we know its value is 0:
uint64 remove_zero_bit(uint64 value, int pos) {
uint64 top_mask = ~0u64 << pos;
return value - ((value & top_mask) >> 1);
}
This version may look a bit funky because we build top_mask from pos. Shouldn’t we use pos + 1, or set top_mask = ~1u64 << pos, or something like that, since we start out with the extra zero bit there? But precisely because we already assume that bit is 0, it turns out not to matter. (Exercise to the reader.) Either way, this is not quite as nice as the insertion variant because of the extra shift.
Alternatively, if you don’t need the value aligned at the bottom of the uint (or are fine with shifting after), you can also use the dual of the bit insertion and add value + (value & bottom_mask) to get a number that has everything shifted by 1.
Anyway. In the BC7 case it really didn’t matter, it just bothered me. But it’s cute regardless and I’ve found other uses for it since (that would have taken even more of a preamble to introduce).
A while back I had to deal with a bit-packed format that contained a list of integer values encoded in one of a pre-defined sets of bit widths, where both the allowed bit widths and the signed-ness were denoted in a header elsewhere. These values never got particularly long (the largest bit width I needed to support was around 14 bits I think?) but having a mixture of different bit widths and signedness-es (is that a word?) was annoying to deal with, particularly the latter. Signed values were stored in the usual two’s complement format.
In particular, sign-extending narrow types is surprisingly awkward. Most people’s go-to solution seems to be shifting the value up in a wider integer type to move the narrow value’s sign bit into the containing types sign bit, then use an arithmetic shift to repeat the sign bit on the way down. That’s a mouthful. I’m talking about code like this:
int32 sign_extend(int32 val_11b) {
int32 t = val_11b << (32 - 11);
return t >> (32 - 11);
}
In C/C++, this explicitly relies on shifting something into the sign bit, which depending on the exact flavor of language standard you’re using is either not allowed or at best fairly recently (voted into C++20) legal. It also explicitly needs to work with a fixed-size integer representation to know what to shift by. That’s not a problem, just not aesthetically pleasing.
There’s a different way to sign-extend that works directly from the definition of two’s complement and doesn’t use any bit-shifting at all, though: all that two’s complement really does is change the place value of the most significant bit. Take our example 11-bit integer, either signed or unsigned. No matter the type, bits 0 through 9 inclusive have place values of 1, 2, 4, .., 512. In an unsigned 11-bit integer, bit 10 has place value 1024; in a signed 11-bit integer, its place value is -1024 instead.
So if we have an unsigned 11-bit value in some integer variable and want to convert it to the corresponding signed 11-bit value (assuming the original value was in two’s complement form), we can do so without any shifting at all. Bits 0-9 retain their interpretation, but the value of bit 10 needs to have its sign flipped to turn unsigned into signed:
int sign_extend(int val_11b) {
return (val_11b & 0x3ff) - (val & 0x400);
}
We pass through bits 0 through 9, isolate bit 10 using val & 0x400 (which is either 0 if bit 10 wasn’t set or 0x400=1024 if it was), and subtract the result. This form doesn’t make any assumptions about the width of the used integer type or do anything undefined or implementation-specific, as long as the narrower values are small enough to never overflow the containing int type anyway. It doesn’t even assume the compilation target uses two’s complement! (Not that that’s a big ask these days.)
A minor variant is to skip the first mask entirely and think purely in terms of the correction we need to do. In our 11-bit example, if bit 10 was set, it contributes 1024 to the value when it should have contributed -1024, which is off by 2048 (twice the place value of that bit). This leads to a version with a single bit mask:
int sign_extend(int val_11b) {
return val - (val & 0x400) * 2;
}
Going back to my original problem, I had a mixture of different bit widths and needed to support either signed or unsigned. This can all be done by having two decoding variants, one for signed and one for unsigned, but with the formulation above, it’s really easy to have the same code handle both:
int zero_or_sign_extend(int val, int sign_bit) {
return val - (val & sign_bit) * 2;
}
We read everything as unsigned initially. When they’re meant to be that way, we pass 0 for sign_bit. For two’s complement signed values, we pass 1 << (bit_width - 1). In my case this was determined at header decode time and stored along the bit widths. The code that actually interpreted the values used the above and didn’t need to do any explicit tests for whether values were meant to be signed or unsigned anywhere.
It’s not earth-shattering, but it sure is neat!
UPDATE: even nicer formulation suggested by Harold Aptroot:
int zero_or_sign_extend(int val, int sign_bit) {
return (val ^ sign_bit) - sign_bit;
}
This one is most easily understood by looking at it case by case. If sign_bit is 0 (unsigned case), both operations do nothing. Otherwise, it gives the mask for an actual sign bit. If that sign bit was originally clear (positive value), the initial XOR effectively adds the value of sign_bit, and then we immediately subtract it again, for a net contribution of 0. If the sign bit started out set, the XOR clears it (effectively subtracting sign_bit from val), and then we explicitly subtract it a second time, exactly like we needed to change the place value of that bit from +sign_bit to -sign_bit.
This is a result I must have re-derived at least 4 times by now in various ways, but this time I’m writing it down so I just have a link next time. All right. If you’re encoding a BCn or ASTC block and are trying to find optimal endpoints (in a least-squares sense) for a given set of weights, you are led to the linear least-squares problem
where A is a n-by-2 matrix of weights, X is a 2-by-c matrix of endpoints we’re solving for where c is the number of color channels (most commonly 3 or 4), B is a n-by-c matrix of desired pixel values, and A looks like
Solving this leads us to the Normal Equations
which is ultimately a regular linear system with multiple right-hand sides. is just a 2×2 matrix, so solving this is very easy (Cramer’s rule is the most popular approach). The only problem being, of course, that this matrix can end up being singular. There’s one obvious case where
ends up singular: when all weights are the same (making the two columns of A linearly dependent). The question is, are there any other cases where we end up with a singular system?
In short: no. Write (because it’s symmetric), then its determinant is
which per Lagrange’s Identity equals
That’s a sum of squares, therefore non-negative terms, and the determinant can only be 0 if all the summands are 0. However, we can simplify
and therefore the determinant is 0 (and our system singular) if and only if all the weights are the same, since we’re summing over all possible pairs of weights and a single pair with a non-zero difference is sufficient to make our sum positive.
Hi. I’m Fabian “ryg” Giesen, one of the co-authors of Oodle, originally made and sold by RAD Game Tools, now (after an acquisition in late 2020) officially Epic Games Tools as per the business license. Everyone (even at Epic) still mostly refers to us as RAD though; it’s been a long-standing name and nobody sees much point in a rebranding campaign. There’s several misconceptions about Oodle etc. floating around. There’s a lot more technical details on the RAD home page and in the docs (if you’re a licensee) but this is a shorter version that tries to give the executive summary of what is what in a few paragraphs each.
What is Oodle?
Oodle is a suite of data compression-related libraries, with three main ones: Oodle Data, which does lossless data compression – think ZIP/Deflate, 7zip/LZMA, Zstd, LZ4, that kind of thing. It was originally designed for storing game assets on disk, but it’s lossless and can be used for whatever data you want. The algorithms in Oodle Data generally emphasize fast decompression. For example, Kraken decodes around 4x faster than Deflate/ZIP for typical data.
Then there’s Oodle Network, which focuses especially on UDP network packets. These are often much smaller (frequently around 100 bytes or less). Oodle Network is also lossless, but is optimized for small, independent amounts of data. It’s slower to decode but has essentially no per-packet overhead.
Finally, we have Oodle Texture, which takes images and encodes them to the “BCn” family of GPU compressed texture formats. Oodle Texture provides high-quality compressors for these formats and supports RDO (Rate-Distortion Optimization), which encodes textures in a different way that takes the same amount of space in GPU memory but will be much smaller on disk after applying a standard lossless compressor like Deflate, Zstd or, of course, Oodle Data.
Kraken
Oodle Kraken, or just Kraken when the context is clear, is one of the algorithms supported by Oodle Data. It’s a good jack-of-all-trades that usually offers much better compression than say Deflate/ZIP and slightly better compression than Zstd, while also being very fast to decode.
Kraken (in our usual software implementation as part of Oodle Data) turned out to be quite popular among PlayStation 4 games and as a result was licensed by Sony for use in the PlayStation 5, which implements a decoder in hardware, making Kraken decompression on PS5 essentially “free”. This decoder was developed by Sony and AMD.
Sometimes, PS5 game package sizes are compared with the sizes of the corresponding game packages on PS4. While Kraken is likely part of any observed size differences, the typical difference between Kraken and a much older format like Deflate/ZIP is around 10-15% or so. In the lossless compression world, even a one-percentage-point difference is considered a big deal, but we’re not talking huge differences here. Some PS4 games are much smaller on PS5, and although the details depend on the individual game, in general, when that happens, it’s almost certainly not primarily due to Kraken. PS4 games are designed to load from a fairly slow hard drive and often duplicate data on disk multiple times to avoid long seek times. PS5 game packages, on the other hand, are always stored on a SSD where seeks are not a major concern. The PS5 packaging tools automatically find and eliminate large duplicated chunks of data, so when PS5 versions of PS4 games are much smaller, this is likely to be single biggest contributor. All credit here is due to the people at Sony who design and maintain these packaging tools.
Mermaid, Selkie, Leviathan
Oodle Data provides algorithms other than Kraken. Mermaid is faster to decode than Kraken but provides lower compression. Selkie is even faster and even lower compression. In the other direction, Leviathan takes longer to decode than Kraken but provides higher compression.
All of these algorithms provide different trade-offs. Game developers can and do choose between them based on their CPU and disk space budget. The latter is often still subject to certain “magic numbers” for games that get physical releases on Blu-Ray discs or similar, where say a 99GB game fits on a 100GB Blu-Ray disc while a 105GB game does not. So even though we’re usually talking about differences of a few percentage points here, sometimes those few percent really do make a big difference.
Oodle Texture
Oodle Texture is, first and foremost, an encoder for BC1-BC7 format textures (collectively referred to as BCn). These texture formats are always lossy. BCn textures are designed for random access on the GPU and chop up images into small, fixed-size blocks. BCn textures are already compressed, but usually get stored on disk with an extra layer of lossless compression applied. Intuitively, think of it like this: because BCn blocks are fixed-size, “easy” regions of an image are sent with more bits than they need, which is the price we pay for fast random access. We’re OK with this once textures are in memory, but on disk it’s just a pointless waste of space, and the extra layer of compression fixes it.
Oodle Texture aims to provide very high-quality encoding while still being relatively fast (to keep artist iteration times manageable). Its main feature is RDO. Oodle Texture RDO makes the BCn encoder aware of that second layer of compression happening, and explicitly tries to make the “easy” blocks more compressible. This does introduce some extra error but can reduce on-disk and download sizes of textures considerably, often by more than 2x in shipping games. It does not make a difference either way in VRAM usage, since the data in VRAM is in the fixed-bit-rate BCn formats.
Oodle Texture RDO results are still just regular BCn textures – they just happen to compress much better when passed into most lossless compressors. There are no extra decode steps required at runtime.
We try quite hard to provide really good BCn encoders. The goal is that at typical settings, Oodle Texture RDO results have similar error to most other (non-RDO) encoders. BCn are lossy formats, so some amount of error is in general unavoidable. But our goal is to provide the same level of fidelity that you’d get without Oodle Texture, just with smaller disk and download footprint.
It’s been a while! Last time, I went over how the 3-stream Huffman decoders in Oodle Data work. The 3-stream layout is what we originally went with. It gives near-ideal performance on the last game console generation’s AMD Jaguar cores and is also a good fit for 32-bit x86 and ARM targets, both of which were still a pretty important target for us at the time the scheme was designed (today, pretty much anything new uses 64-bit code, so that’s what we’re designing for; sub-optimal performance on 32b targets is not a reason for us to keep features out of the bitstream anymore).
As per the analysis in the last part, the 3-stream decoders are getting decent issue slot utilization out of a Skylake-era x86 core, but are still fundamentally limited by the critical path latency for individual symbol decodes and the bit buffer refill. We’re also using a lot of extra instructions compared to the minimum necessary to shave a few cycles off that critical path latency. Instead, what we can do is optimize for instruction/issue slot count and throughput at the expense of a slight increase in latency, and then use a lot more streams. Ideally, if we have enough independent work to keep the machine filled, the critical path latency stops being the primary limiter altogether.
That said, we still have those 32-bit and Jaguar targets who prefer the 3-stream layout and could maybe use 4 streams without introducing a lot of extra spills, but certainly not a lot more. The compromise we use in Oodle Data is that we introduced a 6-stream layout that uses 3 streams for the front half of the data being coded, and an independent 3-stream group for the back half. In effect, we’re doing two 3-stream decodes (with two separate output buffers) at the same time. The platforms that would rather have 3 streams decode the two in sequence instead.1
Many-stream logistics
The number 6 was chosen because it cleanly divides into two sets of 3 (for the platforms that prefer 3), but it also happened to work out in other ways: it turns out both 4 and 8 streams are a bit awkward in terms of register use on x86-64 and 32-bit ARM, and would introduce several extra spills. Probably wouldn’t be a huge deal, but the odd counts are there for a reason.
Anyway, register use is the watchword in general here: the scheme described in the 3-stream article uses 3 registers worth of state per stream: the actual bit buffer, the “number of bits present” count, and the stream read pointer. On x86-64, we have 16 integer GPRs, sort of. The stack pointer is really spoken for, though, which leaves us with 15. 6 streams at 3 registers of state per stream would need 18 registers’ worth just for the bit stream state; the stream read pointer is only used during refill so spilling those during the core decoding loop isn’t the end of the world, but even so, that’s 12 registers allocated, and we still need an output pointer (actually two in our case), a pointer to the decoding table, and at least one or two registers worth of temporary space… and that’s more registers than we have available already. Spilling and reloading around the bit buffer refill isn’t ideal, but workable. When just cycling between our 6 streams means some amount of register thrashing, though, we’re paying an extra tax not just per every 5 symbols decoded, but for every single symbol. The idea with using more streams was to be able to use fewer instructions per decode to use the machine more efficiently. If we’re saving one instructions’ worth of work per symbol and then adding two instructions worth of spill and reload, we’re not doing great. This can still be worth it if we get better instruction-level parallelism (ILP) potential out of it, but we’d rather not.
Therefore, our goal is to be efficient both in terms of instructions executed and in terms of register state we need to keep around. This old post of mine conveniently presents a solution, in the form of bit reader “variant 5”, the throughput-optimized one. The actual scheme is explained in that post, so I won’t reiterate it here. Just know that the basic idea is to get rid of the “bits consumed” counter per stream by setting the MSB of the 64-bit bit buffer upon refill (then shifting the whole word down if parts of the first byte were already consumed earlier), and determining the consumed bit count using a leading zero count. Conveniently, shifting out just-consumed low bits using an unsigned bit-shift right also updates the consumed bit count for us. Therefore, we save both the extra register for the counter and the extra instructions for maintaining it. The trade-off is that this version of bit-buffer IO doesn’t have the “lookahead” property where the refill load can start before the final consumed bit count of the previous iteration is done. Therefore, we pick up one more load latency along the critical path.
Enough build-up, how does a single-symbol decode look like in this form? Assuming BMI2 is available, it’s this:2
andn rcx, rMask, rBits ; peek (rMask = ~2047)
movzx ecx, word [rTablePtr + rcx*2] ; table fetch
shrx rBits, rBits, rcx ; bit buffer update
mov [rDest + <index>], ch ; emit byte
Four instructions and 5 uops (4 fused-domain; the store is 2 ops, one store address and one store data which counts as a single uop in the fused domain, the others are 1 uop each) per symbol decoded, which is pretty sweet. Note we splurge on an extra register to keep the constant ~2047 just so we can do the masking with ANDN, which has a non-destructive 3-register form, instead of MOV/AND. The extra MOV doesn’t matter that much when MOV elimination is active, but due to CPU bugs this was actually later disabled in several CPU models, so this is more consistent. Also note both our destination pointers need to be kept within the low 8 registers so the move-from-CH instruction encoding is available, or else it takes an extra shift instruction and uop. (That’s weird quirks of the x64 encoding for you.) The scheme from the previous article takes 5 instructions and 6 uops per symbol decoded (if we use the same ANDN trick, one more of each otherwise), so this change saves us one instruction and one register’s worth of state per stream, with the latter saving further instructions that would otherwise go into spills and reloads if we used 6 streams naively.
The critical path for this is simple: the first three instructions are on it (for any given stream), the last one is not. The AND and shift have 1 cycle of latency each, the indexed load has 5 cycles (on Skylake), for 7 cycles total. Continuing with last article’s assumption of around 4 fused-domain uop issue slots per cycle (the reality is a bit more complicated, but this is good enough), over these 7 cycles, we get 4×7 = 28 issue slots, of which we’re filling 24 when we’re doing 6 streams. When we’re doing this 5 times in a row, that means 5×4 = 20 unused issue slots by the core decode; we do have other things in the loop such as constant loads, pointer increments, the loop condition, the stream overlap checks, and maybe some spills and reloads around the refill stage, so these remaining slots don’t necessarily have to go to waste either. We don’t get to fill every issue slot every cycle anyway because the scheduling and load balancing in the CPU is not perfect, but this is still a promising setup.
Refill
How do we do the bit buffer refill with this? Well, this again follows the basic logic from “Reading bits in far too many ways (part 3)”. If we implement that naively, we get something like this (per stream):
lzcnt rBits, rBits ; # bits consumed
mov rcx, rBits
shr rcx, 3 ; # of full bytes consumed
add rBitsPtr, rcx ; advance ptr
and rBits, 7 ; leftover bit count
mov rcx, [rBitsPtr] ; load next
or rcx, rMarker ; set marker bit (MSB)
shrx rBits, rcx, rBits ; consume leftover bits
This is already not bad, but note that most of this is on the critical path for the refill as written. The most annoying part of that is the lzcnt which has a 3-cycle latency on most Intel parts. However, note that the rBits value we do the lzcnt on is always the shrx result of the final decode from that stream, and nothing else looks at the contents of the bit buffer after said shift. Therefore, we can juggle things around to do the leading zero count early. The final decode for each of the 6 streams does:
andn rcx, rMask, rBits ; peek (rMask = ~2047)
lzcnt rBits, rBits ; early lzcnt
movzx ecx, word [rTablePtr + rcx*2] ; table fetch
add rBits, rcx ; lzcnt update
mov [rDest + <index>], ch ; emit byte
Instead of a final shrx, we use an add instruction instead (making the bits past the initial 8 bits garbage, but we can deal with that), and we move the lzcnt up so it happens concurrent with the table load (which has 5 cycles latency and is on the critical path). That takes the leading zero count off the critical path entirely and doesn’t need us to execute any extra instructions. We do need to clear the garbage bits that end up at bit 8 and above in rBits afterwards. However, if we look at the refill code given above, that’s a trivial fix (funny how that works out):
; # bits consumed already in low 8 bits of rBits
movzx rcx, low_byte(rBits)
shr rcx, 3 ; # of full bytes consumed
add rBitsPtr, rcx ; advance ptr
and rBits, 7 ; leftover bit count
mov rcx, [rBitsPtr] ; load next
or rcx, rMarker ; set marker bit (MSB)
shrx rBits, rcx, rBits ; consume leftover bits
The code still has some other details to take care of. Because we treat 6-stream blocks as a pair of 3-stream blocks, we still have one backwards, reverse-endian stream in there, we need to maintain 2 separate output pointers, we still need to do the stream-crossing checks and have our loop condition, and so forth. All of that is routine and I won’t go into it here, other than noting that the reverse-endian streams require a 64-bit bswap for their refill, which adds 2 uops and 2 cycle latency to their respective critical paths (on Intel), so those streams have the longest critical-path latency.
Analysis and results
I already went into the core symbol decode step before, noting that it takes 4 instructions, 4 fused-domain uops, and 5 unfused-domain uops per symbol. For the rest of our critical path candidate, that leaves the refill.
With the tweaked final decode described above, the leading zero count is off the critical path, but we still count it as 1 uop that effectively belongs with the refill (it also changes one shrx to an add, but that doesn’t make a difference at the level of analysis I’m doing in this series, although I will note that this is also slightly beneficial since there are more execution ports that can take integer adds than there are ports that execute shifts).
For the refill, that leaves us with 8 instructions executed (including the lzcnt), all single-uop. The critical path for the refill (after the final rBits update for a stream) goes movzx, shr rcx, add rBitsPtr, mov [rBitsPtr], or rBits, shrx rBits. All of these except for the load are 1 cycle latency, the load is 4 cycles on Skylake and derivatives, 5 on newer uArchs. That’s 9 or 10 cycles critical path latency through the refill, respectively, 11/12 cycles on the reversed streams that need the bswap.
We’re doing this for 6 streams; 8 uops × 6 streams = 48 uops and issue slots – plus another 4 uops for the two bswaps implied by the two backwards streams, implying a grand total of 52 uops. By our previous “about 4 unfused domain uop issue slots per cycle” rule of thumb, that’s around 52/4 = 13 cycles worth of work – we’re actually throughput limited on this portion, for once. And since we’re running this in a loop, we can expect the throughput-limited refill and “other loop overhead” portions to collapse with the latency-limited symbol decode step to some extent.
In the “production” version of the loop, it turns out there’s 15-20 fused-domain uops worth of other work (that’s not symbol decoding or refill) in this loop. I’m giving a range because 5 of these are compare-branch instruction pairs that can be macro-fused but whether that actually happens in any given instance is again a bit complicated and depends on other factors such as code alignment that are outside the scope of the level of analysis I’m doing here.
If we take the higher count of 20, this happens to line up precisely with the 20 “wasted” issue slots we noticed during the symbol decode portion. I did not notice this before writing this post and highly doubt that things shake up anywhere close to this in practice, because machine behavior gets fairly “twitchy” in the high-ILP regime and any stray disturbance such as non-ideal scheduler instruction picks or individual L1 cache misses cause long ripple effects. Nevertheless, it does suggest that on Skylake and its many process tweaks, we expect to be, just about, throughput-bound.
This time we’re decoding from 6 streams, 5 symbols each, for 30 symbols decoded per loop iteration. We expect to take 4 issue slots per symbol decode (120 slots total), 8 issue slots per refill (48 slots total), and 20 issue slots worth of bookkeeping etc. per loop iteration, for a total of 188 issue slots. With our 4 issue slots per cycle rule of thumb, if everything works out perfectly, that comes out to 47 cycles per iteration exactly.
For critical path latency, we have 6 independent streams, and the slowest ones are the two byte-reversed ones. Their critical path goes through the symbol decode and refill. We have 5 decodes of 7 cycles each plus a refill of 11 cycles, giving us a latency estimate of about 46 cycles if everything works out right.
In short, under very idealized circumstances, we expect to be throughput-bound (or close to), and hitting around 47/30 = 1.57 cycles/symbol decoded.
How fast does the “production” version of this loop actually run on a Skylake machine, in a real workload, on real data? This is where I would ordinarily post results of a test run, but as I’m writing this, the Skylake test machine I’ve been using for this series isn’t letting me remote in for some reason, so I don’t have same-machine results to post just yet. The values I remember are around 1.83 cycles/symbol for 55 cycles per loop iteration, off from our ideal throughput estimate by 8 cycles or about 17%. (I’ll try to remember to update that with recent measured figures once I get that machine back up and running). That’s a higher discrepancy than I’m comfortable with, but we’re running into the limits of what we can do with our very basic back-of-the-envelope estimates here. A more detailed analysis using uICA gives a lower-bound estimate of 49.75 cycles for Skylake, with the listed bottleneck being “Decoder”. That is about 10% smaller than the actual observed figures on real workloads that I remember, which as noted in the previous installment of this series is a pretty typical ratio for real-world results (involving cache misses, branch mispredictions, other process/OS interference etc.) vs. throughput estimates. Either way, at around 1.83 cycles/symbol observed for 6-stream compared to 2.79 cycles/symbol observed for 3-stream, we get a 1.53x speed-up from doubling the number of streams.
I do have actual measured results on my home desktop machine, which as of this writing is an AMD Ryzen 9 7950X3D (Zen 4) at 4.2GHz nominal. On that machine, when doing single-threaded Huffman decoding using the 3-stream BMI2 decoder, I get around 2.08 RDTSC ticks (which are not actually clock cycles due to dynamic frequency scaling) per symbol, while the 6-stream BMI2 variant gets around 1.37 RDTSC ticks per symbol, for a speed-up factor of around 1.52x from using 6 streams – almost exactly the same ratio, on a different microarchitecture. (There are also dedicated Zen-optimized versions of the inner loop, which exploit some specifics of AMDs Zen microarchitectures, and average around 1.79 RDTSC ticks/symbol for 3-stream and 1.29 ticks/symbol for 6-stream on the same machine; in this case, the speed-up works out to “only” 1.39x)
For what it’s worth, around 1.5x is also about the level of speed-up I see on wider 64-bit ARM machines, although the really wide ones (e.g. Apple M1/M2) can see even larger speed-ups from 6-stream.
On said Zen 4 machine, the pure 6-stream Huffman decoding BMI2 inner loop hits a rate of around 3.07 GB/s decoded per second (in our case, 1 symbol=1 byte), on a single core, with the system otherwise mostly idle, i.e., running the OS and desktop but nothing particularly taxing. When using the Zen-optimized variant, it hits 3.25GB/s.
The entire Huffman decode on the same data, same machine averages 2.52GB/s on a single core (using the straight BMI2 variant, 2.65GB/s for the Zen-optimized variant); this includes other tasks like reading the Huffman code lengths from the bit stream, setting up decode tables, as well as the careful end-of-stream processing.
Discussion and conclusion
The original 6-stream design started as a thought experiment about how many streams a purely throughput-bound Huffman decoder would have to be using on a theoretical x86 with as many registers as we wanted, but the same basic instruction latencies as the Haswell/Broadwell/Skylake machines that were current at the time (early 2016). Once it became clear that the 4-instruction/symbol decode step was not hard to do and the resulting stream count worked out to “around 6”, exactly twice what we were shipping previously and without bumping into register count limits, the (admittedly slightly off-kilter) pair-of-3-streams design coalesced as something that was easy to fit into the existing bitstream structure and could just be ignored on targets that didn’t benefit from it, while giving us a considerable speed boost on the rest.
In aggregate, this series of posts is meant both as an explanation of the methods used and more generally, as an illustrative worked example about what designing a data format for higher ILP potential looks like. A “standard” single-stream Huffman decoder is fundamentally bottlenecked by its single serial dependency chain, but the “right” number of dependency chains for a given task is not necessarily straightforward, nor is it always easy to figure out by experimentation. For example, the throughput-optimized decode step used in this version of the Huffman decoder is strictly worse than the latency-optimized decoder for typical 3- or 4-stream setups, so unless you’re aware of the underlying dynamics, you wouldn’t even try it for anything larger. Likewise, the latency-optimized approaches that work best for smaller stream counts run out of steam as more streams are added (mostly from extra spills), suggesting that there is no further win to be had from increased stream counts when in fact there is.
The particular numbers here are always subject to change; the original 3-stream count in the Oodle Kraken format was motivated primarily by the AMD Jaguar cores that were our most important target at the time, and the 6-stream version, although it happens to work out fine, was just a convenient multiple. As CPU designs get wider, the optimal number of streams keeps increasing, provided other likely bottlenecks keep getting fixed alongside.3 That said, a not-quite-ideal 6-stream arrangement on a machine that would do even better with 9 is still much preferable to your usual single serial bitstream decode.
If nothing else, takes this series as a practical demonstration of just how much wasted potential there is on current hardware (“current” here really meaning “anything built in the 21st century”) in serialization formats with a single stream of variable-sized data where each token needs to be decoded to be able to start decoding the next. Of course decoding speed is far from the only concern in practice, but it should at least be considered, when too often it is ignored entirely. There are definitely problems with multi-stream formats (especially when consuming data incrementally or in a streaming fashion), so they are not a panacea, but in many use cases it is completely fine to decode data in small batches (on the order of tens of kilobytes) with some buffering, and in those cases a multi-stream design can offer substantial benefits, especially in write-once read-many scenarios.
Finally, one lengthy digression on a complementary strategy: decoding multiple Huffman symbols at once. The basic idea here is straightforward: if we’re doing a table lookup per symbol anyway, and if the whole point of variable-length codes is to assign shorter code to more common symbols, can’t we decode multiple symbols from a single table lookup, provided those symbols are all fully contained in the portion of the bitstream we peeked at? And yes, of course we can, but there are considerable logistical difficulties if we look again at the final per-symbol decode step we ended up with:
andn rcx, rMask, rBits ; peek (rMask = ~2047)
movzx ecx, word [rTablePtr + rcx*2] ; table fetch
shrx rBits, rBits, rcx ; bit buffer update
mov [rDest + <index>], ch ; emit byte
This design uses 2 bytes per table entry so it doesn’t have space to decode multiple symbols at once. The easiest generalization is to allow decoding either 1 or 2 symbols per step. That means per table entry, we now need the combined length of all symbols decoded, the values for the first and (potentially) second symbol decoded, and a symbol count (or other flag) for whether the table entry describes 1 or 2 symbols. Take a basic layout of a 32-bit table entry containing 4 bytes (len, nsyms, sym1, sym2) for a concrete example (that’s not the only possible design point, but it is a sensible one). Note that our table entries are now twice the size (and much trickier to initialize), and furthermore, for each table lookup, we don’t know anymore whether we’re going to decode 1 or 2 symbols, so we can’t use immediate offsets for our stores anymore; in every decode step, we need to figure out what the symbol count is and advance the output pointer. We also either need multiple loads or do a single combined 32-bit load and shift to get access to the various fields; pick your poison. My best bet at a workable 1-or-2 symbol decode step looks something like this:
andn rcx, rMask, rBits ; peek (rMask = ~2047)
mov ecx, dword [rTablePtr + rcx*4] ; table fetch
shrx rBits, rBits, rcx ; bit buffer update
movzx rNumSyms, ch ; extract nsyms
shr ecx, 16 ; align output byte(s)
mov [rDest], cx ; emit output byte(s)
add rDest, rNumSyms ; advance write pointer
Note that despite still using oddball tricks like the movzx-from-ch to get (effectively) a bitfield extract, which comes with strings attached (rNumSyms can’t be just any register for this to work), we’re now using 7 instructions and 7 fused-domain uops (8 unfused-domain) to decode either 1 or 2 symbols. Furthermore, we’re adding a secondary dependency chain on the outputs between adjacent logical streams: the add rDest, rNumSyms needs to complete before the next stream can finish its output. That’s not the end of the world since only the stores depend on it, but it’s an extra potential for slowdown.
This is just not all that appealing. As noted above, we already get to be mostly throughput-bound (actually frontend, but at least we’re not constantly waiting on dependency chains anymore) on the original Skylake target, using 4 instructions per symbol decoded. This modified version will in the best case only ever write 2-symbol pairs and use 7 instructions to do so, for 3.5 instructions per symbol decoded. But that’s with a 100% success rate. Even if we manage to decode a pair of symbols a staggering 75% of the time, we’re still taking 7 instructions to decode a single symbol in the remaining cases, which works out to an average of 0.25 × 7 + 0.75 × 3.5 = 4.375 instructions per symbol – still worse than a single symbol decode, although to be fair, the double-symbol decodes end up doing a refill less often when they hit. Considering that, it’s likely that around 75% pair decode rate, we’d probably break even or see a slight speed-up using multi-symbol decoding.
That’s still not particularly promising though; from multi-symbol decoding, we’d like to see a lot more than “break even around the point where we decode 2 symbols at once 3/4 of the time”. Moreover, from the entropy of the symbol stream, we can work out how big the decode table needs to be to be able to successfully decode symbol pairs on most decode steps. Oodle Data uses a (very low) code-length limit of 11 bits to facilitate infrequent refills. In our existing table size of 2048 entries, we can expect to see frequent pairs of symbols if the entropy is around 5.5 bits per symbol or so, which is very low. Our typical data for things like literals after LZ compression is more frequently around 7.2-7.8 bits per symbol (each of which is from an 8-bit alphabet). We would need something on the order of 215 = 32768 table entries for pair decodes to be common; at 4 bytes per table entry, that’s 128KiB, which is definitely outside the L1D cache, and in fact takes up a sizeable fraction of the L2 cache as well. So now we’re expecting to miss the L1D cache on those table accesses as well, meaning our expected latency per symbol decode for the memory access goes from around 5 cycles to around 16-20, which completely blows up our assumptions. Now we have nowhere near enough streams to hide that latency again, and we also need to spend the time to set up these giant tables; as you can see from the numbers quoted in the preceding section, the time spent setting up the decoding tables even for our small (2048-entry, 2 bytes/entry) configuration is not negligible, and multi-symbol decode tables need more space per element, more elements, and are more expensive to set up.
Add all these together, and the case for table-driven multi-symbol decoding in the general case just doesn’t look great: it increases the amount of work per decode step, which is disproportionately felt when the decode steps are otherwise lightweight or there are multiple streams available to compensate for critical-path latency. It increases table build time, which can be a significant fraction of overall decode time if tables change frequently, table building isn’t optimized (it often isn’t) or tables are large, and it often ends up requiring impractically large tables to be able to decode pairs to begin with.
That doesn’t mean that it doesn’t have its uses, especially in cases where you know the expected entropy is low. For example, the JPEG decoder in stb_image normally decodes AC coefficients in two steps, first reading a (run,level) pair (this is encoded into a single 8-bit value as per the JPEG spec) and then reading an extra number of bits determined by the “level” value to figure out the coefficient value. Most AC coefficients are very small (mostly ±1 or ±2), so stb_image’s JPEG decoder additionally has the “fast AC” table that combines decoding (run,level,extra_bits) all into one table lookup (into a 1KiB table) in the typical case when the AC coeffs are very small.
However, for general-purpose lossless (de)coding in software, multi-symbol decoding just isn’t very compelling in my experience.
And that’s it for this post and the Huffman portion of this series, unless there’s interest in some of the other targets. (There’s also a suite of optimized ARM decoders, both 32- and 64-bit, and while they have some cool tricks I’m happy with, they don’t add anything substantially new to the material already covered so far.) I’m leaving the door open to do another follow-up on the non-Huffman portions of Oodle Data’s entropy coding—especially TANS—but I don’t have any concrete plans yet. We’ll see!
Footnotes
[1] This does give us a full 6 independent streams when batch-decoding the entire contents of a Huffman block up front, but is still effectively 3 streams when decoding incrementally. Not a problem in the regular SW Oodle decoders, but there are other “streaming” implementations that don’t get a benefit out of the 6-stream layout. The platforms we support that greatly prefer 3-stream batch decoding were much more important to our customers than the streaming use case at the time this was designed, so this was a sensible trade-off.
[2] You might notice that this code has a tendency to keep all shift counts in explicitly rcx, that exact register, not just a named temporary like I do for the other register references. That’s because even though I’m showing the BMI2 versions of everything, we also support non-BMI2 paths for machines that don’t have it, and in those we want to use the shr reg, cl forms. We also use the store-from-ch instruction, so the register receiving the table load needs to be one of eax, ebx, ecx or edx anyhow; it’s easiest to just make rcx the dedicated temporary register in this code and make sure everything that needs to be a shift count ends up in there in, that way the BMI2 and non-BMI2 paths can be very similar.
[3] Dougall Johnson has done some experimentation on Apple M1/M2 hardware and according to him, going up to even 12 streams is still beneficial, but I haven’t tried to replicate this myself.
Most standard texture compression formats use a type of algorithmic vector quantization (meaning that instead of storing an explicit codebook of possible blocks, the codebook entries are determined by an algorithm that uses the encoded block as an input). This is the case for all the BCn formats, ETC/EAC, and ASTC, but not PVRTC, where the decoded pixels near the edge of a block also depend on adjacent blocks, which makes the encoding process more of a global problem (so the notes in this post do not apply).
Furthermore, the contents of the encoded blocks can usually be partitioned into three categories:
- Header/mode bits, which determine how the remaining contents of the block are to be interpreted.
- Endpoints, which specify some of the colors present in the block in some form. The endpoints are then expanded into a larger palette of colors, and pixels select which color they want using
- Indices into the palette. Pixels generally pick the index that gives the closest match in the palette according to some error metric.
The expansion process from specified endpoints into the full palette frequently involves some form of linear interpolation, and the error metric is usually some variant of squared error (possibly weighted), for numerous reasons.
I have now several times run into people who thought that encoding textures at reasonable speed was impressive because this problem looks like certain well-known hard problems such as Integer Least Squares. This is not the case, for at least two orthogonal reasons:
- Being able to reduce a problem to some variant of Integer Least Squares, or Integer Linear Programming, does not automatically make it hard in a computational complexity sense. The reduction in a hardness proof needs to go the other way: you need to be able to turn an arbitrary instance of a known hard problem into an instance of your problem to prove that your problem is hard. It is entirely possible, and frequently quite easy, to go the other way and turn trivial problems into instances of NP-hard or worse problems (usually by accident).
- Hyper-polynomial growth of the best-known solution to a problem as a function of problem size N matters when the problem size depends on the input, but this is not the case for block-based texture compression. When the N for some hyper-polynomial algorithm is bounded by a constant independent of the size of the input (in this case that would be the dimensions of the texture), the asymptotic complexity of whatever algorithm you use to solve a NP-hard subproblem does not really matter.
In particular, if you chop an image/texture into fixed-size blocks (i.e. block size doesn’t depend on the input) that are then encoded fully independently, and encoding any given block takes a bounded amount of time, that makes encoding of the full image linear time in the number of input pixels, even if encoding any individual block is a combinatorial nightmare. You might not like the constant factors in front, but it’s still going to be linear time.
For the purposes of computational complexity, behold the following universal algorithm to encode compressed textures:
best, best_dist = None, None
for E in enumerate_coded_blocks():
D = decode_block(E)
dist = distance_metric(D, original_block)
if best_dist is None or dist < best_dist:
best_dist = dist
best = E
return best
This is a perfectly fine linear-time algorithm with the only problem being that even with something as modest as 64-bit blocks, it’s not expected to finish before the heat death of the universe. Don’t worry though, this post is not just an exercise in pedantry.
Namely, the other thing we can immediately do from the decomposition of a block into its constituent parts above is to shrink the search space drastically, and of course practical encoding algorithms do just that.
- Not all header/mode bits need to be explored. If desired, methods that still want to guarantee optimality can use branch-and-bound techniques to prune the search. In practice, nobody actually wants to spend the extra time (or cares about the generally tiny improvements from) guaranteeing optimality, so more commonly heuristics are used here as well.
- Formats that split their blocks into subsets let you do parts of the solve separately, which shrinks the search space considerably where it applies.
- Given a set of endpoints, the source pixels, and a distance metric, it is pointless to try all possible choices of indices. Once we fix the endpoints for a subset (and possibly some index constraints resulting from that choice), the optimal indices can generally be determined directly. This direction is usually cheap to solve exactly, but the endpoints are a somewhat awkward search space.
- In the other direction, if we fix a set of indices, the source pixels, and the distance metric, we can try to solve for the optimal endpoints given that set of indices. This direction is usually tricky to solve exactly, but the search space is easier to prune to a small subset. It’s the basis of cluster fit methods like the original algorithm in the S3TC patent and methods derived from it.
The actual reason I’m writing this post is that for several of the simpler formats, the resulting search space is small enough that exhaustive search is completely viable. Not so much for real-world texture encoding, it’s too slow and the benefits are way too small for that. For BC1-5, it’s also pointless because the decoder isn’t exactly specified. But it’s definitely doable to get reference results if you want to see what optimality actually looks like.
In particular, for BC3 alpha or BC4 blocks, the set of possible endpoint/mode combinations is just 16 bits (the ordering of the endpoints selects one of two modes), so 65536 possible choices of endpoint pairs. Determining the optimal endpoints in 1D isn’t even hard (and can be done exactly using fairly straightforward math), and it’s all very easy to accelerate using SIMD, using a GPU, or both. This is not only eminently viable, it’s viable enough that not only can this be used to test individual blocks (or small image snippets), you can encode full textures this way just fine. Sure, encoding a 1024×1024 texture that way takes seconds instead of milliseconds but that’s not long to wait for a reference result at all.
BC5 is just two independent BC4 blocks, one per channel, and they’re completely independent, so this also applies to BC5. BC1 color blocks are still 64-bit blocks and have a 32-bit space of possible endpoint pairs. However, 232 = about 4 billion “interesting” encodings (since we don’t care about sub-optimal choices of indices) is still small enough that exhaustive testing is totally viable on either CPU or GPU if you’re using an optimized search, a beefy CPU/GPU, parallelization and are willing to wait a bit to encode a single 4×4 pixel block. (In this case, you probably would not want to encode a full 1024×1024 texture this way.) And if we can do BC1 and BC4, we can also do BC2 (the alpha part is trivial) and BC3 (effectively separate BC1+BC4, they decompose). In short, BC1-5 are actually totally practical to do exhaustive encoding these days, and the same applies to ETC and EAC.
Would you use this for a practical encoder? Definitely not. Likewise, it’s not viable for the more complex formats like BC6-7 or ASTC. Nevertheless, it felt worth pointing out that optimal BC4-5/EAC encoding (in the least-squares sense anyway) for a known decoder is eminently possible even at home for real-world-sized textures if you’re willing to wait for a few seconds (and that’s with no pruning whatsoever). BC1-3/ETC are more involved, but still at a timescale where determining optimal solutions (given a known decoder) for a small image patch is something you can easily do on a regular home computer even without using any pruning strategies if you’re willing to wait a few hours.
Now the actual application given here is obviously very specific to what I happen to be working on, but the reason I’m writing it up is that I’m trying to make a few more general points: first, “this looks like what I know is a NP-hard problem” is no reason to assume something is actually hard (never mind infeasible) in practice, even if you actually want optimality – hardness reductions work the other way round, and this is by no means the only instance of NP-hard subproblems of bounded size that I’ve run into, which are a very different beast. Second, exploiting even some very basic symmetries can sometimes take combinatorial optimization problems from “never going to happen” to “that’ll be 5 minutes”, so it’s worth looking. Third, computers are fast. Trying on the order of 100,000 things if said trials don’t take long is just not that big a deal and hasn’t been for a long time. For that matter, trying 4 billion things has also been entirely practical for a while, as long as individual trials are in the “100s to 1000s of cycles” range (i.e. don’t try this with slow interpreted languages) and you have a few cores. For the past 15 years or so, pretty much every unary function on 32-bit inputs (especially floats) and every binary function on pairs of 16-bit inputs, I’ve just tested exhaustively on all possible inputs unless there was a very compelling reason why this wasn’t possible.