A very brief BitKnit retrospective
UPDATE May 7, 2023: I wrote this post yesterday somewhat in a huff (for reasons not worth going into) and the original post contained several inaccuracies. These have been corrected in this version and I’ve marked the places where I was wrong [like this]. Let’s just say I wish I hadn’t posted this in its original form, but I did, and I’d rather make the corrections publicly now than pretend it didn’t happen.
BitKnit is a LZ77+RANS lossless general-purpose compressor that was designed 8 years ago (1.0 release on May 20, 2015) to replace the aging and very slow to encode LZ77+arithmetic coding codec built into Granny 3D, and also worked – somewhat unintentionally – as a LZ77/entropy coding test vehicle. The codec did end up in Granny and later a small variation (with slightly different bitstream to work within the different container) in Oodle 2.1.2. In Oodle it was quickly superseded by the “oceanic cryptozoology” (first Kraken, then later Mermaid/Selkie and Leviathan) codecs, so BitKnit the actual codec is just a historical curiosity these days (and deprecated in current Oodle versions), but it was our first released code to use several ideas that later ended up in other Oodle codecs [the original version implied that these ideas started in BitKnit; that’s incorrect], and might be of general interest. So I’ll keep it brief and focus mainly on the parts of it that actually got successfully used elsewhere.
Basic overview
BitKnit (I’ll just write BK from here on out) is basically your usual LZ77 with entropy coding backend. Like LZX and then LZMA, it keeps a history of recent match offsets that can be cheaply reused; BK was designed mainly for Granny files, which are chock full of fixed-size records with highly structured offsets, so unlike LZMAs LRU history of 3 recent match offsets, it keeps 7. This is somewhat expensive to maintain in the encoder/decoder and mostly a waste for ASCII text and text-based formats, but turns out to be very useful for the structured binary data that was its intended use case.
For entropy coding, it uses a 2-way interleaved rANS coder with 32-bit state that emits 16-bit words at a time, and 15-bit “semi-adaptive” (aka deferred summation) multi-symbol (i.e. not binary) models. Semi-adaptive meaning that they change over time, but not after every symbol; model updates are batched and the model remains static in between, which lets it amortize update cost and build some auxiliary data structures to accelerate decoding. The particular model it ended up with was a compromise to hit its speed targets, and its update mechanism wasn’t great. It was significantly faster than LZMA/CABAC/etc.-style “update after every symbol” fully adaptive models, but with appreciably worse compression, and slower to decode than a fully static model would’ve been (which would’ve also enabled tANS usage). Our later codecs jettisoned this part of the design; Oodle LZNA (as the name suggests, a LZMA-derived design) used multi-symbol rANS but with full adaptation after every symbol. Kraken, Mermaid and Leviathan (but not Selkie, which is a byte-aligned LZ with no entropy coding) support multiple choices of entropy coders but they’re all based on static models.
BitKnits use of a semi-adaptive model makes it easy to do the modelling and most of the encoding in a single front-to-back pass. Static models are trickier to encode for because there are interdependencies between e.g. the LZ parse and the statistics of the symbols emitted, which makes encoding harder and somewhat slower, but such is life. The actual rANS bitstream needs to be encoded in reverse though (we prefer our decoders to consume data in increasing address order). To limit encoder space memory usage, BitKnit processes data in chunks of 64k (uncompressed) at a time; the rANS bitstream is flushed after every chunk, which adds a small bit of overhead but lets us use a guaranteed bounded amount of memory in the encoder for arbitrary input sizes, improves its memory access patterns, and is also handy from a file format perspective. (Old post about rANS in practice, derived from how we used it in BK then LZNA, here). All the Oodle formats are also internally chunked, and have been forever (it’s how the framing format works) and I can recommend this strategy in general.
Entropy coding experiments
While working on BitKnit, we were experimenting with a lot with different entropy coding options, and much of it ended up in our other formats in some way or other [my colleague Charles Bloom also did a lot of experimentation at the same time, there was a lot of bouncing ideas back and forth in a bunch of mail threads, and the previous version of this post implied I did all that work, which is not true.] My blog post “Mixing discrete probability distributions” was written while working on BK and underlies both its semi-adaptive models and the follow-up “Models for adaptive arithmetic coding”. The latter are what’s used in Oodle LZNA, mostly with 8-, 9-, 16- or 17-symbol alphabets, using a SIMD decoder. This style of model with a larger radix than binary needs to sustain a much smaller number of symbol decodes per second than the equivalent binary arithmetic coder does, and especially combined with interleaved rANS coding gives quite a nice speed boost in the decoder. The caveat is that they’re usually not quite as good in terms of compression ratio as the “binary tree”-style adaptive binary models, but I consider it a great trade-off that even now, 8 years after writing it up, feels under-explored to me. (LZNA did use it; our follow-up formats didn’t simply because we’ve generally moved away from design points where adaptive coding makes sense entirely, since in our application space, decoder speed is very important.)
The actual models that ended up in LZNA were designed by my colleague Charles Bloom; ultimately, the adaptive coding direction was rejected as being too slow for the rates BK was targeting (not least because some of the platforms BK was meant to work on didn’t have suitable integer SIMD either), but it was a great fit for LZNAs slower target. Alas, it too didn’t last too long; Leviathan, which came out a few years later, usually gets quite close – sometimes better – at something like 10x the decode speed, using purely static models.
LZ literals
The main problem with batched model adaptation is that it takes quite a bit longer for the encoder (or decoder) to respond to changes in symbol statistics. LZMAs backend can pivot rapidly on such changes: when an unexpected (i.e. low-probability) symbol is seen, it’s probability is immediately boosted drastically. In my experiments, even batched updates that run every 2 symbols (i.e. update after every second symbol) did much worse than immediate updates/full adaptation on LZ literals, i.e. all the bytes not covered by string matches. Other types of symbols fared much better in my testing, but that’s of little use, because literals usually make up most of a compressed LZ data stream both by symbol count and by encoded size. If the literals have to be fully adaptive, that most of the runtime cost of full adaptation already, which as already mentioned was slower than what BK was aiming for.
Because BK was meant for Granny (a 3D mesh and animation exporter/runtime toolkit), most of the data I was looking at for mesh exports in particular was mesh vertex data, topology (in the form of index buffers), and textures. Vertex data is laid out in the form of fixed-size structs containing fields like vertex positions, normals, UV coordinates, and so forth. Meshes often have a fair bit of redundancy between vertices, where the bytes encoding positions of vertices frequently match positions of other nearby vertices, it’s common for authored vertex data to have meshes that are axis-aligned planar components that have a lot more structure still, and both the normals and UV coordinates in such cases tend to be highly repetitive as well, although usually with some variation. Indices are list of small integers that are usually “more or less” increasing (meaning the average value of the indices you see in a sliding window of recent indices will steadily go over up time, although there’s of course outliers). Some textures are quite noisy, and LZs don’t do great on those (you can improve things slightly with methods like PNGs filters). Some are “vector-y” with very geometric features and large flat areas, and LZs do much better on those. And some are “vector-y” with gradients, which again LZs don’t do awesomely on (it somewhat depends on gradient orientation), but they obviously have very nice structure in the byte values.
What most of these have in common is that delta coding (i.e. sending values as differences to a prior value, rather than directly) tends to work well on them. And from a different direction, LZMA (the main codec I was comparing against for BK) has a special case in its literal coding for the literal immediately following a match: the idea is that the decoder can be fairly sure that right after a match ends, the next literal in the data stream is not what the next byte after the final matched location would have been – otherwise we would just have sent a longer match to begin with. LZMA does not strictly enforce this, and the optimal parser can do such choices sometimes because it turns out to be advantageous, but the probability of the literal byte being the same as the match byte is still quite low. So what LZMA uses after a match is use a probability model for the XOR of the literal byte and the match byte.
This does two things: first, we expect that XOR to not be 0 with high likelihood, so even if everything else is completely random, we now end up a with an alphabet of 255 uniformly random symbols (with a very low likelihood of a 0), instead of 256 random symbols, which is slightly better even assuming nothing about the distribution of values at all. But second, the match byte and the literal byte are not uncorrelated in practice. The first literal after a match, especially in scenarios like ours where we have arrays of fixed-size records with very regular structure, is often a “near miss”, and the match byte still ends up being a good predictor. LZMA uses a bit-wise binary tree model, so XOR is a natural predictor for that topology.
BitKnit, however, does not; it uses byte-based coding, and as explained above we have a lot of data where we expect delta coding to do well. Hence, BitKnit sends the byte difference (mod 256) between the match byte and the literal byte. [This is a part where I really got the history wrong, there’s an entire correction below.]. This worked great, and some experimentation later, it just ended up sending all literals this way, not just the first literal after a match – the first literal after a match sends the difference to the byte right after the match ended, the second literal after a match a difference to the byte after that, and so on.
This solved several problems at once: first, it acts as an automatic delta-coder for this kind of data, and is neatly integrated with the LZ so there is no fiddly setting up good byte distances to use for delta-ing or similar. Because these distances are determined by the match distances, it’s all automatic within the format. It’s not generally as good as a custom format-aware delta coder, but it tends to get a lot of the benefit and Just Works. Second, it turns out that (for the data I was looking at anyway) the delta-literal distribution was usually a lot less variable than the raw literal distribution, which was the key enabler for BitKnits use of semi-adaptive models throughout.
BitKnit itself always codes literals this way; this is not ideal for other types of data, but – somewhat surprisingly, to me anyway – worked decently even more text-based data like JSON or XML. Right around the time of BKs original release, Charles wrote this up in his post on what he ended up calling LZ-Sub. He was using this at the same time in LZQ1, an experimental codec of his that never shipped but was used for lots of experiments that led to Kraken. Kraken, Mermaid and Leviathan all used this method and later expanded on it. Notably, neither LZQ1 nor the newer codecs go strictly one way or the other. LZQ1 could switch whenever it re-transmitted Huffman tables; Kraken and Mermaid work in chunks (128k of uncompressed data), and both of them have a per-chunk flag whether literals should be taken as raw values or as deltas from the match bytes. The delta decoding is slightly more expensive than just copying raw literals but does tend to give decent compression ratio improvements on many sources. As with essentially all other decisions in these codecs, the decision is made from an explicit space-speed trade-off (i.e. evaluate both compression ratio gains, if any, of using delta-literals, estimate decompression cost, and make the decision by weighing both factors according to a user-configurable trade-off). Leviathan goes further down this path and adds several other literal modes, so there’s not just two. Leviathan doesn’t have any notion of per-byte difference other than subtraction – we tried XOR but never found any real compelling use case – and instead can optionally include other types of context during its literal modeling.
The trade-off here is that the more complicated modes are both more expensive to decode and need to be evaluated by the encoder, which adds a bit to the encode-time cost.
For an example of how this works out in practice, you can look at for example Aras Pranckevičius’ recent (at time of writing) series of blog posts. In post 2 of the series he compares several of the Oodle codecs to a bunch of other competitors and finds out that we compare very well in terms of compression ratio, speed and decompression speed. The differences in his first test are as drastic as they are largely because Aras’ test set (structured binary data, namely several large-ish arrays of data items that are 4 floats each) plays exactly to our formats’ strengths. As you might notice, this is very similar to what you would see for the “arrays of vertex data” situation that was the design goal for BitKnit and prompted many of its design decisions. Kraken and Mermaid improved on BKs design considerably, but the delta-literals in particular come straight from BK and work very well for this kind of data. Later on in the series in part 6, Aras experiments with his own data-specific delta filtering, at which point the big differences between Zstd and Kraken essentially disappear (worth noting his results are exactly on the fast to medium part of the compression speed scale; Zstd gets better bang for the buck in that range, we tend to gain some more compression ratio at the slower settings, but overall I’d describe the two as comparable). In terms of absolute numbers, the delta literals in Kraken mean that Kraken at “Optimal1” (level 5) in the first test hits about 3.71:1 compression ratio while zstd level 18 (comparable encode speed) manages around 3.04:1. The latter test has Kraken Optimal1 around 4.34:1 and zstd level 15 (closest quoted) around 4.17:1, which I’m going to call roughly even. (We could make both codecs trade places here by playing around more with Zstd’s compression levels and Kraken’s configurable space/speed trade-off setting). In short, the delta literals manage to extract about half the gain a specialized difference filter does, with no manual data-specific work required.
Lots of repeat offsets
The main other fairly unique design choice in BitKnit is to have many repeat offset slots; where LZX, LZMA and Zstd (and also Kraken, for that matter) use 3, BitKnit uses an extravagant 7. [Turns out, Charles had also experimented with this some years prior, but that one I legitimately didn’t remember, and either way it’s a matter of changing a single constant, I’m sure it’s been independently discovered many times.] As with everything else in BitKnit, this comes down to its genesis as a codec for a data format leaning towards fixed-size records. In short, the extra repeat offset slots are useful for the data it’s designed for, whereas for many other types of data they don’t really help. The extra repeat offsets rarely outright hurt compression, but maintaining them does take time in the decoder.
One weird wrinkle in BitKnit that we didn’t take forward into other formats is the replacement policy. The standard method is to use “move to front” coding, i.e. repeat offset slots are numbered from most to least recently used. With BitKnits many offset slots, I found that I got some slight wins by not inserting new offsets into the precious “0” (frontmost) slot; instead, new offsets got inserted into slot 2, and the only way for an offset to make it into slots 0 or 1 is to get reused at least once. This was a slight win in BitKnit but didn’t make it into any of our newer codecs; notably this was designed into BitKnit before I had an optimal (or optimizing anyway) parser working, and was a small but solid win with the heuristics-driven parse I was using in the encoder at the time. When we re-evaluated this idea for other formats later, we found that optimizing parsers are good at managing the recent offset slots anyway so there wasn’t much benefit to this. Leviathan did end up using 7 recent match slots (largely because it provided benefits on certain types of highly structured data, same as I saw in BK) but with standard MTF/LRU update rules. (I wrote up how BitKnit implements its offset history years ago.)
And that’s it! I’m not sure if these notes are valuable to anyone else, but BK was a fun little codec to work on (and it was little – decoder, heuristic encoder with multiple speed levels and optimizing encoder all fit into a 2700-line file C++ file, with extensive comments) and it led to many ideas that we elaborated on further in our follow-up codecs.
More details on the literal models
I got the history wrong in the first version of this post, and some digging into old emails and commits later the details are actually more interesting than I remembered, so for what it’s worth, here’s some more history on how the various RAD codecs ended up having delta-coded literals as a mode.
Early BitKnit started with the idea of building something with adaptive multi-symbol arithmetic coding built on rANS. Multi-symbol arithmetic coders have existed for a long but, but they were mostly avoided for a long time due to requiring a divide, which is fairly expensive. At the time (2014), new-ish PCs had 32-bit integer divide latencies that were not unreasonable (clock cycle counts in the high 20s, on a dedicated divider that didn’t block anything else), but some of the targets we were still shipping code for had division that took 100+ cycles and blocked all integer execution for the entire time, which was problematic to say the least. Encoding for rANS still needed divisions but decoding did not, so this was very attractive as simultaneously being faster than a regular multi-symbol decoder on newer PCs, and not being unacceptably bad on more constrained targets.
The constraint was that to be able to use a divide-free approach, we needed to track estimates for a probability distribution in a way that kept the sum of all scaled probabilities at a fixed power of 2 at all times. That’s where the earlier “mixing discrete probability distributions” comes in. Early BK first used this to build a variant of LZMAs binary-tree probability models that worked on groups of 4 bits (nibbles) at a time, and also a variant of LZMAs literal-after-match (“LAM”) model, “Nib-LAM”.
The regular nibble models maintain 17 probability distributions over 16 symbols each: one distribution for the high nibble of a byte, and 16 more for the low nibble depending on what the value of the high nibble was. “Nib-LAM” keeps 48 distributions over 16 symbols: 16 distributions for the high nibble of the literal given the high nibble of the match byte, 16 distributions for the low nibble of the literal if those high 4 bits all agreed (indexed by low nibble of the match byte), and 16 distributions for the nibble of the literal if those high 4 bits didn’t agree (indexed by the high nibble of the literal that we just sent).
Early versions of BK used this Nib-LAM model, but unlike LZMA (which only uses this type of model for the first byte after a match), even the very first versions of BK I could find in source control always used this for later literals too (just extending the most recent match forward). I honestly don’t recall whether this was intentional or just a mistake, but literal coding in BK always worked this way, from the very beginning, and this gelled nicely with the delta literals.
Around the same time, my colleague Charles Bloom was experimenting with what turned into LZNA, and also played around with the Nib-LAM models. These were sometimes better than the bit-wise binary tree LAM models used by LZMA, sometimes worse; them being a bit worse makes sense if you think of them as a “chunkier” approximation of a bit-by-bit model, but more interesting was why they were sometimes doing a lot better.
He noticed a property that is easiest to see in the case when the high nibbles between the literal and match byte agree: we have 16 possible contexts (one for each of the possible high nibbles of the match byte, which agrees with the high nibble of the literal) each storing a distribution over 16 possible values. That is, when those high nibbles match, we track probabilities for all 16*16 possible combinations of (literal low nibble, match low nibble), and in particular can pick up any correlation between those, even when we have a mismatching bit somewhere in the middle (LZMAs bitwise-LAM models “detour” to a different part of the model immediately when that happens). And in particular, he noticed that on files where the Nib-LAM models did surprisingly well, he could do even better by sending straight-up delta literals.
As far as I can tell, that’s how the delta literals in their described form ended up in BK: generalizing the LZMA LAM model idea to a higher radix to get Nib-LAM, using Nib-LAM on all literals (not just the first after a match), Charles noticing that on the cases where that really improves on LZMA you can do even better by sending straight deltas, and then the deltas ended up in BK as the primary literal mechanism and replacing Nib-LAM for good as part of a push for speed, since Nib-LAM was just too slow for BKs target speeds and the delta literals worked fine with semi-adaptive models.
Most of the significant work on all three of LZNA, LZQ1 and BitKnit happened concurrently (and with many mails being sent back and forth…) from late February to late May of 2015. The original version of this post was poorly written; for the record, I’m not trying to claim sole credit for anything described in here, nor do I want to erase the contributions of others on those same emails threads at the time (which included not just Charles and I, but at the very least also Jeff Roberts, Sean Barrett, and Won Chun).