Skip to content

End-of-buffer checks in decompressors

This post is about general techniques for handling end-of-buffer checks in code that processes an input stream a byte at a time, or a few bytes at a time at the most. Concretely, I’ll be talking about decompression code, but many of these ideas are also applicable to related sequential input processing tasks like lexical analysis.

A basic decoder

To show how the problem crops up, let’s look at a simple decompressor and at what happens when we try to make an efficient implementation. Here’s our simple decoder for a toy LZ77 variant:

while (!done) { // main loop
    if (get_bits(1) != 0) { // match
        int offset = 1 + get_bits(13);
        int len = 3 + get_bits(5);

        copy_match(dest, dest - offset, len);
    } else { // uncompressed 8-bit literal
       *dest++ = get_bits(8);
    }
}

This particular coding scheme is just arbitrarily chosen to have a simple example, by the way. It’s not one I would actually use.

How does get_bits look like? The design space of bit IO is a big topic on its own, and I won’t be spending any time on the trade-offs here; let’s just use a basic variant with MSB-first (big endian-like) bit packing, reading the input stream from a memory buffer, one byte at a time:

const uint8_t *input_cursor;    // current input cursor
const uint8_t *input_end;       // end of input buffer

uint8_t read_byte()
{
    // If we reached the end of the input buffer, return 0!
    if (input_cursor >= input_end)
        return 0;

    return *input_cursor++;
}

uint32_t bitcount; // number of bits in bitbuf
uint32_t bitbuf;   // values of bits (bitcount bits from MSB down)

uint32_t get_bits(uint32_t nbits)
{
    assert(0 < nbits && nbits <= 24);

    // Refill: read extra bytes until we have enough bits
    // in buffer. Insert new bits below the ones we already
    // have.
    while (bitcount < nbits) {
        bitbuf |= read_byte() << (24 - bitcount);
        bitcount += 8;
    }

    // The requested bits are the top nbits bits of bitbuf.
    uint32_t ret = bitbuf >> (32 - nbits);

    // Shift them out.
    bitbuf <<= nbits;
    bitcount -= nbits;
    return ret;
}

Note we do an explicit end-of-buffer check in read_byte and return a defined value (0 in this case) past the end of the input stream. This kind of check is generally required to avoid crashes (or buffer overrun bugs!) if there is any chance the input stream might be invalid or corrupted – be it as the result of a deliberate attack, or just a transmission error. Returning 0 past the end of buffer is an arbitrary choice, but a convention I tend to stick with in my code.

Reducing overhead

As for get_bits, the implementation is a fairly typical one. However, as should be obvious, reading a few bits like this is still a relatively involved process, because every call to get_bits involves the refill check and an update of the bit buffer state. A key trick in many decompressors is to reduce this overhead by separating looking at bits from consuming them, which allows us to grab lots of bits at once (speculatively), and then later decide how far to move the input cursor. This basically boils down to splitting get_bits into two parts:

uint32_t peek_bits(uint32_t nbits)
{
    assert(0 < nbits && nbits <= 24);

    // Refill: read extra bytes until we have enough bits
    // in buffer. Insert new bits below the ones we already
    // have.
    while (bitcount < nbits) {
        bitbuf |= readbyte() << (24 - bitcount);
        bitcount += 8;
    }

    // Return requested bits, starting from the MSB in bitbuf.
    return bitbuf >> (32 - nbits);
}

void consume_bits(uint32_t nbits)
{
    assert(bitcount <= nbits);
    bitbuf <<= nbits; // shift them out
    bitcount -= nbits;
}

Using this new interface, we can modify our decoder to reduce bit IO overhead, by doing a single peek_bits call early and then manually extracting the different sub-fields from it:

while (!done) { // main loop
    // We read up to 19 bits; grab them all at once!
    uint32_t bits = peek_bits(19);
    if (bits & (1u << 18)) { // match bit set?
        int offset = 1 + ((bits >> 5) & 0x1fff);
        int len = 3 + (bits & 0x1f);

        consume_bits(19); // 1b flag + 13b offs + 5b len
        copy_match(dest, dest - offset, len);
    } else { // uncompressed 8-bit literal
        *dest++ = (uint8_t) (bits >> 10);
        consume_bits(9); // 1b flag + 8b value
    }
}

This trick of peeking ahead and deciding later how many bits were actually consumed is very important in practice. The example given here is a simple one; a very important use case is decoding Huffman codes (or other variable-length codes) aided by a look-up table.

Note, however, that we changed the input behavior: before, we really only called read_byte when we knew it was necessary to complete reading the current code. Now, we peek ahead more aggressively, and will actually peek past the end of the input bitstream whenever the last token is a literal. It’s possible to avoid this type of problem by being more restrained in the usage of peek_bits: only ever peek ahead by the minimum amount of bits that we know is going to get consumed no matter what. However, doing so forces us to do a bit more work at runtime than the code fragment shown above entails.

However, the variant shown above is still completely correct: our implementation of read_byte checks for the end of the input stream, and returns zeroes once we’ve passed it. However, this is no longer an exceptional condition: rather than being a “contingency plan” in case of corrupted input data, we can now expect to hit this path when decoding many valid bit streams.

In short, we’re taking a check we need for correctness (the end-of-buffer check) and making it serve double duty to simplify the rest of our decoder. So far, all the code we’ve seen is very standard and not remarkable at all. The resulting bit-IO implementation is fairly typical, more so once we stop trying to only call read_byte when strictly necessary and simplify the buffer refill logic slightly by always refilling to have >24 bits in the buffer no matter what the peek amount is.

Even beyond such details, though, this underlying idea is actually quite interesting: the end-of-buffer check is not one we can easily get rid of without losing correctness (or at least robustness in the face of invalid data). But we can leverage it to simplify other parts of the decoder, reducing the “sting”.

How far can we push this? If we take as granted that reading past the end of the buffer is never acceptable, what is the least amount of work we can do to enforce that invariant?

Relaxed requirements

In fact, let’s first go one further and just allow reading past the end-of-buffer too. You only live once, right? Let’s pull out all the stops and worry about correctness later!

It turns out that if we’re allowed to read a few bytes past the end of the buffer, we can use a nifty branch-free refill technique. At this point, I’m going to manually inline the bit IO so we can see more clearly what’s going on:

while (!done) { // main loop
    // how many bytes to read into bit buffer?
    uint32_t refill_bytes = (32 - bitcount) / 8;
    
    // refill!
    bitbuf |= read_be32_unaligned(input_cursor) >> bitcount;
    bitcount += refill_bytes * 8;
    input_cursor += refill_bytes;

    assert(bitcount > 24);

    // peek at next 19 bits
    uint32_t bits = bitbuf >> (32 - 19);

    if (bits & (1u << 18)) { // match bit set?
        int offset = 1 + ((bits >> 5) & 0x1fff);
        int len = 3 + (bits & 0x1f);

        // consume_bits(19);
        bitbuf <<= 19;
        bitcount -= 19;
        copy_match(dest, dest - offset, len);
    } else { // uncompressed 8-bit literal
        *dest++ = (uint8_t) (bits >> 10);
        // consume_bits(9);
        bitbuf <<= 9;
        bitcount -= 9;
    }
}

This style of branchless bit IO is used in e.g. Yann Collet’s FSE and works great when the target machine supports reading unaligned 32-bit big endian values quickly — the read_be32_unaligned function referenced above. This is the case on x86 (MOV and BSWAP or just MOVBE where supported), ARMv6 and later (LDR provided unaligned accesses are allowed, plus REV when in little-endian mode) and POWER/PPC; not sure about other architectures. And for what it’s worth, I’m only showing 32-bit IO here, but this technique really comes into its own on 64-bit architectures, since having at least 56 bits in the buffer means we can usually go for a long while without further refill checks.

That’s a pretty nice decoder! The only problem being that we have no insurance against corrupted bit streams at all, and even valid streams will read past the end of the buffer as part of regular operation. This is, ahem, hardly ideal.

But all is not lost. We know exactly how this code behaves: every iteration, it will try reading 4 bytes starting at input_cursor. We just need to make sure that we don’t execute this load if we know it’s going to be trouble.

Let’s say we work out the location of the spot where we need to start being careful:

// Before the decoder runs:
const uint8_t *input_mark;

if (input_end - input_cursor >= 4)
    input_mark = input_end - 4;
else
    input_mark = input_cursor;

The simplest thing we can do with that information is to just switch over to a slower (but safe) decoder once we’re past that spot:

while (!done && input_cursor <= input_mark) {
    // fast decoder here: we know that reading 4 bytes
    // starting at input_cursor is safe, so we can use
    // branchless bit IO
}

while (!done) {
    // finish using safe decoder that refills one byte at
    // a time with careful checks!
}

This works just fine, and is the technique chosen in e.g. the zlib inflate implementation: one fast decoder that runs when the buffer pointers are well away from the boundaries, and a slower decoder that does precise checking.

Note that the input_cursor < input_mark check is the only addition to our fast decoder that was necessary to make the overall process safe. We have some more prep work, and it turns out we ended up with an entire extra copy of the decoder for the cold “near the end of the buffer” path, but the path we expect to be much more common — decoding while still being safely away from the end of the input stream — really only does that one extra compare (and branch) more than the “fast but unshippable” decoder does!

And now that I’ve done my due diligence and told you about the boring way that involves code duplication, let’s do something much more fun instead!

One decoder should be enough for anyone!

The problem we’re running into is that our buffer is running out of bytes to read. The “safe decoder” solution just tries to be really careful in that scenario. But if we’re not feeling very careful today, well, there’s always the ham-fisted alternative: just switch to a different input buffer that’s not as close to being exhausted yet!

Our input buffers are just arrays of bytes. If we start getting too close to the end of our “real” input buffer, we can just copy the remaining bytes over to a small temp buffer that ends with a few padding bytes:

uint8_t temp_buf[16]; // any size >=4 bytes will do.

while (!done) {
    if (input_cursor >= input_mark) {
        assert(input_cursor < input_end);

        // copy remaining bytes to temp_buf
        size_t bytes_left = (size_t) (input_end - input_cursor);
        assert(bytes_left < sizeof(temp_buf));
        memmove(temp_buf, input_cursor, bytes_left);

        // fill rest of temp_buf with zeros
        memset(temp_buf + bytes_left, 0, sizeof(temp_buf) - bytes_left);

        // and update our buffer pointers!
        input_cursor = temp_buf;
        input_end = temp_buf + sizeof(temp_buf);
        input_mark = input_end - 4;
    }

    assert(input_cursor <= input_mark);
    // rest of fast decoder using branchless bit IO
}

And with that little bit of extra logic, we can use our fast decoder for everything: note that we never read past the bounds of the original buffer. Also note that the logic given above can generate an arbitrary amount of trailing zero bytes: if after swapping buffers around, our input cursor hits the mark again, we just hit the refill path again to generate more zeroes. (This is why the copying code uses memmove).

This is nifty already, but we can push this idea much further still.

Switching input buffers

So far, we’re effectively switching from our regular input buffer to the conceptual equivalent of /dev/zero. But there’s no need for that restriction: we can use the same technique to switch over to a different input buffer altogether.

We again use a temporary transition buffer that we switch to when we reach the end of the current input buffer, but this time, we copy over the first few bytes from the next input buffer after the end of the current buffer, instead of filling the rest with zeroes. We still do this using our small temp buffer.

We place our input mark at the position in the temp buffer where data from the new input buffer starts. Once our input cursor is past that mark, we can change pointers again to resume reading from the new input buffer directly, instead of copying data to the temp buffer.

Note that handling cases like really short input buffers (shorter than our 4-byte “looakhead window”) requires some care here, whereas it’s not a big deal when we do the bounds checking on every consumed input byte. We’re not getting something for nothing here: our “sloppy” end-of-input window simplifies the core loop at the expense of adding some complexity in the boundary case handling.

Once we reach the actual end of the input stream, we start zero-filling, just as before. This all dovetails nicely into my old post “Buffer-centric IO” which combines very well with this technique. Together, we get almost-zero-copy IO, except for the copies into the transition buffer near buffer boundaries, which only touch a small fraction of all bytes and are there to make our lives easier.

A final few generalizations

The example I’ve been using was based on a single get_bits (or later peek_bits) call. But this is really not substantial at all. The crucial property we’re exploiting in the decoder above is that we have a known bound for the number of bytes that can be consumed by a single iteration of the loop. As long as we can establish such a bound, we can do a single check per iteration, and in general, we need to check our input cursor at least once inside every loop that consume a potentially unbounded (or at least large) number of input bytes — which in this example is only the main loop.

For the final generalization, note that a lot of compressors use a stream interface similar to zlib. In essence, this is a buffer interface similar to the one described in “Buffer-centric IO” for both the input and output buffers; the decompressor then gets called and processes data until the input or output buffers are exhausted, the end of stream is reached, or an error occurs, whichever happens first. This type of interface makes the (de)compressor somewhat harder to write but is much more convenient for the client code, which can just hand in whatever.

A typical way to implement this type of interface is described in Simon Tatham’s old article “Coroutines in C” — the key property being that the called function needs to be able to save its state at any point where I/O happens, in case it runs out of buffer space; and furthermore it needs to be able to later resume at exactly that point.

The solution is to effectively turn the (de)compressor into a state machine, and Tatham’s article describes a way to do so using a variant of Duff’s Device, quite probably the most infamous coding trick in the C language. Most (de)compressors with a zlib-like interface end up using this technique (or an equivalent) so they can jump into the middle of the decoder and resume where they left off.

So why do I mention all this? Well, the technique I’ve outlined in this article is applicable here as well: Tatham’s description assumes byte-level granularity IO, which means there’s generally lots of points inside the decoder main loop where we might need to save our state and resume later. If the decoder instead ensures there’s enough bytes left in the buffer to make it through one full iteration of the main loop no matter what, that means we have many fewer points where we need to save our state and later resume, often only in a single location.

What’s particularly interesting about combining the relaxed-refill technique with a coroutine-style decoder is that all of the refill and transition buffer logic can be pulled outside of the decoder proper. In library code, that means it can be shared between multiple decoders; so the logic that deals with the transition buffers and short input buffers only needs to be implemented and debugged once.

Discussion

The key simplification in this scheme is relaxing the strict “check for end of buffer on every byte consumed” check. Instead, we establish an upper bound N on the number of input bytes that can be consumed in a single iteration through our decoder main loop, and make sure that our current input buffer always has at least N bytes left — by switching to a different temporary input buffer if necessary.

This allows us to reduce the number of end-of-buffer checks we need to execute substantially. More importantly, it greatly increases the applicability of branch-less refill techniques in bit IO and arithmetic coding, without having to keep a separate “safe” decoder around.

The net effect is one of concentrating a little complexity from several places in hot code paths (end-of-buffer checks on every byte consumed) into somewhat increased complexity in a single cold code path (buffer switching). This is often desirable.

The biggest single caveat with this technique is that as a result of the decoder requiring N bytes in the input buffer at all times, the decoder effectively “lags behind” by that many bytes – or, depending on your point of view, it “looks ahead” by N bytes, reading from the input stream sooner than strictly necessary.

This can be a problem when, for example, several compressed streams are concatenated into a single file: the decoder may only get to decoding the “end of stream” symbol for stream A after N bytes from stream B have already been submitted to the decoder. The decoder would then need to “un-read” (in the sense of ungetc) the last few bytes or seek backwards. No matter how you dice it, this is annoying and awkward.

As a result, this technique is not all that useful when this is a required feature (e.g. as part of a DEFLATE decoder obeying the zlib interface).
However, there are ways to sidestep this problem: if the bitstream specifies the compressed size for either the entire stream or individual blocks, or if the framing format ends in N or more trailing “footer” bytes (a checksum or something similar), we can use this approach just fine.

UPDATE: As commenter derf_ notes on Hacker News, there’s a nice trick to produce implicit trailing zero bits in a bit reader like the one described above by just setting bitcount to a high value once the last byte’s been read into bitbuf. However, this only works with a decoder exactly like the one shown above. The nice part about switching to an explicit zero-padding buffer is that it works not just with all bit IO implementations I’m aware of, but also with byte-normalized (or larger) arithmetic coders like typical range coders or rANS.

rANS in practice

This year, we (RAD) shipped two new lossless codecs, both using rANS. One of the two is Oodle LZNA (released in May), which Charles has already written about. The other is called BitKnit, which first shipped in July as part of Granny, and is slated for inclusion into more RAD products.

So, with two production-quality versions written and successfully shipped, this seems like a good time to write up some of the things we’ve learned, especially in terms of implementation concerns. Let’s get cracking! (I’m assuming you’re familiar with ANS. If not, I wrote a paper that has a brief explanation, and various older blog posts that give more details on the proofs. I’m also assuming you’re familiar with “conventional” arithmetic coding; if not, you’re not gonna get much out of this.)

One small note before we start…

I’ll be referring to the ANS family as a class of arithmetic coders, because that’s what they are (and so are “range coders“, by the way). So here’s a small historical note before we get cracking: the “bottom-up” construction of ANS and the LIFO encoding seem quite foreign once you’re used to most “modern” arithmetic coders, but what’s interesting is that some of the earliest arithmetic coders actually looked very similar.

In particular, I’m talking about Rissanen’s 1976 paper “Generalized Kraft Inequality and Arithmetic Coding” (which coined the term!). Note the encoding and decoding functions C and D on the second page, very reminiscent to the “bottom-up” construction of ANS (with the code being represented by a number that keeps growing), and the decoder returning symbols in the opposite order they were encoded!

Rissanen’s coder (with its rather cumbersome manual truncated floating point arithmetic subject to careful rounding considerations) never saw any widespread application, as far as I can tell. The coders that actually got traction use the now familiar top-down interval subdivision approach, and a different strategy to adapt to fixed-precision operation. But reading Rissanen’s paper from today’s perspective is really interesting; it feels like a very natural precursor to ANS, and much closer in spirit to ANS than to most of the other algorithms that descended from it.

Why rANS (and not FSE/tANS)?

On my blog especially, I’ve been talking almost exclusively about rANS, and not so much FSE/tANS, the members of the ANS family that have probably been getting the most attention. Why is that?

Briefly, because they’re good at different things. FSE/tANS are (nearly) drop-in replacements for Huffman coding, and have similar strengths and weaknesses. They have very low (and similar) per-symbol decode overhead, but the fast decoders are table-driven, where the table depends on the symbol probabilities. Building Huffman decoding tables is somewhat faster; FSE/tANS offer better compression. Both Huffman and FSE/tANS can in theory support adaptive probabilities, but there’s little point in doing anything but periodic rebuilds; true incremental updates are too slow to be worthwhile. At that point you might as well use a coder which is more suited to incremental adaptation.

Which brings us to rANS. rANS is (again, nearly) a drop-in replacement for multi-symbol Arithmetic coders (such as range coders). It uses fewer registers than most arithmetic coders, has good precision, and the decoder is division-free without introducing any approximations that hurt coding efficiency. Especially with the various tweaks I’ll describe throughout this post, rANS has what are easily the fastest practical multi-symbol alphabet arithmetic decoders I know. rANS coders are also quite simple in implementation, with none of the tricky overflow and underflow concerns that plague most arithmetic coders.

So rANS is a pretty sweet deal if you want a fast arithmetic coder that deals well with relatively fast-changing probabilities. Great! How do we make it work?

Reverse encoding

As mentioned above, ANS coders are LIFO: whatever order you encode symbols in, the decoder will produce them in the opposite order. All my ANS coders (including the public ryg_rans) use the convention that the encoder processes the data in reverse (working from the end towards the beginning), whereas the decoder works forwards (beginning towards end).

With a static model, this is odd, but not especially problematic. With an adaptive model, decoder and model have to process data in the same direction, since the decoder updates the model as it goes along and needs current model probabilities to even know what to read next. So the decoder and the model want to process data in the same direction (forward being the natural choice), and the rANS encoder needs to be processing symbols in the opposite order.

This is where it comes in handy that rANS has an interface very much like a regular arithmetic coder. For example, in my sample implementation, the symbol is described by two values, start and freq, which are equivalent to the symbol interval lower bound and size in a conventional arithmetic coder, respectively.

Most arithmetic coders perform the encoding operation right there and then. In rANS, we need to do the actual encoding backwards, which means we need to buffer the symbols first: (the first description I’ve seen of this idea was in Matt Mahoney’s fpaqa)

// Say our probabilities use 16-bit fixed point.
struct RansSymbol {
  uint16_t start; // start of symbol interval
  uint16_t range; // size of symbol interval
};

class BufferedRansEncoder {
  std::vector<RansSymbol> syms; // or similar

public:
  void encode(uint16_t start, uint16_t range)
  {
    assert(range >= 1);
    assert(start + range <= 0x10000); // no wrap-around

    RansSymbol sym = { start, range };
    syms.push_back(sym);
  }

  void flush_to(RansEncoder &coder);
};

With this, we can use rANS exactly like we would any other arithmetic coder. However, it will not be generating the bitstream incrementally during calls to encode; instead, it simply buffers up operations to be performed later. Once we’re done we can then pop off the symbols one by one, in reverse order, and generate the output bitstream. Easy:

void BufferedRansEncoder::flush_to(RansEncoder &coder)
{
  // Replays the buffered symbols in reverse order to
  // the actual encoder.
  while (!syms.empty()) {
    RansSymbol sym = syms.back();
    coder.encode(sym.start, sym.range);
    syms.pop_back();
  }
}

Once you have this small piece of scaffolding, you really can use rANS as a drop-in replacement for a conventional arithmetic coder. There’s two problems with this, though: if you use this to encode an entire large file, your symbol buffer can get pretty huge, and you won’t get a single bit of output until you’ve processed the entire input stream.

The solution is simple (and can also be found in the aforementioned fpaqa): instead of accumulating all symbols emitted over the entire input stream and doing one big flush at the end, you just process the input data in chunks and flush the coder periodically, resetting the rANS state every time. That reduces compression slightly but means the encoder memory usage remains bounded, which is an important practical consideration. It also guarantees that output is not delayed until the end of stream; finally, lots of compressors are already chunk-based anyway. (For the opportunity to send incompressible data uncompressed rather than wasting time on the decoder end, among other things.)

Basic interleaving

One thing that rANS makes easy is interleaving the output from several encoders into a single bitstream without needing any extra signaling. I’m not going into detail why that works here; I wrote a paper on the subject if you’re interested in details. But the upshot is that you can use multiple rANS encoders and decoders simultaneously, writing to the same output bitstream, rather than having a single one.

Why do we care? Because this is what decoding a single symbol via rANS looks like (adapted from my public ryg_rans code):

static const uint32_t kProbBits = 16;
static const uint32_t kProbMask = (1 << kScaleBits) - 1;

class RansDecoder {
  uint32_t state; // current rANS state
  // (IO stuff omitted)

  uint32_t renormalize_state(uint32_t x)
  {
    // Byte-wise for simplicity; can use other ways.
    while (x < RANS_L) 
      x = (x << 8) | read_byte();

    return x;
  }

public:
  uint32_t decode_symbol()
  {
    uint32_t x = state; // Current state value

    uint32_t xm = x & kProbMask; // low bits determine symbol
    Symbol sym = lookup_symbol(xm); // (various ways to do this)

    // rANS state advance
    x = sym.range * (x >> kProbBits) + xm - sym.start;
    x = renormalize_state(x);

    // Save updated state and return symbol
    state = x;
    return sym.id;
  }
};

Note how literally every single line depends on the results of the previous one. This translates to machine code that has a single, very long, dependency chain with relatively low potential for instruction-level parallelism (ILP). This is fairly typical for all entropy coder inner loops, by the way, not just rANS. And because superscalar processors depend on ILP to deliver high performance, this is bad news; we’re not making good use of the machine.

Hence interleaving. The idea is that we have two RansDecoder instances, each with their own state, but implicitly sharing the same bitstream read pointer (referenced by read_byte). Now, when we have code like this:

  RansDecoder dec0, dec1;
  // ...
  uint32_t sym0 = dec0.decode_symbol():
  uint32_t sym1 = dec1.decode_symbol();

the processor’s out-of-order execution logic can overlap execution of both decodes, effectively running them at the same time. The renormalize step for dec0 needs to happen before the renormalize of dec1, but other than that, there’s no dependencies between the two. For what it’s worth, this does not actually require out-of-order execution; a compiler for an in-order architecture can also work with this, provided it has enough dataflow information to know that dec0 calling read_byte() does not influence anything that dec1 does before its renormalize step. So what interleaving does is convert a very serial task into one that’s much more amenable to superscalar execution.

What it boils down to is this: a regular rANS decoder is a fast, precise, divide-less arithmetic decoder (which is nice to begin with). Interleave two streams using what is otherwise the exact same code, and you get a pretty good boost in performance; (very roughly) around 1.4× faster, on both the decoder and encoder. But this is by now an old hat; this was all in the initial release of ryg_rans.

Some of the early experiments leading up to what later became BitKnit uses this directly, pretty much the same as in the ryg_rans example code, but it turns out it was a bit of a pain in the neck to work with: because the two streams need to interleave properly, the BufferedRansEncoder needs to keep track of which symbol goes to which stream, and both the encoder and decoder code needs to (somewhat arbitrarily) assign symbols to either stream 0 or stream 1. You’d prefer the streams to keep alternating along any given control-flow path, but that’s not always possible, since sometimes you have conditionals where there’s an even number of symbols send on one path, and an odd number sent on the other! So having two explicit streams: not so great. But we found a better way.

Implicit interleaving to the rescue

What we actually ended up doing was interleaving with a twist – literally. We give the underlying rANS encoders (and decoders) two separate state values, and simply swap the two stream states after every encoding and decoding operation (that’s where the “BitKnit” name comes from – it keeps two active states on the same “needle” and alternates between them). The modifications from the decoder shown above are pretty small:

class RansDecoder {
  uint32_t state1; // state for "thread 1"
  uint32_t state2; // state for "thread 2"

  // ...

public:
  uint32_t decode_symbol()
  {
    uint32_t x = state1; // Pick up thread 1

    // ---- BEGIN of code that's identical to the above

    uint32_t xm = x & kProbMask; // low bits determine symbol
    Symbol sym = lookup_symbol(xm); // (various ways to do this)

    // rANS state advance
    x = sym.range * (x >> kProbBits) + xm - sym.start;
    x = renormalize_state(x);

    // ---- END of code that's identical to the above

    // Save updated state, switch the threads, and return symbol
    state1 = state2; // state2 becomes new state1
    state2 = x;      // updated state goes to state2

    return sym.id;
  }
};

The changes to the encoder are analogous and just as simple. It turns out that this really is enough to get all the performance benefits of 2× interleaving, with none of the extra interface complexity. It just looks like a regular arithmetic decoder (or encoder). And assuming you write your implementation carefully, compilers are able to eliminate the one extra register-register move instruction we get from swapping the threads on most paths. It’s all win, basically.

Bypass coding

Borrowing a term from CABAC here; the “bypass coding mode” refers to a mode in the arithmetic coder that just sends raw bits, which you use for data that’s known a priori to be essentially random/incompressible, or at least not worth modeling further. With conventional arithmetic coders, you really need special support for this, since interleaving an arithmetic code stream with a raw bitstream is not trivial.

With rANS, that’s much less of a problem: you can just use a separate bitbuffer and mix it into the target bitstream with no trouble. However, you may not want to: rANS has essentially all of the machinery you need to act as a bit buffer. Can you do it?

Well, first of, you can just use the arithmetic coder with a uniform distribution to send a set number of bits (up to the probability resolution). This works with any arithmetic coder, rANS included, and is fairly trivial:

  // write value "bits" using "numbits"
  coder.encode(bits << (kProbBits - numbits), 
                  1 << (kProbBits - numbits));

and the equivalent on the decoder side. However, this is not particularly fast. Fortunately, it’s actually really easy to throw raw bit IO into a rANS coder: we just add the bits at the bottom of our state variable (or remove them from there in the decoder). That’s it! The only thing we need to do is work out the renormalization condition in the encoder. Using the conventions from the bytewise ryg_rans, an example implementation of the encoder is:

static inline void RansEncPutBits(RansState* r, uint8_t** pptr,
  uint32_t val, uint32_t nbits)
{
  assert(nbits <= 16);
  assert(val < (1u << nbits));

  // nbits <= 16!
  RansState x = RansEncRenorm(*r, pptr, 1 << (16 - nbits), 16);

  // x = C(s,x)
  *r = (x << nbits) | val;
}

and the corresponding getbits in our ongoing example decoder looks like this:

class RansDecoder {
  // ...

  uint32_t get_bits(uint32_t nbits)
  {
    uint32_t x = state1; // Pick up thread 1

    // Get value from low bits then shift them out and
    // renormalize
    uint32_t val = x & ((1u << nbits) - 1);
    x = renormalize_state(x >> nbits);

    // Save updated state, switch the threads, and return value
    state1 = state2; // state2 becomes new state1
    state2 = x;      // updated state goes to state2

    return val;
  }
};

note that except for the funky state swap (which we carry through for consistency), this is essentially just a regular bit IO implementation. So our dual-state rANS admits a “bypass mode” that is quite cheap; usually cheaper than having a separate bit buffer would be (which would occupy yet another CPU register in the decoder), at least in my tests.

Note that if you combine this with the buffering encoder described above, you need a way to flag whether you want to emit a regular symbol or a burst of raw bits, so our RansSymbol structure (and the code doing the actual encoding) gets slightly more complicated since we now have two separate types of “opcodes”.

The implementation above has a limit of 16 bits you can write in a single call to RansEncPutBits. How many bits you can send at once depends on the details of your renormalization logic, and how many bits of rANS state you keep. If you need to send more than 16, you need to split it into multiple operations.

Tying the knot

I got one more: a rANS encoder needs to write its final state to the bitstream, so that the decoder knows where to start. You can just send this state raw; it works just fine. That’s what the ryg_rans example code does.

However, rANS states are not equally likely. In fact, state x occurs with a probability proportional to 1/x. That means that an ideal code should spend approximately \log_2(x) bits to encode a final state of x. Charles has already written about this. Fortunately, the ideal coder for this distribution is easy: we simply send the index of the highest set bit in the state (using a uniform code), followed by the remaining bits.

One options is to do this using regular bit I/O. But now you still need a separate bit IO implementation!

Fortunately, we just covered how do send raw bits through a rANS encoder. So one thing we can do is encode the final state value of stream 2 using the “stream 1” rANS as the output bit buffer, using the putbits functionality just described (albeit without the thread-switching this time). Then we send the final state of the “stream 1” rANS raw (or using a byte-aligned encoding).

This approach is interesting because it takes a pair of two rANS encoder threads and “ties them together” – making a knot, so to speak. In the decoder, undoing the knot is serial (and uses a single rANS decoder), but immediately after initialization, you have a working dual-stream coder. This saves a few bytes compared to the sloppier flushing and is just plain cute.

This technique really comes into its own for the wide-interleave SIMD rANS coders described in my paper, because it can be done in parallel on lots of simultaneous rANS coders in a reduction tree: group lanes into pairs, have each odd-indexed lane flush into its even-indexed neighbor. Now look at groups of 4 lanes; two have already been flushed, and we can flush the rightmost “live” lane into the leftmost lane coder. And so forth. This allows flushing a N× interleaved SIMD rANS coder in O(\log(N)) coding operations, and still has some parallelism while doing so. This is not very exciting for a 2× or 4× interleaved coder, but for GPU applications N is typically on the order of 32 or 64, and at that level it’s definitely interesting.

Conclusion and final notes

Using the techniques described in this post, you can write rANS encoders and decoders that have about the same amount of code as a conventional arithmetic coder with comparable feature set, have a similar interface (aside from the requirement to flush the encoder regularly), are significantly faster to decode (due to the combination of the already-fast rANS decoder with implicit interleaving), and have very cheap “bypass coding” modes.

This is a really sweet package, and we’re quite happy with it. Anyone interested in (de)compression using adaptive models should at least take a look. (For static models, FSE/tANS are stronger contenders.)

What example code there is in this article uses byte-wise renormalization. That’s probably the simplest way, but not the fastest. Oodle LZNA uses a 63-bit rANS state with 32-bits-at-a-time renormalization, just like rans64.h in ryg_rans. That’s a good choice if you’re primarily targeting 64-bit platforms and can afford a 64-bit divide in the encoder (which is quite a bit more expensive than a 32-bit divide on common CPUs). BitKnit uses a 32-bit rANS state with 16-bits-at-a-time renormalization, similar to the coder in ryg_rans rans_word_sse41.h. This is friendlier to 32-bit targets and admits a branch-free renormalize option, but also means the coder has somewhat lower precision. Using a probability range of 16 bits would not be wise in this case; BitKnit uses 14 bits.

Reading and writing are less symmetric than you (probably) think

I am talking about the I/O operations as used in computing here. A typical example of how this kind of thing is exposed are the POSIX syscalls read(2) and write(2), which have the following C function prototypes:

ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count); 

Now these are raw system calls; user programs can use them directly, but they usually don’t. They normally go through some buffered IO layer; in the C standard library, this means FILE* and functions fread and fwrite, which split count into a product of two values in a vestigial nod to record-based IO but are otherwise equivalent. For concreteness, suppose we’re interfacing with actual storage (i.e. not a pipe, socket, virtual filesystem etc.). Then conceptually, a “read”-class operation (like read or fread) grabs bytes from a file say on a disk somewhere and puts them into the specific memory buffer, and a “write”-class operation takes bytes in a memory buffer and writes them to the disk. Which definitely sounds nice and symmetric—but there’s some important behavioral asymmetries between them, especially when errors are in the mix. The reasons have to do with buffering.

Buffered I/O

In general, file I/O operations in your program will not go directly to a storage device; data instead makes its way through several buffering layers (most of which can be disabled using various flags, but in normal usage these layers are on). These layers are there for good reason: on the kernel side, there’s what’s traditionally called the “buffer cache”. Storage devices are “block devices”, which means they store data in blocks. The block size depends on the device; on old hard disks it used to be 512 bytes, CDs, DVDs etc. tend to use 2k blocks, newer storage devices are now on 4k blocks. Block devices only read entire blocks at a time; that means random byte-aligned IO requests such as “read 100 bytes from disk at byte offset 1234567” or “write 2000 bytes to location 987654” can’t be directly passed to the device at all. The buffer cache is used to translate these requests into block-aligned read and write operations that the device understands; non-block-aligned writes also require reading the previous contents of the block that are not overwritten, and those go in the buffer cache as well. And of course, as the name suggests, it acts as a cache.

On the user-space side, we also have buffers, albeit for a different reason: read and write are system calls, and as such incur a transition to kernel space and back. They also need to check for and report errors every time they are invoked. And of course they actually need to do the work we want them to do – copy the data from (read) or to (write) the buffer cache. System call overhead varies between OSes, but it’s safe to assume that the whole process takes at least a couple hundred clock cycles in the best case. So for the overhead not to completely dominate the actual work being done, you generally want to be reading or writing at least a few kilobytes at a time. For scale reference, typical IO buffer sizes as of this writing are 4096 bytes (e.g. Visual C++ 2013 FILE*, Go bufio.Reader/bufio.Writer) or 8192 bytes (e.g. GNU libc FILE*, Java BufferedReader/BufferedWriter).

Often there are more buffers too. For example, most hard drives and RAID controllers have their own caches, and it is not uncommon for user-space code to have several layers of buffering for various reasons. But this is enough to illustrate the basic structure.

All of these buffers are used in much the same way for reading and writing. So where’s the behavioral asymmetry between reading and writing that I’m talking about? You need to think about the state of the world (so to speak) after you call a read-type call and how it differs from the state of the world after a write-type call.

What happens when you issue an IO operation

Let’s look at what goes into servicing a read-type call first: say you open a C FILE* and want to read the first 100 bytes via fread. The C standard I/O library notices that its buffers are currently empty, and tries to fill them up, issuing a read system call to read say 4k worth of data. The kernel in turn asks the file system where the data for the first 4k of the file is located, checks the buffer cache to see if it already has a copy in memory, and if not, it issues a block read command to the storage device. Either way, the kernel makes sure to get those 4k of data into the buffer cache and from there copies them into the standard IO buffers in user-space memory, then returns. The standard IO library looks at the result of the system call, updates the state of its IO buffers, and then copies the 100 requested bytes into the memory buffer the app supplied.

And what if anything goes wrong? Say the file is smaller than 100 bytes, or there was an error reading from disk, or the file is on a network file system that’s currently inaccessible. Well, if that happens, we catch it too: if something goes wrong filling up the buffer cache, the kernel notices and returns an appropriate error to the I/O library, which can in turn pass errors on to the app. Either way, anything that can go wrong will go wrong before the fread call returns. All the intervening layers need to do is make sure to keep the error information around so it can be passed to the app at the appropriate time.

Now let’s go the other way round: let’s open a fresh new file with a 4k write buffer[1] and issue a 100-byte fwrite. This time, the IO library copies the 100 bytes from the app buffer to the write buffer… and immediately returns, reporting success. The underlying write system call will not be executed until either the buffer fills up or is flushed as a result of calling fflush, fseek, fclose or similar.

Quick imaginary show of hands: who reading this habitually checks return codes of fread or fwrite at all? Of those saying “yes”, who also remembers to check return codes of fflush, fseek or fclose? Probably not a lot. Well, if you don’t, you’re not actually checking whether your writes succeeded at all. And while these remarks are C-specific, this general pattern holds for all buffered writer implementations. Buffered writing delays making the actual write system call; that’s kind of the point. But it implies that error reporting is delayed too!

More buffers

This type of problem is not restricted to user-space buffering either. The implementation of write itself has similar issues: generally, after a successful write call, your data made it to the buffer cache, but it hasn’t hit actual storage yet. The kernel will make its best effort to write that data to storage eventually (hopefully within the next few seconds), but if there’s a device error or a system crash, that data could still be lost. Both of these are relatively rare these days, so we don’t worry about them too much, right? Except for those of us who do.

Oh, and while write will go to some lengths to make sure there are no nasty surprises when writing to local filesystems (for example, even with delayed write-back, you want to make sure to reserve free space on the disk early[2], lest you run out during write-back), at least on POSIX systems there can still be write errors that you only get notified about on close, especially when network filesystems such as NFS or SMB/CIFS are in play (I’m not aware of any such late-reported error conditions on Windows, but that doesn’t mean there aren’t any). Something to be aware of: if you’re using these system calls and are not checking the return code of close, you might be missing errors.

Which brings up another point: even on local file systems, you only have the guarantee that the data made it to the buffer cache. It hasn’t necessarily made it to the storage device yet! If you want that (for example, you’ve just finished writing some important data and want to make sure it actually made it all the way), you need to call fsync[3] on the file descriptor before you close it. The Windows equivalent is FlushFileBuffers.

So, if you make sure to check error codes on every write, and you fsync before you close (again checking errors), that means that once you’ve done all that, you’re safe and the data has successfully made it to permanent storage, right?

Well, two final wrinkles. First, RAID controllers and storage devices themselves have caches too. They’re supposed to have enough capacitors so that if the system suddenly loses power, they still have sufficient power to actually get your data written safely. Hopefully that’s actually true. Good luck. Second, the data may have made it to storage, but that doesn’t necessarily mean it’s actually visible, because the metadata necessary to reach it might not have been written yet. Quoting the Linux man page on fsync(2):

Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.

For better or for worse, I can’t recall ever seeing code doing this in the wild, though. I’m honestly not sure what the actual guarantees are that popular Linux file systems provide about these things. If you’re handling really really important data, you should probably find out.

Conclusion and summary

Buffering on the read side is great and pretty much transparent because if anything goes wrong, it will go wrong before you ever get to see the data, and you’ll get a proper error code.

Buffering on the write side is much trickier because it delays actual writing and error reporting in ways that most programmers are supposed to be aware of, but usually aren’t. Few are aware of the actual steps necessary to ensure that data made it to storage safely, and some of the OS abstractions involved don’t exactly make things easier (see the fsync quote above). Here be dragons.

Footnotes

[1] Full buffering not line buffering mode, in case anyone’s feeling nit-picky.
[2] Actual block allocation—as in, selecting which physical location on the device file writes will end up—is often delayed in modern file systems, to make it easier to assign mostly-contiguous space to large files where possible. However, even with delayed allocation, you want to keep track of how much space is going to be available on the device once all currently pending writes complete, so that you can return “out of disk space” errors appropriately instead of discovering that you’re out of space 10 seconds after the user exited the app he was using to edit and save his Important Document. Because that would be bad. This sounds as though it’s just a matter of accounting, but it gets tricky with file systems that use extents and not bitmap-based block allocation: getting the last few discontinuous blocks on the device means that you might need extra space to store the file extents! All of which is to say: this stuff is tricky to get right.
[3] Yes, the name looks like it’s part of the C library buffered IO package, but it’s a proper syscall.

Intervals in modular arithmetic

I wrote about regular interval overlap checking before. Let’s consider a somewhat trickier case, where our intervals are not defined over the real numbers or regular integers, but are instead subsets of the integers modulo N (for some fixed N). In this post, I’ll just consider the math; but this is something I’ve used in several places, so I expect I’ll be writing about some concrete uses eventually. Throughout this document, I’ll be writing \mathbb{Z}_N for the set \mathbb{Z} / N\mathbb{Z} of integers mod N (feel free to ignore the notation if you’re not familiar with it).

Intervals mod N, first attempt

The first thing we need to do is agree on what we mean by an interval mod N. Suppose that N=16. In that setting, the meaning of the interval [5,7] is pretty clear: the set {5, 6, 7}. And we can also take something like the integer interval [14,17] representing the set {14, 15, 16, 17} and reduce it mod N to get {14, 15, 0, 1}. But doing this required us to use values outside of {0, …, N-1} to specify the end points. This is often a useful model – I talk about some advantages when doing this for ring buffers in this old post – but it’s a bit of a cheat; we’d like a way to specify intervals mod N that don’t require us to leave the set of integers mod N to handle the “wraparound” case.

This means that we want to define something like the interval [14,1]. For the regular integers, this would be the empty set; but mod N, it makes more sense to properly wrap around (for example, when you say “the trains don’t run from 11:30pm to 5:30am”, it’s understood that you mean 5:30am the next day, not an empty interval). One way to do this is by handling the wrap-around case separately. Assuming a, b are reduced mod N, we could for example define:

[a,b] \mod N := \begin{cases}  \{ a, a + 1, \hdots, b \} & \mbox{if } a \le b \\  \{ a, a + 1, \hdots, N - 1 \} \cup \{ 0, 1, \hdots, b \} & \mbox{if } a > b  \end{cases}

This works, but it’s messy, and chopping up our single connected interval mod N into two pieces for the wrap-around case feels like it’s missing something fundamental about arithmetic mod N. We’re thinking in terms of a number line:

Interval [14,1] on the number line

But the integers mod N aren’t very “line-like” at all, and they’re commonly (and more appropriately) drawn as a circle (like a clock, with 0 at the top and numbers increasing clockwise, to maximize familiarity—apologies to all mathematicians who expect positive angles to move counter-clockwise).

Interval [14,1] on a circle

And in this visualization, there is absolutely nothing special about our interval [14,1]. We happen to pass through 0, but 0 isn’t actually special. It’s just another point on the circle. The reason 0 becomes special when we think in terms of regular integers (and hence a number line) is that it’s usually the place where we decide to cut open the circle so we can flatten it into a line. But that has nothing to do with the circle (or the integers mod N); it’s an incidental artifact of the representation we’re picking.

A different approach

The key problem here is that intervals are normally defined in terms of ordering relationships. For example, real-number intervals are usually defined as sets

[a,b] = \{ x \in \mathbb{R} | a \le x \le b \}.

based on a total order “≤”. But in our circle, we don’t have a useful ordering relation. When you write “4 < 7”, this means that 4 is to the left of 7 on the real number line. If you start at 4 and keep walking right, you’re gonna come by 7 eventually. If you instead were to walk left, the numbers would just keep getting smaller indefinitely.

On our circle mod N, this is not true. If you start at 4 and keep walking in the positive direction (clockwise in our case), you’ll reach 7 after three steps. If you instead walk counterclockwise from 4, you still reach 7 – this time after taking the long way round, in thirteen steps. This is true for any pair of numbers on the circle, which after all represent congruence classes of integers mod N (if you don’t know the terminology, just ignore the rest of this paragraph). It makes sense to say that the integer 4 is less than the integer 7, but the set \bar{4} = 4 + N\mathbb{Z} = \{ 4 + kN | k \in \mathbb{Z} \} is not in any useful sense “less” or “greater” than the set \bar{7} = 7 + N\mathbb{Z}. Viewed in that light, the definition for [a,b] mod N given above is outright weird; the case distinction between “a ≤ b” and “a > b” is testing a condition that doesn’t really make sense in terms of the concepts being expressed.

So ordering relationships are on shaky footing. We can, however, usefully talk about distances. For example, “4” and “7” are definitely three steps apart when we take the shortest path, also three steps apart when we’re only allowed to move clockwise, and thirteen steps apart when we’re only allowed to move counterclockwise. In general, we can define distance functions for the “increasing” (clockwise), “decreasing” (counter-clockwise) and shortest-path distances between two points (we won’t actually be using that latter one, I just mention it for completeness):

d^{\,+}(a,b) := (b-a) \bmod N
d^{\,-}(a,b) := (a-b) \bmod N
d(a, b) := \min(d^{\,+}(a,b), d^{\,-}(a,b))

These distance functions are, technically speaking, mappings d : \mathbb{Z}_N \times \mathbb{Z}_N \rightarrow \mathbb{N}_0 from the integers mod N to the natural numbers (hence the explicit use of mod as a binary operation). The distances are regular non-negative integers; while points in the integers mod N can’t be meaningfully compared, distances can. That’s why they’re useful to us. Also note that the “mod N” here is the modulus under Euclidean/floor division – that is, a non-negative value smaller than N.

Most programming languages use truncating division instead, which means the modulus has absolute value less than N but might be negative; you need to consider this when turning any of the equations in here into code!

Given all this, let’s go back to our problem of defining intervals nicely. Well, how do we draw something like the interval [14,1] on our circle? We just move the pen to “14” and start drawing a clockwise arc until we hit the “1”. We can use that exact idea to define intervals: namely, a point x is inside the interval [a,b] if, starting from a and walking in increasing order (clockwise), we hit x before we leave the interval. That leads to our improved definition of a closed interval mod N:

[a,b] \pmod N := \{ x \in \mathbb{Z}_N | \;d^{\,+}(a,x) \le d^{\,+}(a,b) \}
or the equivalent
[a,b] \pmod N := \{ x \in \mathbb{Z}_N | \;d^{\,-}(b,x) \le d^{\,-}(b,a) \}

and the generalizations to half-open intervals are straightforward:

[a,b) \pmod N := \{ x \in \mathbb{Z}_N | \;d^{\,+}(a,x) < d^{\,+}(a,b) \}
(a,b] \pmod N := \{ x \in \mathbb{Z}_N | \;d^{\,-}(b,x) < d^{\,-}(b,a) \}

Things are simpler if we always start measuring from the closed (inclusive) end, so that’s what we do. I’ll drop the (mod N) for the rest of the article; we know that’s our setting.

Point-in-interval tests and symmetry

This definition can be turned into code immediately and leads to fairly elegant point-in-interval tests that don’t break down into multiple cases:

// modN(x) is assumed to calculate Euclidean (=non-negative) x % N.

// x in [a,b] (mod N)?
static bool point_in_closed_interval(int x, int a, int b)
{
    return modN(x - a) <= modN(b - a);
}

// x in [a,b) (mod N)?
static bool point_in_half_open_interval(int x, int a, int b)
{
    return modN(x - a) < modN(b - a);
}

At this point, you should also be noticing something funny about that code (it’s a bit harder to see in the definitions above since the detour through the distance functions obscures the actual computation going on); namely, the fact that we’re subtracting a and then reducing mod N on both sides.

It’s confession time. This whole notion with measuring distances along the circle is not how I derived this; it’s the best way I know to think about this problem, but that knowledge came in retrospect. I got here the long way round, with several false starts. The key idea turned out to be thinking about symmetries, which is always worth doing if the problem you’re working on has any; see for example this post I wrote 6 years ago!

In this case, the integers mod N are a cyclic group—they wrap around. That’s why it makes sense to draw them as a circle. And that’s why being attached to any particular point being 0 is a bit silly: a circle has continuous rotational symmetry, and our discrete cyclic group has N-fold rotational symmetry. We get the same image if we rotate by an N’th of a full turn. And going back to our setting, since we really only care about distances, we can cyclically rotate our points around any way we want without changing the results.

What these tests really do is exploit this symmetry, “translating” (or more appropriately, rotating) everything by -a. This turns testing x against the interval [a,b] (or the half-open variants) into testing (x-a) \bmod N against [0,(b-a) \bmod N). Once we move the start point of the interval to 0, we don’t have to worry about wrap-around happening in inconvenient places anymore; comparing the integers directly works fine. Score one for symmetry.

Interval overlap

This takes care of points. Now for something trickier: how do we test for interval overlap?

The standard tests for interval overlap are slick, but not really applicable in our situation: the center-extent trick actually generalizes just fine to integers mod N and much more general settings (it works in arbitrary metric spaces provided the sets in question can be expressed as balls in the target metric) but is not ideal in a discrete setting, and the direct tests rely heavily on an order structure we don’t have in our cyclic world.

But what we do have are reasonably simple point-in-interval tests. Can we build an interval overlap test out of them? Well, we can try. Suppose we have two intervals [a,b] and [c,d]. For example, surely, if the “left” endpoint c of [c,d] falls inside [a,b], the two intervals overlap – after all, we know a point that is in both of them! And by symmetry, swapping the roles of the two intervals, if a falls inside [c,d], we again must have overlap.

If you start drawing a few pictures of intervals in various configurations, you’ll soon notice that testing both of these conditions seems to detect all configurations where the intervals actually overlap (and this works with intervals in the real numbers as well as in our discrete setting). The question is, can we prove that testing these two points is sufficient, or are we merely lucky? It turns out we didn’t just luck out; here’s a quick proof:

Lemma: ([a,b] \cap [c,d]) \ne \emptyset \Leftrightarrow (c \in [a,b]) or (a \in [c,d]). In words, the two (non-empty) intervals overlap if and only if at least one of a or c is inside the respective other interval.
Proof: “\Leftarrowc \in [c,d], so if we also have c \in [a,b], then that gives us a point in the intersection of [a,b] and [c,d], which therefore can’t be empty. Likewise with a. Therefore, if either of the two conditions on the right-hand side holds, we indeed have a non-empty intersection.
\Rightarrow” The intersection isn’t empty, so take x \in ([a,b] \cap [c,d]). x is in both intervals. Informally, we now “slide” x to the “left” (in negative direction) through the intersection until we hit either of the interval end points a or c. Formally, consider the distances from the interval start points to x d_a := d^{\,+}(a,x) and d_c := d^{\,+}(c,x). Suppose that da ≤ dc. Then we have

0 \le d^{\,+}(c,a) = a - c = a - x + x - c
= (x - c) - (x - a) = d_c - d_a \le d_c \le d^{\,+}(c,d)

In words, a’s distance from c, in positive direction, is no more than x’s; since x already was inside of [c,d], surely a must be too. If instead da > dc, we learn that c is inside [a,b]. In either case, this proves the claim.

This takes care of closed intervals. Note that this proof leans heavily on the intervals in question being non-empty. We can readily adapt it to half-open intervals of the form [a,b), but we do need to make sure to catch either interval being empty first, in which case the intersection of intervals is necessarily empty too. Likewise, you can also easily adapt it to half-open intervals of type (a,b], but in this case you want to be using the “right” end points of intervals for testing, not the left end points.

This may all sound complicated, but the implementation is actually quite short and simple:

// do [a,b] and [c,d] overlap?
static bool closed_intervals_overlap(int a, int b, int c, int d)
{
    return modN(c - a) <= modN(b - a) || modN(a - c) <= modN(d - c);
}

// do [a,b) and [c,d) overlap?
static bool half_open_intervals_overlap(int a, int b, int c, int d)
{
    int w0 = modN(b - a);
    int w1 = modN(d - c);

    return (w1 != 0 && modN(c - a) < w0) ||
           (w0 != 0 && modN(a - c) < w1);
}

And there we go. Interval overlap tests mod N.

Implementation notes and variations

The most common case in systems programming involves power-of-2 N. In that scenario, modN is readily implemented via bit masking as a single binary AND operation. If N is 232 or 264, 32- or 64-bit unsigned integers can (and should) be used directly, in which case there is no need for explicit masking in the code (although on some 64-bit architectures, unsigned 32-bit arithmetic compiles into 64-bit arithmetic with masking anyway). This is one of the rare cases where unsigned integer overflow works exactly the way we need. In this case, you want to be using unsigned integers throughout.

As presented, we’re working with intervals given in terms of two end points, because that’s the most common presentation. But computationally, all of the functions in the code shown actually use a single endpoint (on the “closed” end) along with the width of the interval. That’s the modN(b - a) and modN(d - c) terms we keep computing. So if you’re working a lot with intervals mod N, or storing them, you probably want to consider working in that representation directly.

The intervals in this article are intentionally defined with their endpoints coming from \mathbb{Z}_N, to force us to think about the effects of the cyclical wrap-around in the integers mod N cleanly. Now that we’ve spent some time thinking it through, we can relax that requirement. In particular, when using ring buffers with what I call the “virtual stream” model (read/write cursors not reduced mod N), it can make sense to just not reduce the interval lengths mod N at all—that is, turn occurrences of modN(b - a) and modN(d - c) in the code into plain b - a and d - c, respectively, or reduce with respect to a modulus that’s a larger multiple of N. Among other things, this allows us to have half-open intervals covering the entirety of \mathbb{Z}_N, something the fully reduced variant cannot easily do.

And as a final closing remark, this article comes in at 2500 words to explain less than 20 lines of code doing nothing but straightforward integer arithmetic. That has to be some sort of personal record. I’m not sure if that’s good or bad.

The Modulith

Much has been written about all the myriad ways to go wrong when writing software. Poor management; scope creep; too little structure, not modular enough, and it’s a “big ball of mud”. Too much (or too rigid) and it’s a “software crystal”, impossible to alter. And so on.

Suppose you get all that right, and actually ship a useful system to users, it solves their problems well enough, and the code is reasonably clean, has a sound design and a modular structure with interface that, while not perfect, work okay. That’s about as good as it gets! Well done.

Alas, you’re not out of the woods. Success has its own failure modes, and I want to talk about one in particular that affects modular designs.

The arguments for modularity are well known: separating concerns breaks large systems down into smaller constituent parts that can be understood individually, with clearly-defined interfaces between them. Ideally, modules are designed so they can be developed and tested in isolation, and if an individual module is found wanting (say it’s unreliable, faulty or there are simply better solutions available), it can be replaced with another module provided it has the same interface.

And there really are systems like that, where the interfaces are rigid and well-specified, components come only in a handful of “shapes”, and everything cleanly fits together, like Lego bricks. But more commonly, shipping systems look like this (prepare for an extended metaphor):

"Dry stone wall, Island of Mull". Photo by Jan Smith, CC-BY 2.0

“Dry stone wall, Island of Mull”. Photo by Jan Smith, CC-BY 2.0

The modules have irregular shapes and irregular sizes. Some are big, some are quite small. Some closely align with their neighbors; others have big gaps between them. They add up to a coherent whole, but it’s clear that for most of the development time, none of these components really had to have any particular shape. Occasionally you need a small piece with a specific shape to fill a gap, but for the most part, you just work with the materials you have.

The result is still “modular”; it’s built out of smaller pieces, each with their own clearly defined boundaries. But it’s not very regular, and outright weird in some places. That chipped corner on one of the bottom pieces was just an early mistake, but it made for a good place to stick that one flat rock on and somehow that ended up being one of the primary supports for the whole thing. And while building that wall, “I need a rock, about this big” was the only constraint you really had, and you just sort of piled it on. But when repairing it after one of the pieces has been damaged, working out the right shape, finding a replacement that meets that description and getting it in place is really tricky, fiddly work. (End of extended metaphor.)

Know any systems like that? I certainly do. And the end result is what I hereby dub a “modulith” (I am sure this has been observed and named before, but I haven’t seen it elsewhere yet). Made out of small, distinct, cleanly separable pieces, but still, everything but the topmost layer is actually kind of hard to disentangle from the rest, due to a myriad of small interactions with everything surrounding it. Because once you use a module as a building block for something else, there’s a disturbing tendency for all of its remaining quirks and bugs to effectively become part of the spec, as other modules (implicitly or explicitly) start to rely on them.

This is related to, but distinct from, other concepts such as software entropy and technical debt, which primarily deal with effects within a single codebase over time. Here we are dealing with something slightly different: as a particular component is successfully used or re-used (in unmodified form!), the users of said code tend to end up relying (often inadvertently) on various unspecified or underspecified behaviors, implicitly assuming a stronger contract than the component is actually supposed to provide. At that point, your choices are to either make those assumed behaviors actually contractual (not breaking existing code at the cost of severely constraining future evolution of said component), or to fix all users that make stronger assumptions than what is guaranteed (easier said than done if the component in question is popular; often causes ripple effects that break yet more code).

Either way, I don’t have any good solutions, but I’m feeling whimsical and haven’t seen this exact problem described before, so I’m naming it. In the extremely likely case that this has already been described and named by someone else, I’d appreciate a reference!

Models for adaptive arithmetic coding

My colleague Charles Bloom recently announced Oodle LZNA (a new compressor for RAD’s Oodle compression library), which usually beats LZMA on compression while being much faster to decode. The major innovation in LZNA is switching the coding back-end from bit-wise adaptive modeling to models with larger alphabets. With traditional arithmetic coders, you take a pretty big speed hit going from binary to larger alphabets, and you end up leaning heavily on integer divides (one per symbol usually), which is the neglected bastard stepchild of every integer instruction set. But as of last year, we have the ANS family of coders, which alters the cost landscape considerably. Most interesting for use with adaptive models is rANS, which is significantly cheaper than conventional large-alphabet arithmetic (and without divisions in the decoder), but can still code directly from a cumulative probability distribution, just like other arithmetic coders. In short, rANS speeds up non-binary alphabets enough that it makes sense to revisit adaptive modeling for larger alphabets in practical coder design, something that’s been pretty much on hold since the late 90s.

In this post, I will try to explain the high-level design behind the adaptive models that underlie LZNA, and why they are interesting. But let’s start with the models they are derived from.

Adaptive binary models

The canonical adaptive binary model simply keeps count of how many 0s and 1s have been seen:

counts[0] = 1;     // or different init
counts[1] = 1;
prob_for_1 = 0.5;  // used for coding/decoding
  
void adapt(int bit)
{
    counts[bit]++;
    prob_for_1 = counts[1] / (counts[0] + counts[1]);
}

This is the obvious construction. The problem with this is that probabilities move quickly after initialization, but very soon start to calcify; after a few thousand symbols, any individual observation barely changes the probabilities at all. This sucks for heterogeneous data, or just data where the statistics change over time: we spend a long time coding with a model that’s a poor fit, trying to “unlearn” the previous stats.

A better approach for data with statistics that change over time is to gradually “forget” old stats. The resulting models are much quicker to respond to changed input characteristics, which is usually a decent win in practice. The canonical “leaky” adaptive binary model is, essentially, an exponential moving average:

prob_for_1 = 0.5;
f = 1.0 - 1.0 / 32.0;   // adaptation rate, usually 1/pow2

void adapt(int bit)
{
    if (bit == 0)
        prob_for_1 *= f;
    else
        prob_for_1 = prob_for_1 * f + (1.0 - f);
}

This type of model goes back to Howard and Vitter in “Practical implementations of arithmetic coding” (section 3.4). It is absolutely everywhere, albeit usually implemented in fixed point arithmetic and using shifts instead of multiplications. That means you will normally see the logic above implemented like this:

scale_factor = 4096; // .12 fixed point
prob_for_1_scaled = scale_factor / 2;

void adapt(int bit)
{
    if (bit == 0)
        prob_for_1_scaled -= prob_for_1_scaled >> 5;
    else
        prob_for_1_scaled += (scale_factor - prob_for_1_scaled) >> 5;
}

This looks like it’s a straightforward fixed-point version of the code above, but there’s a subtle but important difference: the top version, when evaluated in real-number arithmetic, can have prob_for_1 get arbitrarily close to 0 or 1. The bottom version cannot; when prob_for_1_scaled ≤ 31, it cannot shrink further, and likewise, it cannot grow past scale_factor – 31. So the version below (with a fixed-point scale factor of 4096 and using a right-shift of 5) will keep scaled probabilities in the range [31,4065], corresponding to about [0.0076, 0.9924].

Note that with a shift of 5, we always stay 31 steps away from the top and bottom end of the interval – independent of what our scale factor is. That means that, somewhat counter-intuitively, that both the scale factor and the adaptation rate determine what the minimum and maximum representable probability are. In compression terms, the clamped minimum and maximum probabilities are equivalent to mixing a uniform model into a “real” (infinite-precision) Howard/Vitter-style adaptive model, with low weight. Accounting for this fact is important when trying to analyze (and understand) the behavior of the fixed-point models.

A symmetric formulation

Implementations of binary adaptive models usually only track one probability; since there are two symbols and the probabilities must sum to 1, this is sufficient. But since we’re interested in models for larger than binary alphabets, let’s aim for a more symmetric formulation in the hopes that it will generalize nicely. To do this, we take the probabilities p0 and p1 for 0 and 1, respectively, and stack them into a column vector p:

\mathbf{p} = \begin{pmatrix} p_0 \\ p_1 \end{pmatrix}

Now, let’s look at what the Howard/Vitter update rule produces for the updated vector p‘ when we see a 0 bit (this is just plugging in the update formula from the first program above and using p1 = 1 – p0):

\mathbf{p}' = \begin{pmatrix} 1 - f p_1 \\ f p_1 \end{pmatrix}  = \begin{pmatrix} 1 - f (1 - p_0) \\ f p_1 \end{pmatrix}  = f \mathbf{p} + \begin{pmatrix} 1 - f \\ 0 \end{pmatrix}

And for the “bit is 1” case, we get:

\mathbf{p}' = \begin{pmatrix} 1 - (f p_1 + (1 - f)) \\ f p_1 + (1 - f) \end{pmatrix}  = \begin{pmatrix} f p_0 \\ f p_1 + (1 - f) \end{pmatrix}  = f \mathbf{p} + \begin{pmatrix} 0 \\ 1 - f \end{pmatrix}

And just like that, we have a nice symmetric formulation of our update rule: when the input bit is i, the updated probability vector is

\mathbf{p}' = f \mathbf{p} + (1 - f) \mathbf{e}_i

where the ei are the canonical basis vectors. Once it’s written in this vectorial form, it’s obvious how to generalize this adaptation rule to a larger alphabet: just use a larger probability vector! I’ll go over the implications of this in a second, but first, let me do a brief interlude for those who recognize the shape of that update expression.

Aside: the DSP connection

Interpreted as a discrete-time system

\mathbf{p}_{k+1} = f \mathbf{p}_k + (1-f) \mathbf{e}_{s_k}

where the sk denote the input symbol stream, note that we’re effectively just running a multi-channel IIR filter here (one channel per symbol in the alphabet). Each “channel” is simple linearly interpolating between its previous state and the new input value – it’s a discrete-time leaky integrator, a 1st-order low-pass filter.

Is it possible to use other low-pass filters? You bet! In particular, one popular variant on the fixed-point update rule has two accumulators with different update rates and averages them together. The result is equivalent to a 2nd-order (two-pole) low-pass filter. Its impulse response, corresponding to the weight of symbols over time, initially decays faster but has a longer tail than the 1st-order filter.

And of course you can use FIR low-pass filters too: for example, the incremental box filters described in my post Fast blurs 1 correspond to a “sliding window” model. This approach readily generalizes to more general piecewise-polynomial weighting kernels using the approach described in Heckberts “Filtering by repeated integration”. I doubt that these are actually useful for compression, but it’s always nice to know you have the option.

Can we use arbitrary low-pass filters? Alas, we can’t: we absolutely need linear filters with unit gain (so that the sum of all probabilities stays 1 exactly), and furthermore our filters need to have non-negative impulse responses, since we’re dealing with probabilities here, and probabilities need to stay non-negative. Still, we have several degrees of freedom here, and at least in compression, this space is definitely under-explored right now. I’m sure some of it will come in handy.

Practical non-binary adaptive models

As noted before, the update expression we just derived

\mathbf{p}' = f \mathbf{p} + (1 - f) \mathbf{e}_i

quite naturally generalizes to non-binary alphabets: just stack more than two probabilities into the vector p. But it also generalizes in a different direction: note that we’re just linearly interpolating between the old model p and the “model” for the newly observed symbol, as given by ei. We can change the model we’re mixing in if we want different update characteristics. For example, instead of using a single spike at symbol i (as represented by ei), we might decide to slightly boost the probabilities for adjacent values as well.

Another option is to blend between the “spike” and a uniform distribution with a given weight u:

\mathbf{p}' = f \mathbf{p} + (1 - f) ((1-u) \mathbf{e}_i + u\mathbf{1})

where 1 is the vector consisting of all-1s. This is a way to give us the clamping of probabilities at some minimum level, like we get in the fixed-point binary adaptive models, but in a much more controlled fashion; no implicit dependency on the scaling factor in this formulation! (Although of course an integer implementation will still have some rounding effects.) Note that, for a fixed value of u, the entire right-hand side of that expression only depends on i and does not care about the current value of p at all.

Okay, so we can design a bunch of models using this on paper, but is this actually practical? Note that, in the end, we will still want to get fixed-point integer probabilities out of this, and we would greatly prefer them to all sum to a power of 2, because then we can use rANS. This is where my previous post “Mixing discrete probability distributions” comes in. And this is where I’ll stop torturing you and skip straight to the punchline: a working adaptation step, with pre-computed mix-in models (depending on i only), can be written like this:

int rate = 5;       // use whatever you want!
int CDF[nsyms + 1]; // CDF[nsyms] is pow2
int mixin_CDFs[nsyms][nsyms + 1];

void adapt(int sym)
{
    // no need to touch CDF[0] and CDF[nsyms]; they always stay
    // the same.
    int *mixin = mixin_CDFs[sym];
    for (int i = 1; i < nsyms; ++i)
         CDF[i] += (mixin[i] - CDF[i]) >> rate;
}

which is a pretty sweet generalization of the binary model update rule. As per the “Mixing discrete probability distributions”, this has several non-obvious nice properties. In particular, if the initial CDF has nonzero probability for every symbol, and all the mixin_CDFs do as well, then symbol probabilities will never drop down to zero as a result of this mixing, and we always maintain a constant total throughout. What this variant doesn’t handle very well is round-off; there’s better approaches (although you do want to make sure that something like the spacing lemma from the mixing article still holds), but this variant is decent, and it’s certainly the most satisfying because it’s so breathtakingly simple.

Note that the computations for each i are completely independent, so this is data-parallel and can be written using SIMD instructions, which is pretty important to make larger models practical. This makes the update step fast; you do still need to be careful in implementing the other half of an arithmetic coding model, the symbol lookup step.

Building from here

This is a pretty neat idea, and a very cool new building block to play with, but it’s not a working compressor yet. So what we can do with this, and where is it interesting?

Well. For a long time, in arithmetic coding, there were basically two choices. You could use larger models with very high per-symbol overhead: one or more divisions per symbol decoded, a binary search to locate the right symbol… pretty expensive stuff. Expensive enough that you want to make sure you code very few symbols this way. Adaptive symbols, using something like Fenwick trees, were even worse, and limited in the set of update rules they could support. In practice, truly adaptive large-alphabet models were exceedingly rare; if you needed some adaptation, you were more likely to use a deferred summation model (keep histogramming data and periodically rebuild your CDF from the last histogram) than true per-symbol adaptation, because it was so damn expensive.

At the other extreme, you had binary models. Binary arithmetic coders have much lower per-symbol cost than their non-binary cousins, and good binary models (like the fixed-point Howard/Vitter model we looked at) are likewise quite cheap to update. But the problem is that with a binary coder, you end up sending many more symbols. For example, in LZMA, a single 256-symbol alphabet model gets replaced with 8 binary coding steps. While the individual binary coding steps are much faster, they’re typically not that much faster that you can do 8× as many and still come out much faster. There’s ways to reduce the number of symbols processed. For example, instead of a balanced binary tree binarization, you can build a Huffman tree and use that instead… yes, using a Huffman tree to do arithmetic coding, I know. This absolutely works, and it is faster, but it’s also a bit of a mess and fundamentally very unsatisfying.

But now we have options in the middle: with rANS giving us larger-alphabet arithmetic decoders that are merely slightly slower than binary arithmetic coding, we can do something better. The models described above are not directly useful on a 256-symbol alphabet. No amount of SIMD will make updating 256 probability estimates per input symbol fast. But they are useful on medium-sized alphabets. The LZNA in “Oodle LZNA” stands for “LZ-nibbled-ANS”, because the literals are split into nibbles: instead of having a single adaptive 256-symbol model, or a binary tree of adaptive binary models, LZNA has much flatter trees of medium-sized models. Instead of having one glacially slow decode step per symbol, or eight faster steps per symbol, we can decode an 8-bit symbol in two still-quite-fast steps. The right decomposition depends on the data, and of course on the relative speeds of the decoding steps. But hey, now we have a whole continuum to play with, rather than just two equally unsavory points at the extreme ends!

This is not a “just paste it into your code” solution; there’s still quite a bit of art in constructing useful models out of this, several fun tricks in actually writing a fast decoder for this, some subtle variations on how to do the updates. And we haven’t even begun to properly play with different mix-in models and “integrator” types yet.

Charles’ LZNA is the first compressor we’ve shipped that uses these ideas. It’s not gonna be the last.

UPDATE: As pointed out by commenter “derf_” on Hacker News, the non-binary context modeling in the current Daala draft apparently describes the exact same type of model. Details here (section 2.2, “Non-binary context modeling”, variant 3). This apparently was proposed in 2012. We (RAD) weren’t aware of this earlier (pity, that would’ve saved some time); very cool, and thanks derf for the link!

Triangular numbers mod 2^n

The triangular numbers

\displaystyle T_k := \sum_{j=1}^k j = \binom{k+1}{2} = \frac{k(k+1)}{2}

are a standard sequence used in quadratic probing of open hash tables. For example, they’re used in Google’s sparse_hash and dense_hash, generally considered to be very competitive hash table implementations.

You can find lots of places on the web mentioning that the resulting probe sequence will visit every element of a power-of-2 sized hash table exactly once; more precisely, the function f(k) = T_k \mod 2^m is a permutation on \{ 0, 1, \dots, 2^m-1 \}. But it’s pretty hard to find a proof; everybody seems to refer back to Knuth, and in The Art of Compute Programming, Volume 3, Chapter 6.4, the proof appears as an exercise (number 20 in the Second Edition).

If you want to do this exercise yourself, please stop reading now; spoilers ahead!

Anyway, turns out I arrived at the solution very differently from Knuth. His proof is much shorter and slicker, but pretty “magic” and unmotivated, so let’s take the scenic route! The first step is to look at a bunch of small values and see if we can spot any patterns.

k 0 1 2 3 4 5 6 7 8 9 10 11
Tk 0 1 3 6 10 15 21 28 36 45 55 66
Tk mod 8 0 1 3 6 2 7 5 4 4 5 7 2
Tk mod 4 0 1 3 2 2 3 1 0 0 1 3 2
Tk mod 2 0 1 1 0 0 1 1 0 0 1 1 0

And indeed, there are several patterns that might be useful: looking at the row for “mod 2”, we see that it seems to be just the values 0, 1, 1, 0 repeating, and that sequence itself is just 0, 1 followed by its reverse. The row for “mod 4” likewise looks like it’s just alternating between normal and reverse copies of 0, 1, 3, 2, and the “mod 8” row certainly looks like it might be following a similar pattern. Can we prove this?

First, the mirroring suggests that it might be worthwhile to look at the differences

\displaystyle T_{2n-1-k} - T_k = \binom{2n-k}{2} - \binom{k+1}{2}
\displaystyle = (2n-k)(2n - (k+1))/2 - (k+1)k/2
\displaystyle = 2n^2 - n(2k+1) + k(k+1)/2 - k(k+1)/2
\displaystyle = 2n^2 - n(2k+1)

Both terms are multiples of n, so we have T_{2n-1-k} - T_k \equiv 0 \pmod{n}, which proves the mirroring (incidentally, for any n, not just powers of 2). Furthermore, the first term is a multiple of 2n too, and the second term almost is: we have T_{2n-1-k} - T_k \equiv -n \pmod{2n}. This will come in handy soon.

To prove that T_k \bmod n is 2n-periodic, first note the standard identity for triangular numbers

\displaystyle T_{a+b} = \sum_{j=1}^{a+b} j = \sum_{j=1}^a j + \sum_{j=a+1}^{a+b} j = T_a + \sum_{j=1}^{b} (j + a)
\displaystyle = T_a + \sum_{j=1}^b j + ab = T_a + T_b + ab

in particular, we have

\displaystyle T_{k + 2n} = T_k + T_{2n} + 2kn = T_k + n(2n+1) + 2kn \equiv T_k \pmod{n}

Again, this is for arbitrary n ≥ 1. So far, we’ve proven that for arbitrary positive integer n, the sequence T_k \bmod n is 2n-periodic, with the second half being a mirrored copy of the first half. It turns out that we can wrangle this one fact into a complete proof of the T_k \bmod 2^m, 0 ≤ k < 2m being a permutation fairly easily by using induction:

Basis (m=0): T_k \bmod 1=0, and the function f_0(0) := 0 is a permutation on { 0 }.
Inductive step: Suppose f_m(k) := T_k \bmod 2^m is a permutation on \{ 0, \dots, 2^m-1 \}. Then the values of f_{m+1}(k) = T_k \bmod 2^{m+1} must surely be pairwise distinct for 0 ≤ k < 2m, since they’re already distinct mod 2m. That means we only have to deal with the second half. But above (taking n=2m), we proved that

f_{m+1}(2^{m+1}-1-k) = T_{2^{m+1}-1-k} \equiv T_k - 2^m \pmod{2^{m+1}}

and letting k run from 0 to 2m-1, we notice that mod 2m, these values are congruent to the values in the first half, but mod 2m+1, they differ by an additional term of -n. This implies that the values in the second half are pairwise distinct both from the first half and from each other. This proves that fm+1 is injective, and because it’s mapping the 2m+1-element set \{ 0, \dots, 2^{m+1} - 1 \} onto itself, it must be a permutation.

Which, by mathematical induction, proves our claim.