I’m planning to write a bit about data organization for multi-core scenarios. I started writing a first post but quickly realized that there’s a bunch of basics I need to cover first. In this post, I’ll try just that.

Caches

This is a whirlwhind primer on CPU caches. I’m assuming you know the basic concept, but you might not be familiar with some of the details. (If you are, feel free to skip this section.)

In modern CPUs (almost) all memory accesses go through the cache hierarchy; there’s some exceptions for memory-mapped IO and write-combined memory that bypass at least parts of this process, but both of these are corner cases (in the sense that the vast majority of user-mode code will never see either), so I’ll ignore them in this post.

The CPU core’s load/store (and instruction fetch) units normally can’t even access memory directly – it’s physically impossible; the necessary wires don’t exist! Instead, they talk to their L1 caches which are supposed to handle it. And about 20 years ago, the L1 caches would indeed talk to memory directly. At this point, there’s generally more cache levels involved; this means the L1 cache doesn’t talk to memory directly anymore, it talks to a L2 cache – which in turns talks to memory. Or maybe to a L3 cache. You get the idea.

Caches are organized into “lines”, corresponding to aligned blocks of either 32 (older ARMs, 90s/early 2000s x86s/PowerPCs), 64 (newer ARMs and x86s) or 128 (newer Power ISA machines) bytes of memory. Each cache line knows what physical memory address range it corresponds to, and in this article I’m not going to differentiate between the physical cache line and the memory it represents – this is sloppy, but conventional usage, so better get used to it. In particular, I’m going to say “cache line” to mean a suitably aligned group of bytes in memory, no matter whether these bytes are currently cached (i.e. present in any of the cache levels) or not.

When the CPU core sees a memory load instruction, it passes the address to the L1 data cache (or “L1D$”, playing on the “cache” being pronounced the same way as “cash”). The L1D$ checks whether it contains the corresponding cache line. If not, the whole cache line is brought in from memory (or the next-deeper cache level, if present) – yes, the whole cache line; the assumption being that memory accesses are localized, so if we’re looking at some byte in memory we’re likely to access its neighbors soon. Once the cache line is present in the L1D$, the load instruction can go ahead and perform its memory read. And as long as we’re dealing with read-only access, it’s all really simple, since all cache levels obey what I’ll call the Basic invariant: the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses, at all times. Things gets a bit more complicated once we allow stores, i.e. memory writes. There’s two basic approaches here: write-through and write-back. Write-through is the easier one: we just pass stores through to the next-level cache (or memory). If we have the corresponding line cached, we update our copy (or maybe even just discard it), but that’s it. This preserves the same invariant as before: if a cache line is present in the cache, its contents match memory, always. Write-back is a bit trickier. The cache doesn’t pass writes on immediately. Instead, such modifications are applied locally to the cached data, and the corresponding cache lines are flagged “dirty”. Dirty cache lines can trigger a write-back, at which points their contents are written back to memory or the next cache level. After a write-back, dirty cache lines are “clean” again. When a dirty cache line is evicted (usually to make space for something else in the cache), it always needs to perform a write-back first. The invariant for write-back caches is slightly different. Write-back invariant: after writing back all dirty cache lines, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses. In other words, in write-back caches we lose the “at all times” qualifier and replace it with a weaker condition: either the cache contents match memory (this is true for all clean cache lines), or they contain values that eventually need to get written back to memory (for dirty cache lines). Write-through caches are simpler, but write-back has some advantages: it can filter repeated writes to the same location, and if most of the cache line changes on a write-back, it can issue one large memory transaction instead of several small ones, which is more efficient. Some (mostly older) CPUs use write-through caches everywhere; some use write-back caches everywhere; some have a simpler write-through L1$ backed by a write-back L2$. This may generate redundant traffic between L1$ and L2$but gets the write-back benefits for transfers to lower cache levels or memory. My point being that there’s a whole set of trade-offs here, and different designs use different solutions. Nor is there a requirement that cache line sizes be the same at all levels – it’s not unheard-of for CPUs to have 32-byte lines in L1$ but 128-byte lines in L2for example. Omitted for simplicity in this section: cache associativity/sets; write-allocate or not (I described write-through without write-allocate and write-back with, which is the most common usage); unaligned accesses; virtually-addressed caches. These are all things you can look up if you’re interested, but I’m not going to go that deep here. Coherency protocols As long as that single CPU core is alone in the system, this all works just fine. Add more cores, each with their own caches, and we have a problem: what happens if some other core modifies data that’s in one of our caches? Well, the answer is quite simple: nothing happens. And that’s bad, because we want something to happen when someone else modifies memory that we have a cached copy of. Once we have multiple caches, we really need to keep them synchronized, or we don’t really have a “shared memory” system, more like a “shared general idea of what’s in memory” system. Note that the problem really is that we have multiple caches, not that we have multiple cores. We could solve the entire problem by sharing all caches between all cores: there’s only one L1, and all processors have to share it. Each cycle, the L1$picks one lucky core that gets to do a memory operation this cycle, and runs it. This works just fine. The only problem is that it’s also slow, because cores now spend most of their time waiting in line for their next turn at a L1$ request (and processors do a lot of those, at least one for every load/store instruction). I’m pointing this out because it shows that the problem really isn’t so much a multi-core problem as it is a multi-cache problem. We know that one set of caches works, but when that’s too slow, the next best thing is to have multiple caches and then make them behave as if there was only one cache. This is what cache coherency protocols are for: as the name suggests, they ensure that the contents of multiple caches stay coherent.

There are multiple types of coherency protocols, but most computing devices you deal with daily fall into the category of “snooping” protocols, and that’s what I’ll cover here. (The primary alternative, directory-based systems, has higher latency but scales better to systems with lots of cores).

The basic idea behind snooping is that all memory transactions take place on a shared bus that’s visible to all cores: the caches themselves are independent, but memory itself is a shared resource, and memory access needs to be arbitrated: only one cache gets to read data from, or write back to, memory in any given cycle. Now the idea in a snooping protocol is that the caches don’t just interact with the bus when they want to do a memory transaction themselves; instead, each cache continuously snoops on bus traffic to keep track of what the other caches are doing. So if one cache wants to read from or write to memory on behalf of its core, all the other cores notice, and that allows them to keep their caches synchronized. As soon as one core writes to a memory location, the other cores know that their copies of the corresponding cache line are now stale and hence invalid.

With write-through caches, this is fairly straightforward, since writes get “published” as soon as they happen. But if there are write-back caches in the mix, this doesn’t work, since the physical write-back to memory can happen a long time after the core executed the corresponding store – and for the intervening time, the other cores and their caches are none the wiser, and might themselves try to write to the same location, causing a conflict. So with a write-back model, it’s not enough to broadcast just the writes to memory when they happen; if we want to avoid conflicts, we need to tell other cores about our intention to write before we start changing anything in our local copy. Working out the details, the easiest solution that fits the bill and works for write-back caches is what’s commonly called the MESI protocol.

MESI and friends

This section is called “MESI and friends” because MESI spawned a whole host of closely related coherency protocols. Let’s start with the original though: MESI are the initials for the four states a cache line can be in for any of the multiple cores in a multi-core system. I’m gonna cover them in reverse order, because that’s the better order to explain them in:

• Invalid lines are cache lines that are either not present in the cache, or whose contents are known to be stale. For the purposes of caching, these are ignored. Once a cache line is invalidated, it’s as if it wasn’t in the cache in the first place.
• Shared lines are clean copies of the contents of main memory. Cache lines in the shared state can be used to serve reads but they can’t be written to. Multiple caches are allowed to have a copy of the same memory location in “shared” state at the same time, hence the name.
• Exclusive lines are also clean copies of the contents of main memory, just like the S state. The difference is that when one core holds a line in E state, no other core may hold it at the same time, hence “exclusive”. That is, the same line must be in the I state in the caches of all other cores.
• Modified lines are dirty; they have been locally modified. If a line is in the M state, it must be in the I state for all other cores, same as E. In addition, modified cache lines need to be written back to memory when they get evicted or invalidated – same as the regular dirty state in a write-back cache.

If you compare this to the presentation of write-back caches in the single-core case above, you’ll see that the I, S and M states already had their equivalents: invalid/not present, clean, and dirty cache lines, respectively. So what’s new is the E state denoting exclusive access. This state solves the “we need to tell other cores before we start modifying memory” problem: each core may only write to cache lines if their caches hold them in the E or M states, i.e. they’re exclusively owned. If a core does not have exclusive access to a cache line when it wants to write, it first needs to send an “I want exclusive access” request to the bus. This tells all other cores to invalidate their copies of that cache line, if they have any. Only once that exclusive access is granted may the core start modifying data – and at that point, the core knows that the only copies of that cache line are in its own caches, so there can’t be any conflicts.

Conversely, once some other core wants to read from that cache line (which we learn immediately because we’re snooping the bus), exclusive and modified cache lines have to revert back to the “shared” (S) state. In the case of modified cache lines, this also involves writing their data back to memory first.

The MESI protocol is a proper state machine that responds both to requests coming from the local core, and to messages on the bus. I’m not going to go into detail about the full state diagram and what the different transition types are; you can find more in-depth information in books on hardware architecture if you care, but for our purposes this is overkill. As a software developer, you’ll get pretty far knowing only two things:

Firstly, in a multi-core system, getting read access to a cache line involves talking to the other cores, and might cause them to perform memory transactions.
Writing to a cache line is a multi-step process: before you can write anything, you first need to acquire both exclusive ownership of the cache line and a copy of its existing contents (a so-called “Read For Ownership” request).

And secondly, while we have to do some extra gymnastics, the end result actually does provide some pretty strong guarantees. Namely, it obeys what I’ll call the

MESI invariant: after writing back all dirty (M-state) cache lines, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses. In addition, at all times, when a memory location is exclusively cached (in E or M state) by one core, it is not present in any of the other core’s caches..

Note that this is really just the write-back invariant we already saw with the additional exclusivity rule thrown in. My point being that the presence of MESI or multiple cores does not necessarily weaken our memory model at all.

Okay, so that (very roughly) covers vanilla MESI (and hence also CPUs that use it, ARMs for example). Other processors use extended variants. Popular extensions include an “O” (Owned) state similar to “E” that allows sharing of dirty cache lines without having to write them back to memory first (“dirty sharing”), yielding MOESI, and MERSI/MESIF, which are different names for the same idea, namely making one core the designated responder for read requests to a given cache line. When multiple cores hold a cache line in Shared state, only the designated responder (which holds the cache line in “R” or “F” state) replies to read requests, rather than everyone who holds the cache line in S state. This reduces bus traffic. And of course you can add both the R/F states and the O state, or get even fancier. All these are optimizations, but none of them change the basic invariants provided or guarantees made by the protocol.

I’m no expert on the topic, and it’s quite possible that there are other protocols in use that only provide substantially weaker guarantees, but if so I’m not aware of them, or any popular CPU core that uses them. So for our purposes, we really can assume that coherency protocols keep caches coherent, period. Not mostly-coherent, not “coherent except for a short window after a change” – properly coherent. At that level, barring hardware malfunction, there is always agreement on what the current state of memory should be. In technical terms, MESI and all its variants can, in principle anyway, provide full sequential consistency, the strongest memory ordering guarantee specified in the C++11 memory model. Which begs the question, why do we have weaker memory models, and “where do they happen”?

Memory models

Different architectures provide different memory models. As of this writing, ARM and POWER architecture machines have comparatively “weak” memory models: the CPU core has considerable leeway in reordering load and store operations in ways that might change the semantics of programs in a multi-core context, along with “memory barrier” instructions that can be used by the program to specify constraints: “do not reorder memory operations across this line”. By contrast, x86 comes with a quite strong memory model.

I won’t go into the details of memory models here; it quickly gets really technical, and is outside the scope of this article. But I do want to talk a bit about “how they happen” – that is, where the weakened guarantees (compared to the full sequential consistency we can get from MESI etc.) come from, and why. And as usual, it all boils down to performance.

So here’s the deal: you will indeed get full sequential consistency if a) the cache immediately responds to bus events on the very cycle it receives them, and b) the core dutifully sends each memory operation to the cache, in program order, and wait for it to complete before you send the next one. And of course, in practice modern CPUs normally do none of these things:

• Caches do not respond to bus events immediately. If a bus message triggering a cache line invalidation arrives while the cache is busy doing other things (sending data to the core for example), it might not get processed that cycle. Instead, it will enter a so-called “invalidation queue”, where it sits for a while until the cache has time to process it.
• Cores do not, in general, send memory operations to the cache in strict program order; this is certainly the case for cores with Out-of-Order execution, but even otherwise in-order cores may have somewhat weaker ordering guarantees for memory operations (for example, to ensure that a single cache miss doesn’t immediately make the entire core grind to a halt).
• In particular, stores are special, because they’re a two-phase operation: we first need to acquire exclusive ownership of a cache line before a store can go through. And if we don’t already have exclusive ownership, we need to talk to the other cores, which takes a while. Again, having the core idle and twiddling thumbs while this is happening is not a good use of execution resources. Instead, what happens is that stores start the process of getting exclusive ownership, then get entered into a queue of so-called “store buffers” (some refer to the entire queue as “store buffer”, but I’m going to use the term to refer to the entries). They stay around in this queue for a while until the cache is ready to actually perform the store operation, at which point the corresponding store buffer is “drained” and can be recycled to hold a new pending store.

The implication of all these things is that, by default, loads can fetch stale data (if a corresponding invalidation request was sitting in the invalidation queue), stores actually finish later than their position in the code would suggest, and everything gets even more vague when Out of Order execution is involved. So going back to memory models, there are essentially two camps:

Architectures with a weak memory model do the minimum amount of work necessary in the core that allows software developers to write correct code. Instruction reordering and the various buffering stages are officially permitted; there are no guarantees. If you need guarantees, you need to insert the appropriate memory barriers – which will prevent reordering and drain queues of pending operations where required.

Architectures with stronger memory models do a lot more bookkeeping on the inside. For example, x86 processors keep track of all pending memory operations that are not fully finished (“retired”) yet, in a chip-internal data structure that’s called the MOB (“memory ordering buffer”). As part of the Out of Order infrastructure, x86 cores can roll back non-retired operations if there’s a problem – say an exception like a page fault, or a branch mispredict. I covered some of the details, as well as some of the interactions with the memory subsystem, in my earlier article “Speculatively speaking“. The gist of it is that x86 processors actively watch out for external events (such as cache invalidations) that would retroactively invalidate the results of some of the operations that have already executed, but not been retired yet. That is, x86 processors know what their memory model is, and when an event happens that’s inconsistent within that model, the machine state is rolled back to the last time when it was still consistent with the rules of the memory model. This is the “memory ordering machine clear” I covered in yet another earlier post. The end result is that x86 processors provide very strong guarantees for all memory operations – not quite sequential consistency, though.

So, weaker memory models make for simpler (and potentially lower-power) cores. Stronger memory models make the design of cores (and their memory subsystems) more complex, but are easier to write code for. In theory, the weaker models allow for more scheduling freedom and can be potentially faster; in practice, x86s seem to be doing fine on the performance of memory operations, for the time being at least. So it’s hard for me to call a definite winner so far. Certainly, as a software developer I’m happy to take the stronger x86 memory model when I can get it.

Anyway. That’s plenty for one post. And now that I have all this written up on my blog, the idea is that future posts can just reference it. We’ll see how that goes. Thanks for reading!

I wanted to comment on Timothy Lottes post on Bindless and descriptors (read that first!) and the comment field was too small, so here goes.

Some questions.

General: You’re implicitly assuming there’s lots of different types of resource tables. Why?

UPDATE: It seems like Timothy is talking about D3D12-esque Descriptor Tables. Well, I’m not :), so yeah, kinda talking apart here, but here’s my response anyway.

Essentially, GL bindless just means “one big resource table that contains everything”. And you could certainly implement it as such: one large resource table (multi-megabyte, potentially) that contains descriptor for absolutely everything currently live, with the handles being 32-bit offsets from the base address; nor do I see an intrinsic need for resource tables to be homogeneous in terms of either resource type or update frequency (it might be advantageous in certain cases, but I don’t see why it would be *required*).

NV GL bindless: what’s the handle values in “TEX handle“? Presumably, not a 7-bit register index. I’m assuming that it boils down to a “[bindlessTable + handle]” addressing mode, either explicit (i.e. this is a memory load) or implicit (HW is aware of handles and has a dedicated fetch path for them)? I couldn’t see any details in the documents you linked. Anyway, presumably there’s still resource descriptors somewhere. Where is that “somewhere”? How does do the HW know where to get them from?

NV with resource tables: why are you billing the handle table address LDC per texture sample? Unless there’s very high register pressure, you would only load the resource table pointers once per table, not once per resource access. And given my previous comment about “I don’t see why you need tons of these”, that’s really a constant per-shader cost of a few regs and a few loads; it’s not per texture sample at all.

All AMD: you can’t load the texture/sampler descriptors directly, you still need to know where to load them from (unless you use little enough resources that you can squeeze all descriptors into the 16 scalars). There’s at least one more scalar load to get a base pointer to the resource descriptors for the active draw call. Like the resource table pointer loads I just mentioned, this is amortized (just do that load once and leave the scalar reg around).

For GL bindless mode, you would presumably use a single global descriptor table that all your handles point into, and would preload the base address to that table using the 16 scalars you get to set per draw. This makes the basic model:

AMD DX “bindful”:

// sBatchDesc is one of the 16 pre-init
S_LOAD_DWORD_X4 sTexDesc, sBatchDesc(16) // random offset
SAMPLE ...


AMD GL “bindful”:

S_LOAD_DWORD_X8 sTexAndSmp, sBatchDesc(32)
SAMPLE ...


AMD GL bindless:

// sBindlessTable also one of the 16 pre-init
SAMPLE ...


(explicit handle load here, “bindful” is cheaper on the shader side but pays that cost elsewhere by having to set up the per-batch descriptors)

and

AMD resource tables, separate tex/smp.

// let's say we have 2, sResourceTableImmutable
// and sResourceTableDyn, both pre-init
SAMPLE ...


or we could have a combined texture/sampler handle if we wanted, etc.

The big difference here is that the batch descriptors (sBatchDesc) here only hold texture handles (=offsets into their respective resource tables) not descriptors themselves. In the “bindful” case, there’s a conceptual global table too (the bind points), but it keeps changing between batches, which makes things tricky, and forces you to deal with multiple live versions at the same time and/or copy descriptors around. The resource table model (which bindless is a special case of!) has them stay constant and immutable over the lifetime of a command buffer which gets rid of that cost. GL bindless is similar.

Another note on the AMD resource table version: I’m loading texture/sampler handlers here separately for clarity, but that’s not at all required. A shader that references multiple contigous slots from its sBatchDesc can (and probably should) fuse them into larger pow2-sized S_LOAD_DWORDs. In my example, you could just as well do:

AMD resource tables + “fusing”:

// sTexHandle = sHandles, sSmpHandle = sHandles + 1 (register IDs)
SAMPLE ...


So really, not all that much (cost) difference between the different “bindless-esque” approaches here that I can see.

I’ve been meaning to write another proper post on this for a while, but the last few months have been very busy and I didn’t feel like writing in my spare time.

This is not that post, sadly. This is about something else: namely, me only citing Jarek Duda’s work on ANS and not Yann Collet‘s work on FSE. Apparently, there were multiple versions of Jarek’s ANS paper, and the second version (which contains rANS, the topic I’ve been writing about) was significantly influenced by Yann’s experiences with integrating ideas from the first version into FSE.

Anyway, I did not mention Yann’s work in my rANS posts at all. I just want to make clear that this was because I was writing about rANS not tANS (the family that FSE is a member of), and I simply wasn’t aware that Yann’s work and input significantly influenced the second version of the ANS paper. My apologies; this was a simple oversight, not a deliberate attempt to talk down Yann’s contribution!

On a separate but related note: Mid-february, I wrote a short paper on how entropy coders (with a focus on ANS) can be interleaved on the encode side to allow the decoder side to exploit instruction-level parallelism and/or SIMD instructions. I originally meant to do a separate post about it here, but on trying to write it discovered that I didn’t have much to say on the topic that wasn’t in the paper. Hence, no separate blog post. But I figured I should at least link to it once from here.

Anyway, more regular blog updates should start again soon. Until then!

In the previous post, I talked a bit about how essentially anything we would call a computer these days is in fact, for all practical purposes, a heterogeneous cluster made up of various specialized smaller computers, all connected using various networks that go by different names and are specified in different standards, yet are all suspiciously similar at the architecture level; a fractal of switched, packet-based networks of heterogeneous nodes that make up what we call a single “computer”. Which we then promptly connect with other computers to form home networks, work networks, and ultimately the internet.

So far, that is just an observation. But does it mean anything? Well, sort of. For one thing, it means that all the network security problems that plague inter-computer networking also exist within computers themselves. Just as you can own a machine from outside using a network-based attack (ultimately delivered via Ethernet frames), you can own it using exploits delivered via USB packets, or eSATA, or Thunderbolt. This is not a particularly bad security nightmare – all these attacks require physical access, and once you have physical access to a machine you have all kinds of ways of working around security mechanisms anyway – but it does mean that people trying to sell you hardware that you aren’t allowed to run your own code on have a pretty large attack surface to worry about.

That said, as far as implications of “everything is a computer running a network stack” go, “there are going to be implementation issues” is not exactly the most exciting place to end up at. Things get substantially more interesting once we approach the same underlying truth from another angle, though. To do so, let’s pick a concrete example. Being an affluent male geek, I socialize with a bunch of other affluent male geeks, a lot of whom store enough data (or enough computers) at home to require a NAS (Network Attached Storage) device to house terabytes of (doubtlessly 100% legally acquired!) data and serve it to the multitude of desktop computers, tables, phones and media devices they keep in their home. Now, these boxes are essentially marketed as a bunch of drive enclosures with a network plug and a bunch of added value features; RAID, video streaming (and sometimes live transcoding as well), that kind of thing. Since they’re essentially storage appliances, most people would view them as somewhere between an external hard drive enclosure and a proper computer. Of course, there’s no such thing as “half a computer”, and these devices are typically just small computers with a NIC and a few hundred megabytes to a few gigabytes of RAM, usually running some variant of Linux. They could be used for all kinds of computing; they just happen to be specialized in a way that makes them more suitable for some tasks than others.

That’s not why I bring this up, though; I bring it up because from an end user’s point of view, a NAS is essentially just a disk drive you can plug into multiple machines simultaneously, using an Ethernet plug instead of a SATA plug. Now from the computer’s (and OSes) point of view, a local disk versus a NAS are very different things; at that level, they are speaking very different languages (namely, a block device with a local file system on it, versus a remote network filesystem such as CIFS or NFS). But to most applications running on top of the OS (and by extension most users that use them), the distinction between a local drive and a mounted network share only really becomes apparent when the abstraction breaks.

Interestingly, we see the same pattern today even when we zoom in to the local storage connected to a computer. If you were to get a 2004 SATA hard drive, it would very likely be exactly that – still a “dumb” hard drive, possibly with a bit of cache, a microcontroller to talk SATA to the host and really address the disk, plus a fair amount of (possibly programmable) DSP circuitry to actually handle the encoding of bits into and decoding of bits from magnetic flux patterns, because doing this is way more involved – and way less discrete – than you probably think. So even in 2004, we already have a considerable amount of essentially invisible compute power being thrown at a problem that the higher-up abstraction layers gloss over; but at least all this processing still essentially respects the underlying abstraction of a block device. For a 2014 USB thumb drive, that’s not necessarily true anymore either: they do things like wear leveling, remapping block addresses to distribute write loads evenly over the device (and hence prevent some blocks aging much faster than others). These processes often involve parsing the file system itself (at least if it’s FAT or a variant). Think about this for a second: a storage device that, by itself, understands the file system that is stored on it, and automatically rearranges data to maximize its expected lifetime. This level of autonomy is less than a NAS transcoding videos on the fly, but it’s still a far cry from the truly “dumb” disk and tape drives our reigning block device abstractions were designed for. And we’re not talking about some expensive high-availability low-volume server technology here; this is a storage technology that is both cheap and ubiquitous.

That brings up another point: Flash memory technology is a vastly different beast from magnetic recording technology. And yet, all of those differences get hidden and papered over at a real low level; we could have taught OSes what Flash devices are, how they work, and declared them as a separate device family, with separate file systems that handle the wear leveling on the software side. Yet this is not what happened; evidently, the path of least resistance was to add dedicated hardware and software to every single Flash storage device that makes it look like a hard drive, even though it works very differently, with very different trade-offs.

This is not an isolated incident: if you go over the list of peripherals from last time, you’ll find something similar for every single one of them. Ethernet technology has essentially been completely redesigned between 1989 and 2014, yet the only part of the Ethernet standard that’s both device-independent and directly visible to software – namely, Ethernet frames – has remained essentially unchanged (except for the addition of jumbo frames, which are still not widely used). SAS and SATA may use a very different encoding and topology than their parallel counterparts, but from a software perspective the changes are evolutionary not revolutionary. Regular PCI and PCIe are essentially the same from the perspective of the OS. And so forth. Probably the most impressive example of all is the x86 architecture; a modern x86 processor can (and will) go to substantial lengths to pretend that it’s really just a fancy 8086, a 286, a 386, or a slightly extended AMD Athlon 64, depending on the operating mode; the architecture and implementation of current x86 processors is nothing like the original processors it’s emulating, but while the implementations may change substantially over time, the interfaces – protocols, to stay within our networking terminology – stay mostly constant over large time scales, warts and all.

Note how much this runs counter to the common attitude that software is easy to change, and hardware difficult. This might be true for individuals; but across computing as a whole, it’s missing the point entirely. The more relevant distinction is that implementations are easy to change, and protocols difficult. And once a protocol (or API in software terms) is firmly established and widely supported, it’s damn near impossible to get rid of. If that means that USB thumb drives have to pretend they are hard disks, and need the capability to understand the file system that’s stored on them to do a good job; if that means that Ethernet devices have to pretend that it’s 1989 and all nodes on the network are connected to one wire and identified by their MAC address, rather than connected by point-to-point links via network switches that understand IP just fine; if that means that “x86″ processors have to pretend they’re not actually out-of-order cores with an internal RISC-ish instruction set, hundreds of registers and sophisticated “decoding” (really, translation) engines; then so be it. That is the price we pay to keep things more or less in working order – except for that occasional instance when all our sophisticated abstractions break, the underlying reality shines through, and that mounted network share is revealed to not actually be on the local machine at all, by producing errors that no app knows how to handle.

Just keep hitting refresh. It’ll be back in a minute.

I first started playing around with PCs about 25 years ago, so around 1989/1990 (and began coding maybe a year later). At the time, there was a large variety of different connectors, plugs and buses in use: keyboards used DIN or PS/2 plugs, a point-to-point link with a custom protocol. Mice and modems (if you had them!) were typically connected using RS-232, a point-to-point serial link. Your monitor was connected using a VGA connector, a point-to-point link carrying analog component video (RGBHV). Printers were connected using the parallel port, a point-to-point parallel link (initially 8 bits host->device unidirectional with some unidirectional and bidirectional status bits; later, there were also modes with a bidirectional link on the 8 data bits, used to connect devices like scanners). The PC had an extension bus, now called ISA, that you could use to plug in peripherals such as video cards, floppy controllers, hard disk controllers, SCSI controllers, sound cards and even memory expansions. On the storage side, floppy drives were on a parallel link (not sure if it qualifies as a bus or not) that could multiplex two floppy drives onto one wide ribbon cable. Consumer hard drives typically spoke what is now called PATA (parallel ATA) but was then just called ATA or IDE; essentially just a thinly wrapped ISA bus protocol (these drives already had their controllers on-board). As such, ATA/IDE was a parallel bus (more ribbon cables). There was also SCSI; on PCs, it was considered the “high-end” storage option, on other platforms it was the default. Either way, SCSI was another parallel bus. Finally, if you had an Ethernet card (and in 1990 you probably didn’t in your home machine, because there was no point – it was quite likely the only computer in your household, or in a half-mile radius for that matter), then at the time that meant one long coaxial cable (“bus topology”), T-plugs to plug in machines, and terminators at both ends of the cable to prevent reflections. Ethernet at the time already sent packets (not the IP kind, Ethernet packets, properly called “frames”) over a high-speed (10Mbit/s!) serial link, relying on collision detection in lieu of the more complicated arbitration techniques its competitors used at the time.

Okay, so much for the history lesson. Now let’s fast-forward 25 years. How does the situation look today?

We are still using “Ethernet”, but 2014 Ethernet has very little in common with its 1989 counterpart other than the name. It still uses serial links, but as of 10BASE-T it uses a star topology (using switches and hubs) rather than a bus, and with the introduction of Gigabit Ethernet, the whole “collision detection” business fell by the wayside too; Gigabit Ethernet is a bunch of point-to-point links connected by switches (active devices, instead of hubs which are mostly passive), and hence there’s no physically shared medium anymore; there’s explicit switching. So: serial point-to-point links, packet based, switching. The same tech (albeit at different scaling levels) is used in your home, at your workplace, at your ISP, and in major internet exchanges/peering spots.

Extension cards used to be on the ISA bus; in 2014, it’s all PCI Express, and while it’s still called a “bus”, it hasn’t been one ever since the “Express” part got added. PCIe uses serial not parallel links, the connectivity is point-to-point instead of a shared medium, with arbitration by the “root complex” (which logically sits at the center of the bus, making this a star instead of a bus topology), oh and communication is packet-based. So PCIe uses serial point-to-point links is packet-based, and uses switching.

On the storage front, PATA was succeeded by SATA. SATA is, as the name says, serial instead of parallel. The cables, which used to form a bus (with each cable typically allowing up to two devices to be connected to the controller) now are used for single point-to-point links between devices and the controller. Oh, and there is a more explicit framing protocol based around packets. In other words, SATA has a star topology of serial point-to-point links and is packet-based.

Of course, SCSI didn’t disappear off the face of the earth either; it turned into Serial Attached SCSI (SAS). SAS keeps the SCSI command set, but as the name suggests, links are now serial. Oh, and – interrupt me if this sounds familiar – they’re now point-to-point, with everything connected either directly to the “initiator”, or indirectly through one or more “expanders” (switches), meaning SAS has a tiered-star topology. I’m not sure whether or not SAS is internally packet-based though, but I have my suspicions :) [can anyone confirm this?]. Oh, and SAS can also tunnel SATA streams, because why not.

Keyboard, mice, and printers are now all on USB. USB stands for “Universal Serial Bus”; the first two words of that are accurate (it is pretty much universally used at this point, and it is serial) but the latter is a blatant lie. USB may pretend to be a bus, but it really is a packet-based network with a tiered-star topology. Serial point-to-point links, possibly with hubs in between, and all packet-based. And if you happen to attach a USB storage device such as a thumb drive or external hard disk, well, USB storage really just wraps SCSI commands in USB packets – and these SCSI commands may in turn wrap ATA commands.

And what about displays? Well, if you use DVI or HDMI, you still have a dedicated point-to-point link, much as in the VGA days, although it’s now all digital. But if you happen to be using the newer DisplayPort, well, that’s a – can you guess? – serial, packet-based point-to-point link protocol. Which can also carry USB across the link, by the way. Oh, and there’s also Thunderbolt, which can multiplex both DisplayPort and PCIe packets over its serial point-to-point links. Among other things, this has fun consequences such as allowing your monitor or TV (or anything else plugged into your display connector) full freaking bus master DMA access – a killer feature if I’ve ever seen one! – and, just as fun, finally gives you the opportunity to have a storage device that talks ATA through SCSI by way of USB over DisplayPort inside Thunderbolt, because we can.

Okay, so pretty much every single pluggable link interface that used to be in 1989 PCs has been replaced by what are, effectively, all very similar spins on the same underlying network technology (and they really are all networks, albeit some with funky protocols). Is that the end of it?

Well, no, not quite. Because we now have multiple CPUs, too. And guess how they talk to each other? Well, in the case of PCs, there’s two primary options: AMD’s HyperTransport and Intel’s QuickPath Interconnect (QPI). Needless to say, both of these are packet-based, serial, point-to-point links, typically used to communicate between CPU cores (or CPU sockets anyway).

So what’s my point here? Well, logically, these CPUs appear to share the same memory, but physically it’s more common at this point to have each CPU (or at least each socket) own a certain fraction of physical memory that it’s connected to directly; the rest, it can only access by asking the other socket (=NUMA). And then your hard drive (or SSD), which speaks SCSI or ATA or both, needs a controller to understand these protocols – often programmable. And, in the case of hard drivers, usually also a DSP to decode/encode the stuff that’s on disk. And your GPU also has its own memory, and a wide array of vector processors that is (at this point) easily Turing-complete. And usually several other fairly general-purpose blocks you don’t even know about. In fact, for each general-purpose processor (with its own memory) in your system that you know about, there are probably 2 or 3 that you don’t. There are tiny ARM cores absolutely everywhere, sometimes where you’d least expect them, and Intel likes to use embedded 486s for the same kind of thing (they’re so small by now, you literally need a microscope to even see them).

We don’t really tend to think of them as such, but not only are all kinds of peripherals on network links these days, they also have general-purpose CPUs, often with substantially more processing power (and memory!) than the 286 @ 10MHz I learned programming on in 1990. And this is all still just talking about PCs – I haven’t even started on things like cars or planes.

So what’s my point? Well, just to say that what we think of as “computers” today are, in practice, already fairly large, heterogeneous clusters of different specialized smaller computers, and while only a small part of these computers is visible to the user, there’s a bunch of them lurking just below the surface. This goes not just for PCs, but also phones and tablets. And of course, absolutely everything you plug into a computer these days is – by itself – another computer. Even stuff you think of as “passive” components, like batteries and thumb drives.

And they’re all talking to each other over networks, most of which are starting to look pretty damn similar at this point – and with greatly varying degrees of implementation quality, eliciting both awe that anything works at all and dread at how easy it is all to exploit. So let me conclude by formulating an analogue to Zawinski’s Law of Software Envelopment:

Every data link standard converges to serial point-to-point links connected in a tiered-star topology and transporting packets. Those link standards which cannot so converge are replaced by ones which can.

In the previous post, I wrote about rANS in general. The ANS family is, in essence, just a different design approach for arithmetic coders, with somewhat different trade-offs, strengths and weaknesses than existing coders. In this post, I am going to talk specifically about using rANS as a drop-in replacement for (static) Huffman coding: that is, we are encoding data with a known, static probability distribution for symbols. I am also going to assume a compress-once decode-often scenario: slowing down the encoder (within reason) is acceptable if doing so makes the decoder faster. It turns out that rANS is very useful in this kind of setting.

Review

Last time, we defined the rANS encoding and decoding functions, assuming a finite alphabet $\mathcal{A} = \{ 0, \dots, n - 1 \}$ of n symbols numbered 0 to n-1.

$C(s,x) := M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s)$
$D(x) := (s, F_s \lfloor x/M \rfloor + (x \bmod M) - B_s)$ where $s = s(x \bmod M)$.

where $F_s$ is the frequency of symbol s, $B_s = \sum_{i=0}^{s-1} F_i$ is the sum of the frequencies of all symbols before s, and $M = \sum_{i=0}^{n-1} F_i$ is the sum of all symbol frequencies. Then a given symbol s has (assumed) probability $p_s = F_s / M$.

Furthermore, as noted in the previous post, M can’t be chosen arbitrarily; it must divide L (the lower bound of our normalized interval) for the encoding/decoding algorithms we saw to work.

Given these constraints and the form of C and D, it’s very convenient to have M be a power of 2; this replaces the divisions and modulo operations in the decoder with bit masking and bit shifts. We also choose L as a power of 2 (which needs to be at least as large as M, since otherwise M can’t divide L).

This means that, starting from a reference probability distribution, we need to approximate the probabilities as fractions with common denominator M. My colleague Charles Bloom just wrote a blog post on that very topic, so I’m going to refer you there for details on how to do this optimally.

Getting rid of per-symbol divisions in the encoder

Making M a power of two removes the division/modulo operations in the decoder, but the encoder still has to perform them. However, note that we’re only ever dividing by the symbol frequencies $F_s$, which are known at the start of the encoding operation (in our “static probability distribution” setting). The question is, does that help?

You bet it does. A little known fact (amongst most programmers who aren’t compiler writers or bit hacking aficionados anyway) is that division of a $p$-bit unsigned integer by a constant can always be performed as fixed-point multiplication with a reciprocal, using $2p+1$ bits (or less) of intermediate precision. This is exact – no round-off error involved. Compilers like to use this technique on integer divisions by constants, since multiplication (even long multiplication) is typically much faster than division.

There are several papers on how to choose the “magic constants” (with proofs); however, most of them are designed to be used in the code generator of a compiler. As such, they generally have several possible code sequences for division by constants, and try to use the cheapest one that works for the given divisor. This makes sense in a compiler, but not in our case, where the exact frequencies are not known at compile time and doing run-time branching between different possible instruction sequences would cost more than it saves. Therefore, I would suggest sticking with Alverson’s original paper “Integer division using reciprocals”.

The example code I linked to implements this approach, replacing the division/modulo pair with a pair of integer multiplications; when using this approach, it makes sense to limit the state variable to 31 bits (or 63 bits on 64-bit architectures): as said before, the reciprocal method requires $2p+1$ bits working precision for worst-case divisors, and reducing the range by 1 bit enables a faster (and simpler) implementation than would be required for a full-range variant (especially in C/C++ code, where multi-precision arithmetic is not easy to express). Note that handling the case $F_s=1$ requires some extra work; details are explained in the code.

Symbol lookup

There’s one important step in the decoder that I haven’t talked about yet: mapping from the “slot index” $x \bmod M$ to the corresponding symbol index. In normal rANS, each symbol covers a contiguous range of the “slot index” space (by contrast to say tANS, where the slots for any given symbol are spread relatively uniformly across the slot index space). That means that, if all else fails, we can figure out the symbol ID using a binary search in $\lceil\log_2 n\rceil$ steps (remember that n is the size of our alphabet) from the cumulative frequency table (the $B_s$, which take O(n) space) – independent of the size of M. That’s comforting to know, but doing a binary search per symbol is, in practice, quite expensive compared to the rest of the decoding work we do.

At the other extreme, we can just prepare a look-up table mapping from the cumulative frequency to the corresponding symbol ID. This is very simple (and the technique used in the example code) and theoretically constant-time per symbol, but it requires a table with M entries – and if the table ends up being too large to fit comfortably in a core’s L1 cache, real-world performance (although still technically bounded by a constant per symbol) can get quite bad. Moving in the other direction, if M is small enough, it can make sense to store the per-symbol information in the M-indexed table too, and avoid the extra indirection; I would not recommend this far beyond M=212 though.

Anyway, that gives us two design points: we can use $O(n)$ space, at a cost of $O(\log n)$ per symbol lookup; or we can use $O(M)$ space, with $O(1)$ symbol lookup cost. Now what we’d really like is to get $O(1)$ symbol lookup in $O(n)$ space, but sadly that option’s not on the menu.

Or is it?

The alias method

To make a long story short, I’m not aware of any way to meet our performance goals with the original unmodified rANS algorithm; however, we can do much better if we’re willing to relax our requirements a bit. Notably, there’s no deep reason for us to require that the slots assigned to a given symbol s be contiguous; we already know that e.g. tANS works in a much more relaxed setting. So let’s assume, for the time being, that we can rearrange our slot to symbol mapping arbitrarily (we’ll have to check if this is actually true later, and also work through what it means for our encoder). What does that buy us?

It buys us all we need to meet our performance goals, it turns out (props to my colleague Sean Barrett, who was the first one to figure this out, in our internal email exchanges anyway). As the section title says, the key turns out to be a stochastic sampling technique called the “alias method”. I’m not gonna explain the details here and instead refer you to this short introduction (written by a computational evolutionary geneticist, on randomly picking base pairs) and “Darts, Dice and Coins”, a much longer article that covers multiple ways to sample from a nonuniform distribution (by the way, note that the warnings about numerical instability that often accompany descriptions of the alias method need not worry us; we’re dealing with integer frequencies here so there’s no round-off error).

At this point, you might be wondering what the alias method, a technique for sampling from a non-uniform discrete probability distribution, has anything to do with entropy (de)coding. The answer is that the symbol look-up problem is essentially the same thing: we have a “random” value $x \bmod M$ from the interval [0,M-1], and a matching non-uniform probability distribution (our symbol frequencies). Drawing symbols according to that distribution defines a map from [0,M-1] to our symbol alphabet, which is precisely what we need for our decoding function.

So what does the alias method do? Well, if you followed the link to the article I mentioned earlier, you get the general picture: it partitions the probabilities for our n-symbol alphabet into n “buckets”, such that each bucket i references at most 2 symbols (one of which is symbol i), and the probabilities within each bucket sum to the same value (namely, 1/n). This is always possible, and there is an algorithm (due to Vose) which determines such a partition in O(n) time. More generally, we can do so for any N≥n, by just adding some dummy symbols with frequency 0 at the end. In practice, it’s convenient to have N be a power of two, so for arbitrary n we would just pick N to be the smallest power of 2 that is ≥n.

Translating the sampling terminology to our rANS setting: we can subdivide our interval [0,M-1] into N sub-intervals (“buckets”) of equal size k=M/N, such that each bucket i references at most 2 distinct symbols, one of which is symbol i. We also need to know what the other symbol referenced in this bucket is – alias[i], the “alias” that gives the methods its name – and the position divider[i] of the “dividing line” between the two symbols.

With these two tables, determining the symbol ID from x is quick and easy:

  uint xM = x % M; // bit masking (power-of-2 M)
uint bucket_id = xM / K; // shift (power-of-2 M/N!)
uint symbol = bucket_id;
if (xM >= divider[bucket_id]) // primary symbol or alias?
symbol = alias[bucket_id];


This is O(1) time and O(N) = O(n) space (for the “divider” and “alias” arrays), as promised. However, this isn’t quite enough for rANS: remember that for our decoding function D, we need to know not just the symbol ID, but also which of the (potentially many) slots assigned to that symbol we ended up in; with regular rANS, this was simple since all slots assigned to a symbol are sequential, starting at slot $B_s$:
$D(x) = (s, F_s \lfloor x/M \rfloor + (x \bmod M) - B_s)$ where $s = s(x \bmod M)$.
Here, the $(x \bmod M) - B_s$ part is the number we need. Now with the alias method, the slot IDs assigned to a symbol aren’t necessarily contiguous anymore. However, within each bucket, the slot IDs assigned to a symbol are sequential – which means that instead of the cumulative frequencies $B_s$, we now have two separate per bucket. This allows us to define the complete “alias rANS” decoding function:

  // s, x = D(x) with "alias rANS"
uint xM = x % M;
uint bucket_id = xM / K;
uint symbol, bias;
if (xM < divider[bucket_id]) { // primary symbol or alias?
symbol = bucket_id;
bias = primary_start[bucket_id];
} else {
symbol = alias[bucket_id];
bias = alias_start[bucket_id];
}
x = (x / M) * freq[symbol] + xM - bias;


And although this code is written with branches for clarity, it is in fact fairly easy to do branch-free. We gained another two tables indexed with the bucket ID; generating them is another straightforward linear pass over the buckets: we just need to keep track of how many slots we’ve assigned to each symbol so far. And that’s it – this is all we need for a complete “alias rANS” decoder.

However, there’s one more minor tweak we can make: note that the only part of the computation that actually depends on symbol is the evaluation of freq[symbol]; if we store the frequencies for both symbols in each alias table bucket, we can get rid of the dependent look-ups. This can be a performance win in practice; on the other hand, it does waste a bit of extra memory on the alias table, so you might want to skip on it.

Either way, this alias method allow us to perform quite fast (though not as fast as a fully-unrolled table for small M) symbol look-ups, for large M, with memory overhead (and preparation time) proportional to n. That’s quite cool, and might be particularly interesting in cases where you either have relatively small alphabets (say on the order of 15-25 symbols), need lots of different tables, or frequently switch between tables.

Encoding

However, we haven’t covered encoding yet. With regular rANS, encoding is easy, since – again – the slot ranges for each symbol are contiguous; the encoder just does
$C(s,x) = M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s)$
where $B_s + (x \bmod F_s)$ is the slot id corresponding to the $(x \bmod F_s)$‘th appearance of symbol s.

With alias rANS, each symbol may have its slots distributed across multiple, disjoint intervals – up to N of them. And the encoder now needs to map $(x \bmod F_s)$ to a corresponding slot index that will decode correctly. One way to do this is to just keep track of the mapping as we build the alias table; this takes O(M) space and is O(1) cost per symbol. Another is to keep a sorted list of subintervals (and their cumulative sizes) assigned to each symbol; this takes only O(N) space, but adds a $O(\log_2 N)$ (worst-case) lookup per symbol in the encoder. Sound familiar?

In short, using the alias method doesn’t really solve the symbol lookup problem for large M; or, more precisely, it solves the lookup problem on the decoder side, but at the cost of adding an equivalent problem on the encoder side. What this means is that we have to pick our poison: faster encoding (at the some extra cost in the decoder), or faster decoding (at some extra cost in the encoder). This is fine, though; it means we get to make a trade-off, depending on which of the two steps is more important to us overall. And as long as we are in a compress-once decompress-often scenario (which is fairly typical), making the decoder faster at some reasonable extra cost in the encoder is definitely useful.

Conclusion

We can exploit static, known probabilities in several ways for rANS and related coders: for encoding, we can precompute the right “magic values” to avoid divisions in the hot encoding loop; and if we want to support large M, the alias method enables fast decoding without generating a giant table with M entries – with an O(n) preprocessing step (where n is the number of symbols), we can still support O(1) symbol decoding, albeit with a (slightly) higher constant factor.

I’m aware that this post is somewhat hand-wavey; the problem is that while Vose’s algorithm and the associated post-processing are actually quite straightforward, there’s a lot of index manipulation, and I find the relevant steps to be quite hard to express in prose without the “explanation” ending up harder to read than the actual code. So instead, my intent was to convey the “big picture”; a sample implementation of alias table rANS, with all the details, can be found – as usual – on Github.

And that’s it for today – take care!

We’ve been spending some time at RAD looking at Jarek Duda’s ANS/ABS coders (paper). This is essentially a new way of doing arithmetic coding with some different trade-offs from the standard methods. In particular, they have a few sweet spots for situations that were (previously) hard to handle efficiently with regular arithmetic coding.

The paper covers numerous algorithms. Of those given, I think what Jarek calls “rANS” and “tANS” are the most interesting ones; there’s plenty of binary arithmetic coders already and “uABS”, “rABS” or “tABS” do not, at first sight, offer any compelling reasons to switch (that said, there’s some non-obvious advantages that I’ll get to in a later post).

Charles has already posted a good introduction; I recommend you start there. Charles has also spent a lot of time dealing with the table-based coders, and the table-based construction allows some extra degrees of freedom that make it hard to see what’s actually going on. In this post, I’m going to mostly talk about rANS, the range coder equivalent. Charles already describes it up to the point where you try to make it “streaming” (i.e. finite precision); I’m going to continue from there.

Streaming rANS

In Charles’ notation (which I’m going to use because the paper uses both indexed upper-case I’s and indexed lower-case l’s, which would make reading this with the default font a disaster), we have two functions now: The coding function C and the decoding function D defined as

$C(s,x) := M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s)$
$D(x) := (s, F_s \lfloor x/M \rfloor + (x \bmod M) - B_s)$ where $s = s(x \bmod M)$.

D determines the encoded symbol s by looking at (x mod M) and looking up the corresponding value in our cumulative frequency table (s is the unique s such that $B_s \le x \bmod M < B_s + F_s$).

Okay. If you're working with infinite-precision integers, that's all you need, but that's not exactly suitable for fast (de)compression. We want a way to perform this using finite-precision integers, which means we want to limit the size of x somehow. So what we do is define a "normalized" interval

$I := \{ L, L+1, \dots, bL - 1 \} = [L:bL)$

The [L:bL) thing on the right is the notation I'll be using for a half-open interval of integers). b is the radix of our encoder; b=2 means we're emitting bits at a time, b=256 means we're emitting bytes, and so forth. Okay. We now want to keep x in this range. Too large x don't fit in finite-precision integers, but why not allow small x? The (hand-wavey) intuition is that too small x don't contain enough information; as Charles mentioned, x essentially contains about $\log_2(x)$ bits of information. If x < 4, it can only be one of four distinct values, and as you saw above the value of x (mod M) directly determines which symbol we're coding. So we want x just right: not too large, not too small. Okay. Now let's look at how a corresponding decoder would look:

  while (!done) {
// Loop invariant: x is normalized.
assert(L <= x && x < b*L);

// Decode a symbol
s, x = D(x);

// Renormalization: While x is too small,
// keep reading more bits (nibbles, bytes, ...)
while (x < L)
}


Turns out that’s actually our decoding algorithm, period. What we need to do now is figure out how the corresponding encoder looks. As long as we’re only using C and D with big integers, it’s simple; the two are just inverses of each other. Likewise, we want our encoder to be the inverse of our decoder – exactly. That means we have to do the inverse of the operations the decoder does, in the opposite order. Which means our encoder has to look something like this:

  while (!done) {
// Inverse renormalization: emit bits/bytes etc.
while (???) {
writeToStream(x % b);
x /= b;
}

// Encode a symbol
x = C(s, x);

// Loop invariant: x is normalized
assert(L <= x && x < b*L);
}


So far, this is purely mechanical. The only question is what happens in the “???” – when exactly do we emit bits? Well, for the encoder and decoder to be inverses of each other, the answer has to be “exactly when the decoder would read them”. Well, the decoder reads bits whenever the normalization variant is violated after decoding a symbol, to make sure it holds for the next iteration of the loop. The encoder, again, needs to do the opposite – we need to proactively emit bits before coding s to make sure that, after we’ve applied C, x will be normalized.

In fact, that’s all we need for a first sketch of renormalization:

  while (!done) {
// Keep trying until we succeed
for (;;) {
x_try = C(s, x);
if (L <= x_try && x_try < b*L) { // ok?
x = x_try;
break;
} else {
// Shrink x a bit and try again
writeToStream(x % b);
x /= b;
}
}

x = x_try;
}


Does this work? Well, it might. It depends. We certainly can’t guarantee it from where we’re standing, though. And even if it does, it’s kind of ugly. Can’t we do better? What about the normalization – I’ve just written down the normalization loops, but just because both decoder and encoder maintain the same invariants doesn’t necessarily mean they are in sync. What if at some point of the loop, there are more than two possible normalized configurations – can this happen? Plus there’s some hidden assumptions in here: the encoder, by only ever shrinking x before C, assumes that C always causes x to increase (or at least never causes x to decrease); similarly, the decoder assumes that applying D won’t increase x.

And I’m afraid this is where the proofs begin.

Streaming rANS: the proofs (part 1)

Let’s start with the last question first: does C always increase x? It certainly looks like it might, but there’s floors involved – what if there’s some corner case lurking? Time to check:

$C(s,x)$
$= M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s)$
$= F_s \lfloor x/F_s \rfloor + (x \bmod F_s) + (M - F_s) \lfloor x/F_s \rfloor + B_s$
$= x + (M - F_s) \lfloor x/F_s \rfloor + B_s$
$\ge x$

since $x/F_s$, $B_s$, and $M - F_s$ are all non-negative (the latter because $F_s < M$ – each symbol's frequency is less than the sum of all symbol frequencies). So C indeed always increases x, or more precisely, never decreases it.

Next up: normalization. Let's tackle the decoder first. First off, is normalization in the decoder unique? That is, if $x \in I$, is that the only normalized x it could be at that stage in the decoding process? Yes, it is: $x \in I = [L:bL)$, so $x \ge L$, so

$bx + d \ge bx \ge bL$ where $d \in [0:b)$ arbitrary

That is, no matter what bit / byte / whatever the decoder would read next (that's 'd'), running through the normalization loop once more would cause x to become too large; there's only one way for x to be normalized. But do we always end up with a normalized x? Again, yes. Suppose that $x < L$, then (because we're working in integers) $x \le L - 1$, and hence

$bx + d \le bL - b + d \le bL - 1$ (again $d \in [0:b)$ arbitrary)

The same kind of argument works for the encoder, which floor-divides by b instead of multiplying b it. The key here is that our normalization interval I has a property the ANS paper calls “b-uniqueness”: it’s of the form $I=[k:bk)$ for some positive integer k (its upper bound is b times its lower bound). Any process that grows (or shrinks) x in multiples of b can’t “skip over” I (as in, transition from being larger than the largest value in I to smaller than the smallest value in I or vice versa) in a single step. Simultaneously, there’s also never two valid states the encoder/decoder could be in (which could make them go out of sync: both encoder and decoder think they’re normalized, but they’re at different state values).

To elaborate, suppose we have b=2 and some interval where the ratio between lower and upper bound is a tiny bit less than 2: say $I' = [k:2k-1) = \{ k, k+1, \dots, 2k-2 \}.$ There’s just one value missing. Now suppose the decoder is in state $x=k-1$ and reads a bit, which turns out to be 1. Then the new state is $x' = 2x+1 = 2(k-1) + 1 = 2k - 1 \not\in I'$ – we overshot! We were below the lower bound of I’, and yet with a single bit read, we’re now past its upper bound. I’ is “too small”.

Now let’s try the other direction; again b=2, and this time we make the ratio between upper and lower bound a bit too high: set $I' = [k:2k+1) = \{ k, k+1, \dots, 2k \}.$ There’s no risk of not reaching a state in that interval now, but now there is ambiguity. Where’s the problem? Suppose the encoder is in state $x=2k$. Coding any symbol will require renormalization to “shift out” one bit. The encoder writes that bit (a zero), goes to state $x=k$, and moves on. But there’s a problem: the decoder, after decoding that symbol, will be in state $x=k$ too. And it doesn’t know that the encoder got there from state $x=2k$ by shifting a bit; all the decoder knows is that it’s in state $x=k \in I'$, which is normalized and doesn’t require reading any more bits. So the encoder has written a bit that the decoder doesn’t read, and now the two go out of sync.

Long story short: if the state intervals involved aren’t b-unique, bad things happen. And on the subject of bad things happening, our encoder tries to find an x such that C(s,x) is inside I by shrinking x – but what if that process doesn't find a suitable x? We need to know a bit more about which values of x lead to C(s,x) being inside I, which leads us to the sets

$I_s := \{ x | C(s,x) \in I \}$

i.e. the set of all x such that encoding 's' puts us into I again. If all these sets turn out to be b-unique intervals, we're good. If not, we're in trouble. Time for a brief example.

Intermission: why b-uniqueness is necessary

Let’s pick L=5, b=2 and M=3. We have just two symbols: ‘a’ has probability $P_a=2/3$, and ‘b’ has probability $P_b=1/3$, which we turn into $F_a = 2$, $B_a = 0$, $F_b = 1$, $B_b = 2$. Our normalization interval is $I = [5:2\cdot5) = \{5, 6, 7, 8, 9\}$. By computing C(s,x) for the relevant values of s and x, we find out that

$I_a = \{ 4,5,6 \} = [4:7)$
$I_b = \{ 1,2 \} = [1:3)$

Uh-oh. Neither of these two intervals is b-unique. $I_a$ is too small, and $I_b$ is too big. So what goes wrong?

Well, suppose that we’re in state x=7 and want to encode an ‘a’. 7 is not in $I_a$ (too large). So the encoder emits the LSB of x, divides by 2 and… now x=3. Well, that’s not in $I_a$ either (too small), and shrinking it even further won’t help that. So at this point, the encoder is stuck; there’s no x it can reach that works.

Proofs (part 2): a sufficient condition for b-uniqueness

So we just saw that in certain scenarios, rANS can just get stuck. Is there anything we can do to avoid it? Yes: the paper points out that the embarrassing situation we just ran into can’t happen when M (the sum of all symbol frequencies, the denominator in our probability distribution) divides L, our normalization interval lower bound. That is, $L=kM$ for some positive integer k. It doesn’t give details, though; so, knowing this, can we prove anything about the $I_s$ that would help us? Well, let’s just look at the elements of $I_s$ and see what we can do:

$I_s = \{ x | C(s,x) \in I \}$

let’s work on that condition:

$C(s,x) \in I$
$\Leftrightarrow L \le C(s,x) < bL$
$\Leftrightarrow L \le M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s) < bL$

at this point, we can use that $L=kM$ and divide by M:

$\Leftrightarrow kM \le M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s) < bkM$
$\Leftrightarrow k \le \lfloor x/F_s \rfloor + (B_s + (x \bmod F_s))/M < bk$

Now, for arbitrary real numbers r and natural numbers n we have that

$n \le r \Leftrightarrow n \le \lfloor r \rfloor \quad \textrm{and} \quad r < n \Leftrightarrow \lfloor r \rfloor < n$

Using this, we get:

$\Leftrightarrow k \le \lfloor \lfloor x/F_s \rfloor + (B_s + (x \bmod F_s))/M \rfloor < bk$

note the term in the outer floor bracket is the sum of an integer and a real value inside $[0,1)$, since $0 \le B_s + (x \bmod F_s) < M$, so we can simplify drastically

$\Leftrightarrow k \le \lfloor x/F_s \rfloor < bk$
$\Leftrightarrow k \le x/F_s < bk$
$\Leftrightarrow kF_s \le x < bkF_s$
$\Leftrightarrow x \in [kF_s:bkF_s)$

where we applied the floor identities above again and then just multiplied by $F_s$. Note that the result is an interval of integers with its (exclusive) upper bound being equal to b times its (inclusive) lower bound, just like we need – in other words, assuming that $L=kM$, all the $I_s$ are b-unique and we're golden (this is mentioned in the paper in section 3.3, but not proven, at least not in the Jan 6 2014 version).

Note that this also gives us a much nicer expression to check for our encoder. In fact, we only need the upper bound (due to b-uniqueness, we know there's no risk of us falling through the lower bound), and we end up with the encoding function

  while (!done) {
// Loop invariant: x is normalized
assert(L <= x && x < b*L);

// Renormalize
x_max = (b * (L / M)) * freq[s]; // all but freq[s] constant
while (x >= x_max) {
writeToStream(x % b);
x /= b;
}

// Encode a symbol
// x = C(s, x);
x = freq[s] * (x / M) + (x % M) + base[s];
}


No “???”s left – we have a “streaming” (finite-precision) version of rANS, which is almost like the arithmetic coders you know and love (and in fact quite closely related) except for the bit where you need to encode your data in reverse (and reverse the resulting byte stream).

I put an actual implementation on Github for the curious.

Some conclusions

This is an arithmetic coder, just a weird one. The reverse encoding seems like a total pain at first, and it kind of is, but it comes with a bunch of really non-obvious but amazing advantages that I’ll cover in a later post (or just read the comments in the code). The fact that M (the sum of all frequencies) has to be a multiple of L is a serious limitation, but I don’t (yet?) see any way to work around that while preserving b-uniqueness. So the compromise is to pick M and L to be both powers of 2. This makes the decoder’s division/mod with M fast. The power-of-2 limitation makes rANS really bad for adaptive coding (where you’re constantly updating your stats, and resampling to a power-of-2-sized distribution is expensive), but hey, so is Huffman. As a Huffman replacement, it’s really quite good.

In particular, it supports a divide-free decoder (and actually no per-symbol division in the encoder either, if you have a static table; see my code on Github, RansEncPutSymbol in particular). This is something you can’t (easily) do with existing multi-symbol arithmetic coders, and is a really cool property to have, because it really puts it a lot closer to being a viable Huffman replacement in a lot of places that do care about the speed.

If you look at the decoder, you’ll notice that its entire behavior for a decoding step only depends on the value of x at the beginning: figure out the symbol from the low-order bits of x, go to a new state, read some bits until we’re normalized again. This is where the table-based versions (tANS etc.) come into play: you can just tabulate their behavior! To make this work, you want to keep b and L relatively small. Then you just make a table of what happens in every possible state.

Interestingly, because these tables really do tabulate the behavior of a “proper” arithmetic coder, they’re compatible: if you have two table-baked distributions with use that same values of b and L (i.e. the same interval I), you can switch between them freely; the states mean the same in both of them. It’s not at all obvious that it’s even possible for a table-based encoder to have this property, so it’s even cooler that it comes with no onerous requirements on the distribution!

That said, as interesting as the table-based schemes are, I think the non-table-based variant (rANS) is actually more widely useful. Having small tables severely limits your probability resolution (and range precision), and big tables are somewhat dubious: adds, integer multiplies and bit operations are fine. We can do these quickly. More compute power is a safe thing to bet on right now (memory access is not). (I do have some data points on what you can do on current HW, but I’ll get there in a later post.)

As said, rANS has a bunch of really cool, unusual properties, some of which I’d never have expected to see in any practical entropy coder, with cool consequences. I’ll put that in a separate post, though – this one is long (and technical) enough already. Until then!