Skip to content

Atomic operations and contention

Last time, we covered the basics of how cache coherency works. Today, let’s talk about some of the primitives necessary to build useful systems on top of a coherent cache, and how they work.

Atomicity and atomic operations

A crucial building block for all of this are atomic operations. This has nothing to do with nuclear physics and everything to do with the root of the word atom, the Ancient Greek “ἄτομος” (atomos, “indivisible”). An atomic operation is one that cannot by divided into any smaller parts, or appears to be for the purposes of other cores in the system. To see why this matters, consider what happens when two cores both try to increment a counter at almost the same type, running the equivalent of the C statement counter++;:

Cycle # Core 1 Core 2
0 reg = load(&counter);
1 reg = reg + 1; reg = load(&counter);
2 store(&counter, reg); reg = reg + 1;
3 store(&counter, reg);

In compiled code, this single turns into a load operation, a register increment, and finally a store operation (here written in C-esque pseudocode). These three steps are distinct and will execute in sequence (note that on the micro-architectural level, this is true on x86 as well, even though the instruction set architecture actually features a read-modify-write add [memory], value instruction). And because of this splitting into multiple cycles, it’s possible for Core 2 to read counter after Core 1 has read it (and started incrementing it), but before it has written the result back. The end result is that, even though both cores ran code to increment the counter, the final value of the counter is only incremented by 1; one of the two increment operations was “lost”.

This problem is exactly what atomic operations are there to prevent; if we use an atomic increment (or more generally, atomic add) instead of a regular increment, the active core will make sure that the three steps above (load, add, store) appear to happen as one operation, hence atomic; no other core is allowed to “peek” while the increment is going on.

How atomics are implemented

Now the question is, how is this done? Conceptually, the easiest way to do this is using a locking mechanism: only one core is allowed to execute an atomic operation at any point in time. The core enters the lock before it starts the operation, and leaves it once the operation is complete. This is what the x86 LOCK prefix originally used to mean (approximately; I’m simplifying here). Here, the lock enter operation consists of a message on the bus that says “okay, everyone, back off the bus for a bit” (for our purposes, this means “stop doing memory operations”). Then the core that sent that message needs to wait for all other cores to finish memory operations they’re in the middle of doing, after which they will acknowledge the lock. Only after every other core has acknowledged, the core attempting the locked operation can proceed. Finally, once the lock is released, it again needs to send a message to everyone on the bus saying that “all clear, you can resume issuing requests on the bus now”.

This works. It is also incredibly slow. x86 CPUs still support this (or an equivalent), but only as an absolute emergency, when-all-else-fails fallback path; they need a fallback because the x86 ISA permits very dubious constructs like unaligned atomic operations that cross multiple cache lines, for backwards compatibility. Other architectures generally just don’t allow atomic operations on values that aren’t naturally aligned, nor on values that are “too big”. These constraints guarantee that a single atomic operation always takes place within a single cache line. And once we have that, we’re in good shape: as we saw last time when discussing cache coherency protocols, inter-core communication synchronizes memory at cache line granularity anyway, so in principle we can do complex modifications to a single cache line and then publish all changes at once by pushing the new cache line. Moreover, the MESI state machine features two states (M and E, “modified” and “exclusive”) that guarantee exclusive ownership of a cache line by one core – while a cache line is exclusively owned, no other core can “peek”. We can use that as substitute for our locking protocol.

So here’s the punchline: in a MESI (or derived) system, all we need to do to make sure an operation touching a single cache line is atomic is to a) make sure we issue the right memory barriers so memory operations are correctly ordered with reference to the surrounding code (see previous post), b) acquire exclusive ownership of the cache line before we read anything, c) only write back the results if we had exclusive ownership for the entire duration of the atomic operation. This guarantees that no other core saw any half-finished data. There’s multiple ways to accomplish c). For example, we can build hardware to make a limited set of atomic operations complete in a single bus clock cycle; if we have exclusive ownership of our cache line by the start of a cycle, we can have our modified data by the end of it. Since a cache line can’t possibly “change hands” within a cycle, this is fast enough. Depending on the bus protocol, we might also start playing games where we respond to coherency messages immediately, but might send the data a bit late if we’re in the middle of an atomic operation. Finally, we can just decide not to play games with timing at all; instead we implement steps b) and c) directly: any cache line that’s being used for an atomic operation is “monitored”, and if some other core looks at our cache line before the atomic operation is complete, we need to start over. This is what leads to the load-link/store-conditional (LL/SC) operations present in most RISC CPUs.

And by the way, on the bus side (and hence to other cores), properly aligned atomic operations don’t look any different than normal memory updates. Again, all of the processing is internal to the core that does it; other cores neither know nor care whether memory was updated from an atomic compare-and-swap operation bracketed by memory barriers or a relaxed store.

This all sounds nice and simple, and conceptually it is, but the devil’s in the details. The bad news is that if you’re a CPU architect, every single detail of this process is of crucial importance; your internal implementation of memory operations needs to avoid starvation (every core that wants to gain exclusive access to a cache line should be able to do so, eventually, no matter what the other cores are doing), and make sure that it’s possible to implement certain fundamental atomic operations without deadlocks or livelocks. It sounds obvious, but these guarantees are not automatic for low-level primitives like atomic operations or LL/SC. This stuff is hard to get right, and CPUs need to have an implementation that’s not just correct, it also needs to be fast for “typical cases” (whatever they are). The good news is that if you’re not working at a company designing CPUs, none of this is your problem, and you can rest assured that somebody else has thought this through, backed up by an army of validation engineers trying very hard to find test cases that break it.

The cost of memory operations

Back on the SW side, let’s assume we’re on a typical CPU architecture and are running code on multiple cores. What’s the cost of a memory operation in that environment? It breaks down into several components:

Execution. Executing a memory operation isn’t free. Suppose for now that only one core is active and running single-threaded code; even then, memory access is still complicated. Programs deal with virtual addresses, but coherent caches and memory buses normally deal exclusively in physical memory addresses. So every memory operation first starts with a virtual to physical address conversion (these are itself cached in the what’s commonly called Translation Lookaside Buffer or TLB). If you’re unlucky, that virtual address isn’t currently mapped to physical memory and needs to be brought in from storage; whenever this happens, the OS is going to schedule another thread on the active core for a while, since IO takes a long time in processor terms. But let’s assume that doesn’t happen here.

With the physical address known, the operation starts to go through the memory hierarchy. For example, to complete a memory load, the relevant data normally needs to be brought into the L1 cache first. If it’s not there yet, this can be a multi-step process that – in the worst case – involves a real memory access and then populating all the intermediate cache levels for the relevant cache line. With poor (i.e. not nicely localized) memory access patterns, waiting for cache levels to get populated is one of the main ways a CPU core spends its time. But for now, let’s assume that doesn’t happen (too often) either.

So how fast can we run memory operations if everything goes well? Pretty fast, it turns out. Usually at least one operation (loads/stores) completed per clock cycle, often more than that. Reasonably cache-friendly code will complete billions of memory operations per second on a single 3GHz core.

Memory barriers and atomic read-modify-write operations. For the next step, let’s suppose we’re running code that’s intended for multi-threaded operation, but we’re still doing so on only a single core. As a result, we will now see memory barriers and atomic operations, but no actual interference from another core; let’s just suppose that all relevant cache lines are already exclusively held by our own core. In that situation, how expensive is, say, updating a reference count using an atomic integer addition?

Well, that really depends on the CPU core. In general, micro-architectures with aggressive reordering of memory operations have more expensive memory barriers and atomic operations than those with only slight reordering or in-order execution of memory operations. For example, incrementing a reference count using LOCK INC [mem] on an Intel Atom core (an in-order design) has essentially the same cost as a regular INC [mem] instruction, and somewhat more complicated atomic operations like exchange or exchange-add end costing about 2x to 3x as much as a “vanilla” memory read-modify-write instruction. By contrast, on Intel and AMD’s out-of-order x86 desktop parts, an atomic increment has about 10x-25x the cost of the non-atomic version; that’s the cost of ensuring proper memory ordering. And again, to reiterate: this is still on code that is executing single-threaded. There’s no actual cross-core communication going on yet; this extra cost is incurred purely within a single core, to make the code safe for multi-core execution.

Bus traffic and cache coherency. Some percentage of memory accesses actually misses the cache and goes straight to memory; and once cache line that haven’t been used in a while get evicted, we start getting write-backs. All these events cause bus traffic (and memory traffic). Bus and memory bandwidth is a limited resource; as we start saturating their capacities, things start to get slower.

Moreover, once we switch to running multiple threads of our program on multiple cores, we actually start getting cache coherency traffic, as the cores continually synchronize their respective views of memory. If every thread works on its own independent memory region, this doesn’t really do much; if a given region of memory is only used by one core, then there’s no sharing, and getting exclusive ownerships of one of the corresponding cache lines is easy and doesn’t cause any invalidations elsewhere.

By contrast, if two or more cores frequently access the same cache lines, then these cache lines need to be kept synchronized. Updates to one of these cache lines require exclusive ownership, which means all other cores need to invalidate their copies of that cache line first. As a result, the next time that cache line is accessed by another core, its contents need to be sent across the bus. So we get both extra cache misses (on the remote core) and extra bus traffic. This phenomenon of multiple cores hitting a cache line that is being updated regularly is called “cache (line) contention”, and it is probably the easiest way to make parallel code in shared-memory environments crawl.

Cache line contention

To get cache line contention, we need multiple cores frequently accessing the same cache line, with at least some of these regular accesses being writes. Private data (cache lines only accessed by one thread) is never a problem; neither is immutable data (written once and then not modified until the end of its lifetime). What’s tricky is data that is both shared and mutable: doing so requires a lot of communication to maintain a consistent (as per the constraints of the memory model) view of memory between cores, and communication is expensive – and only keeps getting more so as more parties get involved.

How much more expensive are we talking? I wrote a test (for x86/Windows) a few weeks ago. This test is by no means user-friendly or easy to read, and it assumes a quad-core CPU with simultaneous multi-threading, or a similar topology, with at least 8 logical processors. Here’s the gist of it: as explained above, replacing a read-modify-write add of a value in memory with an atomic “add” operation generally makes it about 10x-25x as expensive (how much exactly depends on the micro-architecture). If you need a rule of thumb, just assume about 10x (good enough for Fermi estimation).

Once there is a second core reading that cache line, though, the cost explodes. If there’s another core generating lots of read traffic on the cache line in a tight loop, the atomic add gets more expensive – much more expensive: typically, 4x to 6x more (that’s on top of the ~10x hit we take from using an atomic add to begin with!). And this cost only goes up if there are more readers, and even more so if there are other writers too. Now, please don’t take these values as gospel; this is a completely synthetic benchmark that doesn’t do anything useful. The actual cost of coherency traffic very much depends on context. All I want to do here is give you a very rough feel for the cost of coherency traffic and the communication it does (namely: it’s not negligible).

Some of this communication is not actually necessary. For example, because cache coherency is tracked at cache line granularity, it’s possible to get lots of bogus coherency traffic simple because the different types of data – immutable, private and shared – are intermingled within the same cache line (or similarly, because one cache line contains private data for multiple threads). This is called “false sharing“. Luckily, this kind of problem is fairly easy to find with a profiler, and also relatively straightforward to fix by either reordering the data in memory (possibly while adding padding to make sure two different kinds of data don’t end up in the same cache line) or just removing some of the offending data entirely. My older post “Cores don’t like to share” gives an example.

What’s left over after this process is “real” contention – contended access to shared data. This includes both actual shared mutable data structures and certain kinds of metadata such as locks and other synchronization objects. Exactly how well this works depends on the precise layout of data in memory, as well as the operations used to access it.

In general, the way to get scalable multi-processor code is to avoid contention as much as possible, and to make whatever contention remains pass quickly – in that order. And to do a decent job at this, it’s important to know how cache coherency works (in broad strokes, anyway), what kind of messages cores exchange to maintain memory coherency, and when that coherency traffic happens. Now that we have those basics covered, we can look at somewhat higher levels of the stack. This post is long enough already, but in the next post, I’m planning to have a look at locks and lock-free data structures, and discuss some of the trade-offs. Until then, take care!

Cache coherency primer

I’m planning to write a bit about data organization for multi-core scenarios. I started writing a first post but quickly realized that there’s a bunch of basics I need to cover first. In this post, I’ll try just that.

Caches

This is a whirlwhind primer on CPU caches. I’m assuming you know the basic concept, but you might not be familiar with some of the details. (If you are, feel free to skip this section.)

In modern CPUs (almost) all memory accesses go through the cache hierarchy; there’s some exceptions for memory-mapped IO and write-combined memory that bypass at least parts of this process, but both of these are corner cases (in the sense that the vast majority of user-mode code will never see either), so I’ll ignore them in this post.

The CPU core’s load/store (and instruction fetch) units normally can’t even access memory directly – it’s physically impossible; the necessary wires don’t exist! Instead, they talk to their L1 caches which are supposed to handle it. And about 20 years ago, the L1 caches would indeed talk to memory directly. At this point, there’s generally more cache levels involved; this means the L1 cache doesn’t talk to memory directly anymore, it talks to a L2 cache – which in turns talks to memory. Or maybe to a L3 cache. You get the idea.

Caches are organized into “lines”, corresponding to aligned blocks of either 32 (older ARMs, 90s/early 2000s x86s/PowerPCs), 64 (newer ARMs and x86s) or 128 (newer Power ISA machines) bytes of memory. Each cache line knows what physical memory address range it corresponds to, and in this article I’m not going to differentiate between the physical cache line and the memory it represents – this is sloppy, but conventional usage, so better get used to it. In particular, I’m going to say “cache line” to mean a suitably aligned group of bytes in memory, no matter whether these bytes are currently cached (i.e. present in any of the cache levels) or not.

When the CPU core sees a memory load instruction, it passes the address to the L1 data cache (or “L1D$”, playing on the “cache” being pronounced the same way as “cash”). The L1D$ checks whether it contains the corresponding cache line. If not, the whole cache line is brought in from memory (or the next-deeper cache level, if present) – yes, the whole cache line; the assumption being that memory accesses are localized, so if we’re looking at some byte in memory we’re likely to access its neighbors soon. Once the cache line is present in the L1D$, the load instruction can go ahead and perform its memory read.

And as long as we’re dealing with read-only access, it’s all really simple, since all cache levels obey what I’ll call the

Basic invariant: the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses, at all times.

Things gets a bit more complicated once we allow stores, i.e. memory writes. There’s two basic approaches here: write-through and write-back. Write-through is the easier one: we just pass stores through to the next-level cache (or memory). If we have the corresponding line cached, we update our copy (or maybe even just discard it), but that’s it. This preserves the same invariant as before: if a cache line is present in the cache, its contents match memory, always.

Write-back is a bit trickier. The cache doesn’t pass writes on immediately. Instead, such modifications are applied locally to the cached data, and the corresponding cache lines are flagged “dirty”. Dirty cache lines can trigger a write-back, at which points their contents are written back to memory or the next cache level. After a write-back, dirty cache lines are “clean” again. When a dirty cache line is evicted (usually to make space for something else in the cache), it always needs to perform a write-back first. The invariant for write-back caches is slightly different.

Write-back invariant: after writing back all dirty cache lines, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses.

In other words, in write-back caches we lose the “at all times” qualifier and replace it with a weaker condition: either the cache contents match memory (this is true for all clean cache lines), or they contain values that eventually need to get written back to memory (for dirty cache lines).

Write-through caches are simpler, but write-back has some advantages: it can filter repeated writes to the same location, and if most of the cache line changes on a write-back, it can issue one large memory transaction instead of several small ones, which is more efficient.

Some (mostly older) CPUs use write-through caches everywhere; some use write-back caches everywhere; some have a simpler write-through L1$ backed by a write-back L2$. This may generate redundant traffic between L1$ and L2$ but gets the write-back benefits for transfers to lower cache levels or memory. My point being that there’s a whole set of trade-offs here, and different designs use different solutions. Nor is there a requirement that cache line sizes be the same at all levels – it’s not unheard-of for CPUs to have 32-byte lines in L1$ but 128-byte lines in L2$ for example.

Omitted for simplicity in this section: cache associativity/sets; write-allocate or not (I described write-through without write-allocate and write-back with, which is the most common usage); unaligned accesses; virtually-addressed caches. These are all things you can look up if you’re interested, but I’m not going to go that deep here.

Coherency protocols

As long as that single CPU core is alone in the system, this all works just fine. Add more cores, each with their own caches, and we have a problem: what happens if some other core modifies data that’s in one of our caches?

Well, the answer is quite simple: nothing happens. And that’s bad, because we want something to happen when someone else modifies memory that we have a cached copy of. Once we have multiple caches, we really need to keep them synchronized, or we don’t really have a “shared memory” system, more like a “shared general idea of what’s in memory” system.

Note that the problem really is that we have multiple caches, not that we have multiple cores. We could solve the entire problem by sharing all caches between all cores: there’s only one L1$, and all processors have to share it. Each cycle, the L1$ picks one lucky core that gets to do a memory operation this cycle, and runs it.

This works just fine. The only problem is that it’s also slow, because cores now spend most of their time waiting in line for their next turn at a L1$ request (and processors do a lot of those, at least one for every load/store instruction). I’m pointing this out because it shows that the problem really isn’t so much a multi-core problem as it is a multi-cache problem. We know that one set of caches works, but when that’s too slow, the next best thing is to have multiple caches and then make them behave as if there was only one cache. This is what cache coherency protocols are for: as the name suggests, they ensure that the contents of multiple caches stay coherent.

There are multiple types of coherency protocols, but most computing devices you deal with daily fall into the category of “snooping” protocols, and that’s what I’ll cover here. (The primary alternative, directory-based systems, has higher latency but scales better to systems with lots of cores).

The basic idea behind snooping is that all memory transactions take place on a shared bus that’s visible to all cores: the caches themselves are independent, but memory itself is a shared resource, and memory access needs to be arbitrated: only one cache gets to read data from, or write back to, memory in any given cycle. Now the idea in a snooping protocol is that the caches don’t just interact with the bus when they want to do a memory transaction themselves; instead, each cache continuously snoops on bus traffic to keep track of what the other caches are doing. So if one cache wants to read from or write to memory on behalf of its core, all the other cores notice, and that allows them to keep their caches synchronized. As soon as one core writes to a memory location, the other cores know that their copies of the corresponding cache line are now stale and hence invalid.

With write-through caches, this is fairly straightforward, since writes get “published” as soon as they happen. But if there are write-back caches in the mix, this doesn’t work, since the physical write-back to memory can happen a long time after the core executed the corresponding store – and for the intervening time, the other cores and their caches are none the wiser, and might themselves try to write to the same location, causing a conflict. So with a write-back model, it’s not enough to broadcast just the writes to memory when they happen; if we want to avoid conflicts, we need to tell other cores about our intention to write before we start changing anything in our local copy. Working out the details, the easiest solution that fits the bill and works for write-back caches is what’s commonly called the MESI protocol.

MESI and friends

This section is called “MESI and friends” because MESI spawned a whole host of closely related coherency protocols. Let’s start with the original though: MESI are the initials for the four states a cache line can be in for any of the multiple cores in a multi-core system. I’m gonna cover them in reverse order, because that’s the better order to explain them in:

  • Invalid lines are cache lines that are either not present in the cache, or whose contents are known to be stale. For the purposes of caching, these are ignored. Once a cache line is invalidated, it’s as if it wasn’t in the cache in the first place.
  • Shared lines are clean copies of the contents of main memory. Cache lines in the shared state can be used to serve reads but they can’t be written to. Multiple caches are allowed to have a copy of the same memory location in “shared” state at the same time, hence the name.
  • Exclusive lines are also clean copies of the contents of main memory, just like the S state. The difference is that when one core holds a line in E state, no other core may hold it at the same time, hence “exclusive”. That is, the same line must be in the I state in the caches of all other cores.
  • Modified lines are dirty; they have been locally modified. If a line is in the M state, it must be in the I state for all other cores, same as E. In addition, modified cache lines need to be written back to memory when they get evicted or invalidated – same as the regular dirty state in a write-back cache.

If you compare this to the presentation of write-back caches in the single-core case above, you’ll see that the I, S and M states already had their equivalents: invalid/not present, clean, and dirty cache lines, respectively. So what’s new is the E state denoting exclusive access. This state solves the “we need to tell other cores before we start modifying memory” problem: each core may only write to cache lines if their caches hold them in the E or M states, i.e. they’re exclusively owned. If a core does not have exclusive access to a cache line when it wants to write, it first needs to send an “I want exclusive access” request to the bus. This tells all other cores to invalidate their copies of that cache line, if they have any. Only once that exclusive access is granted may the core start modifying data – and at that point, the core knows that the only copies of that cache line are in its own caches, so there can’t be any conflicts.

Conversely, once some other core wants to read from that cache line (which we learn immediately because we’re snooping the bus), exclusive and modified cache lines have to revert back to the “shared” (S) state. In the case of modified cache lines, this also involves writing their data back to memory first.

The MESI protocol is a proper state machine that responds both to requests coming from the local core, and to messages on the bus. I’m not going to go into detail about the full state diagram and what the different transition types are; you can find more in-depth information in books on hardware architecture if you care, but for our purposes this is overkill. As a software developer, you’ll get pretty far knowing only two things:

Firstly, in a multi-core system, getting read access to a cache line involves talking to the other cores, and might cause them to perform memory transactions.
Writing to a cache line is a multi-step process: before you can write anything, you first need to acquire both exclusive ownership of the cache line and a copy of its existing contents (a so-called “Read For Ownership” request).

And secondly, while we have to do some extra gymnastics, the end result actually does provide some pretty strong guarantees. Namely, it obeys what I’ll call the

MESI invariant: after writing back all dirty (M-state) cache lines, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses. In addition, at all times, when a memory location is exclusively cached (in E or M state) by one core, it is not present in any of the other core’s caches..

Note that this is really just the write-back invariant we already saw with the additional exclusivity rule thrown in. My point being that the presence of MESI or multiple cores does not necessarily weaken our memory model at all.

Okay, so that (very roughly) covers vanilla MESI (and hence also CPUs that use it, ARMs for example). Other processors use extended variants. Popular extensions include an “O” (Owned) state similar to “E” that allows sharing of dirty cache lines without having to write them back to memory first (“dirty sharing”), yielding MOESI, and MERSI/MESIF, which are different names for the same idea, namely making one core the designated responder for read requests to a given cache line. When multiple cores hold a cache line in Shared state, only the designated responder (which holds the cache line in “R” or “F” state) replies to read requests, rather than everyone who holds the cache line in S state. This reduces bus traffic. And of course you can add both the R/F states and the O state, or get even fancier. All these are optimizations, but none of them change the basic invariants provided or guarantees made by the protocol.

I’m no expert on the topic, and it’s quite possible that there are other protocols in use that only provide substantially weaker guarantees, but if so I’m not aware of them, or any popular CPU core that uses them. So for our purposes, we really can assume that coherency protocols keep caches coherent, period. Not mostly-coherent, not “coherent except for a short window after a change” – properly coherent. At that level, barring hardware malfunction, there is always agreement on what the current state of memory should be. In technical terms, MESI and all its variants can, in principle anyway, provide full sequential consistency, the strongest memory ordering guarantee specified in the C++11 memory model. Which begs the question, why do we have weaker memory models, and “where do they happen”?

Memory models

Different architectures provide different memory models. As of this writing, ARM and POWER architecture machines have comparatively “weak” memory models: the CPU core has considerable leeway in reordering load and store operations in ways that might change the semantics of programs in a multi-core context, along with “memory barrier” instructions that can be used by the program to specify constraints: “do not reorder memory operations across this line”. By contrast, x86 comes with a quite strong memory model.

I won’t go into the details of memory models here; it quickly gets really technical, and is outside the scope of this article. But I do want to talk a bit about “how they happen” – that is, where the weakened guarantees (compared to the full sequential consistency we can get from MESI etc.) come from, and why. And as usual, it all boils down to performance.

So here’s the deal: you will indeed get full sequential consistency if a) the cache immediately responds to bus events on the very cycle it receives them, and b) the core dutifully sends each memory operation to the cache, in program order, and wait for it to complete before you send the next one. And of course, in practice modern CPUs normally do none of these things:

  • Caches do not respond to bus events immediately. If a bus message triggering a cache line invalidation arrives while the cache is busy doing other things (sending data to the core for example), it might not get processed that cycle. Instead, it will enter a so-called “invalidation queue”, where it sits for a while until the cache has time to process it.
  • Cores do not, in general, send memory operations to the cache in strict program order; this is certainly the case for cores with Out-of-Order execution, but even otherwise in-order cores may have somewhat weaker ordering guarantees for memory operations (for example, to ensure that a single cache miss doesn’t immediately make the entire core grind to a halt).
  • In particular, stores are special, because they’re a two-phase operation: we first need to acquire exclusive ownership of a cache line before a store can go through. And if we don’t already have exclusive ownership, we need to talk to the other cores, which takes a while. Again, having the core idle and twiddling thumbs while this is happening is not a good use of execution resources. Instead, what happens is that stores start the process of getting exclusive ownership, then get entered into a queue of so-called “store buffers” (some refer to the entire queue as “store buffer”, but I’m going to use the term to refer to the entries). The stay around in this queue for a while until the cache is ready to actually perform the store operation, at which point the corresponding store buffer is “drained” and can be recycled to hold a new pending store.

The implication of all these things is that, by default, loads can fetch stale data (if a corresponding invalidation request was sitting in the invalidation queue), stores actually finish later than their position in the code would suggest, and everything gets even more vague when Out of Order execution is involved. So going back to memory models, there are essentially two camps:

Architectures with a weak memory model do the minimum amount of work necessary in the core that allows software developers to write correct code. Instruction reordering and the various buffering stages are officially permitted; there are no guarantees. If you need guarantees, you need to insert the appropriate memory barriers – which will prevent reordering and drain queues of pending operations where required.

Architectures with stronger memory models do a lot more bookkeeping on the inside. For example, x86 processors keep track of all pending memory operations that are not fully finished (“retired”) yet, in a chip-internal data structure that’s called the MOB (“memory ordering buffer”). As part of the Out of Order infrastructure, x86 cores can roll back non-retired operations if there’s a problem – say an exception like a page fault, or a branch mispredict. I covered some of the details, as well as some of the interactions with the memory subsystem, in my earlier article “Speculatively speaking“. The gist of it is that x86 processors actively watch out for external events (such as cache invalidations) that would retroactively invalidate the results of some of the operations that have already executed, but not been retired yet. That is, x86 processors know what their memory model is, and when an event happens that’s inconsistent within that model, the machine is state is rolled back to the last time when it was still consistent with the rules of the memory model. This is the “memory ordering machine clear” I covered in yet another earlier post. The end result is that x86 processors provide very strong guarantees for all memory operations – not quite sequential consistency, though.

So, weaker memory models make for simpler (and potentially lower-power) cores. Stronger memory models make the design of cores (and their memory subsystems) more complex, but are easier to write code for. In theory, the weaker models allow for more scheduling freedom and can be potentially faster; in practice, x86s seem to be doing fine on the performance of memory operations, for the time being at least. So it’s hard for me to call a definite winner so far. Certainly, as a software developer I’m happy to take the stronger x86 memory model when I can get it.

Anyway. That’s plenty for one post. And now that I have all this written up on my blog, the idea is that future posts can just reference it. We’ll see how that goes. Thanks for reading!

Bindless chain letter

I wanted to comment on Timothy Lottes post on Bindless and descriptors (read that first!) and the comment field was too small, so here goes.

Some questions.

General: You’re implicitly assuming there’s lots of different types of resource tables. Why?

UPDATE: It seems like Timothy is talking about D3D12-esque Descriptor Tables. Well, I’m not :), so yeah, kinda talking apart here, but here’s my response anyway.

Essentially, GL bindless just means “one big resource table that contains everything”. And you could certainly implement it as such: one large resource table (multi-megabyte, potentially) that contains descriptor for absolutely everything currently live, with the handles being 32-bit offsets from the base address; nor do I see an intrinsic need for resource tables to be homogeneous in terms of either resource type or update frequency (it might be advantageous in certain cases, but I don’t see why it would be *required*).

NV GL bindless: what’s the handle values in “TEX handle“? Presumably, not a 7-bit register index. I’m assuming that it boils down to a “[bindlessTable + handle]” addressing mode, either explicit (i.e. this is a memory load) or implicit (HW is aware of handles and has a dedicated fetch path for them)? I couldn’t see any details in the documents you linked. Anyway, presumably there’s still resource descriptors somewhere. Where is that “somewhere”? How does do the HW know where to get them from?

NV with resource tables: why are you billing the handle table address LDC per texture sample? Unless there’s very high register pressure, you would only load the resource table pointers once per table, not once per resource access. And given my previous comment about “I don’t see why you need tons of these”, that’s really a constant per-shader cost of a few regs and a few loads; it’s not per texture sample at all.

All AMD: you can’t load the texture/sampler descriptors directly, you still need to know where to load them from (unless you use little enough resources that you can squeeze all descriptors into the 16 scalars). There’s at least one more scalar load to get a base pointer to the resource descriptors for the active draw call. Like the resource table pointer loads I just mentioned, this is amortized (just do that load once and leave the scalar reg around).

For GL bindless mode, you would presumably use a single global descriptor table that all your handles point into, and would preload the base address to that table using the 16 scalars you get to set per draw. This makes the basic model:

AMD DX “bindful”:

// sBatchDesc is one of the 16 pre-init 
S_LOAD_DWORD_X4 sTexDesc, sBatchDesc(16) // random offset
S_LOAD_DWORD_X4 sSmpDesc, sBatchDesc(48) // ...
SAMPLE ...

AMD GL “bindful”:

S_LOAD_DWORD_X8 sTexAndSmp, sBatchDesc(32)
SAMPLE ...

AMD GL bindless:

// sBindlessTable also one of the 16 pre-init
S_BUFFER_LOAD_DWORD sTexHandle, sConstantBuf(12)
S_LOAD_DWORD_X8 sTexAndSmp, sBindlessTable(sTexHandle)
SAMPLE ...

(explicit handle load here, “bindful” is cheaper on the shader side but pays that cost elsewhere by having to set up the per-batch descriptors)

and

AMD resource tables, separate tex/smp.

// let's say we have 2, sResourceTableImmutable
// and sResourceTableDyn, both pre-init
S_LOAD_DWORD sTexHandle, sBatchDesc(24)
S_LOAD_DWORD_X4 sTexDesc, sResourceTableDyn(sTexHandle)
S_LOAD_DWORD sSmpHandle, sBatchDesc(28)
S_LOAD_DWORD_X4 sSmpDesc, sResourceTableImmutable(sSmpHandle)
SAMPLE ...

or we could have a combined texture/sampler handle if we wanted, etc.

The big difference here is that the batch descriptors (sBatchDesc) here only hold texture handles (=offsets into their respective resource tables) not descriptors themselves. In the “bindful” case, there’s a conceptual global table too (the bind points), but it keeps changing between batches, which makes things tricky, and forces you to deal with multiple live versions at the same time and/or copy descriptors around. The resource table model (which bindless is a special case of!) has them stay constant and immutable over the lifetime of a command buffer which gets rid of that cost. GL bindless is similar.

Another note on the AMD resource table version: I’m loading texture/sampler handlers here separately for clarity, but that’s not at all required. A shader that references multiple contigous slots from its sBatchDesc can (and probably should) fuse them into larger pow2-sized S_LOAD_DWORDs. In my example, you could just as well do:

AMD resource tables + “fusing”:

// sTexHandle = sHandles, sSmpHandle = sHandles + 1 (register IDs)
S_LOAD_DWORD_X2 sHandles, sBatchDesc(24)
S_LOAD_DWORD_X4 sTexDesc, sResourceTableDyn(sTexHandle)
S_LOAD_DWORD_X4 sSmpDesc, sResourceTableImmutable(sSmpHandle)
SAMPLE ...

So really, not all that much (cost) difference between the different “bindless-esque” approaches here that I can see.

FSE/ANS history correction

I’ve been meaning to write another proper post on this for a while, but the last few months have been very busy and I didn’t feel like writing in my spare time.

This is not that post, sadly. This is about something else: namely, me only citing Jarek Duda’s work on ANS and not Yann Collet‘s work on FSE. Apparently, there were multiple versions of Jarek’s ANS paper, and the second version (which contains rANS, the topic I’ve been writing about) was significantly influenced by Yann’s experiences with integrating ideas from the first version into FSE.

Anyway, I did not mention Yann’s work in my rANS posts at all. I just want to make clear that this was because I was writing about rANS not tANS (the family that FSE is a member of), and I simply wasn’t aware that Yann’s work and input significantly influenced the second version of the ANS paper. My apologies; this was a simple oversight, not a deliberate attempt to talk down Yann’s contribution!

On a separate but related note: Mid-february, I wrote a short paper on how entropy coders (with a focus on ANS) can be interleaved on the encode side to allow the decoder side to exploit instruction-level parallelism and/or SIMD instructions. I originally meant to do a separate post about it here, but on trying to write it discovered that I didn’t have much to say on the topic that wasn’t in the paper. Hence, no separate blog post. But I figured I should at least link to it once from here.

Anyway, more regular blog updates should start again soon. Until then!

Networks all the way down, part 2.

In the previous post, I talked a bit about how essentially anything we would call a computer these days is in fact, for all practical purposes, a heterogeneous cluster made up of various specialized smaller computers, all connected using various networks that go by different names and are specified in different standards, yet are all suspiciously similar at the architecture level; a fractal of switched, packet-based networks of heterogeneous nodes that make up what we call a single “computer”. Which we then promptly connect with other computers to form home networks, work networks, and ultimately the internet.

So far, that is just an observation. But does it mean anything? Well, sort of. For one thing, it means that all the network security problems that plague inter-computer networking also exist within computers themselves. Just as you can own a machine from outside using a network-based attack (ultimately delivered via Ethernet frames), you can own it using exploits delivered via USB packets, or eSATA, or Thunderbolt. This is not a particularly bad security nightmare – all these attacks require physical access, and once you have physical access to a machine you have all kinds of ways of working around security mechanisms anyway – but it does mean that people trying to sell you hardware that you aren’t allowed to run your own code on have a pretty large attack surface to worry about.

That said, as far as implications of “everything is a computer running a network stack” go, “there are going to be implementation issues” is not exactly the most exciting place to end up at. Things get substantially more interesting once we approach the same underlying truth from another angle, though. To do so, let’s pick a concrete example. Being an affluent male geek, I socialize with a bunch of other affluent male geeks, a lot of whom store enough data (or enough computers) at home to require a NAS (Network Attached Storage) device to house terabytes of (doubtlessly 100% legally acquired!) data and serve it to the multitude of desktop computers, tables, phones and media devices they keep in their home. Now, these boxes are essentially marketed as a bunch of drive enclosures with a network plug and a bunch of added value features; RAID, video streaming (and sometimes live transcoding as well), that kind of thing. Since they’re essentially storage appliances, most people would view them as somewhere between an external hard drive enclosure and a proper computer. Of course, there’s no such thing as “half a computer”, and these devices are typically just small computers with a NIC and a few hundred megabytes to a few gigabytes of RAM, usually running some variant of Linux. They could be used for all kinds of computing; they just happen to be specialized in a way that makes them more suitable for some tasks than others.

That’s not why I bring this up, though; I bring it up because from an end user’s point of view, a NAS is essentially just a disk drive you can plug into multiple machines simultaneously, using an Ethernet plug instead of a SATA plug. Now from the computer’s (and OSes) point of view, a local disk versus a NAS are very different things; at that level, they are speaking very different languages (namely, a block device with a local file system on it, versus a remote network filesystem such as CIFS or NFS). But to most applications running on top of the OS (and by extension most users that use them), the distinction between a local drive and a mounted network share only really becomes apparent when the abstraction breaks.

Interestingly, we see the same pattern today even when we zoom in to the local storage connected to a computer. If you were to get a 2004 SATA hard drive, it would very likely be exactly that – still a “dumb” hard drive, possibly with a bit of cache, a microcontroller to talk SATA to the host and really address the disk, plus a fair amount of (possibly programmable) DSP circuitry to actually handle the encoding of bits into and decoding of bits from magnetic flux patterns, because doing this is way more involved – and way less discrete – than you probably think. So even in 2004, we already have a considerable amount of essentially invisible compute power being thrown at a problem that the higher-up abstraction layers gloss over; but at least all this processing still essentially respects the underlying abstraction of a block device. For a 2014 USB thumb drive, that’s not necessarily true anymore either: they do things like wear leveling, remapping block addresses to distribute write loads evenly over the device (and hence prevent some blocks aging much faster than others). These processes often involve parsing the file system itself (at least if it’s FAT or a variant). Think about this for a second: a storage device that, by itself, understands the file system that is stored on it, and automatically rearranges data to maximize its expected lifetime. This level of autonomy is less than a NAS transcoding videos on the fly, but it’s still a far cry from the truly “dumb” disk and tape drives our reigning block device abstractions were designed for. And we’re not talking about some expensive high-availability low-volume server technology here; this is a storage technology that is both cheap and ubiquitous.

That brings up another point: Flash memory technology is a vastly different beast from magnetic recording technology. And yet, all of those differences get hidden and papered over at a real low level; we could have taught OSes what Flash devices are, how they work, and declared them as a separate device family, with separate file systems that handle the wear leveling on the software side. Yet this is not what happened; evidently, the path of least resistance was to add dedicated hardware and software to every single Flash storage device that makes it look like a hard drive, even though it works very differently, with very different trade-offs.

This is not an isolated incident: if you go over the list of peripherals from last time, you’ll find something similar for every single one of them. Ethernet technology has essentially been completely redesigned between 1989 and 2014, yet the only part of the Ethernet standard that’s both device-independent and directly visible to software – namely, Ethernet frames – has remained essentially unchanged (except for the addition of jumbo frames, which are still not widely used). SAS and SATA may use a very different encoding and topology than their parallel counterparts, but from a software perspective the changes are evolutionary not revolutionary. Regular PCI and PCIe are essentially the same from the perspective of the OS. And so forth. Probably the most impressive example of all is the x86 architecture; a modern x86 processor can (and will) go to substantial lengths to pretend that it’s really just a fancy 8086, a 286, a 386, or a slightly extended AMD Athlon 64, depending on the operating mode; the architecture and implementation of current x86 processors is nothing like the original processors it’s emulating, but while the implementations may change substantially over time, the interfaces – protocols, to stay within our networking terminology – stay mostly constant over large time scales, warts and all.

Note how much this runs counter to the common attitude that software is easy to change, and hardware difficult. This might be true for individuals; but across computing as a whole, it’s missing the point entirely. The more relevant distinction is that implementations are easy to change, and protocols difficult. And once a protocol (or API in software terms) is firmly established and widely supported, it’s damn near impossible to get rid of. If that means that USB thumb drives have to pretend they are hard disks, and need the capability to understand the file system that’s stored on them to do a good job; if that means that Ethernet devices have to pretend that it’s 1989 and all nodes on the network are connected to one wire and identified by their MAC address, rather than connected by point-to-point links via network switches that understand IP just fine; if that means that “x86″ processors have to pretend they’re not actually out-of-order cores with an internal RISC-ish instruction set, hundreds of registers and sophisticated “decoding” (really, translation) engines; then so be it. That is the price we pay to keep things more or less in working order – except for that occasional instance when all our sophisticated abstractions break, the underlying reality shines through, and that mounted network share is revealed to not actually be on the local machine at all, by producing errors that no app knows how to handle.

Just keep hitting refresh. It’ll be back in a minute.

Networks all the way down.

I first started playing around with PCs about 25 years ago, so around 1989/1990 (and began coding maybe a year later). At the time, there was a large variety of different connectors, plugs and buses in use: keyboards used DIN or PS/2 plugs, a point-to-point link with a custom protocol. Mice and modems (if you had them!) were typically connected using RS-232, a point-to-point serial link. Your monitor was connected using a VGA connector, a point-to-point link carrying analog component video (RGBHV). Printers were connected using the parallel port, a point-to-point parallel link (initially 8 bits host->device unidirectional with some unidirectional and bidirectional status bits; later, there were also modes with a bidirectional link on the 8 data bits, used to connect devices like scanners). The PC had an extension bus, now called ISA, that you could use to plug in peripherals such as video cards, floppy controllers, hard disk controllers, SCSI controllers, sound cards and even memory expansions. On the storage side, floppy drives were on a parallel link (not sure if it qualifies as a bus or not) that could multiplex two floppy drives onto one wide ribbon cable. Consumer hard drives typically spoke what is now called PATA (parallel ATA) but was then just called ATA or IDE; essentially just a thinly wrapped ISA bus protocol (these drives already had their controllers on-board). As such, ATA/IDE was a parallel bus (more ribbon cables). There was also SCSI; on PCs, it was considered the “high-end” storage option, on other platforms it was the default. Either way, SCSI was another parallel bus. Finally, if you had an Ethernet card (and in 1990 you probably didn’t in your home machine, because there was no point – it was quite likely the only computer in your household, or in a half-mile radius for that matter), then at the time that meant one long coaxial cable (“bus topology”), T-plugs to plug in machines, and terminators at both ends of the cable to prevent reflections. Ethernet at the time already sent packets (not the IP kind, Ethernet packets, properly called “frames”) over a high-speed (10Mbit/s!) serial link, relying on collision detection in lieu of the more complicated arbitration techniques its competitors used at the time.

Okay, so much for the history lesson. Now let’s fast-forward 25 years. How does the situation look today?

We are still using “Ethernet”, but 2014 Ethernet has very little in common with its 1989 counterpart other than the name. It still uses serial links, but as of 10BASE-T it uses a star topology (using switches and hubs) rather than a bus, and with the introduction of Gigabit Ethernet, the whole “collision detection” business fell by the wayside too; Gigabit Ethernet is a bunch of point-to-point links connected by switches (active devices, instead of hubs which are mostly passive), and hence there’s no physically shared medium anymore; there’s explicit switching. So: serial point-to-point links, packet based, switching. The same tech (albeit at different scaling levels) is used in your home, at your workplace, at your ISP, and in major internet exchanges/peering spots.

Extension cards used to be on the ISA bus; in 2014, it’s all PCI Express, and while it’s still called a “bus”, it hasn’t been one ever since the “Express” part got added. PCIe uses serial not parallel links, the connectivity is point-to-point instead of a shared medium, with arbitration by the “root complex” (which logically sits at the center of the bus, making this a star instead of a bus topology), oh and communication is packet-based. So PCIe uses serial point-to-point links is packet-based, and uses switching.

On the storage front, PATA was succeeded by SATA. SATA is, as the name says, serial instead of parallel. The cables, which used to form a bus (with each cable typically allowing up to two devices to be connected to the controller) now are used for single point-to-point links between devices and the controller. Oh, and there is a more explicit framing protocol based around packets. In other words, SATA has a star topology of serial point-to-point links and is packet-based.

Of course, SCSI didn’t disappear off the face of the earth either; it turned into Serial Attached SCSI (SAS). SAS keeps the SCSI command set, but as the name suggests, links are now serial. Oh, and – interrupt me if this sounds familiar – they’re now point-to-point, with everything connected either directly to the “initiator”, or indirectly through one or more “expanders” (switches), meaning SAS has a tiered-star topology. I’m not sure whether or not SAS is internally packet-based though, but I have my suspicions :) [can anyone confirm this?]. Oh, and SAS can also tunnel SATA streams, because why not.

Keyboard, mice, and printers are now all on USB. USB stands for “Universal Serial Bus”; the first two words of that are accurate (it is pretty much universally used at this point, and it is serial) but the latter is a blatant lie. USB may pretend to be a bus, but it really is a packet-based network with a tiered-star topology. Serial point-to-point links, possibly with hubs in between, and all packet-based. And if you happen to attach a USB storage device such as a thumb drive or external hard disk, well, USB storage really just wraps SCSI commands in USB packets – and these SCSI commands may in turn wrap ATA commands.

And what about displays? Well, if you use DVI or HDMI, you still have a dedicated point-to-point link, much as in the VGA days, although it’s now all digital. But if you happen to be using the newer DisplayPort, well, that’s a – can you guess? – serial, packet-based point-to-point link protocol. Which can also carry USB across the link, by the way. Oh, and there’s also Thunderbolt, which can multiplex both DisplayPort and PCIe packets over its serial point-to-point links. Among other things, this has fun consequences such as allowing your monitor or TV (or anything else plugged into your display connector) full freaking bus master DMA access – a killer feature if I’ve ever seen one! – and, just as fun, finally gives you the opportunity to have a storage device that talks ATA through SCSI by way of USB over DisplayPort inside Thunderbolt, because we can.

Okay, so pretty much every single pluggable link interface that used to be in 1989 PCs has been replaced by what are, effectively, all very similar spins on the same underlying network technology (and they really are all networks, albeit some with funky protocols). Is that the end of it?

Well, no, not quite. Because we now have multiple CPUs, too. And guess how they talk to each other? Well, in the case of PCs, there’s two primary options: AMD’s HyperTransport and Intel’s QuickPath Interconnect (QPI). Needless to say, both of these are packet-based, serial, point-to-point links, typically used to communicate between CPU cores (or CPU sockets anyway).

So what’s my point here? Well, logically, these CPUs appear to share the same memory, but physically it’s more common at this point to have each CPU (or at least each socket) own a certain fraction of physical memory that it’s connected to directly; the rest, it can only access by asking the other socket (=NUMA). And then your hard drive (or SSD), which speaks SCSI or ATA or both, needs a controller to understand these protocols – often programmable. And, in the case of hard drivers, usually also a DSP to decode/encode the stuff that’s on disk. And your GPU also has its own memory, and a wide array of vector processors that is (at this point) easily Turing-complete. And usually several other fairly general-purpose blocks you don’t even know about. In fact, for each general-purpose processor (with its own memory) in your system that you know about, there are probably 2 or 3 that you don’t. There are tiny ARM cores absolutely everywhere, sometimes where you’d least expect them, and Intel likes to use embedded 486s for the same kind of thing (they’re so small by now, you literally need a microscope to even see them).

We don’t really tend to think of them as such, but not only are all kinds of peripherals on network links these days, they also have general-purpose CPUs, often with substantially more processing power (and memory!) than the 286 @ 10MHz I learned programming on in 1990. And this is all still just talking about PCs – I haven’t even started on things like cars or planes.

So what’s my point? Well, just to say that what we think of as “computers” today are, in practice, already fairly large, heterogeneous clusters of different specialized smaller computers, and while only a small part of these computers is visible to the user, there’s a bunch of them lurking just below the surface. This goes not just for PCs, but also phones and tablets. And of course, absolutely everything you plug into a computer these days is – by itself – another computer. Even stuff you think of as “passive” components, like batteries and thumb drives.

And they’re all talking to each other over networks, most of which are starting to look pretty damn similar at this point – and with greatly varying degrees of implementation quality, eliciting both awe that anything works at all and dread at how easy it is all to exploit. So let me conclude by formulating an analogue to Zawinski’s Law of Software Envelopment:

Every data link standard converges to serial point-to-point links connected in a tiered-star topology and transporting packets. Those link standards which cannot so converge are replaced by ones which can.

rANS with static probability distributions

In the previous post, I wrote about rANS in general. The ANS family is, in essence, just a different design approach for arithmetic coders, with somewhat different trade-offs, strengths and weaknesses than existing coders. In this post, I am going to talk specifically about using rANS as a drop-in replacement for (static) Huffman coding: that is, we are encoding data with a known, static probability distribution for symbols. I am also going to assume a compress-once decode-often scenario: slowing down the encoder (within reason) is acceptable if doing so makes the decoder faster. It turns out that rANS is very useful in this kind of setting.

Review

Last time, we defined the rANS encoding and decoding functions, assuming a finite alphabet \mathcal{A} = \{ 0, \dots, n - 1 \} of n symbols numbered 0 to n-1.

C(s,x) := M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s)
D(x) := (s, F_s \lfloor x/M \rfloor + (x \bmod M) - B_s) where s = s(x \bmod M).

where F_s is the frequency of symbol s, B_s = \sum_{i=0}^{s-1} F_i is the sum of the frequencies of all symbols before s, and M = \sum_{i=0}^{n-1} F_i is the sum of all symbol frequencies. Then a given symbol s has (assumed) probability p_s = F_s / M.

Furthermore, as noted in the previous post, M can’t be chosen arbitrarily; it must divide L (the lower bound of our normalized interval) for the encoding/decoding algorithms we saw to work.

Given these constraints and the form of C and D, it’s very convenient to have M be a power of 2; this replaces the divisions and modulo operations in the decoder with bit masking and bit shifts. We also choose L as a power of 2 (which needs to be at least as large as M, since otherwise M can’t divide L).

This means that, starting from a reference probability distribution, we need to approximate the probabilities as fractions with common denominator M. My colleague Charles Bloom just wrote a blog post on that very topic, so I’m going to refer you there for details on how to do this optimally.

Getting rid of per-symbol divisions in the encoder

Making M a power of two removes the division/modulo operations in the decoder, but the encoder still has to perform them. However, note that we’re only ever dividing by the symbol frequencies F_s, which are known at the start of the encoding operation (in our “static probability distribution” setting). The question is, does that help?

You bet it does. A little known fact (amongst most programmers who aren’t compiler writers or bit hacking aficionados anyway) is that division of a p-bit unsigned integer by a constant can always be performed as fixed-point multiplication with a reciprocal, using 2p+1 bits (or less) of intermediate precision. This is exact – no round-off error involved. Compilers like to use this technique on integer divisions by constants, since multiplication (even long multiplication) is typically much faster than division.

There are several papers on how to choose the “magic constants” (with proofs); however, most of them are designed to be used in the code generator of a compiler. As such, they generally have several possible code sequences for division by constants, and try to use the cheapest one that works for the given divisor. This makes sense in a compiler, but not in our case, where the exact frequencies are not known at compile time and doing run-time branching between different possible instruction sequences would cost more than it saves. Therefore, I would suggest sticking with Alverson’s original paper “Integer division using reciprocals”.

The example code I linked to implements this approach, replacing the division/modulo pair with a pair of integer multiplications; when using this approach, it makes sense to limit the state variable to 31 bits (or 63 bits on 64-bit architectures): as said before, the reciprocal method requires 2p+1 bits working precision for worst-case divisors, and reducing the range by 1 bit enables a faster (and simpler) implementation than would be required for a full-range variant (especially in C/C++ code, where multi-precision arithmetic is not easy to express). Note that handling the case F_s=1 requires some extra work; details are explained in the code.

Symbol lookup

There’s one important step in the decoder that I haven’t talked about yet: mapping from the “slot index” x \bmod M to the corresponding symbol index. In normal rANS, each symbol covers a contiguous range of the “slot index” space (by contrast to say tANS, where the slots for any given symbol are spread relatively uniformly across the slot index space). That means that, if all else fails, we can figure out the symbol ID using a binary search in \lceil\log_2 n\rceil steps (remember that n is the size of our alphabet) from the cumulative frequency table (the B_s, which take O(n) space) – independent of the size of M. That’s comforting to know, but doing a binary search per symbol is, in practice, quite expensive compared to the rest of the decoding work we do.

At the other extreme, we can just prepare a look-up table mapping from the cumulative frequency to the corresponding symbol ID. This is very simple (and the technique used in the example code) and theoretically constant-time per symbol, but it requires a table with M entries – and if the table ends up being too large to fit comfortably in a core’s L1 cache, real-world performance (although still technically bounded by a constant per symbol) can get quite bad. Moving in the other direction, if M is small enough, it can make sense to store the per-symbol information in the M-indexed table too, and avoid the extra indirection; I would not recommend this far beyond M=212 though.

Anyway, that gives us two design points: we can use O(n) space, at a cost of O(\log n) per symbol lookup; or we can use O(M) space, with O(1) symbol lookup cost. Now what we’d really like is to get O(1) symbol lookup in O(n) space, but sadly that option’s not on the menu.

Or is it?

The alias method

To make a long story short, I’m not aware of any way to meet our performance goals with the original unmodified rANS algorithm; however, we can do much better if we’re willing to relax our requirements a bit. Notably, there’s no deep reason for us to require that the slots assigned to a given symbol s be contiguous; we already know that e.g. tANS works in a much more relaxed setting. So let’s assume, for the time being, that we can rearrange our slot to symbol mapping arbitrarily (we’ll have to check if this is actually true later, and also work through what it means for our encoder). What does that buy us?

It buys us all we need to meet our performance goals, it turns out (props to my colleague Sean Barrett, who was the first one to figure this out, in our internal email exchanges anyway). As the section title says, the key turns out to be a stochastic sampling technique called the “alias method”. I’m not gonna explain the details here and instead refer you to this short introduction (written by a computational evolutionary geneticist, on randomly picking base pairs) and “Darts, Dice and Coins”, a much longer article that covers multiple ways to sample from a nonuniform distribution (by the way, note that the warnings about numerical instability that often accompany descriptions of the alias method need not worry us; we’re dealing with integer frequencies here so there’s no round-off error).

At this point, you might be wondering what the alias method, a technique for sampling from a non-uniform discrete probability distribution, has anything to do with entropy (de)coding. The answer is that the symbol look-up problem is essentially the same thing: we have a “random” value x \bmod M from the interval [0,M-1], and a matching non-uniform probability distribution (our symbol frequencies). Drawing symbols according to that distribution defines a map from [0,M-1] to our symbol alphabet, which is precisely what we need for our decoding function.

So what does the alias method do? Well, if you followed the link to the article I mentioned earlier, you get the general picture: it partitions the probabilities for our n-symbol alphabet into n “buckets”, such that each bucket i references at most 2 symbols (one of which is symbol i), and the probabilities within each bucket sum to the same value (namely, 1/n). This is always possible, and there is an algorithm (due to Vose) which determines such a partition in O(n) time. More generally, we can do so for any N≥n, by just adding some dummy symbols with frequency 0 at the end. In practice, it’s convenient to have N be a power of two, so for arbitrary n we would just pick N to be the smallest power of 2 that is ≥n.

Translating the sampling terminology to our rANS setting: we can subdivide our interval [0,M-1] into N sub-intervals (“buckets”) of equal size k=M/N, such that each bucket i references at most 2 distinct symbols, one of which is symbol i. We also need to know what the other symbol referenced in this bucket is – alias[i], the “alias” that gives the methods its name – and the position divider[i] of the “dividing line” between the two symbols.

With these two tables, determining the symbol ID from x is quick and easy:

  uint xM = x % M; // bit masking (power-of-2 M)
  uint bucket_id = xM / K; // shift (power-of-2 M/N!)
  uint symbol = bucket_id;
  if (xM >= divider[bucket_id]) // primary symbol or alias?
    symbol = alias[bucket_id];

This is O(1) time and O(N) = O(n) space (for the “divider” and “alias” arrays), as promised. However, this isn’t quite enough for rANS: remember that for our decoding function D, we need to know not just the symbol ID, but also which of the (potentially many) slots assigned to that symbol we ended up in; with regular rANS, this was simple since all slots assigned to a symbol are sequential, starting at slot B_s:
D(x) = (s, F_s \lfloor x/M \rfloor + (x \bmod M) - B_s) where s = s(x \bmod M).
Here, the (x \bmod M) - B_s part is the number we need. Now with the alias method, the slot IDs assigned to a symbol aren’t necessarily contiguous anymore. However, within each bucket, the slot IDs assigned to a symbol are sequential – which means that instead of the cumulative frequencies B_s, we now have two separate per bucket. This allows us to define the complete “alias rANS” decoding function:

  // s, x = D(x) with "alias rANS"
  uint xM = x % M;
  uint bucket_id = xM / K;
  uint symbol, bias;
  if (xM < divider[bucket_id]) { // primary symbol or alias?
    symbol = bucket_id;
    bias = primary_start[bucket_id];
  } else {
    symbol = alias[bucket_id];
    bias = alias_start[bucket_id];
  }
  x = (x / M) * freq[symbol] + xM - bias;

And although this code is written with branches for clarity, it is in fact fairly easy to do branch-free. We gained another two tables indexed with the bucket ID; generating them is another straightforward linear pass over the buckets: we just need to keep track of how many slots we’ve assigned to each symbol so far. And that’s it – this is all we need for a complete “alias rANS” decoder.

However, there’s one more minor tweak we can make: note that the only part of the computation that actually depends on symbol is the evaluation of freq[symbol]; if we store the frequencies for both symbols in each alias table bucket, we can get rid of the dependent look-ups. This can be a performance win in practice; on the other hand, it does waste a bit of extra memory on the alias table, so you might want to skip on it.

Either way, this alias method allow us to perform quite fast (though not as fast as a fully-unrolled table for small M) symbol look-ups, for large M, with memory overhead (and preparation time) proportional to n. That’s quite cool, and might be particularly interesting in cases where you either have relatively small alphabets (say on the order of 15-25 symbols), need lots of different tables, or frequently switch between tables.

Encoding

However, we haven’t covered encoding yet. With regular rANS, encoding is easy, since – again – the slot ranges for each symbol are contiguous; the encoder just does
C(s,x) = M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s)
where B_s + (x \bmod F_s) is the slot id corresponding to the (x \bmod F_s)‘th appearance of symbol s.

With alias rANS, each symbol may have its slots distributed across multiple, disjoint intervals – up to N of them. And the encoder now needs to map (x \bmod F_s) to a corresponding slot index that will decode correctly. One way to do this is to just keep track of the mapping as we build the alias table; this takes O(M) space and is O(1) cost per symbol. Another is to keep a sorted list of subintervals (and their cumulative sizes) assigned to each symbol; this takes only O(N) space, but adds a $O(\log_2 N)$ (worst-case) lookup per symbol in the encoder. Sound familiar?

In short, using the alias method doesn’t really solve the symbol lookup problem for large M; or, more precisely, it solves the lookup problem on the decoder side, but at the cost of adding an equivalent problem on the encoder side. What this means is that we have to pick our poison: faster encoding (at the some extra cost in the decoder), or faster decoding (at some extra cost in the encoder). This is fine, though; it means we get to make a trade-off, depending on which of the two steps is more important to us overall. And as long as we are in a compress-once decompress-often scenario (which is fairly typical), making the decoder faster at some reasonable extra cost in the encoder is definitely useful.

Conclusion

We can exploit static, known probabilities in several ways for rANS and related coders: for encoding, we can precompute the right “magic values” to avoid divisions in the hot encoding loop; and if we want to support large M, the alias method enables fast decoding without generating a giant table with M entries – with an O(n) preprocessing step (where n is the number of symbols), we can still support O(1) symbol decoding, albeit with a (slightly) higher constant factor.

I’m aware that this post is somewhat hand-wavey; the problem is that while Vose’s algorithm and the associated post-processing are actually quite straightforward, there’s a lot of index manipulation, and I find the relevant steps to be quite hard to express in prose without the “explanation” ending up harder to read than the actual code. So instead, my intent was to convey the “big picture”; a sample implementation of alias table rANS, with all the details, can be found – as usual – on Github.

And that’s it for today – take care!

Follow

Get every new post delivered to your Inbox.

Join 204 other followers