are a standard sequence used in quadratic probing of open hash tables. For example, they’re used in Google’s sparse_hash
and dense_hash
, generally considered to be very competitive hash table implementations.
You can find lots of places on the web mentioning that the resulting probe sequence will visit every element of a power-of-2 sized hash table exactly once; more precisely, the function is a permutation on . But it’s pretty hard to find a proof; everybody seems to refer back to Knuth, and in The Art of Compute Programming, Volume 3, Chapter 6.4, the proof appears as an exercise (number 20 in the Second Edition).
If you want to do this exercise yourself, please stop reading now; spoilers ahead!
Anyway, turns out I arrived at the solution very differently from Knuth. His proof is much shorter and slicker, but pretty “magic” and unmotivated, so let’s take the scenic route! The first step is to look at a bunch of small values and see if we can spot any patterns.
k | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
T_{k} | 0 | 1 | 3 | 6 | 10 | 15 | 21 | 28 | 36 | 45 | 55 | 66 |
T_{k} mod 8 | 0 | 1 | 3 | 6 | 2 | 7 | 5 | 4 | 4 | 5 | 7 | 2 |
T_{k} mod 4 | 0 | 1 | 3 | 2 | 2 | 3 | 1 | 0 | 0 | 1 | 3 | 2 |
T_{k} mod 2 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
And indeed, there are several patterns that might be useful: looking at the row for “mod 2″, we see that it seems to be just the values 0, 1, 1, 0 repeating, and that sequence itself is just 0, 1 followed by its reverse. The row for “mod 4″ likewise looks like it’s just alternating between normal and reverse copies of 0, 1, 3, 2, and the “mod 8″ row certainly looks like it might be following a similar pattern. Can we prove this?
First, the mirroring suggests that it might be worthwhile to look at the differences
Both terms are multiples of n, so we have , which proves the mirroring (incidentally, for any n, not just powers of 2). Furthermore, the first term is a multiple of 2n too, and the second term almost is: we have . This will come in handy soon.
To prove that is 2n-periodic, first note the standard identity for triangular numbers
in particular, we have
Again, this is for arbitrary n ≥ 1. So far, we’ve proven that for arbitrary positive integer n, the sequence is 2n-periodic, with the second half being a mirrored copy of the first half. It turns out that we can wrangle this one fact into a complete proof of the , 0 ≤ k < 2^{m} being a permutation fairly easily by using induction:
Basis (m=0): , and the function is a permutation on { 0 }.
Inductive step: Suppose is a permutation on . Then the values of must surely be pairwise distinct for 0 ≤ k < 2^{m}, since they’re already distinct mod 2^{m}. That means we only have to deal with the second half. But above (taking n=2^{m}), we proved that
and letting k run from 0 to 2^{m}-1, we notice that mod 2^{m}, these values are congruent to the values in the first half, but mod 2^{m+1}, they differ by an additional term of -n. This implies that the values in the second half are pairwise distinct both from the first half and from each other. This proves that f_{m+1} is injective, and because it’s mapping the 2^{m+1}-element set onto itself, it must be a permutation.
Which, by mathematical induction, proves our claim.
Distributions over a binary alphabet are easy to mix (in the “context mixing” sense): they can be represented by a single scalar (I’ll use the probability of the symbol being a ‘1’, probability of ‘0’ works too) which are easily combined with other scalars. In practice, these probabilities are represented as fixed-size integers in a specific range. A typical choice is . Often, p=0 (“it’s certainly a 0″) and p=1 (certain 1) are excluded, because they only allow us to “encode” (using zero bits!) one of the two symbols.
Multi-symbol alphabets are trickier. A binary alphabet can be described using one probability p; the probability of the other symbol must be 1-p. Larger alphabets involve multiple probabilities, and that makes things trickier. Say we have a n-symbol alphabet. A probability distribution for such an alphabet is, conventionally, a vector such that for all and furthermore . In practice, we will again represent them using integers. Rather than dealing with the hassle of having fractions everywhere, let’s just define our “finite-precision distributions” as integer vectors , again for all k, and where T is the total. Same as with the binary case, we would like T to be a constant power of 2. This lets us use cheaper variants of conventional arithmetic coders, as well as rANS. Unlike the binary case, maintaining this invariant takes explicit work, and it’s not cheap.
In practice, the common choice is to either drop the requirement that T be constant (most adaptive multi-symbol models for arithmetic coding take that route), or switch to a semi-adaptive model such as deferred summation that performs model updates in batches, which can either pick batches such that the constant sum is automatically maintained, or at least amortize the work spent enforcing it. A last option is using a “sliding window” style model that just uses a history of character counts over the last N input symbols. This makes it easy to maintain the constant sum, but it’s pretty limited.
A second problem is mixing – both as an explicit operation for things like context mixing, and as a building block for model updates more sophisticated than simply incrementing a counter. With binary models, this is easy – just a weighted sum of probabilities. Multi-symbol models with a constant sum are harder: for example, say we want to average the two 3-symbol models and while maintaining a total of T=8. Computing illustrates the problem: the two non-integer counts have to get rounded back to integers somehow. But if we round both up, we end up at a total of 9; and if we round both down, we end up with a total of 7. To achieve our desired total of 8, we need to round one up, and one down – but it’s not exactly obvious how we should choose on any given symbol! In short, maintaining a constant total while mixing in this form proves to be tricky.
However, I recently realized that these problems all disappear when we work with the cumulative distribution function (CDF) instead of the raw symbol counts.
For a discrete probability distribution p with total T, define the corresponding cumulative probabilities:
P_{0} is an empty sum, so P_{0}=0. On the other end, P_{n} = T, since that’s just the sum over all values in p, and we know the total is T. For the elements in between, p_{k} ≥ 0 implies that P_{k} ≥ P_{k-1}, i.e. P is monotonically non-decreasing. Conversely, given any P with these three properties, we can compute the differences between adjacent elements and determine the underlying distribution p.
And it turns out that while mixing quantized symbol probabilities is problematic, mixing cumulative distribution functions pretty much just works. To wit: suppose we are given two CDFs P and Q with the same total T, and a blending factor , then define:
Note that because summation is linear, we have
so this is indeed the CDF corresponding to a blended model between p and q. However, we’re still dealing with real values here; we need integers, and the easiest way to get there is to just truncate, possibly with some rounding bias :
It turns out that this just works, but this requires proof, and to get there it helps to prove a little lemma first.
Spacing lemma: Suppose that for some j, and where m is an arbitrary integer. Then we also have .
Proof: We start by noting that
and since m is an integer, we have
which was our claim.
Using the lemma, it’s now easy to show that R is a valid CDF with total T: we need to establish that R_{0}=0, R_{n}=T, and show that R is monotonic. The first two are easy, since the P and Q we’re mixing both agree at these points and 0≤b<1.
As for monotonicity, note that is the same as (and the same for P and Q). Therefore, we can apply the spacing lemma with m=0 for 1≤j≤n: monotonicity of P and Q implies that R will be monotonic too. And that’s it: this proves that R is valid CDF for a distribution with total T. If we want the individual symbol frequencies, we can recover them as .
Note that the spacing lemma is a fair bit stronger than just establishing monotonicity. Most importantly, if p_{k}≥m and q_{k}≥m, the spacing lemma tells us that r_{k}≥m – these properties carry over to the blended distribution, as one would expect! I make note of this because this kind of invariant is often violated by approximate techniques that first round the probabilities, then fudge them to make the total come out right.
This gives us a way to blend between two probability distributions while maintaining a constant total, without having to deal with dodgy ad-hoc rounding decisions. This requires working in CDF form, but that’s pretty common for arithmetic coding models anyway. As long as the mixing computation is done exactly (which is straightforward when working in integers), the total will remain constant.
I only described linear blending, but the construction generalizes in the obvious way to arbitrary convex combinations. It is thus directly applicable to mixing more than two distributions while only rounding once. Furthermore, given the CDFs of the input models, the corresponding interval for a single symbol can be found using just two mixing operations to find the two interval end points; there’s no need to compute the entire CDF for the mixed model. This is in contrast to direct mixing of the symbol probabilities, which in general needs to look at all symbols to either determine the total (if a non-constant-T approach is used) or perform the adjustments to make the total come out to T.
Furthermore, the construction shows that probability distributions with sum T are closed under “rounded convex combinations” (as used in the proof). The spacing lemma implies that the same is true for a multitude of more restricted distributions: for example, the set of probability distributions where each symbol has a nonzero probability (corresponding to distributions with monotonically increasing, instead of merely non-decreasing, CDFs) is also closed under convex combinations in this sense. This is a non-obvious result, to me anyway.
One application for this (as frequently noted) is context mixing of multi-symbol distributions. Another is as a building block in adaptive model updates that’s a good deal more versatile than the obvious “steal the count from one symbol, add it to another” update step.
I have no idea whether this is new or not (probably not); I certainly hadn’t seen it before, and neither had anyone else at RAD. Nor do I know whether this will be useful to anyone else, but it seemed worth writing up!
The standard approach to computing Mv using SIMD instructions boils down to taking a linear combination of the four column vectors a, b, c and d, using standard SIMD componentwise addition, multiplication and broadcast shuffles.
// Given M as its four constituent column vectors a, b, c, d, // compute r=M*v. r = v.xxxx*a + v.yyyy*b + v.zzzz*c + v.wwww*d;
This computes the vector-matrix product using four shuffles, four (SIMD) multiplies, and three additions. This is all bog-standard. And if the ISA we’re working on has free broadcast swizzles (ARM NEON for example), we’re done. But if not, can we do better? Certainly if we know things about M or v: if M has a special structure, or some components of v are known to be always 0, 1 or -1, chances are good we can save a bit of work (whether it makes a difference is another matter). But what if M and v are completely general, and all we know is that we want to transform a lot of vectors with a single M? If v is either given as or returned in SoA form (structure-of-arrays), we can reduce the number of per-vector shuffles greatly if we’re willing to preprocess M a bit and have enough registers available. But let’s say we’re not doing that either: our input v is in packed form, and we want the results packed too. Is there anything we can do?
There’s no way to reduce the number of multiplies or additions in general, but we can get rid of exactly one shuffle per vector, if we’re willing to rearrange M a bit. The trick is to realize that we’re using each of v.x, v.y, v.z, and v.w exactly four times, and that the computations we’re doing (a bunch of component-wise multiplies and additions) are commutative and associative, so we can reorder them, in exact arithmetic anyway. (This type of computation is usually done in floating point, where we don’t actually have associativity, but I’m going to gloss over this.)
Let’s look at the our first set of products, v.xxxx * a
. We’re just walking down a column of M, multiplying each element we see by v_{x}. What if we walk in a different direction? Going along horizontals turns out to be boring (it’s essentially the same, just transposed), but diagonals of M are interesting, the main diagonal in particular.
So here’s the punch line: we form four new vectors by walking along diagonals (with wrap-around) as follows:
Phrasing the matrix multiply in terms of these four vectors, we get:
r = v*e + v.yzwx*f + v.zwxy*g + v.wxyz*h;
Same number of multiplies and adds, but one shuffle per vector less (because the swizzle pattern for v in the first term is xyzw
, which is the natural ordering of v). Also note that forming e, f, g, and h given M in column vector form is also relatively cheap: it’s a matrix transposition with a few post-swizzles to implement the cyclic rotations. If you have M as row vectors (for example because it’s stored in row-major order), it’s even cheaper.
So: multiplying a packed 4-vector with a constant 4×4-matrix takes one shuffle less than the standard approach, if we’re willing to do some preprocessing on M (or store our matrices in a weird layout to begin with). Does this matter? It depends. On current desktop x86 cores, it’s pretty marginal, because SIMD shuffles can execute in parallel (during the same cycle) with additions and multiplications. On older cores with less execution resources, on in-order SIMD CPUs, and on low-power parts, it can definitely help though.
For what it’s worth: if your 4D vectors come from graphics or physics workloads and are actually homogeneous 3-vectors with a constant w=1 and no projective transforms anywhere in sight, you can exploit that structure explicitly for higher gains than this. But I ran into this with a DSP workload (with v just being a vector of 4 arbitrary samples), and in that case it’s definitely useful to know, especially since anything convolution-related tends to have highly diagonal (Toeplitz, to be precise) structure to begin with.
Date: Wed, 05 Feb 2014 16:43:36 -0800
From: Fabian Giesen
Subject: Alias Huffman coding.
Huffman <= ANS (strict subset)
(namely, power-of-2 frequencies)
We can take any discrete probability distribution of N events and use the Alias method to construct a O(N)-entry table that allows us to sample from that distribution in O(1) time.
We can apply that same technique to e.g. rANS coding to map from (x mod M) to “what symbol is x”. We already have that.
Ergo, we can construct a Huffman-esque coder that can decode symbols using a single table lookup, where the table size only depends on N_sym and not the code lengths. (And the time to build said table given the code lengths is linear in N_sym too).
Unlike regular/canonical Huffman codes, these can have multiple unconnected ranges for the same symbol, so you still need to deal with the range remapping (the “slot_adjust” thing) you have in Alias table ANS; basically, the only difference ends up being that you have a shift instead of a multiply by the frequency.
But there’s still some advantages in that a few things simplify; for example, there’s no need (or advantage) to using a L that’s larger than M. An obvious candidate is choosing L=M=B so that your Huffman codes are length-limited to half your word size and you never do IO in smaller chunks than that.
Okay. So where does that get us? Well, something like the MSB alias rANS decoder, with a shift instead of a multiply, really:
// decoder state // suppose max_code_len = 16 U32 x; U16 const * input_ptr; U32 const m = (1 << max_code_len) - 1; U32 const bucket_shift = max_code_len - log2_nbuckets; // decode: U32 xm = x & m; U32 xm_shifted = xm >> bucket_shift; U32 bucket = xm_shifted * 2; if (xm < hufftab_divider[xm_shifted]) bucket++; x = (x & ~m) >> hufftab_shift[bucket]; x += xm - hufftab_adjust[bucket]; if (x < (1<<16)) x = (x << 16) | *input_ptr++; return hufftab_symbol[bucket];
So with a hypothetical compiler that can figure out the adc-for-bucket
thing, we’d get something like
; x in eax, input_ptr in esi movzx edx, ax ; x & m (for bucket id) shr edx, 8 ; edx = xm_shifted movzx ebx, ax ; ebx = xm cmp ax, [hufftab_divider + edx*2] adc edx, edx ; edx = bucket xor eax, ebx ; eax = x & ~m mov cl, [hufftab_shift + edx] shr eax, cl movzx ecx, word [hufftab_adjust + edx*2] add eax, ebx ; x += xm movzx edx, byte [hufftab_symbol + edx] ; symbol sub eax, ecx ; x -= adjust[bucket] cmp eax, (1<<16) jae done shl eax, 16 movzx ecx, word [esi] add esi, 2 or eax, ecx done: ; new x in eax, new input_ptr in esi ; symbol in edx
which is actually pretty damn nice considering that’s both Huffman decode and bit buffer rolled into one. Especially so since it handles all cases – there’s no extra conditions and no cases (rare though they might be) where you have to grab more bits and look into another table. Bonus points because it has an obvious variant that’s completely branch-free:
; same as before up until... sub eax, ecx ; x -= adjust[bucket] movzx ecx, word [esi] mov ebx, eax shl ebx, 16 or ebx, ecx lea edi, [esi+2] cmp eax, (1<<16) cmovb eax, ebx cmovb esi, edi
Okay, all that’s nice and everything, but for x86 it’s nothing we haven’t seen before. I have a punch line though: the same thing works on PPC – the adc thing and “sbb reg, reg” both have equivalents, so you can do branch-free computation based on some carry flag easily.
BUT, couple subtle points:
(x & foo) >> bar
(left-shift or right-shift) kind of things, which map really really well to PPC because there’s rlwinm / rlwimi.The in-order PPCs hate variable shifts (something like 12+ cycles microcoded). Well, guess what, everything we multiply with is a small per-symbol constant, so we can just store (1 << len) per symbol and use mullw
. That’s 9 cycles non-pipelined (and causes a stall after issue), but still, better than the microcode. But… wait a second.
If this ends up faster than your usual Huffman, and there’s a decent chance that it might (branch-free and all), the fastest “Huffman” decoder on in-order PPC would, in fact, be a full-blown arithmetic decoder. Which amuses me no end.
# NOTE: LSB of "bucket" complemented compared to x86 # r3 = x, r4 = input ptr # r20 = &tab_divider[0] # r21 = &tab_symbol[0] # r22 = &tab_mult[0] # r23 = &tab_adjust[0] rlwinm r5, r3, 24, 23, 30 # r5 = (xm >> bucket_shift) * 2 rlwinm r6, r3, 0, 16, 31 # r6 = xm lhzx r7, r20, r5 # r7 = tab_divider[xm_shifted] srwi r8, r3, 16 # r8 = x >> log2(m) subfc r9, r7, r6 # (r9 ignored but sets carry) lhz r10, 0(r4) # *input_ptr addze r5, r5 # r5 = bucket lbzx r9, r21, r5 # r9 = symbol add r5, r5, r5 # r5 = bucket word offs lhzx r7, r22, r5 # r7 = mult li r6, 0x10000 # r6 = op for sub later lhzx r5, r23, r5 # r5 = adjust mullw r7, r7, r8 # r7 = mult * (x >> m) subf r5, r5, r6 # r5 = xm - tab_adjust[bucket] add r5, r5, r7 # r5 = new x subfc r6, r6, r5 # sets carry iff (x >= (1<<16)) rlwimi r10, r5, 16, 0, 16 # r10 = (x << 16) | *input_ptr subfe r6, r6, r6 # ~0 if (x < (1<<16)), 0 otherwise slwi r7, r6, 1 # -2 if (x < (1<<16)), 0 otherwise and r10, r10, r6 andc r5, r5, r6 subf r4, r7, r4 # input_ptr++ if (x < (1<<16)) or r5, r5, r10 # new x
That should be a complete alias rANS decoder assuming M=L=b=2^{16}.
-Fabian
Conceptually, a Huffman decoder starts from the root, then reads one bit at a time, descending into the sub-tree denoted by that bit. If that sub-tree is a leaf node, return the corresponding symbol. Otherwise, keep reading more bits and descending into smaller and smaller sub-trees until you do hit a leaf node. That’s all there is to it.
Except, of course, no serious implementation of Huffman decoding works that way. Processing the input one bit at a time is just a lot of overhead for very little useful work done. Instead, normal implementations effectively look ahead by a bunch of bits and table-drive the whole thing. Peek ahead by k bits, say k=10. You also prepare a table with 2^{k} entries that encodes what the one-bit-at-a-time Huffman decoder would do when faced with those k input bits:
struct TableEntry { int num_bits; // Number of bits consumed int symbol; // Index of decoded symbol };
If it reaches a leaf node, you record the ID of the symbol it arrived at, and how many input bits were actually consumed to get there (which can be less than k). If not, the next symbol takes more than k bits, and you need a back-up plan. Set num_bits
to 0 (or some other value that’s not a valid code length) and use a different strategy to decode the next symbol: typically, you either chain to another (secondary) table or fall back to a slower (one-bit-at-a-time or similar) Huffman decoder with no length limit. Since Huffman coding only assigns long codes to rare symbols – that is, after all, the whole point – it doesn’t tend to matter much; with well-chosen k (typically, slightly larger than the log2 of the size of your symbol alphabet), the “long symbol” case is pretty rare.
So you get an overall decoder that looks like this:
while (!done) { // Read next k bits without advancing the cursor int bits = peekBits(k); // Decode using our table int nbits = table[bits].num_bits; if (nbits != 0) { // Symbol *out++ = table[bits].symbol; consumeBits(nbits); } else { // Fall-back path for long symbols here! } }
This ends up particularly nice combined with canonical Huffman codes, and some variant of it is used in most widely deployed Huffman decoders. All of this is classic and much has been written about it elsewhere. If any of this is news to you, I recommend Moffat and Turpin’s 1997 paper “On the implementation of minimum redundancy prefix codes”. I’m gonna assume it’s not and move on.
For the next step, suppose we fix k to be the length of our longest codeword. Anything smaller and we need to deal with the special cases just discussed; anything larger is pointless. A table like the one above then tells us what to do for every possible combination of k input bits, and when we turn the k-bit lookahead into explicit state, we get a finite-state machine that decodes Huffman codes:
state = getBits(k); // read initial k bits while (!done) { // Current state determines output symbol *out++ = table[state].symbol; // Update state (assuming MSB-first bit packing) int nbits = table[state].num_bits; state = (state << nbits) & ((1 << k) - 1); state |= getBits(nbits); }
state
is essentially a k-bit shift register that contains our lookahead bits, and we need to update it in a way that matches our bit packing rule. Note that this is precisely the type of Huffman decoder Charles talks about here while explaining ANS. Alternatively, with LSB-first bit packing:
state = getBits(k); while (!done) { // Current state determines output symbol *out++ = table[state].symbol; // Update state (assuming LSB-first bit packing) int nbits = table[state].num_bits; state >>= nbits; state |= getBits(nbits) << (k - nbits); }
This is still the exact same table as before, but because we’ve sized the table so that each symbol is decoded in one step, we don’t need a fallback path. But so far this is completely equivalent to what we did before; we’re just explicitly keeping track of our lookahead bits in state
.
But this process still involves, essentially, two separate state machines: one explicit for our Huffman decoder, and one implicit in the implementation of our bitwise IO functions, which ultimately read data from the input stream at least one byte at a time.
For our next trick, let’s look at the bitwise IO we need and turn that into an explicit state machine as well. I’m assuming you’ve implemented bitwise IO before; if not, I suggest you stop here and try to figure out how to do it before reading on.
Anyway, how exactly the bit IO works depends on the bit packing convention used, the little/big endian of the compression world. Both have their advantages and their disadvantages; in this post, my primary version is going to be LSB-first, since it has a clearer correspondence to rANS which we’ll get to later. Anyway, whether LSB-first or MSB-first, a typical bit IO implementation uses two variables, one for the “bit buffer” and one that counts how many bits are currently in it. A typical implementation looks like this:
uint32_t buffer; // The bits themselves uint32_t num_bits; // Number of bits in the buffer right now uint32_t getBits(uint32_t count) { // Return low "count" bits from buffer uint32_t ret = buffer & ((1 << count) - 1); // Consume them buffer >>= count; num_bits -= count; // Refill the bit buffer by reading more bytes // (kMinBits is a constant here) while (num_bits < kMinBits) { buffer |= *in++ << num_bits; num_bits += 8; } return ret; }
Okay. That’s fine, but we’d like for there to be only one state variable in our state machine, and preferably not just on a technicality such as declaring our one state variable to be a pair of two values. Luckily, there’s a nice trick to encode both the data and the number of bits in the bit buffer in a single value: we just keep an extra 1 bit in the state
, always just past the last “real” data bit. Say we have a 8-bit state
, then we assign the following codes (in binary):
in_binary(state) | num_bits |
0 0 0 0 0 0 0 1 |
0 |
0 0 0 0 0 0 1 * |
1 |
0 0 0 0 0 1 * * |
2 |
0 0 0 0 1 * * * |
3 |
0 0 0 1 * * * * |
4 |
0 0 1 * * * * * |
5 |
0 1 * * * * * * |
6 |
1 * * * * * * * |
7 |
The locations denoted *
store the actual data bits. Note that we’re fitting 1 + 2 + … + 128 = 255 different states into a 8-bit byte, as we should. The only value we’re not using is “0”. Also note that we have num_bits = floor(log2(state))
precisely, and that we can determine num_bits
using bit scanning instructions when we need to. Let’s look at how the code comes out:
uint32_t state; // As described above uint32_t getBits(uint32_t count) { // Return low "count" bits from state uint32_t ret = state & ((1 << count) - 1); // Consume them state >>= count; // Refill the bit buffer by reading more bytes // (kMinBits is a constant here) // Note num_bits is a local variable! uint32_t num_bits = find_highest_set_bit(state); while (num_bits < kMinBits) { // Need to clear 1-bit at position "num_bits" // and add a 1-bit at bit "num_bits + 8", hence the // "+ (256 - 1)". state += (*in++ + (256 - 1)) << num_bits; num_bits += 8; } return ret; }
Okay. This is written to be as similar as possible to the implementation we had before. You can phrase the while
condition in terms of state
and only compute num_bits
inside the refill loop, which makes the non-refill case slightly faster, but I wrote it the way I did to emphasize the similarities.
Consuming bits is slightly cheaper than the regular bit buffer, refilling is a bit more expensive, but we’re down to one state variable instead of two. Let’s call that a win for our purposes (and it certainly can be when low on registers). Note I only covered LSB-first bit packing here, but we can do a similar trick for MSB bit buffers by using the least-significant set bit as a sentinel instead. It works out very similar.
So what happens when we plug this into the finite-state Huffman decoder from before?
Note that our state machine decoder above still just kept the k lookahead bits in state
, and that they’re not exactly hard to recover from our bit buffer state
. In fact, they’re pretty much the same. So we can just fuse them together to get a state machine-based Huffman decoder that only uses byte-wise IO:
state = 1; // No bits in buffer refill(); // Run "refill" step from the loop once while (!done) { // Current state determines output symbol index = state & ((1 << k) - 1); *out++ = table[index].symbol; // Update state (consume bits) state >>= table[index].num_bits; // Refill bit buffer (make sure at least k bits in it) // This reads bytes at a time, but could just as well // read 16 or 32 bits if "state" is large enough. num_bits = find_highest_set_bit(state); while (num_bits < k) { state += (*in++ + (256 - 1)) << num_bits; num_bits += 8; } }
The slightly weird refill()
call at the start is just to keep the structure as similar as possible to what we had before. And there we have it, a simple Huffman decoder with one state variable and a table. Of course you can combine this type of bit IO with other Huffman approaches, such as multi-table decoding, too. You could also go even further and bake most of the bit IO into tables like Charles describes here, effectively using a table on the actual state
and not just its low bits, but that leads to enormous tables and is really not a good idea in practice; not only are the tables too large to fit in the cache, general-purpose compressors will also usually spend more time building these tables than they ever spend using them (since it’s rare to use a single Huffman table for more than a few dozen kilobytes at a time).
Okay. So far, there’s nothing in here that’s not at least 20 years old.
The decoder above still reads the exact same bit stream as the original LSB-first decoder. But if we’re willing to prescribe the exact form of the decoder, we can use a different refilling strategy that’s more convenient (or cheaper). In particular, we can do this:
state = read_3_bytes() | (1 << 24); // might as well! while (!done) { // Current state determines output symbol index = state & ((1 << k) - 1); *out++ = table[index].symbol; // Update state (consume bits) state >>= table[index].num_bits; // Refill while (state < (1 << k)) state = (state << 8) | *in++; }
This is still workable a Huffman decoder, and it’s cheaper than the one we saw before, because refilling got cheaper. But it also got a bit, well, strange. Note we’re reading 8 bits and putting them into the low bits of state
; since we’re processing bits LSB-first, that means we added them at the “front” of our bit queue, rather than appending them as we used to! In principle, this is fine. Bits are bits. But processing bits out-of-sequence in that way is certainly atypical, and means extra work for the encoder, which now needs to do extra work to figure out exactly how to permute the bits so the decoder reads them in the right order. In fact, it’s not exactly obvious that you can encode this format efficiently to begin with.
But you definitely can, by encoding backwards. Because, drum roll: this isn’t a regular table-driven Huffman decoder anymore. What this actually is is a rANS decoder for symbols with power-of-2 probabilities. The state >>= table[index].num_bits;
is what the decoding state transition function for rANS reduces to in that case.
In other words, this is where we start to see new stuff. It might be possible that someone did a decoder like this before last year, but if they did, I certainly never encountered it before. And trust me, it is weird; the byte stream the corresponding encoder emits is uniquely decodable and has the same length as the bit stream generated for the corresponding Huffman or canonical Huffman code, but the bit-shuffling means it’s not even a regular prefix code stream.
But there’s one more, which is a direct corollary of the existence of alias rANS: we can use the alias method to build a fast decoding table with size proportional to the number of symbols in the alphabet, completely independent of the code lengths!
Note the alias method allows you to construct a table with an arbitrary number of entries, as long as it’s larger than the number of symbols. For efficiency, you’ll typically want to round up to the next power of 2. I’m not going to describe the exact encoder details here, simply because it’s just rANS with power-of-2 probabilities, and the ryg_rans
encoder/decoder can handle that part just fine. So you already have example code. But that means you can build a fast “Huffman” decoder like this:
kMaxCodeLen = 24; // max code len in bits kCodeMask = (1 << kMaxCodeLen) - 1; kBucketShift = kMaxCodeLen - SymbolStats::LOG2NSYMS; state = read_3_bytes() | (1 << 24); // might as well! while (!done) { // Figure out bucket in alias table; same data structures as in // ryg_rans, except syms->slot_nbits (number of bits in Huffman // code for symbol) instead of syms->slot_nfreqs is given. uint32_t index = state & kCodeMask; uint32_t bucket_id = index >> kBucketShift; uint32_t bucket2 = bucket_id * 2; if (index < syms->divider[bucket_id]) ++bucket2; // bucket determines output symbol *out++ = syms->sym_id[bucket2]; // Update state (just D(x) for pow2 probabilities) state = (state & ~kCodeMask) >> syms->slot_nbits[bucket2]; state += index - syms->slot_adjust[bucket2]; // Refill (make sure at least kMaxCodeLen bits in buffer) while (state <= kCodeMask) state = (state << 8) | *in++; }
I find this remarkable because essentially all other fast (~constant time per symbol) Huffman decoding tricks have some dependence on the distribution of code lengths. This one does not; the alias table size is determined strictly by the number of symbols. The only fundamental data-dependency is how often the “refill” code is run (it runs, necessarily, once per input byte, so it will run less often – relatively speaking – on highly compressible data than it will on high-entropy data). (I’m not counting the computation of bucket2
here because it’s just a conditional add, and is in fact written the way it is precisely so that it can be mapped to a compare-then-add-with-carry sequence.)
Note that this one really is a lot weirder still than the previous variant, which at least kept the “space” assigned to individual codes connected. This one will, through the alias table construction, end up allocating small parts of the code range for large symbols all over the place. It’s still exactly equivalent to a Huffman coder in terms compression ratio and code “lengths”, but the underlying construction really doesn’t have much to do with Huffman at all at this point, and we’re not even emitting particular bit strings for code words anymore.
All that said, I don’t think this final variant is actually interesting in practice; if I did, I would have written about it earlier. If you’re bothering to implement rANS and build an alias table, it really doesn’t make sense to skimp out on the one extra multiply that turns this algorithm into a full arithmetic decoder (as opposed to quasi-Huffman), unless your multiplier is really slow that is.
But I do find it to be an interesting construction from a theoretical standpoint, if nothing else. And if you don’t agree, well, maybe you at least learned something about certain types of Huffman decoders and their relation to table-based ANS decoders. :)
In the meantime, here’s a quick post on something else: little-endian (LE) vs. big-endian (BE) and some of the trade-offs involved. The whole debate comes up periodically by proponents of LE or BE with missionary zeal, and then I get annoyed, because usually what makes either endianness superior for some applications makes it inferior for others. So for what it’s worth, here’s the trade-offs I’m aware of:
LE stores bytes in the order you do most math operations on them (if you were to do it byte by byte and not in larger chunks, that is). Additions and subtractions proceed from least-significant bit (LSB) to most-significant bit (MSB), always, because that’s the order carries (and borrows) are generated. Multiplications form partial products from smaller terms (at the limit, individual bits, though for hardware you’re more likely to use radix-4 booth recoding or similar) and add them, and the final addition likewise is LSB to MSB. Long division is the exception and works its way downwards from the most significant bits, but divisions are generally much less frequent than additions, subtractions and multiplication.
Arbitrary-precision arithmetic (“bignum arithmetic”) thus typically chops up numbers into segments (“legs”) matching the word size of the underlying machine, and stores these words in memory ordered from least significant to most significant – on both LE and BE architectures.
All 8-bit ISAs I’m personally familiar with (Intel 8080, Zilog Z80, MOS 6502) use LE, presumably for that reason; it’s the more natural byte order for 16-bit numbers if you only have an 8-bit ALU. (That said, Motorola’s 8-bit 6800 apparently used BE). And consequently, if you’re designing a new architecture with the explicit goal of being source-code compatible with the 8080 (yes, x86 was already constrained by backwards-compatibility considerations even for the original 8086!) it’s going to be little-endian.
So if you want to do a lexicographic sort, memcmp
does the right thing on BE but not on LE: encoding numbers in BE is an order-preserving bijection (if the ordering predicate is lexicographic comparison). This is a very useful property if you’re in the business of selling your customers machines that spend a large chunk of their time sorting, searching and retrieving records, and likely one of the reasons why IBM’s architectures dating back as far as the mainframe era are big-endian. (It doesn’t make sense to speak of “endianness” before the IBM 7030, since that was the machine that introduced byte-addressable memory in the first place; machines before that point had word-based memory). It’s still common to encode numbers in BE for databases and key-value stores, even on LE architectures.
All LE architectures I’m aware of have the LSB as bit 0 and number both bits and bytes in order of increasing significance. Thus, bit and byte order agree: byte 0 of a number on a LE machine stores bits 0-7, byte 1 stores bits 8-15 and so forth. (Assuming 8-bit bytes, that is.)
BE has two schools. First, there’s “Motorola style”, which is bits numbered with LSB=0 and from then on in increasing order of significance; but at the same time, byte 0 is the most significant byte, and following bytes decrease in significance. So by these conventions, a 16-bit number would store bits 8-15 in byte 0, and bits 0-7 in byte 1. As you can see, there’s a mismatch between byte order and bit order.
Finally, there’s BE “IBM style”, which instead labels the MSB as bit 0. As the bit number increases, they decrease in significance. In this scheme, same as in LE, byte 0 stores bits 0-7 of a number, byte 1 stores bits 8-15, and so forth; these bytes are exactly reversed compared to the LE variant, but bit and byte ordering are in agreement again.
That said, referring to the MSB as bit 0 is confusing in other ways; people normally expect bit 0 to have mask 1 << 0
, and with MSB-first bit numbering that’s not the case.
For LE, the 8/16/32-bit prefixes of a 64-bit number all start at the same address as the number itself. This can be viewed as either an advantage (“it’s convenient!”) or a disadvantage (“it hides bugs!”).
A LE load of 8/16/32/64 bits will always put all source bits at the same position in the destination register; as you make the load wider, it will just zero-clear (or sign-extend) less of them. Flow of data through the load/store circuitry is thus essentially the same regardless of operand size; different AND masks corresponding to the load size, but that’s it.
For BE, prefixes start at different offsets. Again, can be viewed as either an advantage (“it prevents bugs!”) or a disadvantage (“I can’t transparently widen fields after the fact!”).
A BE load of 8/16/32/64 bits puts the source bits in different locations in the destination register; instead of a width-dependent mask, we get a width-dependent shift. In a circuit, this is a Mux of differently-shifted versions of the source operand, which is (very slightly) more complicated than the masking for LE. (Not that I actually think anyone cares about HW complexity at that level today, or has in over a decade for that matter.)
That said, the difference can be slightly interesting if you don’t have a full complement of differently-sized loads; the Cell SPUs are an example. If you don’t have narrow loads, BE is hit a bit more than LE is. A synthesized LE narrow load is wide_unaligned_load(addr) & mask
(where the wide unaligned load might itself consist of multiple steps, like it does on the SPUs); synthesized BE narrow load is wide_unaligned_load(addr + offs) & mask
. Note the extra add of a non-zero offset, which means one more instruction. You can get rid of it in principle by just having all addresses for e.g. byte-aligned data be pre-incremented by offs
, but that’s obnoxious too.
And that’s it for now, off the top of my head.
A crucial building block for all of this are atomic operations. This has nothing to do with nuclear physics and everything to do with the root of the word atom, the Ancient Greek “ἄτομος” (atomos, “indivisible”). An atomic operation is one that cannot by divided into any smaller parts, or appears to be for the purposes of other cores in the system. To see why this matters, consider what happens when two cores both try to increment a counter at almost the same type, running the equivalent of the C statement counter++;
:
Cycle # | Core 1 | Core 2 |
---|---|---|
0 | reg = load(&counter); | |
1 | reg = reg + 1; | reg = load(&counter); |
2 | store(&counter, reg); | reg = reg + 1; |
3 | store(&counter, reg); |
In compiled code, this single turns into a load operation, a register increment, and finally a store operation (here written in C-esque pseudocode). These three steps are distinct and will execute in sequence (note that on the micro-architectural level, this is true on x86 as well, even though the instruction set architecture actually features a read-modify-write add [memory], value
instruction). And because of this splitting into multiple cycles, it’s possible for Core 2 to read counter
after Core 1 has read it (and started incrementing it), but before it has written the result back. The end result is that, even though both cores ran code to increment the counter, the final value of the counter is only incremented by 1; one of the two increment operations was “lost”.
This problem is exactly what atomic operations are there to prevent; if we use an atomic increment (or more generally, atomic add) instead of a regular increment, the active core will make sure that the three steps above (load, add, store) appear to happen as one operation, hence atomic; no other core is allowed to “peek” while the increment is going on.
Now the question is, how is this done? Conceptually, the easiest way to do this is using a locking mechanism: only one core is allowed to execute an atomic operation at any point in time. The core enters the lock before it starts the operation, and leaves it once the operation is complete. This is what the x86 LOCK
prefix originally used to mean (approximately; I’m simplifying here). Here, the lock enter operation consists of a message on the bus that says “okay, everyone, back off the bus for a bit” (for our purposes, this means “stop doing memory operations”). Then the core that sent that message needs to wait for all other cores to finish memory operations they’re in the middle of doing, after which they will acknowledge the lock. Only after every other core has acknowledged, the core attempting the locked operation can proceed. Finally, once the lock is released, it again needs to send a message to everyone on the bus saying that “all clear, you can resume issuing requests on the bus now”.
This works. It is also incredibly slow. x86 CPUs still support this (or an equivalent), but only as an absolute emergency, when-all-else-fails fallback path; they need a fallback because the x86 ISA permits very dubious constructs like unaligned atomic operations that cross multiple cache lines, for backwards compatibility. Other architectures generally just don’t allow atomic operations on values that aren’t naturally aligned, nor on values that are “too big”. These constraints guarantee that a single atomic operation always takes place within a single cache line. And once we have that, we’re in good shape: as we saw last time when discussing cache coherency protocols, inter-core communication synchronizes memory at cache line granularity anyway, so in principle we can do complex modifications to a single cache line and then publish all changes at once by pushing the new cache line. Moreover, the MESI state machine features two states (M and E, “modified” and “exclusive”) that guarantee exclusive ownership of a cache line by one core – while a cache line is exclusively owned, no other core can “peek”. We can use that as substitute for our locking protocol.
So here’s the punchline: in a MESI (or derived) system, all we need to do to make sure an operation touching a single cache line is atomic is to a) make sure we issue the right memory barriers so memory operations are correctly ordered with reference to the surrounding code (see previous post), b) acquire exclusive ownership of the cache line before we read anything, c) only write back the results if we had exclusive ownership for the entire duration of the atomic operation. This guarantees that no other core saw any half-finished data. There’s multiple ways to accomplish c). For example, we can build hardware to make a limited set of atomic operations complete in a single bus clock cycle; if we have exclusive ownership of our cache line by the start of a cycle, we can have our modified data by the end of it. Since a cache line can’t possibly “change hands” within a cycle, this is fast enough. Depending on the bus protocol, we might also start playing games where we respond to coherency messages immediately, but might send the data a bit late if we’re in the middle of an atomic operation. Finally, we can just decide not to play games with timing at all; instead we implement steps b) and c) directly: any cache line that’s being used for an atomic operation is “monitored”, and if some other core looks at our cache line before the atomic operation is complete, we need to start over. This is what leads to the load-link/store-conditional (LL/SC) operations present in most RISC CPUs.
And by the way, on the bus side (and hence to other cores), properly aligned atomic operations don’t look any different than normal memory updates. Again, all of the processing is internal to the core that does it; other cores neither know nor care whether memory was updated from an atomic compare-and-swap operation bracketed by memory barriers or a relaxed store.
This all sounds nice and simple, and conceptually it is, but the devil’s in the details. The bad news is that if you’re a CPU architect, every single detail of this process is of crucial importance; your internal implementation of memory operations needs to avoid starvation (every core that wants to gain exclusive access to a cache line should be able to do so, eventually, no matter what the other cores are doing), and make sure that it’s possible to implement certain fundamental atomic operations without deadlocks or livelocks. It sounds obvious, but these guarantees are not automatic for low-level primitives like atomic operations or LL/SC. This stuff is hard to get right, and CPUs need to have an implementation that’s not just correct, it also needs to be fast for “typical cases” (whatever they are). The good news is that if you’re not working at a company designing CPUs, none of this is your problem, and you can rest assured that somebody else has thought this through, backed up by an army of validation engineers trying very hard to find test cases that break it.
Back on the SW side, let’s assume we’re on a typical CPU architecture and are running code on multiple cores. What’s the cost of a memory operation in that environment? It breaks down into several components:
Execution. Executing a memory operation isn’t free. Suppose for now that only one core is active and running single-threaded code; even then, memory access is still complicated. Programs deal with virtual addresses, but coherent caches and memory buses normally deal exclusively in physical memory addresses. So every memory operation first starts with a virtual to physical address conversion (these are itself cached in the what’s commonly called Translation Lookaside Buffer or TLB). If you’re unlucky, that virtual address isn’t currently mapped to physical memory and needs to be brought in from storage; whenever this happens, the OS is going to schedule another thread on the active core for a while, since IO takes a long time in processor terms. But let’s assume that doesn’t happen here.
With the physical address known, the operation starts to go through the memory hierarchy. For example, to complete a memory load, the relevant data normally needs to be brought into the L1 cache first. If it’s not there yet, this can be a multi-step process that – in the worst case – involves a real memory access and then populating all the intermediate cache levels for the relevant cache line. With poor (i.e. not nicely localized) memory access patterns, waiting for cache levels to get populated is one of the main ways a CPU core spends its time. But for now, let’s assume that doesn’t happen (too often) either.
So how fast can we run memory operations if everything goes well? Pretty fast, it turns out. Usually at least one operation (loads/stores) completed per clock cycle, often more than that. Reasonably cache-friendly code will complete billions of memory operations per second on a single 3GHz core.
Memory barriers and atomic read-modify-write operations. For the next step, let’s suppose we’re running code that’s intended for multi-threaded operation, but we’re still doing so on only a single core. As a result, we will now see memory barriers and atomic operations, but no actual interference from another core; let’s just suppose that all relevant cache lines are already exclusively held by our own core. In that situation, how expensive is, say, updating a reference count using an atomic integer addition?
Well, that really depends on the CPU core. In general, micro-architectures with aggressive reordering of memory operations have more expensive memory barriers and atomic operations than those with only slight reordering or in-order execution of memory operations. For example, incrementing a reference count using LOCK INC [mem]
on an Intel Atom core (an in-order design) has essentially the same cost as a regular INC [mem]
instruction, and somewhat more complicated atomic operations like exchange or exchange-add end costing about 2x to 3x as much as a “vanilla” memory read-modify-write instruction. By contrast, on Intel and AMD’s out-of-order x86 desktop parts, an atomic increment has about 10x-25x the cost of the non-atomic version; that’s the cost of ensuring proper memory ordering. And again, to reiterate: this is still on code that is executing single-threaded. There’s no actual cross-core communication going on yet; this extra cost is incurred purely within a single core, to make the code safe for multi-core execution.
Bus traffic and cache coherency. Some percentage of memory accesses actually misses the cache and goes straight to memory; and once cache line that haven’t been used in a while get evicted, we start getting write-backs. All these events cause bus traffic (and memory traffic). Bus and memory bandwidth is a limited resource; as we start saturating their capacities, things start to get slower.
Moreover, once we switch to running multiple threads of our program on multiple cores, we actually start getting cache coherency traffic, as the cores continually synchronize their respective views of memory. If every thread works on its own independent memory region, this doesn’t really do much; if a given region of memory is only used by one core, then there’s no sharing, and getting exclusive ownerships of one of the corresponding cache lines is easy and doesn’t cause any invalidations elsewhere.
By contrast, if two or more cores frequently access the same cache lines, then these cache lines need to be kept synchronized. Updates to one of these cache lines require exclusive ownership, which means all other cores need to invalidate their copies of that cache line first. As a result, the next time that cache line is accessed by another core, its contents need to be sent across the bus. So we get both extra cache misses (on the remote core) and extra bus traffic. This phenomenon of multiple cores hitting a cache line that is being updated regularly is called “cache (line) contention”, and it is probably the easiest way to make parallel code in shared-memory environments crawl.
To get cache line contention, we need multiple cores frequently accessing the same cache line, with at least some of these regular accesses being writes. Private data (cache lines only accessed by one thread) is never a problem; neither is immutable data (written once and then not modified until the end of its lifetime). What’s tricky is data that is both shared and mutable: doing so requires a lot of communication to maintain a consistent (as per the constraints of the memory model) view of memory between cores, and communication is expensive – and only keeps getting more so as more parties get involved.
How much more expensive are we talking? I wrote a test (for x86/Windows) a few weeks ago. This test is by no means user-friendly or easy to read, and it assumes a quad-core CPU with simultaneous multi-threading, or a similar topology, with at least 8 logical processors. Here’s the gist of it: as explained above, replacing a read-modify-write add of a value in memory with an atomic “add” operation generally makes it about 10x-25x as expensive (how much exactly depends on the micro-architecture). If you need a rule of thumb, just assume about 10x (good enough for Fermi estimation).
Once there is a second core reading that cache line, though, the cost explodes. If there’s another core generating lots of read traffic on the cache line in a tight loop, the atomic add gets more expensive – much more expensive: typically, 4x to 6x more (that’s on top of the ~10x hit we take from using an atomic add to begin with!). And this cost only goes up if there are more readers, and even more so if there are other writers too. Now, please don’t take these values as gospel; this is a completely synthetic benchmark that doesn’t do anything useful. The actual cost of coherency traffic very much depends on context. All I want to do here is give you a very rough feel for the cost of coherency traffic and the communication it does (namely: it’s not negligible).
Some of this communication is not actually necessary. For example, because cache coherency is tracked at cache line granularity, it’s possible to get lots of bogus coherency traffic simple because the different types of data – immutable, private and shared – are intermingled within the same cache line (or similarly, because one cache line contains private data for multiple threads). This is called “false sharing“. Luckily, this kind of problem is fairly easy to find with a profiler, and also relatively straightforward to fix by either reordering the data in memory (possibly while adding padding to make sure two different kinds of data don’t end up in the same cache line) or just removing some of the offending data entirely. My older post “Cores don’t like to share” gives an example.
What’s left over after this process is “real” contention – contended access to shared data. This includes both actual shared mutable data structures and certain kinds of metadata such as locks and other synchronization objects. Exactly how well this works depends on the precise layout of data in memory, as well as the operations used to access it.
In general, the way to get scalable multi-processor code is to avoid contention as much as possible, and to make whatever contention remains pass quickly – in that order. And to do a decent job at this, it’s important to know how cache coherency works (in broad strokes, anyway), what kind of messages cores exchange to maintain memory coherency, and when that coherency traffic happens. Now that we have those basics covered, we can look at somewhat higher levels of the stack. This post is long enough already, but in the next post, I’m planning to have a look at locks and lock-free data structures, and discuss some of the trade-offs. Until then, take care!
This is a whirlwhind primer on CPU caches. I’m assuming you know the basic concept, but you might not be familiar with some of the details. (If you are, feel free to skip this section.)
In modern CPUs (almost) all memory accesses go through the cache hierarchy; there’s some exceptions for memory-mapped IO and write-combined memory that bypass at least parts of this process, but both of these are corner cases (in the sense that the vast majority of user-mode code will never see either), so I’ll ignore them in this post.
The CPU core’s load/store (and instruction fetch) units normally can’t even access memory directly – it’s physically impossible; the necessary wires don’t exist! Instead, they talk to their L1 caches which are supposed to handle it. And about 20 years ago, the L1 caches would indeed talk to memory directly. At this point, there’s generally more cache levels involved; this means the L1 cache doesn’t talk to memory directly anymore, it talks to a L2 cache – which in turns talks to memory. Or maybe to a L3 cache. You get the idea.
Caches are organized into “lines”, corresponding to aligned blocks of either 32 (older ARMs, 90s/early 2000s x86s/PowerPCs), 64 (newer ARMs and x86s) or 128 (newer Power ISA machines) bytes of memory. Each cache line knows what physical memory address range it corresponds to, and in this article I’m not going to differentiate between the physical cache line and the memory it represents – this is sloppy, but conventional usage, so better get used to it. In particular, I’m going to say “cache line” to mean a suitably aligned group of bytes in memory, no matter whether these bytes are currently cached (i.e. present in any of the cache levels) or not.
When the CPU core sees a memory load instruction, it passes the address to the L1 data cache (or “L1D$”, playing on the “cache” being pronounced the same way as “cash”). The L1D$ checks whether it contains the corresponding cache line. If not, the whole cache line is brought in from memory (or the next-deeper cache level, if present) – yes, the whole cache line; the assumption being that memory accesses are localized, so if we’re looking at some byte in memory we’re likely to access its neighbors soon. Once the cache line is present in the L1D$, the load instruction can go ahead and perform its memory read.
And as long as we’re dealing with read-only access, it’s all really simple, since all cache levels obey what I’ll call the
Basic invariant: the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses, at all times.
Things gets a bit more complicated once we allow stores, i.e. memory writes. There’s two basic approaches here: write-through and write-back. Write-through is the easier one: we just pass stores through to the next-level cache (or memory). If we have the corresponding line cached, we update our copy (or maybe even just discard it), but that’s it. This preserves the same invariant as before: if a cache line is present in the cache, its contents match memory, always.
Write-back is a bit trickier. The cache doesn’t pass writes on immediately. Instead, such modifications are applied locally to the cached data, and the corresponding cache lines are flagged “dirty”. Dirty cache lines can trigger a write-back, at which points their contents are written back to memory or the next cache level. After a write-back, dirty cache lines are “clean” again. When a dirty cache line is evicted (usually to make space for something else in the cache), it always needs to perform a write-back first. The invariant for write-back caches is slightly different.
Write-back invariant: after writing back all dirty cache lines, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses.
In other words, in write-back caches we lose the “at all times” qualifier and replace it with a weaker condition: either the cache contents match memory (this is true for all clean cache lines), or they contain values that eventually need to get written back to memory (for dirty cache lines).
Write-through caches are simpler, but write-back has some advantages: it can filter repeated writes to the same location, and if most of the cache line changes on a write-back, it can issue one large memory transaction instead of several small ones, which is more efficient.
Some (mostly older) CPUs use write-through caches everywhere; some use write-back caches everywhere; some have a simpler write-through L1$ backed by a write-back L2$. This may generate redundant traffic between L1$ and L2$ but gets the write-back benefits for transfers to lower cache levels or memory. My point being that there’s a whole set of trade-offs here, and different designs use different solutions. Nor is there a requirement that cache line sizes be the same at all levels – it’s not unheard-of for CPUs to have 32-byte lines in L1$ but 128-byte lines in L2$ for example.
Omitted for simplicity in this section: cache associativity/sets; write-allocate or not (I described write-through without write-allocate and write-back with, which is the most common usage); unaligned accesses; virtually-addressed caches. These are all things you can look up if you’re interested, but I’m not going to go that deep here.
As long as that single CPU core is alone in the system, this all works just fine. Add more cores, each with their own caches, and we have a problem: what happens if some other core modifies data that’s in one of our caches?
Well, the answer is quite simple: nothing happens. And that’s bad, because we want something to happen when someone else modifies memory that we have a cached copy of. Once we have multiple caches, we really need to keep them synchronized, or we don’t really have a “shared memory” system, more like a “shared general idea of what’s in memory” system.
Note that the problem really is that we have multiple caches, not that we have multiple cores. We could solve the entire problem by sharing all caches between all cores: there’s only one L1$, and all processors have to share it. Each cycle, the L1$ picks one lucky core that gets to do a memory operation this cycle, and runs it.
This works just fine. The only problem is that it’s also slow, because cores now spend most of their time waiting in line for their next turn at a L1$ request (and processors do a lot of those, at least one for every load/store instruction). I’m pointing this out because it shows that the problem really isn’t so much a multi-core problem as it is a multi-cache problem. We know that one set of caches works, but when that’s too slow, the next best thing is to have multiple caches and then make them behave as if there was only one cache. This is what cache coherency protocols are for: as the name suggests, they ensure that the contents of multiple caches stay coherent.
There are multiple types of coherency protocols, but most computing devices you deal with daily fall into the category of “snooping” protocols, and that’s what I’ll cover here. (The primary alternative, directory-based systems, has higher latency but scales better to systems with lots of cores).
The basic idea behind snooping is that all memory transactions take place on a shared bus that’s visible to all cores: the caches themselves are independent, but memory itself is a shared resource, and memory access needs to be arbitrated: only one cache gets to read data from, or write back to, memory in any given cycle. Now the idea in a snooping protocol is that the caches don’t just interact with the bus when they want to do a memory transaction themselves; instead, each cache continuously snoops on bus traffic to keep track of what the other caches are doing. So if one cache wants to read from or write to memory on behalf of its core, all the other cores notice, and that allows them to keep their caches synchronized. As soon as one core writes to a memory location, the other cores know that their copies of the corresponding cache line are now stale and hence invalid.
With write-through caches, this is fairly straightforward, since writes get “published” as soon as they happen. But if there are write-back caches in the mix, this doesn’t work, since the physical write-back to memory can happen a long time after the core executed the corresponding store – and for the intervening time, the other cores and their caches are none the wiser, and might themselves try to write to the same location, causing a conflict. So with a write-back model, it’s not enough to broadcast just the writes to memory when they happen; if we want to avoid conflicts, we need to tell other cores about our intention to write before we start changing anything in our local copy. Working out the details, the easiest solution that fits the bill and works for write-back caches is what’s commonly called the MESI protocol.
This section is called “MESI and friends” because MESI spawned a whole host of closely related coherency protocols. Let’s start with the original though: MESI are the initials for the four states a cache line can be in for any of the multiple cores in a multi-core system. I’m gonna cover them in reverse order, because that’s the better order to explain them in:
If you compare this to the presentation of write-back caches in the single-core case above, you’ll see that the I, S and M states already had their equivalents: invalid/not present, clean, and dirty cache lines, respectively. So what’s new is the E state denoting exclusive access. This state solves the “we need to tell other cores before we start modifying memory” problem: each core may only write to cache lines if their caches hold them in the E or M states, i.e. they’re exclusively owned. If a core does not have exclusive access to a cache line when it wants to write, it first needs to send an “I want exclusive access” request to the bus. This tells all other cores to invalidate their copies of that cache line, if they have any. Only once that exclusive access is granted may the core start modifying data – and at that point, the core knows that the only copies of that cache line are in its own caches, so there can’t be any conflicts.
Conversely, once some other core wants to read from that cache line (which we learn immediately because we’re snooping the bus), exclusive and modified cache lines have to revert back to the “shared” (S) state. In the case of modified cache lines, this also involves writing their data back to memory first.
The MESI protocol is a proper state machine that responds both to requests coming from the local core, and to messages on the bus. I’m not going to go into detail about the full state diagram and what the different transition types are; you can find more in-depth information in books on hardware architecture if you care, but for our purposes this is overkill. As a software developer, you’ll get pretty far knowing only two things:
Firstly, in a multi-core system, getting read access to a cache line involves talking to the other cores, and might cause them to perform memory transactions.
Writing to a cache line is a multi-step process: before you can write anything, you first need to acquire both exclusive ownership of the cache line and a copy of its existing contents (a so-called “Read For Ownership” request).
And secondly, while we have to do some extra gymnastics, the end result actually does provide some pretty strong guarantees. Namely, it obeys what I’ll call the
MESI invariant: after writing back all dirty (M-state) cache lines, the contents of all cache lines present in any of the cache levels are identical to the values in memory at the corresponding addresses. In addition, at all times, when a memory location is exclusively cached (in E or M state) by one core, it is not present in any of the other core’s caches..
Note that this is really just the write-back invariant we already saw with the additional exclusivity rule thrown in. My point being that the presence of MESI or multiple cores does not necessarily weaken our memory model at all.
Okay, so that (very roughly) covers vanilla MESI (and hence also CPUs that use it, ARMs for example). Other processors use extended variants. Popular extensions include an “O” (Owned) state similar to “E” that allows sharing of dirty cache lines without having to write them back to memory first (“dirty sharing”), yielding MOESI, and MERSI/MESIF, which are different names for the same idea, namely making one core the designated responder for read requests to a given cache line. When multiple cores hold a cache line in Shared state, only the designated responder (which holds the cache line in “R” or “F” state) replies to read requests, rather than everyone who holds the cache line in S state. This reduces bus traffic. And of course you can add both the R/F states and the O state, or get even fancier. All these are optimizations, but none of them change the basic invariants provided or guarantees made by the protocol.
I’m no expert on the topic, and it’s quite possible that there are other protocols in use that only provide substantially weaker guarantees, but if so I’m not aware of them, or any popular CPU core that uses them. So for our purposes, we really can assume that coherency protocols keep caches coherent, period. Not mostly-coherent, not “coherent except for a short window after a change” – properly coherent. At that level, barring hardware malfunction, there is always agreement on what the current state of memory should be. In technical terms, MESI and all its variants can, in principle anyway, provide full sequential consistency, the strongest memory ordering guarantee specified in the C++11 memory model. Which begs the question, why do we have weaker memory models, and “where do they happen”?
Different architectures provide different memory models. As of this writing, ARM and POWER architecture machines have comparatively “weak” memory models: the CPU core has considerable leeway in reordering load and store operations in ways that might change the semantics of programs in a multi-core context, along with “memory barrier” instructions that can be used by the program to specify constraints: “do not reorder memory operations across this line”. By contrast, x86 comes with a quite strong memory model.
I won’t go into the details of memory models here; it quickly gets really technical, and is outside the scope of this article. But I do want to talk a bit about “how they happen” – that is, where the weakened guarantees (compared to the full sequential consistency we can get from MESI etc.) come from, and why. And as usual, it all boils down to performance.
So here’s the deal: you will indeed get full sequential consistency if a) the cache immediately responds to bus events on the very cycle it receives them, and b) the core dutifully sends each memory operation to the cache, in program order, and wait for it to complete before you send the next one. And of course, in practice modern CPUs normally do none of these things:
The implication of all these things is that, by default, loads can fetch stale data (if a corresponding invalidation request was sitting in the invalidation queue), stores actually finish later than their position in the code would suggest, and everything gets even more vague when Out of Order execution is involved. So going back to memory models, there are essentially two camps:
Architectures with a weak memory model do the minimum amount of work necessary in the core that allows software developers to write correct code. Instruction reordering and the various buffering stages are officially permitted; there are no guarantees. If you need guarantees, you need to insert the appropriate memory barriers – which will prevent reordering and drain queues of pending operations where required.
Architectures with stronger memory models do a lot more bookkeeping on the inside. For example, x86 processors keep track of all pending memory operations that are not fully finished (“retired”) yet, in a chip-internal data structure that’s called the MOB (“memory ordering buffer”). As part of the Out of Order infrastructure, x86 cores can roll back non-retired operations if there’s a problem – say an exception like a page fault, or a branch mispredict. I covered some of the details, as well as some of the interactions with the memory subsystem, in my earlier article “Speculatively speaking“. The gist of it is that x86 processors actively watch out for external events (such as cache invalidations) that would retroactively invalidate the results of some of the operations that have already executed, but not been retired yet. That is, x86 processors know what their memory model is, and when an event happens that’s inconsistent within that model, the machine state is rolled back to the last time when it was still consistent with the rules of the memory model. This is the “memory ordering machine clear” I covered in yet another earlier post. The end result is that x86 processors provide very strong guarantees for all memory operations – not quite sequential consistency, though.
So, weaker memory models make for simpler (and potentially lower-power) cores. Stronger memory models make the design of cores (and their memory subsystems) more complex, but are easier to write code for. In theory, the weaker models allow for more scheduling freedom and can be potentially faster; in practice, x86s seem to be doing fine on the performance of memory operations, for the time being at least. So it’s hard for me to call a definite winner so far. Certainly, as a software developer I’m happy to take the stronger x86 memory model when I can get it.
Anyway. That’s plenty for one post. And now that I have all this written up on my blog, the idea is that future posts can just reference it. We’ll see how that goes. Thanks for reading!
Some questions.
General: You’re implicitly assuming there’s lots of different types of resource tables. Why?
UPDATE: It seems like Timothy is talking about D3D12-esque Descriptor Tables. Well, I’m not :), so yeah, kinda talking apart here, but here’s my response anyway.
Essentially, GL bindless just means “one big resource table that contains everything”. And you could certainly implement it as such: one large resource table (multi-megabyte, potentially) that contains descriptor for absolutely everything currently live, with the handles being 32-bit offsets from the base address; nor do I see an intrinsic need for resource tables to be homogeneous in terms of either resource type or update frequency (it might be advantageous in certain cases, but I don’t see why it would be *required*).
NV GL bindless: what’s the handle values in “TEX handle
“? Presumably, not a 7-bit register index. I’m assuming that it boils down to a “[bindlessTable + handle]
” addressing mode, either explicit (i.e. this is a memory load) or implicit (HW is aware of handles and has a dedicated fetch path for them)? I couldn’t see any details in the documents you linked. Anyway, presumably there’s still resource descriptors somewhere. Where is that “somewhere”? How does do the HW know where to get them from?
NV with resource tables: why are you billing the handle table address LDC per texture sample? Unless there’s very high register pressure, you would only load the resource table pointers once per table, not once per resource access. And given my previous comment about “I don’t see why you need tons of these”, that’s really a constant per-shader cost of a few regs and a few loads; it’s not per texture sample at all.
All AMD: you can’t load the texture/sampler descriptors directly, you still need to know where to load them from (unless you use little enough resources that you can squeeze all descriptors into the 16 scalars). There’s at least one more scalar load to get a base pointer to the resource descriptors for the active draw call. Like the resource table pointer loads I just mentioned, this is amortized (just do that load once and leave the scalar reg around).
For GL bindless mode, you would presumably use a single global descriptor table that all your handles point into, and would preload the base address to that table using the 16 scalars you get to set per draw. This makes the basic model:
AMD DX “bindful”:
// sBatchDesc is one of the 16 pre-init S_LOAD_DWORD_X4 sTexDesc, sBatchDesc(16) // random offset S_LOAD_DWORD_X4 sSmpDesc, sBatchDesc(48) // ... SAMPLE ...
AMD GL “bindful”:
S_LOAD_DWORD_X8 sTexAndSmp, sBatchDesc(32) SAMPLE ...
AMD GL bindless:
// sBindlessTable also one of the 16 pre-init S_BUFFER_LOAD_DWORD sTexHandle, sConstantBuf(12) S_LOAD_DWORD_X8 sTexAndSmp, sBindlessTable(sTexHandle) SAMPLE ...
(explicit handle load here, “bindful” is cheaper on the shader side but pays that cost elsewhere by having to set up the per-batch descriptors)
and
AMD resource tables, separate tex/smp.
// let's say we have 2, sResourceTableImmutable // and sResourceTableDyn, both pre-init S_LOAD_DWORD sTexHandle, sBatchDesc(24) S_LOAD_DWORD_X4 sTexDesc, sResourceTableDyn(sTexHandle) S_LOAD_DWORD sSmpHandle, sBatchDesc(28) S_LOAD_DWORD_X4 sSmpDesc, sResourceTableImmutable(sSmpHandle) SAMPLE ...
or we could have a combined texture/sampler handle if we wanted, etc.
The big difference here is that the batch descriptors (sBatchDesc) here only hold texture handles (=offsets into their respective resource tables) not descriptors themselves. In the “bindful” case, there’s a conceptual global table too (the bind points), but it keeps changing between batches, which makes things tricky, and forces you to deal with multiple live versions at the same time and/or copy descriptors around. The resource table model (which bindless is a special case of!) has them stay constant and immutable over the lifetime of a command buffer which gets rid of that cost. GL bindless is similar.
Another note on the AMD resource table version: I’m loading texture/sampler handlers here separately for clarity, but that’s not at all required. A shader that references multiple contigous slots from its sBatchDesc can (and probably should) fuse them into larger pow2-sized S_LOAD_DWORD
s. In my example, you could just as well do:
AMD resource tables + “fusing”:
// sTexHandle = sHandles, sSmpHandle = sHandles + 1 (register IDs) S_LOAD_DWORD_X2 sHandles, sBatchDesc(24) S_LOAD_DWORD_X4 sTexDesc, sResourceTableDyn(sTexHandle) S_LOAD_DWORD_X4 sSmpDesc, sResourceTableImmutable(sSmpHandle) SAMPLE ...
So really, not all that much (cost) difference between the different “bindless-esque” approaches here that I can see.
This is not that post, sadly. This is about something else: namely, me only citing Jarek Duda’s work on ANS and not Yann Collet‘s work on FSE. Apparently, there were multiple versions of Jarek’s ANS paper, and the second version (which contains rANS, the topic I’ve been writing about) was significantly influenced by Yann’s experiences with integrating ideas from the first version into FSE.
Anyway, I did not mention Yann’s work in my rANS posts at all. I just want to make clear that this was because I was writing about rANS not tANS (the family that FSE is a member of), and I simply wasn’t aware that Yann’s work and input significantly influenced the second version of the ANS paper. My apologies; this was a simple oversight, not a deliberate attempt to talk down Yann’s contribution!
On a separate but related note: Mid-february, I wrote a short paper on how entropy coders (with a focus on ANS) can be interleaved on the encode side to allow the decoder side to exploit instruction-level parallelism and/or SIMD instructions. I originally meant to do a separate post about it here, but on trying to write it discovered that I didn’t have much to say on the topic that wasn’t in the paper. Hence, no separate blog post. But I figured I should at least link to it once from here.
Anyway, more regular blog updates should start again soon. Until then!