Much has been written about all the myriad ways to go wrong when writing software. Poor management; scope creep; too little structure, not modular enough, and it’s a “big ball of mud”. Too much (or too rigid) and it’s a “software crystal”, impossible to alter. And so on.

Suppose you get all that right, and actually ship a useful system to users, it solves their problems well enough, and the code is reasonably clean, has a sound design and a modular structure with interface that, while not perfect, work okay. That’s about as good as it gets! Well done.

Alas, you’re not out of the woods. Success has its own failure modes, and I want to talk about one in particular that affects modular designs.

The arguments for modularity are well known: separating concerns breaks large systems down into smaller constituent parts that can be understood individually, with clearly-defined interfaces between them. Ideally, modules are designed so they can be developed and tested in isolation, and if an individual module is found wanting (say it’s unreliable, faulty or there are simply better solutions available), it can be replaced with another module provided it has the same interface.

And there really are systems like that, where the interfaces are rigid and well-specified, components come only in a handful of “shapes”, and everything cleanly fits together, like Lego bricks. But more commonly, shipping systems look like this (prepare for an extended metaphor):

The modules have irregular shapes and irregular sizes. Some are big, some are quite small. Some closely align with their neighbors; others have big gaps between them. They add up to a coherent whole, but it’s clear that for most of the development time, none of these components really had to have any particular shape. Occasionally you need a small piece with a specific shape to fill a gap, but for the most part, you just work with the materials you have.

The result is still “modular”; it’s built out of smaller pieces, each with their own clearly defined boundaries. But it’s not very regular, and outright weird in some places. That chipped corner on one of the bottom pieces was just an early mistake, but it made for a good place to stick that one flat rock on and somehow that ended up being one of the primary supports for the whole thing. And while building that wall, “I need a rock, about this big” was the only constraint you really had, and you just sort of piled it on. But when repairing it after one of the pieces has been damaged, working out the right shape, finding a replacement that meets that description and getting it in place is really tricky, fiddly work. (End of extended metaphor.)

Know any systems like that? I certainly do. And the end result is what I hereby dub a “modulith” (I am sure this has been observed and named before, but I haven’t seen it elsewhere yet). Made out of small, distinct, cleanly separable pieces, but still, everything but the topmost layer is actually kind of hard to disentangle from the rest, due to a myriad of small interactions with everything surrounding it. Because once you use a module as a building block for something else, there’s a disturbing tendency for all of its remaining quirks and bugs to effectively become part of the spec, as other modules (implicitly or explicitly) start to rely on them.

This is related to, but distinct from, other concepts such as software entropy and technical debt, which primarily deal with effects within a single codebase over time. Here we are dealing with something slightly different: as a particular component is successfully used or re-used (in unmodified form!), the users of said code tend to end up relying (often inadvertently) on various unspecified or underspecified behaviors, implicitly assuming a stronger contract than the component is actually supposed to provide. At that point, your choices are to either make those assumed behaviors actually contractual (not breaking existing code at the cost of severely constraining future evolution of said component), or to fix all users that make stronger assumptions than what is guaranteed (easier said than done if the component in question is popular; often causes ripple effects that break yet more code).

Either way, I don’t have any good solutions, but I’m feeling whimsical and haven’t seen this exact problem described before, so I’m naming it. In the extremely likely case that this has already been described and named by someone else, I’d appreciate a reference!

My colleague Charles Bloom recently announced Oodle LZNA (a new compressor for RAD’s Oodle compression library), which usually beats LZMA on compression while being *much* faster to decode. The major innovation in LZNA is switching the coding back-end from bit-wise adaptive modeling to models with larger alphabets. With traditional arithmetic coders, you take a pretty big speed hit going from binary to larger alphabets, and you end up leaning heavily on integer divides (one per symbol usually), which is the neglected bastard stepchild of every integer instruction set. But as of last year, we have the ANS family of coders, which alters the cost landscape considerably. Most interesting for use with adaptive models is rANS, which is significantly cheaper than conventional large-alphabet arithmetic (and without divisions in the decoder), but can still code directly from a cumulative probability distribution, just like other arithmetic coders. In short, rANS speeds up non-binary alphabets enough that it makes sense to revisit adaptive modeling for larger alphabets in practical coder design, something that’s been pretty much on hold since the late 90s.

In this post, I will try to explain the high-level design behind the adaptive models that underlie LZNA, and why they are interesting. But let’s start with the models they are derived from.

### Adaptive binary models

The canonical adaptive binary model simply keeps count of how many 0s and 1s have been seen:

counts[0] = 1; // or different init counts[1] = 1; prob_for_1 = 0.5; // used for coding/decoding void adapt(int bit) { counts[bit]++; prob_for_1 = counts[1] / (counts[0] + counts[1]); }

This is the obvious construction. The problem with this is that probabilities move quickly after initialization, but very soon start to calcify; after a few thousand symbols, any individual observation barely changes the probabilities at all. This sucks for heterogeneous data, or just data where the statistics change over time: we spend a *long* time coding with a model that’s a poor fit, trying to “unlearn” the previous stats.

A better approach for data with statistics that change over time is to gradually “forget” old stats. The resulting models are much quicker to respond to changed input characteristics, which is usually a decent win in practice. The canonical “leaky” adaptive binary model is, essentially, an exponential moving average:

prob_for_1 = 0.5; f = 1.0 - 1.0 / 32.0; // adaptation rate, usually 1/pow2 void adapt(int bit) { if (bit == 0) prob_for_1 *= f; else prob_for_1 = prob_for_1 * f + (1.0 - f); }

This type of model goes back to Howard and Vitter in “Practical implementations of arithmetic coding” (section 3.4). It is absolutely everywhere, albeit usually implemented in fixed point arithmetic and using shifts instead of multiplications. That means you will normally see the logic above implemented like this:

scale_factor = 4096; // .12 fixed point prob_for_1_scaled = scale_factor / 2; void adapt(int bit) { if (bit == 0) prob_for_1_scaled -= prob_for_1_scaled >> 5; else prob_for_1_scaled += (scale_factor - prob_for_1_scaled) >> 5; }

This *looks* like it’s a straightforward fixed-point version of the code above, but there’s a subtle but important difference: the top version, when evaluated in real-number arithmetic, can have `prob_for_1`

get arbitrarily close to 0 or 1. The bottom version cannot; when `prob_for_1_scaled`

≤ 31, it cannot shrink further, and likewise, it cannot grow past `scale_factor`

– 31. So the version below (with a fixed-point scale factor of 4096 and using a right-shift of 5) will keep scaled probabilities in the range [31,4065], corresponding to about [0.0076, 0.9924].

Note that with a shift of 5, we always stay 31 steps away from the top and bottom end of the interval – *independent of what our scale factor is*. That means that, somewhat counter-intuitively, that both the scale factor *and* the adaptation rate determine what the minimum and maximum representable probability are. In compression terms, the clamped minimum and maximum probabilities are equivalent to mixing a uniform model into a “real” (infinite-precision) Howard/Vitter-style adaptive model, with low weight. Accounting for this fact is important when trying to analyze (and understand) the behavior of the fixed-point models.

### A symmetric formulation

Implementations of binary adaptive models usually only track one probability; since there are two symbols and the probabilities must sum to 1, this is sufficient. But since we’re interested in models for larger than binary alphabets, let’s aim for a more symmetric formulation in the hopes that it will generalize nicely. To do this, we take the probabilities p_{0} and p_{1} for 0 and 1, respectively, and stack them into a column vector **p**:

Now, let’s look at what the Howard/Vitter update rule produces for the updated vector **p**‘ when we see a 0 bit (this is just plugging in the update formula from the first program above and using p_{1} = 1 – p_{0}):

And for the “bit is 1” case, we get:

And just like that, we have a nice symmetric formulation of our update rule: when the input bit is *i*, the updated probability vector is

where the **e**_{i} are the canonical basis vectors. Once it’s written in this vectorial form, it’s obvious how to generalize this adaptation rule to a larger alphabet: just use a larger probability vector! I’ll go over the implications of this in a second, but first, let me do a brief interlude for those who recognize the shape of that update expression.

### Aside: the DSP connection

Interpreted as a discrete-time system

where the s_{k} denote the input symbol stream, note that we’re effectively just running a multi-channel IIR filter here (one channel per symbol in the alphabet). Each “channel” is simple linearly interpolating between its previous state and the new input value – it’s a discrete-time leaky integrator, a 1st-order low-pass filter.

Is it possible to use other low-pass filters? You bet! In particular, one popular variant on the fixed-point update rule has two accumulators with different update rates and averages them together. The result is equivalent to a 2nd-order (two-pole) low-pass filter. Its impulse response, corresponding to the weight of symbols over time, initially decays faster but has a longer tail than the 1st-order filter.

And of course you can use FIR low-pass filters too: for example, the incremental box filters described in my post Fast blurs 1 correspond to a “sliding window” model. This approach readily generalizes to more general piecewise-polynomial weighting kernels using the approach described in Heckberts “Filtering by repeated integration”. I doubt that these are actually useful for compression, but it’s always nice to know you have the option.

Can we use arbitrary low-pass filters? Alas, we can’t: we absolutely need linear filters with unit gain (so that the sum of all probabilities stays 1 exactly), and furthermore our filters need to have non-negative impulse responses, since we’re dealing with probabilities here, and probabilities need to stay non-negative. Still, we have several degrees of freedom here, and at least in compression, this space is definitely under-explored right now. I’m sure some of it will come in handy.

### Practical non-binary adaptive models

As noted before, the update expression we just derived

quite naturally generalizes to non-binary alphabets: just stack more than two probabilities into the vector **p**. But it also generalizes in a different direction: note that we’re just linearly interpolating between the old model **p** and the “model” for the newly observed symbol, as given by **e**_{i}. We can change the model we’re mixing in if we want different update characteristics. For example, instead of using a single spike at symbol i (as represented by **e**_{i}), we might decide to slightly boost the probabilities for adjacent values as well.

Another option is to blend between the “spike” and a uniform distribution with a given weight *u*:

where **1** is the vector consisting of all-1s. This is a way to give us the clamping of probabilities at some minimum level, like we get in the fixed-point binary adaptive models, but in a much more controlled fashion; no implicit dependency on the scaling factor in this formulation! (Although of course an integer implementation will still have some rounding effects.) Note that, for a fixed value of *u*, the entire right-hand side of that expression only depends on *i* and does not care about the current value of **p** at all.

Okay, so we can design a bunch of models using this on paper, but is this actually practical? Note that, in the end, we will still want to get fixed-point integer probabilities out of this, and we would greatly prefer them to all sum to a power of 2, because then we can use rANS. This is where my previous post “Mixing discrete probability distributions” comes in. And this is where I’ll stop torturing you and skip straight to the punchline: a working adaptation step, with pre-computed mix-in models (depending on *i* only), can be written like this:

int rate = 5; // use whatever you want! int CDF[nsyms + 1]; // CDF[nsyms] is pow2 int mixin_CDFs[nsyms][nsyms + 1]; void adapt(int sym) { // no need to touch CDF[0] and CDF[nsyms]; they always stay // the same. int *mixin = mixin_CDFs[sym]; for (int i = 1; i < nsyms; ++i) CDF[i] += (mixin[i] - CDF[i]) >> rate; }

which is a pretty sweet generalization of the binary model update rule. As per the “Mixing discrete probability distributions”, this has several non-obvious nice properties. In particular, if the initial CDF has nonzero probability for every symbol, and all the `mixin_CDFs`

do as well, then symbol probabilities will never drop down to zero as a result of this mixing, and we always maintain a constant total throughout. What this variant doesn’t handle very well is round-off; there’s better approaches (although you do want to make sure that something like the spacing lemma from the mixing article still holds), but this variant is decent, and it’s certainly the most satisfying because it’s so breathtakingly simple.

Note that the computations for each *i* are completely independent, so this is data-parallel and can be written using SIMD instructions, which is pretty important to make larger models practical. This makes the update step fast; you do still need to be careful in implementing the other half of an arithmetic coding model, the symbol lookup step.

### Building from here

This is a pretty neat idea, and a very cool new building block to play with, but it’s not a working compressor yet. So what we can do with this, and where is it interesting?

Well. For a long time, in arithmetic coding, there were basically two choices. You could use larger models with very high per-symbol overhead: one or more divisions per symbol decoded, a binary search to locate the right symbol… pretty expensive stuff. Expensive enough that you want to make sure you code very few symbols this way. Adaptive symbols, using something like Fenwick trees, were even worse, and limited in the set of update rules they could support. In practice, truly adaptive large-alphabet models were exceedingly rare; if you needed some adaptation, you were more likely to use a deferred summation model (keep histogramming data and periodically rebuild your CDF from the last histogram) than true per-symbol adaptation, because it was so damn expensive.

At the other extreme, you had binary models. Binary arithmetic coders have *much* lower per-symbol cost than their non-binary cousins, and good binary models (like the fixed-point Howard/Vitter model we looked at) are likewise quite cheap to update. But the problem is that with a binary coder, you end up sending *many* more symbols. For example, in LZMA, a single 256-symbol alphabet model gets replaced with 8 binary coding steps. While the individual binary coding steps are much faster, they’re typically not *that* much faster that you can do 8× as many and still come out much faster. There’s ways to reduce the number of symbols processed. For example, instead of a balanced binary tree binarization, you can build a Huffman tree and use that instead… yes, using a Huffman tree to do arithmetic coding, I know. This absolutely works, and it is faster, but it’s also a bit of a mess and fundamentally very unsatisfying.

But now we have options in the middle: with rANS giving us larger-alphabet arithmetic decoders that are merely slightly slower than binary arithmetic coding, we can do something better. The models described above are *not* directly useful on a 256-symbol alphabet. No amount of SIMD will make updating 256 probability estimates per input symbol fast. But they *are* useful on medium-sized alphabets. The LZNA in “Oodle LZNA” stands for “LZ-nibbled-ANS”, because the literals are split into nibbles: instead of having a single adaptive 256-symbol model, or a binary tree of adaptive binary models, LZNA has much flatter trees of medium-sized models. Instead of having one glacially slow decode step per symbol, or eight faster steps per symbol, we can decode an 8-bit symbol in two still-quite-fast steps. The right decomposition depends on the data, and of course on the relative speeds of the decoding steps. But hey, now we have a whole continuum to play with, rather than just two equally unsavory points at the extreme ends!

This is not a “just paste it into your code” solution; there’s still quite a bit of art in constructing useful models out of this, several fun tricks in actually writing a fast decoder for this, some subtle variations on how to do the updates. And we haven’t even begun to properly play with different mix-in models and “integrator” types yet.

Charles’ LZNA is the first compressor we’ve shipped that uses these ideas. It’s not gonna be the last.

**UPDATE**: As pointed out by commenter “derf_” on Hacker News, the non-binary context modeling in the current Daala draft apparently describes the exact same type of model. Details here (section 2.2, “Non-binary context modeling”, variant 3). This apparently was proposed in 2012. We (RAD) weren’t aware of this earlier (pity, that would’ve saved some time); very cool, and thanks derf for the link!

are a standard sequence used in quadratic probing of open hash tables. For example, they’re used in Google’s `sparse_hash`

and `dense_hash`

, generally considered to be very competitive hash table implementations.

You can find lots of places on the web mentioning that the resulting probe sequence will visit every element of a power-of-2 sized hash table exactly once; more precisely, the function is a permutation on . But it’s pretty hard to find a proof; everybody seems to refer back to Knuth, and in The Art of Compute Programming, Volume 3, Chapter 6.4, the proof appears as an exercise (number 20 in the Second Edition).

If you want to do this exercise yourself, please stop reading now; spoilers ahead!

Anyway, turns out I arrived at the solution very differently from Knuth. His proof is much shorter and slicker, but pretty “magic” and unmotivated, so let’s take the scenic route! The first step is to look at a bunch of small values and see if we can spot any patterns.

k | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|---|---|---|

T_{k} |
0 | 1 | 3 | 6 | 10 | 15 | 21 | 28 | 36 | 45 | 55 | 66 |

T_{k} mod 8 |
0 | 1 | 3 | 6 | 2 | 7 | 5 | 4 | 4 | 5 | 7 | 2 |

T_{k} mod 4 |
0 | 1 | 3 | 2 | 2 | 3 | 1 | 0 | 0 | 1 | 3 | 2 |

T_{k} mod 2 |
0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 |

And indeed, there are several patterns that might be useful: looking at the row for “mod 2”, we see that it seems to be just the values 0, 1, 1, 0 repeating, and that sequence itself is just 0, 1 followed by its reverse. The row for “mod 4” likewise looks like it’s just alternating between normal and reverse copies of 0, 1, 3, 2, and the “mod 8” row certainly looks like it might be following a similar pattern. Can we prove this?

First, the mirroring suggests that it might be worthwhile to look at the differences

Both terms are multiples of n, so we have , which proves the mirroring (incidentally, for any n, not just powers of 2). Furthermore, the first term is a multiple of 2n too, and the second term *almost* is: we have . This will come in handy soon.

To prove that is 2n-periodic, first note the standard identity for triangular numbers

in particular, we have

Again, this is for arbitrary n ≥ 1. So far, we’ve proven that for arbitrary positive integer n, the sequence is 2n-periodic, with the second half being a mirrored copy of the first half. It turns out that we can wrangle this one fact into a complete proof of the , 0 ≤ k < 2^{m} being a permutation fairly easily by using induction:

**Basis** (m=0): , and the function is a permutation on { 0 }.

**Inductive step**: Suppose is a permutation on . Then the values of must surely be pairwise distinct for 0 ≤ k < 2^{m}, since they’re already distinct mod 2^{m}. That means we only have to deal with the second half. But above (taking n=2^{m}), we proved that

and letting k run from 0 to 2^{m}-1, we notice that mod 2^{m}, these values are congruent to the values in the first half, but mod 2^{m+1}, they differ by an additional term of -n. This implies that the values in the second half are pairwise distinct both from the first half and from each other. This proves that f_{m+1} is injective, and because it’s mapping the 2^{m+1}-element set onto itself, it must be a permutation.

Which, by mathematical induction, proves our claim.

Let’s talk a bit about probability distributions in data compression; more specifically, about a problem in dealing with multi-symbol alphabets during adaptation.

Distributions over a binary alphabet are easy to mix (in the “context mixing” sense): they can be represented by a single scalar (I’ll use the probability of the symbol being a ‘1’, probability of ‘0’ works too) which are easily combined with other scalars. In practice, these probabilities are represented as fixed-size integers in a specific range. A typical choice is . Often, p=0 (“it’s certainly a 0”) and p=1 (certain 1) are excluded, because they only allow us to “encode” (using zero bits!) one of the two symbols.

Multi-symbol alphabets are trickier. A binary alphabet can be described using one probability p; the probability of the other symbol must be 1-p. Larger alphabets involve multiple probabilities, and that makes things trickier. Say we have a n-symbol alphabet. A probability distribution for such an alphabet is, conventionally, a vector such that for all and furthermore . In practice, we will again represent them using integers. Rather than dealing with the hassle of having fractions everywhere, let’s just define our “finite-precision distributions” as integer vectors , again for all k, and where T is the total. Same as with the binary case, we would like T to be a constant power of 2. This lets us use cheaper variants of conventional arithmetic coders, as well as rANS. Unlike the binary case, maintaining this invariant takes explicit work, and it’s not cheap.

In practice, the common choice is to either drop the requirement that T be constant (most adaptive multi-symbol models for arithmetic coding take that route), or switch to a semi-adaptive model such as deferred summation that performs model updates in batches, which can either pick batches such that the constant sum is automatically maintained, or at least amortize the work spent enforcing it. A last option is using a “sliding window” style model that just uses a history of character counts over the last N input symbols. This makes it easy to maintain the constant sum, but it’s pretty limited.

A second problem is mixing – both as an explicit operation for things like context mixing, and as a building block for model updates more sophisticated than simply incrementing a counter. With binary models, this is easy – just a weighted sum of probabilities. Multi-symbol models with a constant sum are harder: for example, say we want to average the two 3-symbol models and while maintaining a total of T=8. Computing illustrates the problem: the two non-integer counts have to get rounded back to integers *somehow*. But if we round both up, we end up at a total of 9; and if we round both down, we end up with a total of 7. To achieve our desired total of 8, we need to round one up, and one down – but it’s not exactly obvious how we should choose on any given symbol! In short, maintaining a constant total while mixing in this form proves to be tricky.

However, I recently realized that these problems all disappear when we work with the cumulative distribution function (CDF) instead of the raw symbol counts.

### Working with cumulative counts

For a discrete probability distribution p with total T, define the corresponding cumulative probabilities:

P_{0} is an empty sum, so P_{0}=0. On the other end, P_{n} = T, since that’s just the sum over all values in p, and we know the total is T. For the elements in between, p_{k} ≥ 0 implies that P_{k} ≥ P_{k-1}, i.e. P is monotonically non-decreasing. Conversely, given any P with these three properties, we can compute the differences between adjacent elements and determine the underlying distribution p.

And it turns out that while mixing quantized symbol probabilities is problematic, mixing cumulative distribution functions pretty much just works. To wit: suppose we are given two CDFs P and Q with the same total T, and a blending factor , then define:

Note that because summation is linear, we have

so this is indeed the CDF corresponding to a blended model between p and q. However, we’re still dealing with real values here; we need integers, and the easiest way to get there is to just truncate, possibly with some rounding bias :

It turns out that this just works, but this requires proof, and to get there it helps to prove a little lemma first.

**Spacing lemma**: Suppose that for some j, and where m is an arbitrary integer. Then we also have .

**Proof**: We start by noting that

and since m is an integer, we have

which was our claim.

Using the lemma, it’s now easy to show that R is a valid CDF with total T: we need to establish that R_{0}=0, R_{n}=T, and show that R is monotonic. The first two are easy, since the P and Q we’re mixing both agree at these points and 0≤b<1.

As for monotonicity, note that is the same as (and the same for P and Q). Therefore, we can apply the spacing lemma with m=0 for 1≤j≤n: monotonicity of P and Q implies that R will be monotonic too. And that’s it: this proves that R is valid CDF for a distribution with total T. If we want the individual symbol frequencies, we can recover them as .

Note that the spacing lemma is a fair bit stronger than just establishing monotonicity. Most importantly, if p_{k}≥m and q_{k}≥m, the spacing lemma tells us that r_{k}≥m – these properties carry over to the blended distribution, as one would expect! I make note of this because this kind of invariant is often violated by approximate techniques that first round the probabilities, then fudge them to make the total come out right.

### Conclusion

This gives us a way to blend between two probability distributions while maintaining a constant total, without having to deal with dodgy ad-hoc rounding decisions. This requires working in CDF form, but that’s pretty common for arithmetic coding models anyway. As long as the mixing computation is done exactly (which is straightforward when working in integers), the total will remain constant.

I only described linear blending, but the construction generalizes in the obvious way to arbitrary convex combinations. It is thus directly applicable to mixing more than two distributions while only rounding once. Furthermore, given the CDFs of the input models, the corresponding interval for a single symbol can be found using just two mixing operations to find the two interval end points; there’s no need to compute the entire CDF for the mixed model. This is in contrast to direct mixing of the symbol probabilities, which in general needs to look at all symbols to either determine the total (if a non-constant-T approach is used) or perform the adjustments to make the total come out to T.

Furthermore, the construction shows that probability distributions with sum T are closed under “rounded convex combinations” (as used in the proof). The spacing lemma implies that the same is true for a multitude of more restricted distributions: for example, the set of probability distributions where each symbol has a nonzero probability (corresponding to distributions with monotonically increasing, instead of merely non-decreasing, CDFs) is also closed under convex combinations in this sense. This is a non-obvious result, to me anyway.

One application for this (as frequently noted) is context mixing of multi-symbol distributions. Another is as a building block in adaptive model updates that’s a good deal more versatile than the obvious “steal the count from one symbol, add it to another” update step.

I have no idea whether this is new or not (probably not); I certainly hadn’t seen it before, and neither had anyone else at RAD. Nor do I know whether this will be useful to anyone else, but it seemed worth writing up!

Suppose we want to calculate a product between a 4×4 matrix M and a 4-element vector v:

The standard approach to computing Mv using SIMD instructions boils down to taking a linear combination of the four column vectors a, b, c and d, using standard SIMD componentwise addition, multiplication and broadcast shuffles.

// Given M as its four constituent column vectors a, b, c, d, // compute r=M*v. r = v.xxxx*a + v.yyyy*b + v.zzzz*c + v.wwww*d;

This computes the vector-matrix product using four shuffles, four (SIMD) multiplies, and three additions. This is all bog-standard. And if the ISA we’re working on has free broadcast swizzles (ARM NEON for example), we’re done. But if not, can we do better? Certainly if we know things about M or v: if M has a special structure, or some components of v are known to be always 0, 1 or -1, chances are good we can save a bit of work (whether it makes a difference is another matter). But what if M and v are completely general, and all we know is that we want to transform a lot of vectors with a single M? If v is either given as or returned in SoA form (structure-of-arrays), we can reduce the number of per-vector shuffles greatly if we’re willing to preprocess M a bit and have enough registers available. But let’s say we’re not doing that either: our input v is in packed form, and we want the results packed too. Is there anything we can do?

There’s no way to reduce the number of multiplies or additions in general, but we can get rid of exactly one shuffle per vector, if we’re willing to rearrange M a bit. The trick is to realize that we’re using each of v.x, v.y, v.z, and v.w exactly four times, and that the computations we’re doing (a bunch of component-wise multiplies and additions) are commutative and associative, so we can reorder them, in exact arithmetic anyway. (This type of computation is usually done in floating point, where we don’t actually have associativity, but I’m going to gloss over this.)

Let’s look at the our first set of products, `v.xxxx * a`

. We’re just walking down a column of M, multiplying each element we see by v_{x}. What if we walk in a different direction? Going along horizontals turns out to be boring (it’s essentially the same, just transposed), but diagonals of M are interesting, the main diagonal in particular.

So here’s the punch line: we form four new vectors by walking along diagonals (with wrap-around) as follows:

Phrasing the matrix multiply in terms of these four vectors, we get:

r = v*e + v.yzwx*f + v.zwxy*g + v.wxyz*h;

Same number of multiplies and adds, but one shuffle per vector less (because the swizzle pattern for v in the first term is `xyzw`

, which is the natural ordering of v). Also note that forming e, f, g, and h given M in column vector form is also relatively cheap: it’s a matrix transposition with a few post-swizzles to implement the cyclic rotations. If you have M as row vectors (for example because it’s stored in row-major order), it’s even cheaper.

So: multiplying a packed 4-vector with a constant 4×4-matrix takes one shuffle less than the standard approach, if we’re willing to do some preprocessing on M (or store our matrices in a weird layout to begin with). Does this matter? It depends. On current desktop x86 cores, it’s pretty marginal, because SIMD shuffles can execute in parallel (during the same cycle) with additions and multiplications. On older cores with less execution resources, on in-order SIMD CPUs, and on low-power parts, it can definitely help though.

For what it’s worth: if your 4D vectors come from graphics or physics workloads and are actually homogeneous 3-vectors with a constant w=1 and no projective transforms anywhere in sight, you can exploit that structure explicitly for higher gains than this. But I ran into this with a DSP workload (with v just being a vector of 4 arbitrary samples), and in that case it’s definitely useful to know, especially since anything convolution-related tends to have highly diagonal (Toeplitz, to be precise) structure to begin with.

This is precisely what the last post was about. So nothing new. This is just my original mail on the topic with some more details that might be interesting and/or amusing to a few people. :)

Date: Wed, 05 Feb 2014 16:43:36 -0800

From: Fabian Giesen

Subject: Alias Huffman coding.

Huffman <= ANS (strict subset)

(namely, power-of-2 frequencies)

We can take any discrete probability distribution of N events and use the Alias method to construct a O(N)-entry table that allows us to sample from that distribution in O(1) time.

We can apply that same technique to e.g. rANS coding to map from (x mod M) to “what symbol is x”. We already have that.

Ergo, we can construct a Huffman-esque coder that can decode symbols using a single table lookup, where the table size only depends on N_sym and not the code lengths. (And the time to build said table given the code lengths is linear in N_sym too).

Unlike regular/canonical Huffman codes, these can have multiple unconnected ranges for the same symbol, so you still need to deal with the range remapping (the “slot_adjust” thing) you have in Alias table ANS; basically, the only difference ends up being that you have a shift instead of a multiply by the frequency.

But there’s still some advantages in that a few things simplify; for example, there’s no need (or advantage) to using a L that’s larger than M. An obvious candidate is choosing L=M=B so that your Huffman codes are length-limited to half your word size and you never do IO in smaller chunks than that.

Okay. So where does that get us? Well, something like the MSB alias rANS decoder, with a shift instead of a multiply, really:

// decoder state // suppose max_code_len = 16 U32 x; U16 const * input_ptr; U32 const m = (1 << max_code_len) - 1; U32 const bucket_shift = max_code_len - log2_nbuckets; // decode: U32 xm = x & m; U32 xm_shifted = xm >> bucket_shift; U32 bucket = xm_shifted * 2; if (xm < hufftab_divider[xm_shifted]) bucket++; x = (x & ~m) >> hufftab_shift[bucket]; x += xm - hufftab_adjust[bucket]; if (x < (1<<16)) x = (x << 16) | *input_ptr++; return hufftab_symbol[bucket];

So with a hypothetical compiler that can figure out the adc-for-bucket

thing, we’d get something like

; x in eax, input_ptr in esi movzx edx, ax ; x & m (for bucket id) shr edx, 8 ; edx = xm_shifted movzx ebx, ax ; ebx = xm cmp ax, [hufftab_divider + edx*2] adc edx, edx ; edx = bucket xor eax, ebx ; eax = x & ~m mov cl, [hufftab_shift + edx] shr eax, cl movzx ecx, word [hufftab_adjust + edx*2] add eax, ebx ; x += xm movzx edx, byte [hufftab_symbol + edx] ; symbol sub eax, ecx ; x -= adjust[bucket] cmp eax, (1<<16) jae done shl eax, 16 movzx ecx, word [esi] add esi, 2 or eax, ecx done: ; new x in eax, new input_ptr in esi ; symbol in edx

which is actually pretty damn nice considering that’s both Huffman decode and bit buffer rolled into one. Especially so since it handles all cases – there’s no extra conditions and no cases (rare though they might be) where you have to grab more bits and look into another table. Bonus points because it has an obvious variant that’s completely branch-free:

; same as before up until... sub eax, ecx ; x -= adjust[bucket] movzx ecx, word [esi] mov ebx, eax shl ebx, 16 or ebx, ecx lea edi, [esi+2] cmp eax, (1<<16) cmovb eax, ebx cmovb esi, edi

Okay, all that’s nice and everything, but for x86 it’s nothing we haven’t seen before. I have a punch line though: the same thing works on PPC – the adc thing and “sbb reg, reg” both have equivalents, so you can do branch-free computation based on some carry flag easily.

BUT, couple subtle points:

- this thing has a bunch of
`(x & foo) >> bar`

(left-shift or right-shift) kind of things, which map*really really*well to PPC because there’s rlwinm / rlwimi. -
The in-order PPCs hate variable shifts (something like 12+ cycles microcoded). Well, guess what, everything we multiply with is a small per-symbol constant, so we can just store (1 << len) per symbol and use

`mullw`

. That’s 9 cycles non-pipelined (and causes a stall after issue), but still, better than the microcode. But… wait a second.If this ends up faster than your usual Huffman, and there’s a decent chance that it might (branch-free and all), the fastest “Huffman” decoder on in-order PPC would, in fact, be a full-blown arithmetic decoder. Which amuses me no end.

# NOTE: LSB of "bucket" complemented compared to x86 # r3 = x, r4 = input ptr # r20 = &tab_divider[0] # r21 = &tab_symbol[0] # r22 = &tab_mult[0] # r23 = &tab_adjust[0] rlwinm r5, r3, 24, 23, 30 # r5 = (xm >> bucket_shift) * 2 rlwinm r6, r3, 0, 16, 31 # r6 = xm lhzx r7, r20, r5 # r7 = tab_divider[xm_shifted] srwi r8, r3, 16 # r8 = x >> log2(m) subfc r9, r7, r6 # (r9 ignored but sets carry) lhz r10, 0(r4) # *input_ptr addze r5, r5 # r5 = bucket lbzx r9, r21, r5 # r9 = symbol add r5, r5, r5 # r5 = bucket word offs lhzx r7, r22, r5 # r7 = mult li r6, 0x10000 # r6 = op for sub later lhzx r5, r23, r5 # r5 = adjust mullw r7, r7, r8 # r7 = mult * (x >> m) subf r5, r5, r6 # r5 = xm - tab_adjust[bucket] add r5, r5, r7 # r5 = new x subfc r6, r6, r5 # sets carry iff (x >= (1<<16)) rlwimi r10, r5, 16, 0, 16 # r10 = (x << 16) | *input_ptr subfe r6, r6, r6 # ~0 if (x < (1<<16)), 0 otherwise slwi r7, r6, 1 # -2 if (x < (1<<16)), 0 otherwise and r10, r10, r6 andc r5, r5, r6 subf r4, r7, r4 # input_ptr++ if (x < (1<<16)) or r5, r5, r10 # new x

That should be a complete alias rANS decoder assuming M=L=b=2^{16}.

-Fabian

Applying the rANS-with-alias-table construction from “rANS with static probability distributions” to Huffman codes has some interesting results. In a sense, there’s nothing new here once you have these two ingredients. I remember mentioning this idea in a mail when I wrote ryg_rans, but it didn’t seem worth writing an article about. I’ve changed my mind on that: while the restriction to Huffman-like code lengths is strictly weaker than “proper” arithmetic coding, we do get a pretty interesting variant on table/state machine-style “Huffman” decoders out of the deal. So let’s start with a description of how they usually operate and work our way to the alias rANS variant.

### Table-based Huffman decoders

Conceptually, a Huffman decoder starts from the root, then reads one bit at a time, descending into the sub-tree denoted by that bit. If that sub-tree is a leaf node, return the corresponding symbol. Otherwise, keep reading more bits and descending into smaller and smaller sub-trees until you do hit a leaf node. That’s all there is to it.

Except, of course, no serious implementation of Huffman decoding works that way. Processing the input one bit at a time is just a lot of overhead for very little useful work done. Instead, normal implementations effectively look ahead by a bunch of bits and table-drive the whole thing. Peek ahead by *k* bits, say *k=10*. You also prepare a table with 2^{k} entries that encodes what the one-bit-at-a-time Huffman decoder would do when faced with those k input bits:

struct TableEntry { int num_bits; // Number of bits consumed int symbol; // Index of decoded symbol };

If it reaches a leaf node, you record the ID of the symbol it arrived at, and how many input bits were actually consumed to get there (which can be less than *k*). If not, the next symbol takes more than *k* bits, and you need a back-up plan. Set `num_bits`

to 0 (or some other value that’s not a valid code length) and use a different strategy to decode the next symbol: typically, you either chain to another (secondary) table or fall back to a slower (one-bit-at-a-time or similar) Huffman decoder with no length limit. Since Huffman coding only assigns long codes to rare symbols – that is, after all, the whole point – it doesn’t tend to matter much; with well-chosen *k* (typically, slightly larger than the log2 of the size of your symbol alphabet), the “long symbol” case is pretty rare.

So you get an overall decoder that looks like this:

while (!done) { // Read next k bits without advancing the cursor int bits = peekBits(k); // Decode using our table int nbits = table[bits].num_bits; if (nbits != 0) { // Symbol *out++ = table[bits].symbol; consumeBits(nbits); } else { // Fall-back path for long symbols here! } }

This ends up particularly nice combined with canonical Huffman codes, and some variant of it is used in most widely deployed Huffman decoders. All of this is classic and much has been written about it elsewhere. If any of this is news to you, I recommend Moffat and Turpin’s 1997 paper “On the implementation of minimum redundancy prefix codes”. I’m gonna assume it’s not and move on.

### State machines

For the next step, suppose we fix *k* to be the length of our longest codeword. Anything smaller and we need to deal with the special cases just discussed; anything larger is pointless. A table like the one above then tells us what to do for every possible combination of *k* input bits, and when we turn the *k*-bit lookahead into explicit state, we get a finite-state machine that decodes Huffman codes:

state = getBits(k); // read initial k bits while (!done) { // Current state determines output symbol *out++ = table[state].symbol; // Update state (assuming MSB-first bit packing) int nbits = table[state].num_bits; state = (state << nbits) & ((1 << k) - 1); state |= getBits(nbits); }

`state`

is essentially a *k*-bit shift register that contains our lookahead bits, and we need to update it in a way that matches our bit packing rule. Note that this is precisely the type of Huffman decoder Charles talks about here while explaining ANS. Alternatively, with LSB-first bit packing:

state = getBits(k); while (!done) { // Current state determines output symbol *out++ = table[state].symbol; // Update state (assuming LSB-first bit packing) int nbits = table[state].num_bits; state >>= nbits; state |= getBits(nbits) << (k - nbits); }

This is still the exact same table as before, but because we’ve sized the table so that each symbol is decoded in one step, we don’t need a fallback path. But so far this is completely equivalent to what we did before; we’re just explicitly keeping track of our lookahead bits in `state`

.

But this process still involves, essentially, two separate state machines: one explicit for our Huffman decoder, and one implicit in the implementation of our bitwise IO functions, which ultimately read data from the input stream at least one byte at a time.

### A bit buffer state machine

For our next trick, let’s look at the bitwise IO we need and turn that into an explicit state machine as well. I’m assuming you’ve implemented bitwise IO before; if not, I suggest you stop here and try to figure out how to do it before reading on.

Anyway, how exactly the bit IO works depends on the bit packing convention used, the little/big endian of the compression world. Both have their advantages and their disadvantages; in this post, my primary version is going to be LSB-first, since it has a clearer correspondence to rANS which we’ll get to later. Anyway, whether LSB-first or MSB-first, a typical bit IO implementation uses two variables, one for the “bit buffer” and one that counts how many bits are currently in it. A typical implementation looks like this:

uint32_t buffer; // The bits themselves uint32_t num_bits; // Number of bits in the buffer right now uint32_t getBits(uint32_t count) { // Return low "count" bits from buffer uint32_t ret = buffer & ((1 << count) - 1); // Consume them buffer >>= count; num_bits -= count; // Refill the bit buffer by reading more bytes // (kMinBits is a constant here) while (num_bits < kMinBits) { buffer |= *in++ << num_bits; num_bits += 8; } return ret; }

Okay. That’s fine, but we’d like for there to be only one state variable in our state machine, and preferably not just on a technicality such as declaring our one state variable to be a pair of two values. Luckily, there’s a nice trick to encode both the data and the number of bits in the bit buffer in a single value: we just keep an extra 1 bit in the `state`

, always just past the last “real” data bit. Say we have a 8-bit `state`

, then we assign the following codes (in binary):

in_binary(state) | num_bits |

`0 0 0 0 0 0 0 1` |
0 |

`0 0 0 0 0 0 1 *` |
1 |

`0 0 0 0 0 1 * *` |
2 |

`0 0 0 0 1 * * *` |
3 |

`0 0 0 1 * * * *` |
4 |

`0 0 1 * * * * *` |
5 |

`0 1 * * * * * *` |
6 |

`1 * * * * * * *` |
7 |

The locations denoted `*`

store the actual data bits. Note that we’re fitting 1 + 2 + … + 128 = 255 different states into a 8-bit byte, as we should. The only value we’re not using is “0”. Also note that we have `num_bits = floor(log2(state))`

precisely, and that we can determine `num_bits`

using bit scanning instructions when we need to. Let’s look at how the code comes out:

uint32_t state; // As described above uint32_t getBits(uint32_t count) { // Return low "count" bits from state uint32_t ret = state & ((1 << count) - 1); // Consume them state >>= count; // Refill the bit buffer by reading more bytes // (kMinBits is a constant here) // Note num_bits is a local variable! uint32_t num_bits = find_highest_set_bit(state); while (num_bits < kMinBits) { // Need to clear 1-bit at position "num_bits" // and add a 1-bit at bit "num_bits + 8", hence the // "+ (256 - 1)". state += (*in++ + (256 - 1)) << num_bits; num_bits += 8; } return ret; }

Okay. This is written to be as similar as possible to the implementation we had before. You can phrase the `while`

condition in terms of `state`

and only compute `num_bits`

inside the refill loop, which makes the non-refill case slightly faster, but I wrote it the way I did to emphasize the similarities.

Consuming bits is slightly cheaper than the regular bit buffer, refilling is a bit more expensive, but we’re down to one state variable instead of two. Let’s call that a win for our purposes (and it certainly can be when low on registers). Note I only covered LSB-first bit packing here, but we can do a similar trick for MSB bit buffers by using the least-significant set bit as a sentinel instead. It works out very similar.

So what happens when we plug this into the finite-state Huffman decoder from before?

### State machine Huffman decoder with built-in bit IO

Note that our state machine decoder above still just kept the *k* lookahead bits in `state`

, and that they’re not exactly hard to recover from our bit buffer `state`

. In fact, they’re pretty much the same. So we can just fuse them together to get a state machine-based Huffman decoder that only uses byte-wise IO:

state = 1; // No bits in buffer refill(); // Run "refill" step from the loop once while (!done) { // Current state determines output symbol index = state & ((1 << k) - 1); *out++ = table[index].symbol; // Update state (consume bits) state >>= table[index].num_bits; // Refill bit buffer (make sure at least k bits in it) // This reads bytes at a time, but could just as well // read 16 or 32 bits if "state" is large enough. num_bits = find_highest_set_bit(state); while (num_bits < k) { state += (*in++ + (256 - 1)) << num_bits; num_bits += 8; } }

The slightly weird `refill()`

call at the start is just to keep the structure as similar as possible to what we had before. And there we have it, a simple Huffman decoder with one state variable and a table. Of course you can combine this type of bit IO with other Huffman approaches, such as multi-table decoding, too. You could also go even further and bake most of the bit IO into tables like Charles describes here, effectively using a table on the actual `state`

and not just its low bits, but that leads to enormous tables and is really not a good idea in practice; not only are the tables too large to fit in the cache, general-purpose compressors will also usually spend more time building these tables than they ever spend using them (since it’s rare to use a single Huffman table for more than a few dozen kilobytes at a time).

Okay. So far, there’s nothing in here that’s not at least 20 years old.

### Let’s get weird, stage 1

The decoder above still reads the exact same bit stream as the original LSB-first decoder. But if we’re willing to prescribe the exact form of the decoder, we can use a different refilling strategy that’s more convenient (or cheaper). In particular, we can do this:

state = read_3_bytes() | (1 << 24); // might as well! while (!done) { // Current state determines output symbol index = state & ((1 << k) - 1); *out++ = table[index].symbol; // Update state (consume bits) state >>= table[index].num_bits; // Refill while (state < (1 << k)) state = (state << 8) | *in++; }

This is still workable a Huffman decoder, and it’s cheaper than the one we saw before, because refilling got cheaper. But it also got a bit, well, *strange*. Note we’re reading 8 bits and putting them into the *low* bits of `state`

; since we’re processing bits LSB-first, that means we added them at the “front” of our bit queue, rather than appending them as we used to! In principle, this is fine. Bits are bits. But processing bits out-of-sequence in that way is certainly atypical, and means extra work for the encoder, which now needs to do extra work to figure out exactly how to permute the bits so the decoder reads them in the right order. In fact, it’s not exactly obvious that you can encode this format efficiently to begin with.

But you definitely can, by encoding backwards. Because, drum roll: this isn’t a regular table-driven Huffman decoder anymore. What this actually is is a rANS decoder for symbols with power-of-2 probabilities. The `state >>= table[index].num_bits;`

is what the decoding state transition function for rANS reduces to in that case.

In other words, this is where we start to see new stuff. It might be possible that someone did a decoder like this before last year, but if they did, I certainly never encountered it before. And trust me, it *is* weird; the byte stream the corresponding encoder emits is uniquely decodable and has the same length as the bit stream generated for the corresponding Huffman or canonical Huffman code, but the bit-shuffling means it’s not even a regular prefix code stream.

### Let’s get weird, stage 2: binary alias coding

But there’s one more, which is a direct corollary of the existence of alias rANS: we can use the alias method to build a fast decoding table with size proportional to the number of symbols in the alphabet, completely independent of the code lengths!

Note the alias method allows you to construct a table with an arbitrary number of entries, as long as it’s larger than the number of symbols. For efficiency, you’ll typically want to round up to the next power of 2. I’m not going to describe the exact encoder details here, simply because it’s just rANS with power-of-2 probabilities, and the `ryg_rans`

encoder/decoder can handle that part just fine. So you already have example code. But that means you can build a fast “Huffman” decoder like this:

kMaxCodeLen = 24; // max code len in bits kCodeMask = (1 << kMaxCodeLen) - 1; kBucketShift = kMaxCodeLen - SymbolStats::LOG2NSYMS; state = read_3_bytes() | (1 << 24); // might as well! while (!done) { // Figure out bucket in alias table; same data structures as in // ryg_rans, except syms->slot_nbits (number of bits in Huffman // code for symbol) instead of syms->slot_nfreqs is given. uint32_t index = state & kCodeMask; uint32_t bucket_id = index >> kBucketShift; uint32_t bucket2 = bucket_id * 2; if (index < syms->divider[bucket_id]) ++bucket2; // bucket determines output symbol *out++ = syms->sym_id[bucket2]; // Update state (just D(x) for pow2 probabilities) state = (state & ~kCodeMask) >> syms->slot_nbits[bucket2]; state += index - syms->slot_adjust[bucket2]; // Refill (make sure at least kMaxCodeLen bits in buffer) while (state <= kCodeMask) state = (state << 8) | *in++; }

I find this remarkable because essentially all other fast (~constant time per symbol) Huffman decoding tricks have some dependence on the distribution of code lengths. This one does not; the alias table size is determined strictly by the number of symbols. The only fundamental data-dependency is how often the “refill” code is run (it runs, necessarily, once per input byte, so it will run less often – relatively speaking – on highly compressible data than it will on high-entropy data). (I’m not counting the computation of `bucket2`

here because it’s just a conditional add, and is in fact written the way it is precisely so that it can be mapped to a compare-then-add-with-carry sequence.)

Note that this one really is a lot weirder still than the previous variant, which at least kept the “space” assigned to individual codes connected. This one will, through the alias table construction, end up allocating small parts of the code range for large symbols all over the place. It’s still exactly equivalent to a Huffman coder in terms compression ratio and code “lengths”, but the underlying construction really doesn’t have much to do with Huffman at all at this point, and we’re not even emitting particular bit strings for code words anymore.

All that said, I don’t think this final variant is actually interesting *in practice*; if I did, I would have written about it earlier. If you’re bothering to implement rANS and build an alias table, it really doesn’t make sense to skimp out on the one extra multiply that turns this algorithm into a full arithmetic decoder (as opposed to quasi-Huffman), unless your multiplier is really slow that is.

But I do find it to be an interesting construction from a theoretical standpoint, if nothing else. And if you don’t agree, well, maybe you at least learned something about certain types of Huffman decoders and their relation to table-based ANS decoders. :)