In the previous post, I looked at repeated application of FIR filters on the same signal and laid out why their frequency response is the thing to look at if we’re wondering about their long-term stability under repeated application, because the Discrete-Time Fourier Transform (DTFT) of the filter coefficients corresponds to the eigenvalues of the linear map that convolves a signal with the filter once.

Now, let’s go back to the original problem that Casey asked about: half-pixel (or half-pel) interpolation filters for motion compensation. We know from Casey’s post which of the filters he tried “blow up” under repeated application, and which ones don’t, and we now know some of the theory. The question is, does it track? Let’s find out!

### A small filter zoo: 6 taps

We’ll be looking at the frequency responses of various filters here. As a reminder, what we’re looking at are the values of the function

$\displaystyle \hat{f}(\omega) = \sum_{k=0}^{m-1} f_k \exp(-i \omega k)$

where the $f_k$ are our filter coefficients, which are real numbers. It’s easy to prove from this form that the function $\hat{f}$ is continuous, 2π-periodic, and has $\hat{f}(-\omega) = \overline{\hat{f}(\omega)}$ (where the bar denotes complex conjugation, as usual). Therefore the “interesting” part of the frequency response (for our purposes anyway) is contained in the values of $\hat{f}(\omega)$ for ω in [0,π]. Furthermore, we only care about the “magnitude response”, the absolute values of the (complex) frequency response, which is conventionally plotted in decibels (i.e. effectively on a logarithmic scale); the angle denotes phase response, which is not relevant to our question (and also happens to be relatively boring in this case, because all filters under consideration are symmetric and thus linear-phase FIR filters). Finally I’ll crop the plots I’m showing to only show frequencies from 0 to 0.8π (up to 80% Nyquist frequency) because all filters under consideration sharply decay past that point (if not much earlier), so the portion from 0.8π to π contributes very little information and makes the y axis scaling awkward.

With all that out of the way, let’s start with our first filter: the 2-tap linear interpolation filter [0.5, 0.5] (it’s “bilinear” if you do it in both the X and Y directions, but all filters under consideration are separable and we’re only looking at 1D versions of them, so it’s just straight-up linear interpolation).

The dotted line at 0dB is the line we’re not allowed to cross if we want repeated application to be stable. It’s also the line we want to be exactly on for an ideal interpolation filter. As we can see from the diagram, linear interpolation is good at the first part, not so good at the second part: it’s unit gain at a frequency of 0 but immediately rolls off, which is what causes it to over-blur. Not great.

The next filter that Casey looks at jumps from 2 taps all the way to 6: it’s the half-pixel interpolation filter from H.264/AVC. Here’s its magnitude response:

This one is quite different. We can immediately see that it has above-unit gain for a significant fraction of its spectrum, peaking around 0.5π. That tells us it should blow up in Casey’s repeated-filtering test, and indeed it does. On the plus side, it has a much wider passband, and is in general much closer to an ideal interpolation filter than basic linear interpolation is.

The image for the 6-tap Lanczos filter is not that different:

In fact, this one’s qualitatively so similar that it seems fair to assume that the H.264 filter was designed as an approximation to Lanczos6 with reduced-precision coefficients. (It would certainly be plausible, but I haven’t checked.)

Next up, let’s look at Casey’s filter with quantized coefficients (numerator of 32):

Casey’s filter stays at or below unit gain; it has noticeably below-unit gain for much of its passband which is not ideal, but at least its passband is significantly wider than basic linear interpolation, and looks decent out to about 0.5π before it really starts to cut off.

And while we’re at it, here’s Casey’s other 6-tap filter from the second part of his series:

This one very slightly pokes above unit gain around 0.5π, but evidently little enough not to blow up when the output signal is quantized to 8 bits after every step, as Casey’s test does. Other than that, we have some passband ripple, but are really quite close to unit gain until at least half-Nyquist, after which our performance starts to deteriorate, as it does for all 6-tap filters.

Our final 6-tap filter in this round is a 6-tap Lagrange interpolator. This one’s not in Casey’s series; its coefficients are [ 0.011719, -0.097656, 0.585938, 0.585938, -0.097656, 0.011719 ]. Lagrange interpolators are a classic family of interpolating filters (closely related to Lagrange polynomials) and have extremely flat passbands:

On the good side, yes, the passband really is flat. The trade-off is that the filter response visibly starts to dip around 0.4π: the price we pay for that flat passband is losing more of the high frequencies than we would with other filters. On the other hand, this type of filter is not going to explode with repeated application. (Really, this is the second Lagrange-type filter I’ve shown, since the initial linear interpolation filter coincides with a 2-tap Lagrange interpolator.)

### 8-tap filters

I’m putting these in their own category: the two extra taps make a significant difference in the attainable filter quality, so they’re not on even footing with the 6-tap contenders. Let’s start with the filter from H.265/HEVC:

Some ripple in the passband and an above-unit peak around 0.6π. Looking at the frequency response, this filter should blow up in Casey’s tests, and indeed it does. However it also manages to pass through frequencies all the way out to nearly 0.7π, not something we’ve seen so far.

Here’s 8-tap Lanczos:

Again qualitatively similar to what we see from the HEVC filter, although I overall like the “real” HEVC filter a bit better than this one. Another filter that we would expect to blow up (and Casey’s testing confirms it does).

At the other extreme, 8-tap Lagrange:

Passband straight as a ruler, but not much use past 0.5π. The good news is that, once again, it’s perfectly stable.

Here’s Casey’s contender:

Again some passband ripple and it’s poking slightly above unity around 0.55π, but it manages good results out to about 0.6π before it starts to cut.

### Conclusions

That’s a couple of different filter types, and at this point we’ve seen enough to start drawing some conclusions, namely:

First, adding more taps gives us frequency responses closer to our ideal, which is not exactly surprising. The trade-offs are that filters with more taps are more expensive to evaluate, and that codecs also care about things other than frequency response. Among other things, we generally want reduced-precision fixed-point approximations to the filter coefficients both for efficiency (particularly in hardware decoders) and to ensure different codec implementations agree (which, for various reasons, gets significantly nastier once floating-point is in the mix).

Second, among the filters I showed, there’s a fairly clear spectrum: at one end, we have the Lagrange interpolators with strictly unit-or-below gain. They are completely stable under repeated application but also tend to lose high frequencies fairly quickly. In the other direction, we have filters like the Lanczos-derived ones or those used in H.264 or HEVC that have wider pass bands but also clearly above-unit gain for at least some frequencies, meaning that frequency content in those regions will grow and ultimately explode under repeated application. Finally, Casey’s filters are somewhere in between; they have wider pass bands than pure Lagrange interpolators but keep their maximum gain close enough to 1 to avoid blow-ups when applied to data that is quantized to 8 bit fixed point.

This last part is something I originally intended to do a more in-depth analysis of, which is the reason this post is so delayed. But I just never felt inspired to actually write this part and didn’t make any headway the 3 times I tried to just sit down and write it without inspiration either, so, well, sorry. No proper analysis here. I figured it’s better to at least publish the rest of the article without it.

The gist of it is this: if your samples are quantized to say 8-bit fixed point after every step, it seems plausible that you should be able to get away with slightly above unit gain at some frequencies. Essentially, all the inputs (being quantized) are integers, which means they need to change by at least 0.5 steps in either direction to actually round to a different value. If the gain at a given frequency isn’t high enough to accomplish this change, nothing will happen, and even a filter with above-unit gain for some frequencies that should in theory blow up eventually, won’t. Experimentally this seems to be true and I meant to do a proper proof but as said, you’ll have to live without it for the time being. (I might do another post in the future if I do come up with a nice proof.)

Finally: how much does it matter? Casey’s original posts framed not diverging over repeated application as an essential requirement for a motion interpolation filter in a video codec, but it’s really not that simple.

There’s no question that stability under iteration is desirable, the same way that a perfectly flat response over all frequencies is desirable: ideally we’d like interpolation filters to behave like a perfect all-pass filter. But we care about computational cost and memory bandwidth too, so we don’t get to use infinitely wide filters, which means we can’t get a perfect frequency response. This is somewhat hidden by the cropping I did, but all filters shown decay very sharply above around 80% Nyquist. The stability issue is similar: if we care about stability above all else, there is a known answer, which is to use Lagrange interpolators, which give us perfect stability but have subpar response at the high frequencies.

There is another thing to consider here: in a video codec, motion compensation is part of the coding loop. The encoder knows exactly what the decoder will do with a given frame and set of motion vectors, and will use that result to code further updates against. In short, the interpolation filter is not left to its own devices in a feedback loop. The encoder is continuously monitoring what the current state of the frame in a compliant decoder will be. If there is a build-up of high-frequency energy over time, all that really happens is that at some point, the error will become high enough for the encoder to decide to do something about it, which means sending not just a pure motion vector, but instead a motion vector with a residual (a correction applied to the image data after motion compensation). Said residual is also coded in frequency space (essentially), most commonly using a DCT variant. In short, the end result of using a filter that has resonances in the high frequencies is that a few times per second, the encoder will have to send residuals to cancel out these resonances. This costs extra bits, and correcting errors in the high frequencies tends to be more expensive than for the lower frequencies (image and video codecs can generally code lower frequencies more cheaply than high frequencies).

But suppose we didn’t do that, and instead used say a perfectly stable Lagrange interpolator. It doesn’t have any resonances that would cause the image to blow up, but it does act like a bit of a low-pass filter. In short, instead of steadily gaining energy in the high frequencies, we end up steadily losing energy in the high frequencies. And this too ends up with the encoder having to send high-frequency heavy residuals periodically, this time to add in the missing high frequencies (instead of removing excess ones).

Neither of these is obviously preferable to, or cheaper than, the other. Sending high-frequency heavy residuals is relatively expensive, but we end up having to do it periodically regardless of which type we choose.

That’s not to say that it doesn’t matter at all; it’s just to point out that the actual decision of which type of filter to use in any real codec is not made in a vacuum and depends on other factors. For example, H.264/AVC and HEVC at low bit rates rely aggressively on their deblocking filters, which are essentially low-pass filters applied over block edges. In that context, a somewhat sharpening motion interpolation filter can help mitigate some of the damage, whereas a Lagrange interpolator would make things even more one-sided.

In short, there is no such thing as a single “best” interpolation filter. The decision is made in context and depends on what the rest of the codec does.

Casey Muratori posted on his blog about half-pixel interpolation filters this week, where he ends up focusing on a particular criterion: whether the filter in question is stable under repeated application or not.

There are many things about filters that are more an art than a science, especially where perceptual factors are concerned, but this particular question is both free of tricky perceptual evaluations and firmly in the realm of things we have excellent theory for, albeit one that will require me to start with a linear algebra infodump. So let’s get into it!

### Analyzing iterated linear maps

Take any vector space V over some field $\mathbb{F}$ and any linear map $T : V \rightarrow V$ from that space to itself. An eigenvector v of T is a nonzero element of V such that $T(v) = Tv = \lambda v$ for some $\lambda \in \mathbb{F}$ – that is, the result of applying the map T to v is a scaled version of v itself. The scale factor λ is the corresponding eigenvalue.

Now when we’re iterating the map T multiple times, eigenvectors of T behave in a very simple way under the iterated map: we know that applying T to v gives back a scaled version of v, and then linearity of T allows us to conclude that:
$\displaystyle T^2(v) = T(T(v)) = T(Tv) = T(\lambda v) = \lambda T(v) = \lambda^2 v$
and more generally $T^k(v) = \lambda^k v$ for any $k \in \mathbb{N}$.

The best possible case is that we find lots of eigenvectors – enough to fully characterize the map purely by what it does on its eigenvectors. For example, if V is a finite-dimensional vector space with $\mathrm{dim}(V)=n$, then if we can find n linearly independent eigenvectors, we’re golden: we can select a basis entirely made of eigenvectors, and then written in that basis, T will have a very simple form: we will have $T = Q \Lambda Q^{-1}$ where $\Lambda = \mathrm{diag}(\lambda_1, \dots, \lambda_n)$ for some Q (whose columns contain n linearly independent eigenvectors of T).

That is, in the right basis (made of eigenvectors), T is just a diagonal matrix, which is to say, a (non-uniform) scale. This makes analysis of repeated applications of T easy, since:
$\displaystyle T^2 = Q \Lambda Q^{-1} Q \Lambda Q^{-1} = Q \Lambda^2 Q^{-1}$
and in general
$T^k = Q \Lambda^k Q^{-1}$ and $\Lambda^k = \mathrm{diag}(\lambda_1^k, \dots, \lambda_n^k)$: viewed in the basis made of eigenvectors, repeated application of T is just repeated scaling, and behaviour over lots of iterations ultimately just hinges on what the eigenvalues are.

Not every matrix can be written that way; ones that can are called diagonalizable. But there is a very important class of transforms (and now we allow infinite-dimensional spaces again) that is guaranteed to be diagonalizable: so called self-adjoint transforms. In the finite-dimensional real case, these correspond to symmetric matrices (matrices A such that $A = A^T$). Such transforms are guaranteed to be diagonalizable, and even better, their eigenvectors are guaranteed to be pairwise orthogonal to each other, meaning the transform Q is an orthogonal matrix (a rotation or reflection), which among other things makes the whole process numerically quite well-behaved.

As a small aside, if you’ve ever wondered why iterative solvers for linear systems usually require symmetric (or, in the complex case, Hermitian) matrices: this is why. If a matrix is symmetric, it it diagonalizable, which allows us to build an iterative process to solve linear equations that we can analyze easily and know will converge (if we do it right). It’s not that we can’t possibly do anything iterative on non-symmetric linear systems; it just becomes a lot trickier to make any guarantees, especially if we allow arbitrary matrices. (Which could be quite pathological.)

Anyway, that’s a bit of background on eigendecompositions of linear maps. But what does any of this have to do with filtering?

### Enter convolution

Convolution itself is a bilinear map, meaning it’s linear in both arguments. That means that if we fix either of the arguments, we get a linear map. Suppose we have a FIR filter f given by its coefficients $(f_0, f_1, \dots, f_{m-1})$. Then we can define an associated linear map $T_f$ on a suitable space, say something like $T_f : \ell^\infty(\mathbb{C}) \rightarrow \ell^\infty(\mathbb{C})$ (writing $\ell^\infty(\mathbb{C})$ for the set of bounded sequences of complex numbers) by setting
$\displaystyle T_f(x) = T_f x := f * x$.

If this is all a bit dense on notation for you, all I’m doing here is holding one of the two arguments to the convolution operator constant, and trying to at least specify what set our map is working on (in this case, bounded sequences of real numbers).

And now we’re just about ready for the punchline: we now have a linear map from a set to itself, although in this case we’re dealing with infinite sequences, not finite ones. Luckily the notions of eigenvectors (eigensequences in this case) and eigenvalues generalize just fine. What’s even better is that for all discrete convolutions, we get a full complement of eigensequences, and we know exactly what they are. Namely, define the family of sequences $e_\omega$ by:

$\displaystyle e_\omega[n] = \exp(i \omega n) = \cos(\omega n) + i \sin(\omega n)$

That’s a cosine wave with frequency ω in the real part and the corresponding sine wave in the imaginary part, if you are so inclined, although I much prefer to stick with the complex exponentials, especially when doing algebra (it makes things easier). Anyway, if we apply our FIR filter f to that signal, we get (this is just expanding out the definition of discrete convolution for our filter and input signal, using the convention that unqualified summation is over all values of k where the sum is well-defined)

$\displaystyle (T_f e_\omega)[n] = \sum_k f_k \exp(i \omega (n-k))$

$\displaystyle = \exp(i \omega n) \underbrace{\sum_k f_k \exp(-i \omega k)}_{=:\hat{f}(\omega)}$
$\displaystyle = \hat{f}(\omega) \exp(i \omega n)$

There’s very little that happens here. The first line is just expanding the definition; then in the second line we use the properties of the exponential function (and the linearity of sums) to pull out the constant factor of $\exp(i \omega n)$. And it turns out the entire rest of the formula doesn’t depend on n at all, so it turns into a constant factor for the whole sequence. It does depend on f and ω, so we label it $\hat{f}(\omega)$. The final line states exactly what we wanted, namely that the result of applying $T_f$ to $e_\omega$ is just a scaled copy of $e_\omega$ itself—we have an eigensequence (with eigenvalue $\hat{f}(\omega)$).

Also note that the formula for the eigenvalue isn’t particularly scary either in our case, since we’re dealing with a FIR filter f, meaning it’s a regular finite sum:

$\displaystyle \hat{f}(\omega) = \sum_{k=0}^{m-1} f_k \exp(-i \omega k)$

Oh, and there’s one more minor detail I’ve neglected to mention so far: that’s just the discrete-time Fourier transform (DTFT, not to be confused with the DFT, although they’re related) of f. Yup, we started out with a digital FIR filter, asked what happens when we iterate it a bunch, did a brief detour into linear algebra, and ended up in Fourier theory.

Long story short, if you want to know whether a linear digital filter is stable under repeated application, you want to look at its eigenvalues, which in turn are just given by its frequency response. In particular, for any given frequency ω, we have exactly three options:

• $|\hat{f}(\omega)| = 1$. In this case, the amplitude at that frequency is preserved exactly under repeated application.
• $|\hat{f}(\omega)| < 1$. If the filter dampens a given frequency, no matter how little, then the amplitude of the signal at that frequency will eventually be driven to zero. This is stable but causes the signal to degrade. Typical interpolation filters tend to do this for the higher frequencies, which is why signals tend to lose such frequencies (in visual terms, get blurrier) over time.
• $|\hat{f}(\omega)| > 1$. If a filter amplifies any frequency by more than 1, even by just a tiny bit, then any signal containing a nonzero amount of that frequency will eventually blow up.

The proof for all three cases is simply observing that k-fold application of the filter f to the signal $e_\omega$ will result in the signal $(\hat{f}(\omega))^k e_\omega$. To generalize this to a wider class of signals (not just complex exponentials) we would need to represent said signals as sum of complex exponentials, which is exactly what Fourier series are all about; so it can be done, but I won’t bother with the details here, since they’re out of the scope of this post.

Therefore, all you need to know about the stability of a given filter under repeated application is contained in its Fourier transform. I’ll try to do another post soon that shows the Fourier transforms of the filters Casey mentioned (or their magnitude response anyway, which is what we care about) and touches on other aspects such as the effect of rounding and quantization, but we’re at a good stopping point right now, so I’ll end this post here.

Hash tables are the most popular probabilistic data structure, by quite a margin. You compute some hash code then compute an array index from that hash code. If you’re using open addressing-class schemes, you then have a rule to find other array indices that a value with the given hash code might be found at; with separate chaining, your array entries point to the head of a linked list, or the root of a binary tree, or whatever other data structure you prefer for the given use case.

No matter what exactly you do, this entire class of schemes always gives you a Las Vegas algorithm, meaning this type of hash table will always retain values for all the keys you inserted, but you’re gambling on how long a lookup takes.

An alternative is to enforce a strict limit P≥1 on the number of probes performed (for open addressing) or the size of any of the secondary data structures (for separate chaining). The result is a “forgetful dictionary”: keys are allowed to disappear, or at least become unretrievable. That’s quite a limitation. In return, the worst-case lookup cost becomes bounded. In short, we now have a Monte Carlo algorithm: we’re now allowed to fail lookups (keys can disappear over time), but runtime variability is significantly reduced.

Let’s call this data structure a “cache table”, both because it’s useful for caching the results of queries and because it matches the most common organization of caches in hardware. We’ve been using that name at RAD for a while now, and I’ve also seen it used elsewhere, so it’s likely someone independently came up with the same name for the same thing, which I take to be a good sign.

In conventional hash tables, we have different probing schemes, or separate chaining with different data structures. For a cache table, we have our strict bound P on the number of entries we allow in the same bucket, and practical choices of P are relatively small, in the single or low double digits. That makes the most natural representation a simple 2D array: N rows of hash buckets with P columns, each with space for one entry, forming a N×P grid. Having storage for all entries in the same row right next to each other leads to favorable memory access patterns, better than most probe sequences used in open addressing or the pointer-heavy data structures typically used in separate chaining.

To look up a key, we calculate the row index from the hash code, using something like row = hash % N. Then we check whether there is a matching entry in that row, by looping over all P columns. That’s also why you don’t want P to get too large. What I’ve just described matches the operation of a P-way set associative cache in hardware exactly, and we will sometimes call P the number of “ways” in the cache table.

P=1 is the simplest case and corresponds to a direct-mapped cache in hardware. Each entry has exactly one location in the cache array where it can go; it’s either there or it’s not present at all. Inserting an entry means replacing the previous entry at that location.

For P≠1, there are multiple choices of which item to replace on insertion; which one is picked is determined by the replacement policy. There are many candidates to choose from, with different implementation trade-offs; a decent option that doesn’t require any extra metadata stored alongside each row is to use random replacement, i.e. just picking a (pseudo-)random column within the given row to evict on every insertion. “Hang on”, you might say, “aren’t hash codes pseudo-random already?”. Yes, they are, but you want to use a random number generator independent of your hash function here, or else you’re really just building a direct-mapped cache with N×P rows. One of the benefits of keeping P entries per row is that it allows us to have two different keys with identical hash values (i.e. a hash collision) in the same table; as per the birthday problem, even with a well-distributed hash, you are likely to see collisions.

So what is this type of data structure useful for? I’ve come across two classes of use cases so far:

• Caching queries, as noted above. It’s used extensively to build (memory) caches in hardware, but it’s useful in software too. If you’re trying to cache results of operations in software, there is often a desire to keep the size of the cache bounded and have lookups take a predictable time; cache tables deliver both and are generally simpler than trying to manually limit the number of live entries in a regular hash table.
• Approximate searching tasks. For example, they’re quite useful in LZ77 match finding – and have been used that way since at least the late 80s (a 1986 patent covering the case P=1, now expired, was the subject of a rather famous lawsuit) and early 90s (P≠1), respectively.

This is a rehash of something I wrote in a forum post something like 10 years ago. It turns out that forum prohibited archiving in its robots.txt so it’s not on the Internet Archive. The original operator of said forum hosted a copy (with broken formatting) of the original posts for a while, but at a different URL, breaking all links. And now that copy’s gone too, again with archiving disabled apparently. Sigh.

I got asked about this yesterday; I don’t have a copy of my original derivation anymore either, but here’s my best attempt at reconstructing what I probably wrote back then, and hopefully it won’t get lost again this time.

Suppose you’re given a unit quaternion $q$ and a vector $v$. Quaternions are a common rotation representation in several fields (including computer graphics and numerical rigid-body dynamics) for reasons beyond the scope of this post. To apply a rotation to a vector, one computes the quaternion product $q v q^*$, where $v$ is implicitly identified with the quaternion with real (scalar) part 0 and $v$ as its imaginary part, and $q^*$ denotes the conjugate of $q$. Such quaternions with a real part of 0 are also referred to as “pure imaginary” quaternions. Anyway, the result of the above product is another pure imaginary quaternion, corresponding to the rotated vector.

This is all explained and motivated elsewhere, and I won’t bother doing so here. Now generally, you often want to apply the same transformation to many vectors. In that case, you’re better off turning the quaternion into the equivalent 3×3 rotation matrix first. You can look up the formula elsewhere or work it out from the expression above with some algebra. That’s not the topic of this post either.

But sometimes you really only want to transform a single vector with a quaternion, or have other reasons for not wanting to expand to an explicit 3×3 (or larger) matrix first, like for example trying to minimize live register count in a shader program. So let’s look at ways of doing that.

### The direct way

First, I’m going to split quaternions in their real (scalar) and imaginary (vector) parts, writing $q=(q_r, q_{xyz})$ where $q_r$ is the real part and $q_{xyz}$ imaginary. The conjugate of $q$ is given by $q^*=(q_r, -q_{xyz})$.

The product of two quaternions $a$ and $b$ is given by

$ab = (a_r b_r - a_{xyz} \cdot b_{xyz}, a_r b_{xyz} + b_r a_{xyz} + a_{xyz} \times b_{xyz})$

where $\cdot$ denotes the usual dot product and $\times$ the cross product. If you’re not used to seeing this in vector notation, I recommend using this (or something more abstract like geometric algebra) for derivations; writing everything in terms of scalars and the $i, j, k$ basis elements makes you miss the forest for the trees.

Anyway, armed with this formula, we can compute our final product without too much trouble. Let’s start with the $qv$ part:

$qv = (-q_{xyz} \cdot v, q_r v + q_{xyz} \times v)$

Not so bad. Now we have to multiply it from the right by $q^*$, which I’ll do in multiple steps. First, let’s take care of the real part, by multiplying our just-computed values for $qv$ with $q^*$ using the general multiplication formula above:

$(qvq^*)_r = -q_r (q_{xyz} \cdot v) - ((q_r v + q_{xyz} \times v) \cdot (-q_{xyz}))$
$= -q_r (q_{xyz} \cdot v) + q_r (v \cdot q_{xyz}) + ((q_{xyz} \times v) \cdot q_{xyz})$
$= q_{xyz} \cdot (q_{xyz} \times v) = 0$

because the first two dot products are identical and the cross product of $q_{xyz}$ and $v$ is perpendicular to $q_{xyz}$. This proves that $qvq^*$ is indeed a pure imaginary quaternion again, just like the $v$ we started out with.

Nice to know, but of course we’re actually here for the vector part:

$(qvq^*)_{xyz} = (-q_{xyz} \cdot v) (-q_{xyz}) + q_r (q_r v + q_{xyz} \times v)$
$+ (q_r v + q_{xyz} \times v) \times (-q_{xyz})$
$= (q_{xyz} \cdot v) q_{xyz} + q_r^2 v + q_r (q_{xyz} \times v) + q_{xyz} \times (q_r v + q_{xyz} \times v)$
$= (q_{xyz} \cdot v) q_{xyz} + q_r^2 v + 2 q_r (q_{xyz} \times v) + q_{xyz} \times (q_{xyz} \times v)$

which so far has used nothing fancier than the antisymmetry of the cross product $a \times b = -b \times a$.

If we pull out and name the shared cross product, we get

$u = q_{xyz} \times v$
$(qvq^*)_{xyz} = (q_{xyz} \cdot v) q_{xyz} + q_r^2 v + 2 q_r u + q_{xyz} \times u$

which is the direct expression for the transformed vector from the formula. This is what you get if you just plug everything into the formulas and apply basic algebraic simplifications until you run out of obvious things to try (which is exactly what we did).

In terms of scalar operation count, this boils down to two cross products at 6 multiplies and 3 additions (well, subtractions) each; one 3D dot product at 3 multiplies and 2 additions; 3 vector-by-scalar multiplications at 3 multiplies each; two scalar multiplies (to form $q_r^2$ and $2q_r$, although the latter can also be computed via addition if you prefer); and finally 3 vector additions at 3 adds each. The total operation count is thus 26 scalar multiplies and 17 additions, unless I miscounted. For GPUs, a multiply-add “a*b+c” generally counts as a single operation, and in that case the scalar operation count is 9 scalar multiplies and 17 scalar multiply-adds.

### Stand back, I’m going to try algebra!

We can do better than that. So far, we haven’t used that $q$ is a unit quaternion, meaning that $q_r^2 + q_{xyz} \cdot q_{xyz} = 1$. We can thus plug in $(1 - q_{xyz} \cdot q_{xyz})$ for $q_r^2$, which yields:

$(qvq^*)_{xyz} = (q_{xyz} \cdot v) q_{xyz} + (1 - q_{xyz} \cdot q_{xyz}) v + 2 q_r u + q_{xyz} \times u$
$= v + (q_{xyz} \cdot v) q_{xyz} - (q_{xyz} \cdot q_{xyz}) v + 2 q_r u + q_{xyz} \times u$

This might look worse, but it’s progress, because we can now apply the vector triple product identity

$a \times (b \times c) = (a \cdot c)b - (a \cdot b)c$

with $a=q_{xyz}$, $b=q_{xyz}$, $c=v$, leaving us with:

$(qvq^*)_{xyz} = v + q_{xyz} \times (q_{xyz} \times v) + 2 q_r u + q_{xyz} \times u$
$= v + q_{xyz} \times u + 2 q_r u + q_{xyz} \times u$
$= v + 2 q_r u + 2 q_{xyz} \times u$

The two remaining terms involving $u$ both multiply by two, so we use a slightly different shared temporary for the final version of our formula:

$t = 2 q_{xyz} \times v$
$(qvq^*)_{xyz} = v + q_r t + q_{xyz} \times t$

Final scalar operation count: without multiply-adds, the two cross products sum to 12 multiplies and 6 adds, scaling t by two takes 3 multiplies (or adds, your choice), and the final vector-by-scalar multiply and summation take 3 multiplies and 6 adds, respectively. That’s 18 multiplies and 12 adds total, or 15 multiplies and 15 adds if you did the doubling using addition.

Either way, that’s a significant reduction on both counts.

With multiply-adds, I count a total of 3 multiplies and 9 multiply-adds for the cross products (the first cross product has nothing to sum to, but the second does); either 3 multiplies or 3 adds (your choice again) for the doubling; and another 3 multiply-adds for the $v + q_r t$ portion. That’s either 6 multiplies and 12 multiply-adds or 3 multiplies, 3 adds and 12 multiply-adds. Furthermore some GPUs can fold a doubling straight into the operand without counting as another operation at all; in that case we get 3 multiplies and 12 multiply-adds.

For comparison, multiplying a vector by a 3×3 rotation matrix takes 9 multiplies and 6 additions (or 3 multiplies plus 6 multiply-adds). So even though this is much better than we started out with, it’s generally still worthwhile to form that rotation matrix explicitly if you plan on transforming lots of vectors by the same quaternion, and aren’t worried about GPU vector register counts or similar.

“Rate-distortion optimization” is a term in lossy compression that sounds way more complicated and advanced than it actually is. There’s an excellent chance that by the end of this post you’ll go “wait, that’s it?”. If so, great! My goal here is just to explain the basic idea, show how it plays out in practice, and maybe get you thinking about other applications. So let’s get started!

### What does “rate-distortion optimization” actually mean?

The mission statement for lossless data compression is pretty clear. A lossless compressor transforms one blob of data into another one, ideally (but not always) smaller, in a way that is reversible: there’s another transform (the decoder) that, when applied to the second blob, will return an exact copy of the original data.

Lossy compression is another matter. The output of the decoder is usually at least slightly different from the original data, and sometimes very different; and generally, it’s almost always possible to take an existing lossily-compressed piece of data, say a short video or MP3 file, make a slightly smaller copy by literally deleting a hundred bytes from the middle of the file with a hex editor, and get another version of the original video (or audio) file that is still clearly recognizable yet degraded. Seriously, if you’ve never done this before, try it, especially with MP3s: it’s quite hard to mangle a MP3 file in a way that will make it not play back anymore, because MP3s have no required file-level headers, tolerate arbitrary junk in the middle of the bitstream, and are self-synchronizing.

With lossless compression, it makes sense to ask “what is the smallest I can make this file?”. With lossy compression, less so; you can generally get files far smaller than you would ever want to use, because the data is degraded so much it’s barely recognizable (if that). Minimizing file size alone isn’t interesting; we want to minimize size while simultaneously maximizing the quality of the result. The way we do this is by coming up with some error metric that measures the distance between the original image and the result the decoder will actually produce given a candidate bitstream. Now our bitstream has two associated values: its length in bits, the (bit) rate, usually called R in formulas, and a measure of how much error was introduced as a result of the lossy coding process, the distortion, or D in formulas. R is almost always measured in bits or bytes; D can be in one of many units, depending on what type of error metric is used.

Rate-distortion optimization (RDO for short) then means that an encoder considers several possible candidate bit streams, evaluates their rate and distortion, and tries to make an optimal choice; if possible, we’d like it to be globally optimal (i.e. returning a true best-possible solution), but at the very least optimal among the candidates that were considered. Sounds great, but what does “better” mean here? Given a pair $(R_1,D_1)$ and another pair $(R_2,D_2)$, how do we tell which one is better?

### The pareto frontier

Suppose what we have 8 possible candidate solutions we are considering, each with their own rate and distortion scores. If we prepare a scatter plot of rate on the x-axis versus distortion on the y-axis, we might get something like this:

The thin, light-gray candidates aren’t very interesting, because what they have in common is that there is at least one other candidate solution that is strictly better than them in every way. That is, some other candidate point has the same (or lower) rate, and also the same (or lower) distortion as the grayed-out points. There’s just no reason to ever pick any of those points, based on the criteria we’re considering, anyway. In the scatterplot, this means that there is at least one other point that is both to the left (lower rate) and below (lower distortion).

For any of the points in the diagram, imagine a horizontal and a vertical line going through it, partitioning the plane into four quadrants. Any point that ends up in the lower-left quadrant (lower rate and lower distortion) is clearly superior. Likewise, any point in the upper-right quadrant (higher rate and higher distortion) is clearly inferior. The situation with the other two quadrants isn’t as clear.

Which brings us to the other four points: the three fat black points, and the red point. These are all points that have no other points to the left and below them, meaning they are pareto efficient. This means that, unlike the situation with the light gray points, we can’t pick another candidate that improves one of the metrics without making the other worse. The set of points that are pareto efficient constitutes the pareto frontier.

These points are not all the same, though. The three fat black points are not just pareto efficient, but are also on the convex hull of the point set (the lower left contour of the convex hull here is drawn using blue lines), whereas the red point is not. The points that are both pareto and on the convex hull of the pareto frontier are particularly important, and can be characterized in a different way.

Namely, imagine taking a straightedge, angling it so that it’s either horizontal, “descending” (with reference to our graph) from left to right, or vertical, and then sliding it slowly “upwards” from the origin without changing its orientation until it hits one of the points in our set. It will look something like this:

The shading here is meant to suggest the motion of the green line; we keep sliding it up until it eventually catches on the middle of our three fat black points. If we change the angle of our line to something else, we can manage to hit the other two black points, but not the red one. This has nothing to do with this particular problem and is a general property of convex sets: any vertices of the set must be extremal in some direction.

Getting a bit more precise here, the various copies of the green line I’ve drawn correspond to lines

$w_1 R + w_2 D = \mathrm{const.}$

and the constraints I gave on the orientation of the line boil down to $w_1, w_2 \ge 0$ (with at least one being nonzero). Sliding our straightedge until we hit a point corresponds to the minimization problem

$\min_i w_1 R_i + w_2 D_i$

for a given choice of w1 and w2, and the three black points are the three possible solutions we might get for our set of candidate points. So we’ve switched from purely minimizing rate or purely minimizing distortion (both of which, in general, tend to give somewhat pathological results) towards minimizing some linear combination of the two with non-negative weights; and doing so with various choices of the weights w1 and w2 will allow us to sweep out the lower left convex hull of the pareto frontier, which is often the “interesting” part (although, as the red point in our example illustrates, this process excludes some of the points on the pareto frontier).

This does not seem particularly impressive so far: we don’t want to purely minimize one quantity or the other, so instead we’re minimizing a linear combination of the two. That seems it would’ve been the obvious next thing to try. But looking at the characterization above does at least give us some idea on what looking at these linear combinations ends up doing, and exactly what we end up giving up (namely, the pareto points not on the convex hull). And there’s another massively important aspect to consider here.

### The Lagrangian connection

If we take our linear combination above and divide through by w1 or w2 (assuming they are non-zero; dividing our objective by a scalar constant does not change the results of the optimization problem), respectively, we get:

$L_1 = R + \frac{w_2}{w_1} D =: R + \lambda_1 D$
$L_2 = D + \frac{w_1}{w_2} R =: D + \lambda_2 R$

which are essentially the Lagrangians we would get for continuous optimization problems of the form “minimize R subject to D=const.” (L1) and “minimize D subject to R=const.” (L2); that is, if we were in a continuously differentiable setting (for data compression we usually aren’t), trying to solve the problems of either minimizing bit rate while hitting a set distortion target or minimizing distortion within a given bit rate limit woud lead us to study the same type of expression. Generalizing one step further, allowing not just equality but also inequality constraints (i.e. rate or distortion at most a certain value, rather then requiring exact match) still leads to essentially the same functions, this time by way of the KKT conditions.

In this post, I intentionally went “backwards” by starting with the Lagrangian-esque expressions and then mentioning the connection to continuous optimization problems because I want to emphasize that this type of linear combination of the different metrics arises naturally, even in a fully discrete setting; starting out with Lagrange or KKT multipliers would get us into technical difficulties with discrete decisions that do not necessary admit continuously differentiable objectives or constraint functions. Since the whole process makes sense even without explicitly mentioning Lagrange multipliers at all, that seemed like the most natural way to handle it.

### What this means in practice

I hopefully now have you convinced that looking at the minima of the linear combination

$w_1 R + w_2 D$

is a sensible thing to do, and both our direct derivation and the two Lagrange multiplier formulations for continuous problems I mentioned lead us towards it. Neither our direct derivation nor the Lagrange multiplier version tells us what to set our mystery weight parameters to, however. In fact, the Lagrange multiplier method flat-out tells you that for every instance of your optimization problem, there exist the right values for the Lagrange multipliers that correspond to an optimum of the problem you’re interested in, but it doesn’t tell you how to get them.

However, what’s nice is that the same thing also works in reverse, as we saw earlier with our line-sweeping procedure: picking the angle of the line we’re sliding around corresponds to picking a Lagrange multiplier. No matter which one we pick, we are going to end up finding an optimal point for some trade-off, namely one that is pareto; it just might not be the exact trade-off we wanted.

For example, suppose we decide that a single-variable parameterization like in the Lagrange multiplier scenario is convenient, and we pick something like L1, namely w1 fixed at 1, w2 = λ. We were assuming from the beginning that we have some method of generating candidate solutions; all that’s left to do is to rank them and pick a winner. Let’s start with λ=1, which leads to a solution with some (R,D) pair that minimizes $R + D$. Note it’s often useful to think of these quantities with units attached; we typically measure R in bits and [D] is whatever it is, so the unit of λ must be [λ] = bits/[D], i.e. λ is effectively an exchange rate that tells us how many units of distortion are worth as much as a single bit. We can then look at the R and D values of the solution we got back. If say we’re trying to hit a certain bit rate, then if R is close to that target, we might be happy and just stop. If R is way above the target bit rate, we overshot, and clearly need to penalize high distortions less; we might try another round with λ=0.1 next. Conversely, if say R is a few percent below the target rate, we might try another round with slightly higher lambda, say λ=1.02, trying to penalize distortion a bit more so we spend more bits to reduce it, and see if that gets us even closer.

With this kind of process, even knowing nothing else about the problem, you can systematically explore the options along the pareto frontier and try to find a better fit. What’s nice is that finding the minimum for a given choice of parameters (λ in our case) doesn’t require us to store all candidate options considered and rank them later; storing all candidates is not a big deal in our original example, where we were only trying to decide between a handful of options, but in practice you often end up with huge search spaces (exponential in the problem size is not unheard of), and being able to bake it down to a single linear function is convenient in other ways; for example, it tends to work well with efficient dynamic programming solutions to problems that would be infeasible to handle with brute force.

### Wait, that’s it?

Yes, pretty much. Instead of trying to purely minimize bit rate or distortion, you use a combination $R + \lambda D$ and vary λ to taste to hit your targets. Often, you know enough about your problem space to have a pretty good idea of what values λ should have; for example, in video compression, it’s pretty common to tie the λ used when coding residuals to quantizer step size, based on the (known) behavior of the quantizer and the (expected) characteristics of real-world signals. But even when you don’t know anything about your data, you can always use a search process for λ as outlined above (which is, of course, slower).

Now the one thing to note in practice is that you rarely use just a single distortion metric; for example, in video coding, it’s pretty common to use one distortion metric when quantizing residuals, a different one for motion search, and a third for overall block mode decision. In general, the more frequently something is done (or the bigger the search space is), the more willing codecs are to make compromises with their distortion measure to enable more efficient searches. In general, good results require both decent metrics and doing a good job exploring the search space, and accepting some defects in the metrics in exchange for a significant increase in search space covered in the same amount of time is often a good deal.

But the basic process is just this: measure bit rate and distortion (in your unit of choice) for every option you’re considering, and then rank your options based on their combined $R + \lambda D$ (or $D + \lambda' R$, which is a different but equivalent parameterization) scores. This gives you points on the lower left convex hull of the pareto frontier, which is what you want.

This applies in other settings as well. For example, the various lossless compressors in Oodle are, well, lossless, but they still have a bunch of decisions to make, some of which take more time in the decoder than others. For a lossless codec, measuring “distortion” doesn’t make any sense (it’s always 0), but measuring decode time does; so the Oodle encoders optimize for a trade-off between compressed size and decode time.

Of course, you can have more parameters too; for example, you might want to combine these two ideas and do joint optimization over bit rate, distortion, and decode time, leading to an expression of the type $R + \lambda D + \mu T$ with two Lagrange multipliers, with λ as before, and a second multiplier μ that encodes the exchange rate from time units into bits.

Either way, the details of this quickly get complicated, but the basic idea is really quite simple. I hope this post de-mystifies it a bit.

(Continued from part 2. “A whirlwind introduction to dataflow graphs” is required reading.)

Last time, we saw a whole bunch of different bit reader implementations. This time, I’ll continue with a few more variants, implementation considerations on pipelined superscalar CPUs, and some ways to use the various degrees of freedom to our advantage.

### Dependency structure of bit readers

To get a better feel for where the bottlenecks in bit decoding are, let me restate some of the bit reading approaches we’ve covered in the previous parts again in our pseudo-assembly language, and then we can have a look at the corresponding dependency graphs.

Let’s start with variant 3 from last time, but I’ll do a LSB-first version this time:

refill3_lsb:
rBytesConsumed = lsr(rBitPos, 3);
rBitPtr = rBitPtr + rBytesConsumed;
rBitPos = rBitPos & 7;

peekbits3_lsb(count):
rBits = lsr(rBitBuf, rBitPos);
rBits = rBits & rBitMask; // result

consume3_lsb(count):
rBitPos = rBitPos + count;


Note that if count is a compile-time constant, the computation for rBitMask can be entirely constant-folded. Peeking ahead by a constant, fixed number of bits then working out from the result how many bits to actually consume is quite common in practice, so that’s what we’ll do. If we do a refill followed by two peek/consume cycles with the consume count being determined from the read bits “somehow”, followed by another refill (for the next loop iteration), the resulting pseudo-asm is like this:

    // Initial refill
rBytesConsumed = lsr(rBitPos, 3);   // Consumed 0
rBitPtr = rBitPtr + rBytesConsumed; // Advance 0
rBitPos = rBitPos & 7;              // LeftoverBits 0

// First decode (peek count==19)
rBits = lsr(rBitBuf, rBitPos);      // BitsRemaining 0
rBits = rBits & 0x7ffff;            // BitsMasked 0
rCount = determineCount(rBits);     // DetermineCount 0
rBitPos = rBitPos + rCount;         // PosInc 0

// Second decode
rBits = lsr(rBitBuf, rBitPos);      // BitsRemaining 1
rBits = rBits & 0x7ffff;            // BitsMasked 1
rCount = determineCount(rBits);     // DetermineCount 1
rBitPos = rBitPos + rCount;         // PosInc 1

// Second refill
rBytesConsumed = lsr(rBitPos, 3);   // Consumed 1
rBitPtr = rBitPtr + rBytesConsumed; // Advance 1
rBitPos = rBitPos & 7;              // LeftoverBits 1


And the dependency graph looks dishearteningly long and skinny:

Ouch. That’s averaging less than one instruction per cycle, and it’s all in one big, serial dependency chain. Not depicted in this graph but also worth noting is that the 4-cycle latency edge from “Load” to “BitsRemaining” is a recurring delay that will occur on every refill, because the computation of the updated rBitPtr depends on the decode prior to the refill having been completed. Now this is not a full decoder, since I’m showing only the parts to do with the bitstream IO (presumably a real decoder also contains code to, you know, actually decode the bits and store the result somewhere), but it’s still somewhat disappointing. Note that the DetermineCount step is a placeholder: if the count is known in advance, for example because we’re reading a fixed-length field, you can ignore it completely. The single cycle depicted in the graph is OK for very simple cases; more complicated cases will often need multiple cycles here, for example because they perform a table lookup to figure out the count. Either way, even with our optimistic single-cycle single-operation DetermineCount step, the critical path through this graph is pretty long, and there’s very little latent parallelism in it.

Does variant 4 fare any better? The primitives look like this in pseudo-ASM:

refill4_lsb:
rNextSh = lsl(rNext, rBitCount);
rBitBuf = rBitBuf | rNextSh;

// Most instruction sets don't have a subtract-from-immediate
// but do have xor-by-immediate, so this is an advantageous
// way to write 63 - rBitCount. (This works since we know that
// rBitCount is in [0,63]).

rBitCount = rBitCount | 56;

peekbits4_lsb(count):
rBits = rBitBuf & rBitMask; // result

consume4_lsb(count):
rBitBuf = lsr(rBitBuf, count);
rBitCount = rBitCount - count;


the pseudo-code for our “refill, do two peek/consume cycles, then refill again” scenario looks like this:

    // Initial refill
rNextSh = lsl(rNext, rBitCount);    // NextShift 0
rBitBuf = rBitBuf | rNextSh;        // BitInsert 0
rBitCount = rBitCount | 56;         // RefillCount 0

// First decode (peek count==19)
rBits = rBitBuf & 0x7ffff;          // BitsMasked 0
rCount = determineCount(rBits);     // DetermineCount 0
rBitBuf = lsr(rBitBuf, rCount);     // ConsumeShift 0
rBitCount = rBitCount - rCount;     // ConsumeSub 0

// Second decode
rBits = rBitBuf & 0x7ffff;          // BitsMasked 1
rCount = determineCount(rBits);     // DetermineCount 1
rBitBuf = lsr(rBitBuf, rCount);     // ConsumeShift 1
rBitCount = rBitCount - rCount;     // ConsumeSub 1

// Second refill
rNextSh = lsl(rNext, rBitCount);    // NextShift 1
rBitBuf = rBitBuf | rNextSh;        // BitInsert 1
rBitCount = rBitCount | 56;         // RefillCount 1


with this dependency graph:

That’s a bunch of differences, and you might want to look at variant 3 and 4 in different windows side-by-side. The variant 4 refill does take 3 extra instructions, but we can immediately see that we get more latent instruction-level parallelism (ILP) in return:

1. The variant 4 refill splits into three dependency chains, not two.
2. The LoadNext for the second refill can start immediately after the AdvancePtr for the first refill, moving the load off the critical path for the second and subsequent iterations. Variant 3 has a 6-cycle latency from the determination of the final rBitPos in the first iteration to a refilled rBitBuf; in variant 4, that latency shrinks to 2 cycles (one shift and an OR). In other words, while the refill takes more instructions, most of them are off the critical path.
3. The consume step in variant 4 has two parallel computations; in variant 3, the rBitPos update is critical and feeds into the shift in the next “peek” operation. Variant 4 has a single shift (to consume bits) on the critical path to the next peek; as a result, the latency between two subsequent decodes is one cycle less in variant 4: 3 cycles instead of 4.

In short, this version trades a slight increase in refill complexity for a noticeable latency reduction of several key steps, provided it’s running on a superscalar CPU. That’s definitely nice. On the other hand, the key decode steps are still very linear. We’re limited by the latency of a long chain of serial computations, which is a bad place to be: if possible, it’s generally preferable to be limited by throughput (how many instructions we can execute), not latency (how fast we can complete them). Especially so if most of the latency in question comes from integer instructions that already have a single cycle of latency. Over the past 30 years, the number of executions units and instructions per cycle in mainstream CPU parts have steadily, if slowly, increased. But if we want to see any benefit from this, we need to write code that has a use for these extra execution resources.

### Multiple streams

As is often the case, the best solution to this problem is the straightforward one: if decoding from a single bitstream is too serial, then why not decode from multiple bitstreams at once? And indeed, this is much better; there’s not much point to showing a graph here, since it’s literally just two copies of a single-stream graph next to each other. Even with a very serial decoder like variant 3 above, you can come a lot closer to filling up a wide out-of-order machine as long as you use enough streams. To a first-order approximation, using N streams will also give you N times the latent ILP—and given how serial a lot of the direct decoders are, this will translate into a substantial (usually not quite N-times, but still very noticeable) speed-up in the decoder on wide-enough processors. So what’s the catch? There are several:

1. Using multiple streams is a change to the bitstream format, not just an implementation detail. In particular, in any long-term storage format, any change in the number of bitstreams is effectively a change in the protocol or file format.
2. You need to define how to turn the multiple streams into a single output bytestream. This can be simple concatenation along with a header, it can be some form of interleaving or a sophisticated framing format, but no matter what it ends up being, it’s an increase in complexity (and usually also in storage overhead) relative to producing a single bitstream that contains everything in the order it’s read.
3. For anything with short packets and low latency requirements (e.g. game packets or voice chat), you either have to interleave streams fairly finely-grained (increasing size overhead), or suffer latency increases.
4. Decoding from N streams in parallel increases the amount of internal state in the decoder. In the decoder variants shown above, a N-wide variant needs N copies of rBitBuf, rBitPos/rBitCount and rBitPtr, at the very least, plus several temporary registers. For N=2 this is usually not a big deal, but for large counts you will start to run out of registers at least on some targets. There’s relatively little work being done on any given individual data item; if values get spilled from registers, the resulting loads and stores tend to have a very noticeable cost and will easily negate the benefit from using more streams.

In short, it’s not a panacea, but one of the usual engineering trade-offs. So how many streams should you use? It depends. At this point, for anything that is even remotely performance-sensitive, I would recommend trying at least N=2 streams. Even if your decoder has a lot of other stuff going on (computations with the decoded values etc.), bitstream decoding tends to be serial enough that there’s many wasted cycles otherwise, even on something relatively narrow like a dual-issue in-order machine. Having two streams adds a relatively small amount of overhead to the bitstream format (to signal the start of the data for stream 2 in every coding unit, or something equivalent), needs a modest amount of extra state for the second bit decoder, and tends to result in sizeable wins on pretty much any current CPU.

Using more than 2 streams can be a significant win in tight loops that do nothing but bitstream decoding, but is overkill in most other cases. Before you commit to a specific (high) number, you ideally want to try implementations on at least a few different target devices; a good number on one device may be past a big performance cliff on another, and having that kind of thing enshrined in a protocol or file format is unfortunate.

### Aside: SIMD? GPU?

If you use many streams, can you use SIMD instructions, or offload work to a GPU? Yes, you can, but the trade-offs get a bit icky here.

Vectorizing the simple decoders outlined above directly is, generally speaking, not great. There’s not a lot of computation going on per iteration, and operations such as refills end up using gathers, which tend to have a high associated overhead. To hide this overhead, and the associated latencies, you generally still need to be running multiple instances of your SIMD decoder in parallel, so your total number of streams ends up being the number of SIMD lanes times two (or more, if you need more instances). Having a high number of streams may be OK if all your targets have good wide SIMD support, but can be a real pain if you need to decode on at least one that doesn’t.

The same thing goes for GPUs, but even more so. With single warps/wavefronts of usually 16-64 invocations, we’re talking many streams just to not be running a kernel at quarter utilization, and we generally need to dispatch multiple warps worth of work to hide memory access latency. Between these two factors, it’s easy to end up needing well over 100 parallel streams just to not be stalled most of the time. At that scale, the extra overhead for signaling individual stream boundaries is definitely not negligible anymore, and the magic numbers are different between different GPU vendors; striking a useful compromise between the needs of different GPUs while also retaining the ability to decode on a CPU if no suitable GPU is available starts to get quite tricky.

There are techniques to at least make the memory access patterns and interleaving overhead somewhat more palatable (I wrote about this elsewhere), but this is an area of ongoing research, and so far there’s no silver bullet I could point at and just say “do this”. This is definitely a challenge going forward.

### Tricks with multiple streams

If you’re using multiple streams, you need to decide how these multiple streams get assembled into the final output bitstream. If you don’t have any particular reason to go with a fine-grained interleaving, the easiest and most straightforward option is to concatenate the sub-streams, with a header telling you how long the individual pieces are, here pictured for 3 streams:

Also pictured are the initial stream bit pointers before reading anything (pointers in a C-like or assembly-like setting; if you’re using something higher-level, probably indices into a byte slice). The beginning of stream 0 is implicit—right after the end of the header—and the end of the final stream is often supplied by an outer framing layer, but the initial positions of bitptr1 and bitptr2 need to be signaled in the bytestream somehow, usually by encoding the length of streams 0 and 1 in the header.

One thing I haven’t mentioned so far are bounds checks. Compressed data is normally untrusted since all the channels you might get that data from tend to be prone to either accidental (error during storage or in transit) or intentional (malicious attacker trying to craft harmful data) corruption, so careful input validation is not optional. What this usually boils down to in practice is that every load from the bitstream needs to be guarded by a range check that guarantees it’s in bounds. The overhead of this can be reduced in various ways. For example, one popular method is to unroll loops a few times and check at the top that there are enough bytes left for worst-case number of bytes consumed in the number of unrolled iterations, then only dropping to a careful loop that checks every single byte access at the very end of the stream. I’ve written about another useful technique before.

But why am I mentioning this here? Because it turns out that with multiple streams laid out sequentially, the overhead of bounds checking can be reduced. A direct range check for 3 streams that checks whether there are at least K bytes left would look like this:

// This needs to happen before we do any loads:
// If any of the streams are close to exhausted
// (fewer than K bytes left), drop to careful loop
if (bitend0 - bitptr0 < K ||
bitend1 - bitptr1 < K ||
bitend2 - bitptr2 < K)
break;


But when the three streams are sequential, we can use a simpler expression. First, we don’t actually need to worry about reading past the end of stream 0 or stream 1 as long as we still stay within the overall containing byte slice. And second, we can relax the check in the inner loop to use a much weaker test:

// Only check the last stream against the end; for
// other streams, simply test whether an the read
// pointer for an earlier stream is overtaking the
// read ponter for a later stream (which is never
// valid)
if (bitptr0 > bitptr1 ||
bitptr1 > bitptr2 ||
bitend2 - bitptr2 < K)
break;


The idea is that bitptr1 starts out pointing at bitend0, and only keeps increasing from there. Therefore, if we ever have bitptr0 > bitptr1, we know for sure that something went wrong and we read past the end of stream 0. That will give us garbage data (which we need to handle anyway), but not read out of bounds, since the checks maintain the invariant that bitptr0bitptr1bitptr2bitend2 - K. A later careful loop should use more precise checking, but this variant of the test is simpler and doesn’t require most of the bitend values to be reloaded in every iteration of our decoding loop.

Another interesting option is to reverse the order of some of the streams (which flips endianness as a side effect), and then glue pairs of forward and backward streams together, like shown here for streams 1 and 2:

I admit this sounds odd, but this has a few interesting properties. One of them is that it shrinks the amount of header space somewhat: in the image, the initial stream pointer for stream 2 is the same as the end of the buffer, and if there were 4 streams, the initial read pointers for stream 2 and 3 would start out in the same location (but going opposite directions). In general, we only need to denote the boundaries between stream pairs instead of individual streams. Then we let the decoder run as before, checking that the read cursors for the forward/backward pair don’t cross. If everything went right, once we’ve consumed the entire input stream, the final read cursors in a forward/backward pair should end up right next to each other. It’s a bit strange in that we don’t know the size of either stream in advance, just their sum, but it works fine.

Another consequence is that there’s no need to keep track of an explicit end pointer in the inner decoder loop if the final stream is a backwards stream; the pointer-crossing check takes care of it. In our running example, we’re now down to

// Check for pointer crossing; if done right, we get end-of-buffer
if (bitptr0 > bitptr1 ||
bitptr1 > bitptr2)
break;


In this version, bitptr0 and bitptr1 point at the next byte to be read in the forwards stream, whereas bitptr2 is offset by -K to ensure we don’t overrun the buffer; this is just a constant offset however, which folds into the memory access on regular load instructions. It’s all a bit tricky, but it saves a couple instructions, makes the bitstream slightly smaller and reduces the number of live variables in a hot loop, with the savings usually being larger the cost of a single extra endian swap. With a two-stream layout, generating the second bitstream in reverse also happens to be convenient on the encoder side, because we can reserve memory for the expected (or budgeted) size of the combined bitstream without having to guess how many bytes end up in either half; it’s just a regular double-ended stack. Once encoding is done, the two parts can be compacted in-place by moving the second half downwards.

None of these properties are a big deal in and of themselves, but they make for a nice package, and a two-stream setup with a forwards/backwards pair is now our default layout for most parts in most parts of the Oodle bitstream (Oodle is a lossless data compression library I work on).

Between the various tricks outlined so far, the size overhead and the extra CPU cost for wrangling multiple streams can be squeezed down quite far. But we still have to deal with the increased number of live variables that multiple streams imply. It turns out that if we’re willing to tolerate a moderate increase in critical path latency, we can reduce the amount of state variables per bit reader, in some cases while simultaneously (slightly) reducing the number of instructions executed. The advantage here is that we can fit more streams into a given number of working registers than we could otherwise; if we can use enough streams that we’re primarily limited by execution throughput and not critical path latency, increasing said latency is OK, and reducing the overall number of instructions helps us increase the throughput even more. So how does that work?

### Bit reader variant 5: minimal state, throughput-optimized

The bit reader variants I’ve shown so far generally split the bit buffer state across two variables: one containing the actual bits and another keeping track of how many bits are left in the buffer (or, equivalently, keeping track of the current read position within the buffer). But there’s a simple trick that allows us to reduce this to a single state variable: the bit shifts we use always shift in zeros. If we turn the MSB (for a LSB-first bit buffer) or the LSB (for a MSB-first bit buffer) into a marker bit that’s always set, we can use that marker to track how many bits we’ve consumed in total come the next refill. That allows us to get rid of the bit count and the instructions that manipulate it. That means one less variable in need of a register, and depending on which variant we’re comparing to, also fewer instructions executed per “consume”.

I’ll present this variant in the LSB-first version, and this time there’s an actual reason to (slightly) prefer LSB-first over MSB-first.

const uint8_t *bitptr; // Pointer to current byte
uint64_t bitbuf = 1ull << 63; // Init to marker in MSB

void refill5_lsb() {
assert(bitbuf != 0);

// Count how many bits we consumed using a "leading zero
// count" instruction. See notes below.

bitptr += bits_consumed >> 3;

// Refill and put the marker in the MSB
bitbuf = read64LE(bitptr) | (1ull << 63);

// Consume the bits in this byte that we've already used.
bitbuf >>= bits_consumed & 7;
}

uint64_t peekbits5_lsb(int count) {
assert(count >= 1 && count <= 56);
// Just need to mask the low bits.
return bitbuf & ((1ull << count) - 1);
}

void consume5_lsb(int count) {
bitbuf >>= count;
}


This “count leading zeros” operation might seem strange and weird if you haven’t seen it before, but it happens to be something that’s useful in other contexts as well, and most current CPU architectures have fast instructions that do this! Other than the strangeness going on in the refill, where we first have to figure out the number of bits consumed from the old marker bit, then insert a new marker bit and do a final shift to consume the partial bits from the first byte, this is like a hybrid between variants 3 and 4 from last time.

The pseudo-assembly for our running “refill, two decodes, then another refill” scenario goes like this: (not writing out the marker constant explicitly here)

    // Initial refill
rBitsConsumed = clz64(rBitBuf);     // CountLZ 0
rMarked = rNext | MARKER;           // OrMarker 0
rLeftover = rBitsConsumed & 7;      // LeftoverBits 0
rBitBuf = lsr(rMarked, rLeftover);  // ConsumeLeftover 0

// First decode (peek count==19)
rBits = rBitBuf & 0x7ffff;          // BitsMasked 0
rCount = determineCount(rBits);     // DetermineCount 0
rBitBuf = lsr(rBitBuf, rCount);     // Consume 0

// Second decode
rBits = rBitBuf & 0x7ffff;          // BitsMasked 1
rCount = determineCount(rBits);     // DetermineCount 1
rBitBuf = lsr(rBitBuf, rCount);     // Consume 1

// Second refill
rBitsConsumed = clz64(rBitBuf);     // CountLZ 1
rMarked = rNext | MARKER;           // OrMarker 1
rLeftover = rBitsConsumed & 7;      // LeftoverBits 1
rBitBuf = lsr(rMarked, rLeftover);  // ConsumeLeftover 1


The refill has 7 integer operations, the same as variant 4 (“looakhead”) above, and 3 more than variant 3 (“bit extract”), while the decode step takes 3 operations (including the determineCount step), one fewer than variants 3 (“bit extract”) and 4 (“lookahead”). The latter means that we equalize with the regular bit extract form in terms of instruction count when we perform at least 3 decodes per refill, and start to pull ahead if we manage more than 3. For completeness, here’s the dependency graph:

Easily the longest critical path of the variants we’ve seen so far, and very serial indeed. It doesn’t help that not only do we not know the load address early, we also have several more steps in the refill compared to the basic variant 3. But having the entire “hot” bit buffer state concentrated in a single register (rBitBuf) during the decodes means that we can afford many streams at once, and with enough streams that extra latency can be hidden.

This one definitely needs to be deployed carefully, but it’s a powerful tool when used in the right place. Several of the fastest (and hottest) decoder loops in Oodle use it.

Note that with this variation, there’s a reason to stick with the LSB-first version: the equivalent MSB-first version needs a way to count the number of trailing zero bits, which is a much less common instruction, although it can be synthesized from a leading zero count and standard arithmetic/logical operations at acceptable extra cost. Which brings me to my final topic for this post.

### MSB-first vs. LSB-first: the final showdown

Throughout this 3-parter series, I’ve been continually emphasizing that there’s no major reason to prefer MSB-first or LSB-first for bit IO. Both are broadly equivalent and have efficient algorithms. But having now belabored that point sufficiently, if we can make both of them work, which one should we choose?

There are definitely differences that push you into one direction or another, depending on your intended use case. Here are some you might want to consider, in no particular order:

• As we saw in part 2, the natural MSB-first peekbits and getbits implementations run into trouble (of the undefined-behavior and hardware-actually-behaving-in-surprising-ways kind) when count == 0, whereas with the natural LSB-first implementation, this case is unproblematic. If you need to support counts of 0 (usefol for e.g. variable-length codes), LSB-first tends to be slightly more convenient. Alternatives for MSB-first are a rotate-based implementation (which has no problems with 0 count) or using an extra shift, turning x >> (64 - count) into (x >> 1) >> (63 - count).
• MSB-first coding tends to have a big edge for universal variable-length codes. Unary codes can be decoded quickly via the aforementioned “count leading zero” instructions; gamma codes and the closely related Exp-Golomb codes also admit direct decoding in a fairly slick way; and the same goes for Golomb-Rice codes and a few others. If you’re considering universal codes, MSB-first is definitely handier.
• At the other extreme, LSB-first coding often ends up slightly cheaper for the table-based decoders commonly used when a code isn’t fixed as part of the format; Huffman decoders for example.
• MSB-first meshes somewhat more naturally with big-endian byte order, and LSB-first with little-endian. If you’re deeply committed to either side in this particular holy war, this might drive you one way or the other.

Charles and me both tend to default to MSB-first but will switch to LSB-first where it’s a win on multiple target architectures (or on a single important target).

### Conclusion

That’s it for both this post and this mini-series; apologies for the long delay, caused by first a surprise deadline that got dropped in my lap right as I was writing the series originally, and then exacerbated by a combination of technical difficulties (alas, still ongoing) and me having gotten “out of the groove” in the intervening time.

This post ended up longer than my usual, and skips around topics a bit more than I’d like, but I really didn’t want to make this series a four-parter; I still have a few notes here and there, but I don’t want to drag this topic out much longer, not on this general level anyway. Instead, my plan is to write about some more down-to-earth case studies soon, so I can be less hand-wavy, and maybe even do some actual assembly-level analysis for an actual real-world CPU instead of an abstract idealized machine. We’ll see.

Until then!

While in the middle of writing “Reading bits in far too many ways, part 3”, I realized that I had written a lot of background material that had absolutely nothing to do with bit I/O and really was worth putting in its own post. This is that post.

The problem I’m concerned with is fairly easy to state: say we have some piece of C++ code that we’re trying to understand (and perhaps improve) the performance of. A good first step is to profile it, which will give us some hints which parts are slow, but not necessarily why. On a fundamental level, any kind of profiling (or other measurement) is descriptive, not predictive: it can tell you how an existing system is behaving, but if you’re designing something that’s more than a few afternoons worth of work, you probably don’t have the time or resources to implement 5 or 6 completely different design alternatives, pick whichever one happens to work best, and throw the rest away. You should be able to make informed decisions up front from an algorithm sketch without having to actually write a fleshed-out implementation.

One thing I want to emphasize particularly here is that experiments coupled with before/after measurements are no adequate substitute for a useful performance model. These kinds of measurements can tell you how much you’ve improved, but not if you are where you should be: if I tell you that by tweaking some config files, I managed to double the number of requests served per second by the web server, that sounds great. It sounds less good if I give you the additional piece of information that with this fix deployed, we’re now at a whopping 1.5 requests per second; having an absolute scale of reference matters!

This goes especially for microbenchmarks. With microbenchmarks, like a trial lawyer during cross-examination, you should never ask a question you don’t know the answer to (or at least have a pretty good idea of what it is). Real-world systems are generally too complex and intertwined to understand from surface measurements alone. If you have no idea how a system works at all, you don’t know what the right questions are, nor how to ask them, and any answers you get will be opaque at best, if not outright garbage. Microbenchmarks are a useful tool to confirm that an existing model is a good approximation to reality, but not very helpful in building these models to begin with.

### Machine models

So, if we want to go deeper than just squinting at C/C++ code and doing some hand-waving, we need to start looking at a somewhat lower abstraction level and define a machine model that is more sophisticated than “statements execute one by one”. If you’re only interested in a single specific processor, one option is to use whatever documentation and tools you can find for the chip in question and analyze your code in detail for that specific machine. And if you’re willing to go all-out on microarchitectural tweaking, that’s indeed the way to go, but it’s a giant step from looking at C++ code, and complete overkill in most cases.

Instead, what I’m going to do is use a simplified machine model that allows us to make quantitative predictions about the behavior of straightforward compute-bound loops, which is simple to describe but still gives us a lot of useful groundwork for more complex scenarios. Here’s what I’ll use:

• We have an unlimited set of 64-bit integer general-purpose registers, which I’ll refer to by names like rSomething. Any “identifiers” that aren’t prefixed with a lowercase r are either symbolic constants or things like labels.
• We have the usual 64-bit integer arithmetic and logic operations. All operations can either be performed between two registers or a register and an immediate constant, and the result is placed in another register. All arithmetic uses two’s complement. For simplicity, all 64-bit values are permitted as immediate constants.
• There’s a flat, byte-granular 64-bit address space, and pointers are just represented as integers.
• All memory accesses require explicit load and store operations. Memory accesses are either 8, 16, 32, or 64 bits in size and can use (for my convenience) both little-endian or big-endian byte ordering, when requested. One of these is the default, but both are the same cost. Narrow stores store the least significant bits of the register in question; narrow loads zero-extend to 64 bits. Loads and stores have a few common addressing modes (that I’ll introduce as I use them). Unaligned loads and stores are supported.
• There’s unconditional branches, which just jump to a given location, and conditional branches, which compare a register to either another register or an immediate constant, and branch to a given destination if the condition is true.

Code will be written in a pseudo-C form, at most one instruction per line. Here’s a brief example showing what kind of thing I have in mind:

loop:                            // label
rFoo = rBar | 1;               // bitwise logical OR
rFoo = lsl(rFoo, 3);           // logical shift left
rBar = asr(rBar, rBaz);        // arithmetic shift right
store16BE(rDest + 3, rMem);    // big-endian store
rCount = rCount - 1;           // basic arithmetic
if rCount != 0 goto loop;      // branch


Shifts use explicit mnemonics because there’s different types of right shifts and at this level of abstraction, registers are generally treated as untyped bags of bits. I’ll introduce other operations and addressing modes as we get to them. What we’ve seen so far is quite close to classic RISC instruction sets, although I’ll allow a larger set of addressing modes than some of the more minimalist designs, and require support for unaligned access on all loads and stores. It’s also close in spirit to an IR (Intermediate Representation) you’d expect to see early in the backend of a modern compiler: somewhat lower-level than LLVM IR, and comparable to early-stage LLVM Machine IR or GCC RTL.

This model requires us to make the distinction between values kept in registers and memory accesses explicit, and flattens down control flow to basic blocks connected by branches. But it’s still relatively easy to look at a small snippet of C++ and e.g. figure out how many arithmetic instructions it boils down to: just count the number of operations.

As a next step, we could now specify a virtual processor to go with our instruction set, but I don’t want to really get into that level of detail; instead of specifying the actual processor, I’ll work the same way actual architectures do: we require that the end result (eventual register and memory contents in our model) of running a program must be as if we had executed the instructions sequentially one by one (as-if rule). Beyond that, an aggressive implementation is free to cut corners as much as it wants provided it doesn’t get caught. We’ll assume we’re in an environment—the combination of compilers/tools and the processor itself—that uses pipelining and tries to extract instruction-level parallelism to achieve higher performance, in particular:

• Instructions can launch independent from each other, and take some number of clock cycles to complete. For an instruction to start executing, all the operands it depends on need to have been computed. As long as the dependencies are respected, all reorderings are valid.
• There is some limit W (“width”) on how many new instructions we can start per clock cycle. In-flight instructions don’t interfere with each other; as long as we have enough independent work, we can start W new instructions every cycle. We’re going to treat W as variable.
• Memory operations have a latency of 4 cycles, meaning that the result of a load is available 4 cycles after the load issued, and a load reading the bytes written by a prior store can issue 4 cycles after the store. That’s a fairly typical latency for a load that hits in the L1 cache, in case you were wondering.
• Branches (conditional or not) count as a single instruction, but their latency is variable. Unconditional branches or easily predicted branches such as the loop counter in along-running loop have an effective latency of 0 cycles, meaning the instructions being branched to can issue at the same time as the branch itself. Unpredictable branches have a nonzero cost that depends on how unpredictable they are—I won’t even try to be more precise here.
• Every other instruction has a latency of 1 clock cycle, meaning the result is available in the next cycle.

This model can be understood as approximating either a dataflow architecture, an out-of-order machine with a very large issue window (and infinitely fast front-end), or a statically scheduled in-order machine running code compiled with a Sufficiently Smart Scheduler. (The kind that actually exists; e.g. a compiler implementing software pipelining).

Furthermore, I’m assuming that while there is explicit control flow (unlike a pure dataflow machine), there is a branch prediction mechanism in place that allows the machine to guess the control flow path taken arbitrarily far in advance. When these guesses are correct, the branches are effectively free other than still taking an instruction slot, during which time the machine checks whether its prediction was correct. When the guess was incorrect, the machine reverts all computations that were down the incorrectly guessed path, and takes some number of clock cycles to recover. If this idea of branch prediction is new to you, I’ll refer you to Dan Luu’s excellent article on the subject, which explains both how and why computers would be doing this.

The end result of these model assumptions is that while control flow exists, it’s on the sidelines: its only observable effect is that it sometimes causes us to throw away a bunch of work and take a brief pause to recover when we guessed wrong. Dataflow, on the other hand—the dependencies between instructions, and how long it takes for these dependencies to be satisfied—is front and center.

### Dataflow graphs

Why this emphasis? Because dataflow and data dependencies is because they can be viewed as the fundamental expression of the structure of a particular computation, whether it’s done on a small sequential machine, a larger superscalar out-of-order CPU, a GPU, or in hardware (be it a hand-soldered digital circuit, a FPGA, or an ASIC). Dataflow and keeping track of the shape of data dependencies is an organizing principle of both the machines themselves and the compilers that target them.

And these dependencies are naturally expressed in graph form, with individual operations being the nodes and data dependencies denoted by directed edges. In this post, I’ll have dependent operations point towards the operations they depend on, with the directed edges labeled with their latency. To reduce clutter, I’ll only write latency numbers when they’re not 1.

With all that covered, and to see what the point of this all is, let’s start with a simple, short toy program that just sums the 64-bit integers in some array delineated by two pointers stored in rCurPtr (which starts pointing to the first element) and rEndPtr (which points to one past the last element), idiomatic C++ iterator-style.

loop:
rSum = rSum + rCurInt;            // Sum
rCurPtr = rCurPtr + 8;            // Advance
if rCurPtr != rEndPtr goto loop;  // Done?


We load a 64-bit integer from the current pointer, add it to our current running total in register rSum, increment the pointer by 8 bytes (since we grabbed a 64-bit integer), and then loop until we’re done. Now let’s say we run this program for a short 6 iterations and draw the corresponding dataflow graph (click to see full-size version):

Note I group nodes into ranks by which cycle they can execute in, at the earliest, assuming we can issue as many instructions in parallel as we want, purely constrained by the data dependencies. The “Load” and “Advance” from the first iteration can execute immediately; the “Done?” check from the first iteration looks at the updated rCurPtr, which is only known one cycle later; and “Sum” from the first iteration needs to wait for the load to finish, which means it can only start a full 4 cycles later.

As we can see, during the first four cycles, all we do is keep issuing more loads and advancing the pointer. It takes until cycle 4 for the results of the first load to become available, so we can actually do some summing. After that, one more load completes every cycle, allowing us to add one more integer to the running sum in turn. If we let this process continue for longer, all the middle iterations would look the way cycles 4 and 5 do: in our state state, we’re issuing a copy of all four instructions in the loop every cycle, but from different iterations.

There’s a few conclusions we can draw from this: first, we can see that this four-instruction loop achieves a steady-state throughput of one integer added to the sum in every clock cycle. We take a few cycles to get into the steady state, and then a few more cycles at the end to drain out the pipeline, but if we start in cycle 0 and keep running N iterations, then the final sum will be completed by cycle N+4. Second, even though I said that our model has infinite lookahead and is free to issue as many instructions per cycle as it wants, we “only” end up using at most 4 instructions per cycle. The limiter here ends up being the address increment (“Advance”); we increment the pointer after every load, per our cost model this increment takes a cycle of latency, and therefore the load in the next iteration of the loop (which wants to use the updated pointer) can start in the next cycle at the earliest.

This is a crucial point: the longest-latency instruction in this loop is definitely the load, at 4 cycles. But that’s not a limiting factor; we can schedule around the load and do the summing later. The actual problem here is with the pointer advance; every single instruction that comes after it in program order depends on it either directly or indirectly, and therefore, its 1 cycle of latency determines when the next loop iteration can start. We say it’s on the critical path. In loops specifically, we generally distinguish between intra-iteration dependencies (between instructions within the same iteration, say “Sum 0” depending on “Load 0”) and inter-iteration or loop-carried dependencies (say “Sum 1” depending on “Sum 0”, or “Load 1” depending on “Advance 0”). Intra-iteration dependencies may end up delaying instructions within that iteration quite a lot, but it’s inter-iteration dependencies that determine how soon we can start working on the next iteration of the loop, which is usually more important because it tends to open up more independent instructions to work on.

The good news is that W=4 is actually a fairly typical number of instructions decoded/retired per cycle in current (as of this writing in early 2018) out-of-order designs, and the instruction mixture here (1 load, 1 branch, 2 arithmetic instructions) is also one that is quite likely to be able to issue in parallel on a realistic 4-wide decode/retire design. While many machines can issue a lot more instructions than that in short bursts, a steady state of 4 instructions per cycle is definitely good. So even though we’re not making much of the infinite parallel computing power of our theoretical machine, in practical terms, we’re doing OK, although on real machines we might want to apply some more transforms to the loop; see below.

Because these real-world machines can’t start an arbitrary number of instructions at the same time, we have another concern: throughput. Say we’re running the same loop on a processor that has W=2, i.e. only two instructions can start every cycle. Because our loop has 4 instructions, that means that we can’t possibly start a new loop iteration more often than once every two clock cycles, and the limiter aren’t the data dependencies, but the number of instructions our imaginary processor can execute in a clock cycle; we’re throughput-bound. We would also be throughput-bound on a machine with W=3, with a steady state of 3 new instructions issued per clock cycle, where we can start working on a new iteration every 4/3≈1.33 cycles.

### A different example

For the next example, we’re going to look at what’s turned into everyone’s favorite punching-bag of a data structure, the linked list. Let’s do the exact same task as before, only this time, the integers are stored in a singly-linked list instead of laid out as an array. We store first a 64-bit integer and then a 64-bit pointer to the next element, with the end of the list denoted by a special value stored in rEndPtr as before. We also assume the list has at least 1 element. The corresponding program looks like this:

loop:
rSum = rSum + rCurInt;            // Sum
if rCurPtr != rEndPtr goto loop;  // Done?


Very similar to before, only this time, instead of incrementing the pointer, we do another load to grab the “next” pointer. And here’s what happens to the dataflow graph if we make this one-line change:

Switching from a contiguous array to a linked list means that we have to wait for the load to finish before we can start the next iteration. Because loads have a latency of 4 cycles in our model, that means we can’t start a new iteration any more often than once every 4 cycles. With our 4-instruction loop, we don’t even need any instruction-level parallelism to reach that target; we might as well just execute one instruction per cycle and still hit the same overall throughput.

Now, this example, with its short 4-instruction loop, is fairly extreme; if our loop had say a total of 12 instructions that worked out nicely, the same figure might well end up averaging 3 instructions per clock cycle, and that’s not so bad. But the underlying problem here is a nasty one: because our longest-latency instruction is on the critical path between iterations, it ends up determining the overall loop throughput.

In our model, we’re still primarily focused on compute-bound code, and memory access is very simple: there’s no memory hierarchy with different cache levels, all memory accesses take the same time. If we instead had a more realistic model, we would also have to deal with the fact that some memory accesses take a whole lot longer than 4 cycles to complete. For example, suppose we have three cache levels and, at the bottom, DRAM. Sticking with the powers-of-4 theme, let’s say that a L1 cache hit takes 4 cycles (i.e. our current memory access latency), a L2 hit takes 16 cycles, a L3 hit takes 64 cycles, and an actual memory access takes 256 cycles—for what it’s worth, all these numbers are roughly in the right ballpark for high-frequency desktop CPUs under medium memory subsystem load as of this writing.

Finding work to keep the machine otherwise occupied for the next 4 cycles (L1 hit) is usually not that big a deal, unless we have a very short loop with unfavorable dependency structure, as in the above example. Fully covering the 16 cycles for a L1 miss but L2 hit is a bit trickier and requires a larger out-of-order window, but current out-of-order CPUs have those, and as long as there’s enough other independent work and not too many hard-to-predict branches along the way, things will work out okay. With a L3 cache hit, we’ll generally be hard-pressed to find enough independent work to keep the core usefully busy during the wait for the result, and if we actually miss all the way to DRAM, then in our current model, the machine is all but guaranteed to stall; that is, to have many cycles with no instructions executed at all, just like the gaps in the diagram above.

Because linked lists have this nasty habit of putting memory access latencies on the critical path, they have a reputation of being slow “because they’re bad for the cache”. Now while it’s definitely true that most CPUs with a cache would much rather have you iterate sequentially over an array, we have to be careful how we think about it. To elaborate, suppose we have yet another sum kernel, this time processing an array of pointers to integers, to compute the sum of the pointed-to values.

loop:
rSum = rSum + rCurInt;             // Sum
rCurPtr = rCurPtr + 8;             // Advance
if rCurPtr != rEndPtr goto loop;   // Done?


And this time, I’ll prune the dataflow graph to show only the current iteration and its direct dependency relationships with earlier and later iterations, because otherwise these more complicated graphs will get cluttered and unreadable quickly:

A quick look over that graph shows us that copies of the same instruction from different iterations are all spaced 1 cycle apart; this means that in the steady state, we will again execute one iteration of the loop per clock cycle, this time issuing 5 instructions instead of 4 (because there are 5 instructions in the loop). Just like in the linked list case, the pointer indirection here allows us to jump all over memory (potentially incurring cache misses along the way) if we want to, but there’s a crucial difference: in this setup, we can keep setting up future iterations of the loop and get more loads started while we’re waiting for the first memory access to complete.

To explain what I mean, let’s pretend that every single of the “LoadInt”s misses the L1 cache, but hits in the L2 cache, so its actual latency is 16 cycles, not 4. But a latency of 16 cycles just means that it takes 16 cycles between issuing the load and getting the result; we can keep issuing other loads for the entire time. So the only thing that ends up happening is that the “Sum k” in the graph above happens 12 cycles later. We still start two new loads every clock cycle in the steady state; some of them end up taking longer, but that does not keep us from starting work on a new iteration of the loop in every cycle.

Both the linked list and the indirect-sum examples have the opportunity to skip all over memory if they want to; but in the linked-list case, we need to wait for the result of the previous load until we can get started on the next one, whereas in the indirect-sum case, we get to overlap the wait times from the different iterations nicely. As a result, in the indirect-sum case, the extra latency towards reaching the final sum is essentially determined by the worst single iteration we had, whereas in the linked-list case, every single cache miss makes our final result later (and costs us throughput).

The fundamental issue isn’t that the linked-list traversal might end up missing the cache a lot; while this isn’t ideal (and might cost us in other ways), the far more serious issue is that any such cache miss prevents us from making progress elsewhere. Having a lot of cache misses isn’t necessarily a problem if we get to overlap them; having long stretches of time were we can’t do anything else, because everything else we could do depends on that one cache-missing load, is.

In fact, when we hit this kind of problem, our best bet is to just switch to doing something else entirely. This is what CPUs with simultaneous multithreading/hardware threads (“hyperthreads”) and essentially all GPUs do: build the machine so that it can process instructions from multiple instruction streams (threads), and then if one of the threads isn’t really making progress right now because it’s waiting for something, just work on something else for a while. If we have enough threads, then we can hopefully fill those gaps and always have something useful to work on. This trade-off is worthwhile if we have many threads and aren’t really worried about the extra latency caused by time-slicing, which is why this approach is especially popular in throughput-centric architectures that don’t worry about slight latency increases.

### Unrolling

But let’s get back to our original integer sum code for a second:

loop:
rSum = rSum + rCurInt;            // Sum
rCurPtr = rCurPtr + 8;            // Advance
if rCurPtr != rEndPtr goto loop;  // Done?


We have a kernel with four instructions here. Out of these four, two (“Load” and “Sum”) do the actual work we want done, whereas “Advance” and “Done?” just implement the loop itself and are essentially overhead. This type of loop is a prime target for unrolling, where we collapse two or more iterations of the loop into one to decrease the overhead fraction. Let’s not worry about the setup or what to do when the number of elements in the array is odd right now, and only focus on the “meat” of the loop. Then a 2× unrolled version might look like this:

loop:
rSum = rSum + rCurInt;            // SumEven
rSum = rSum + rCurInt;            // SumOdd
rCurPtr = rCurPtr + 16;           // Advance
if rCurPtr != rEndPtr goto loop;  // Done?


which has this dataflow graph:

Note that even though I’m writing to rCurInt twice in an iteration, which constitutes a write-after-write (WAW) or “output dependency”, there’s no actual dataflow between the loads and sums for the first and second version of rCurInt, so the loads can issue in parallel just fine.

This isn’t bad: we now have two loads every iteration and spend 6N instructions to sum 2N integers, meaning we take 3 instructions per integer summed, whereas our original kernel took 4. That’s an improvement, and (among other things) means that while our original integer-summing loop needed a machine that sustained 4 instructions per clock cycle to hit full throughput, we can now hit the same throuhgput on a smaller machine that only does 3 instructions per clock. This is definitely progress.

However, there’s a problem: if we look at the diagram, we can see that we can indeed start a new pair of loads every clock cycle, but there’s a problem with the summing: we have two dependent adds in our loop, and as we can see from the relationship between “SumEven k” and “SumEven k+1”, the actual summing part of the computation still takes 2 cycles per iteration. On our idealized dataflow machine with infinite lookahead, that just means that all the loads will get front-loaded, and then the adds computing the final sum proceed at their own pace; the result will eventually be available, but it will still take a bit more than 2N cycles, no faster than we were in the original version of the loop. On a more realistic machine (which can only look ahead by a limited number of instructions), we would eventually stop being able to start new loop iterations until some of the old loop iterations have completed. No matter how we slice it, we’ve gone from adding one integer to the sum per cycle to adding two integers to the sum every two cycles. We might take fewer instructions to do so, which is a nice consolation prize, but this is not what we wanted!

What’s happened is that unrolling shifted the critical path. Before, the critical path between iterations went through the pointer advance (or, to be more precise, there were two critical paths, one through the pointer advance and one through the sum, and they were both the same length). Now that we do half the number of advances per item, that isn’t a problem anymore; but the fact that we’re summing these integers sequentially is now the limiter.

A working solution is to change the algorithm slightly: instead of keeping a single sum of all integers, we keep two separate sums. One for the integers at even-numbered array positions, and one for the integers at odd-numberd positions. Then we need to sum those two values at the end. This is the algorithm:

loop:
rSumEven = rSumEven + rCurInt;    // SumEven
rSumOdd = rSumOdd + rCurInt;      // SumOdd
rCurPtr = rCurPtr + 16;           // Advance
if rCurPtr != rEndPtr goto loop;  // Done?

rSum = rSumEven + rSumOdd;        // FinalSum


And the dataflow graph for the loop kernel looks as follows:

Where before all the summing was in what’s called the same dependency chain (the name should be self-explanatory by now, I hope), we have now split the summation into two dependency chains. And this is enough to make a sufficiently-wide machine that can sustain 6 instructions per cycle complete our integer-summing task in just slightly more than half a cycle per integer being summed. Progress!

On a somewhat narrower 4-wide design, we are now throughput-bound, and take around 6/4=1.5 cycles per two integers summed, or 0.75 cycles per integer. That’s still a good improvement from the 1 cycle per integer we would have gotten on the same machine from the non-unrolled version; this gain is purely from reduction the loop overhead fraction, and further unrolling could reduce it even further. (That said, unless your loop really is as tiny as our example, you don’t generally want to go overboard with unrolling.)

### Tying it all together

In the introduction, I talked about the need for a model detailed enough to make quantitative, not just qualitative, predictions; and at least for very simple compute-bound loops, that is exactly what we have now. At this point, you should know enough to look at the dependency structure of simple loops, and have some idea for how much (or how little) latent parallelism there is, and be able to compute a coarse upper bound on their “speed of light” on various machines with different peak instructions/cycle rates.

Of course, there are many simplifications here, most of which have been already noted in the text; we’re mostly ignoring the effects of the memory hierarchy, we’re not worrying at all about where the decoded instructions come from and how fast they can possibly be delivered, we’ve been flat-out assuming that our branch prediction oracle is perfect, and we’ve been pretending that while there may be a limit on the total number of instructions we can issue per cycle, it doesn’t matter what these instructions are. None of these are true. And even if we’re still compute-bound, we need to worry at least about that latter constraint: sometimes it can make a noticeable difference to tweak the “instruction mix” so it matches better what the hardware can actually do in a given clock cycle.

But all these caveats aside, the basic concepts introduced here are very general, and even just sketching out the dependency graph of a loop like this and seeing it in front of you should give you useful ideas about what potential problems are and how you might address them. If you’re interested in performance optimization, it is definitely worth your time practicing this so you can look at loops and get a “feel” for how they execute, and how the shape of your algorithm (or your data structures, in the linked list case) aids or constrains the compiler and processor.

UPDATE: Some additional clarifications in answer to some questions: paraphrasing one, “if you have to first write C code, translate it to some pseudo-assembly, and then look at the graph, how can this possibly be a better process than just measuring the code in the first place?” Well, the trick here is that to measure anything, you actually need a working program. You don’t to draw a dataflow graph. For example, a common scenario is that there are many ways you could structure some task, and they all want their data structured differently. Actually implementing and testing multiple variants like this requires you to write a lot of plumbing to massage data from one format into another (all of which can be buggy). Drawing a graph can be done from a brief description of the inner loop alone, and you can leave out the parts that you don’t currently care about, or “dummy them out” by replacing them with a coarse approximation (“random work here, maybe 10 cycles latency?”). You only need to make these things precise when they become close to the critical path (or you’re throughput-bound).