Skip to content

Rate-distortion optimization

“Rate-distortion optimization” is a term in lossy compression that sounds way more complicated and advanced than it actually is. There’s an excellent chance that by the end of this post you’ll go “wait, that’s it?”. If so, great! My goal here is just to explain the basic idea, show how it plays out in practice, and maybe get you thinking about other applications. So let’s get started!

What does “rate-distortion optimization” actually mean?

The mission statement for lossless data compression is pretty clear. A lossless compressor transforms one blob of data into another one, ideally (but not always) smaller, in a way that is reversible: there’s another transform (the decoder) that, when applied to the second blob, will return an exact copy of the original data.

Lossy compression is another matter. The output of the decoder is usually at least slightly different from the original data, and sometimes very different; and generally, it’s almost always possible to take an existing lossily-compressed piece of data, say a short video or MP3 file, make a slightly smaller copy by literally deleting a hundred bytes from the middle of the file with a hex editor, and get another version of the original video (or audio) file that is still clearly recognizable yet degraded. Seriously, if you’ve never done this before, try it, especially with MP3s: it’s quite hard to mangle a MP3 file in a way that will make it not play back anymore, because MP3s have no required file-level headers, tolerate arbitrary junk in the middle of the bitstream, and are self-synchronizing.

With lossless compression, it makes sense to ask “what is the smallest I can make this file?”. With lossy compression, less so; you can generally get files far smaller than you would ever want to use, because the data is degraded so much it’s barely recognizable (if that). Minimizing file size alone isn’t interesting; we want to minimize size while simultaneously maximizing the quality of the result. The way we do this is by coming up with some error metric that measures the distance between the original image and the result the decoder will actually produce given a candidate bitstream. Now our bitstream has two associated values: its length in bits, the (bit) rate, usually called R in formulas, and a measure of how much error was introduced as a result of the lossy coding process, the distortion, or D in formulas. R is almost always measured in bits or bytes; D can be in one of many units, depending on what type of error metric is used.

Rate-distortion optimization (RDO for short) then means that an encoder considers several possible candidate bit streams, evaluates their rate and distortion, and tries to make an optimal choice; if possible, we’d like it to be globally optimal (i.e. returning a true best-possible solution), but at the very least optimal among the candidates that were considered. Sounds great, but what does “better” mean here? Given a pair (R_1,D_1) and another pair (R_2,D_2), how do we tell which one is better?

The pareto frontier

Suppose what we have 8 possible candidate solutions we are considering, each with their own rate and distortion scores. If we prepare a scatter plot of rate on the x-axis versus distortion on the y-axis, we might get something like this:

Rate vs. Distortion scatterplot

The thin, light-gray candidates aren’t very interesting, because what they have in common is that there is at least one other candidate solution that is strictly better than them in every way. That is, some other candidate point has the same (or lower) rate, and also the same (or lower) distortion as the grayed-out points. There’s just no reason to ever pick any of those points, based on the criteria we’re considering, anyway. In the scatterplot, this means that there is at least one other point that is both to the left (lower rate) and below (lower distortion).

For any of the points in the diagram, imagine a horizontal and a vertical line going through it, partitioning the plane into four quadrants. Any point that ends up in the lower-left quadrant (lower rate and lower distortion) is clearly superior. Likewise, any point in the upper-right quadrant (higher rate and higher distortion) is clearly inferior. The situation with the other two quadrants isn’t as clear.

Which brings us to the other four points: the three fat black points, and the red point. These are all points that have no other points to the left and below them, meaning they are pareto efficient. This means that, unlike the situation with the light gray points, we can’t pick another candidate that improves one of the metrics without making the other worse. The set of points that are pareto efficient constitutes the pareto frontier.

These points are not all the same, though. The three fat black points are not just pareto efficient, but are also on the convex hull of the point set (the lower left contour of the convex hull here is drawn using blue lines), whereas the red point is not. The points that are both pareto and on the convex hull of the pareto frontier are particularly important, and can be characterized in a different way.

Namely, imagine taking a straightedge, angling it so that it’s either horizontal, “descending” (with reference to our graph) from left to right, or vertical, and then sliding it slowly “upwards” from the origin without changing its orientation until it hits one of the points in our set. It will look something like this:

Rate vs. Distortion scatterplot with lines

The shading here is meant to suggest the motion of the green line; we keep sliding it up until it eventually catches on the middle of our three fat black points. If we change the angle of our line to something else, we can manage to hit the other two black points, but not the red one. This has nothing to do with this particular problem and is a general property of convex sets: any vertices of the set must be extremal in some direction.

Getting a bit more precise here, the various copies of the green line I’ve drawn correspond to lines

w_1 R + w_2 D = \mathrm{const.}

and the constraints I gave on the orientation of the line boil down to w_1, w_2 \ge 0 (with at least one being nonzero). Sliding our straightedge until we hit a point corresponds to the minimization problem

\min_i w_1 R_i + w_2 D_i

for a given choice of w1 and w2, and the three black points are the three possible solutions we might get for our set of candidate points. So we’ve switched from purely minimizing rate or purely minimizing distortion (both of which, in general, tend to give somewhat pathological results) towards minimizing some linear combination of the two with non-negative weights; and doing so with various choices of the weights w1 and w2 will allow us to sweep out the lower left convex hull of the pareto frontier, which is often the “interesting” part (although, as the red point in our example illustrates, this process excludes some of the points on the pareto frontier).

This does not seem particularly impressive so far: we don’t want to purely minimize one quantity or the other, so instead we’re minimizing a linear combination of the two. That seems it would’ve been the obvious next thing to try. But looking at the characterization above does at least give us some idea on what looking at these linear combinations ends up doing, and exactly what we end up giving up (namely, the pareto points not on the convex hull). And there’s another massively important aspect to consider here.

The Lagrangian connection

If we take our linear combination above and divide through by w1 or w2 (assuming they are non-zero; dividing our objective by a scalar constant does not change the results of the optimization problem), respectively, we get:

L_1 = R + \frac{w_2}{w_1} D =: R + \lambda_1 D
L_2 = D + \frac{w_1}{w_2} R =: D + \lambda_2 R

which are essentially the Lagrangians we would get for continuous optimization problems of the form “minimize R subject to D=const.” (L1) and “minimize D subject to R=const.” (L2); that is, if we were in a continuously differentiable setting (for data compression we usually aren’t), trying to solve the problems of either minimizing bit rate while hitting a set distortion target or minimizing distortion within a given bit rate limit woud lead us to study the same type of expression. Generalizing one step further, allowing not just equality but also inequality constraints (i.e. rate or distortion at most a certain value, rather then requiring exact match) still leads to essentially the same functions, this time by way of the KKT conditions.

In this post, I intentionally went “backwards” by starting with the Lagrangian-esque expressions and then mentioning the connection to continuous optimization problems because I want to emphasize that this type of linear combination of the different metrics arises naturally, even in a fully discrete setting; starting out with Lagrange or KKT multipliers would get us into technical difficulties with discrete decisions that do not necessary admit continuously differentiable objectives or constraint functions. Since the whole process makes sense even without explicitly mentioning Lagrange multipliers at all, that seemed like the most natural way to handle it.

What this means in practice

I hopefully now have you convinced that looking at the minima of the linear combination

w_1 R + w_2 D

is a sensible thing to do, and both our direct derivation and the two Lagrange multiplier formulations for continuous problems I mentioned lead us towards it. Neither our direct derivation nor the Lagrange multiplier version tells us what to set our mystery weight parameters to, however. In fact, the Lagrange multiplier method flat-out tells you that for every instance of your optimization problem, there exist the right values for the Lagrange multipliers that correspond to an optimum of the problem you’re interested in, but it doesn’t tell you how to get them.

However, what’s nice is that the same thing also works in reverse, as we saw earlier with our line-sweeping procedure: picking the angle of the line we’re sliding around corresponds to picking a Lagrange multiplier. No matter which one we pick, we are going to end up finding an optimal point for some trade-off, namely one that is pareto; it just might not be the exact trade-off we wanted.

For example, suppose we decide that a single-variable parameterization like in the Lagrange multiplier scenario is convenient, and we pick something like L1, namely w1 fixed at 1, w2 = λ. We were assuming from the beginning that we have some method of generating candidate solutions; all that’s left to do is to rank them and pick a winner. Let’s start with λ=1, which leads to a solution with some (R,D) pair that minimizes R + D. Note it’s often useful to think of these quantities with units attached; we typically measure R in bits and [D] is whatever it is, so the unit of λ must be [λ] = bits/[D], i.e. λ is effectively an exchange rate that tells us how many units of distortion are worth as much as a single bit. We can then look at the R and D values of the solution we got back. If say we’re trying to hit a certain bit rate, then if R is close to that target, we might be happy and just stop. If R is way above the target bit rate, we overshot, and clearly need to penalize high distortions less; we might try another round with λ=0.1 next. Conversely, if say R is a few percent below the target rate, we might try another round with slightly higher lambda, say λ=1.02, trying to penalize distortion a bit more so we spend more bits to reduce it, and see if that gets us even closer.

With this kind of process, even knowing nothing else about the problem, you can systematically explore the options along the pareto frontier and try to find a better fit. What’s nice is that finding the minimum for a given choice of parameters (λ in our case) doesn’t require us to store all candidate options considered and rank them later; storing all candidates is not a big deal in our original example, where we were only trying to decide between a handful of options, but in practice you often end up with huge search spaces (exponential in the problem size is not unheard of), and being able to bake it down to a single linear function is convenient in other ways; for example, it tends to work well with efficient dynamic programming solutions to problems that would be infeasible to handle with brute force.

Wait, that’s it?

Yes, pretty much. Instead of trying to purely minimize bit rate or distortion, you use a combination R + \lambda D and vary λ to taste to hit your targets. Often, you know enough about your problem space to have a pretty good idea of what values λ should have; for example, in video compression, it’s pretty common to tie the λ used when coding residuals to quantizer step size, based on the (known) behavior of the quantizer and the (expected) characteristics of real-world signals. But even when you don’t know anything about your data, you can always use a search process for λ as outlined above (which is, of course, slower).

Now the one thing to note in practice is that you rarely use just a single distortion metric; for example, in video coding, it’s pretty common to use one distortion metric when quantizing residuals, a different one for motion search, and a third for overall block mode decision. In general, the more frequently something is done (or the bigger the search space is), the more willing codecs are to make compromises with their distortion measure to enable more efficient searches. In general, good results require both decent metrics and doing a good job exploring the search space, and accepting some defects in the metrics in exchange for a significant increase in search space covered in the same amount of time is often a good deal.

But the basic process is just this: measure bit rate and distortion (in your unit of choice) for every option you’re considering, and then rank your options based on their combined R + \lambda D (or D + \lambda' R, which is a different but equivalent parameterization) scores. This gives you points on the lower left convex hull of the pareto frontier, which is what you want.

This applies in other settings as well. For example, the various lossless compressors in Oodle are, well, lossless, but they still have a bunch of decisions to make, some of which take more time in the decoder than others. For a lossless codec, measuring “distortion” doesn’t make any sense (it’s always 0), but measuring decode time does; so the Oodle encoders optimize for a trade-off between compressed size and decode time.

Of course, you can have more parameters too; for example, you might want to combine these two ideas and do joint optimization over bit rate, distortion, and decode time, leading to an expression of the type R + \lambda D + \mu T with two Lagrange multipliers, with λ as before, and a second multiplier μ that encodes the exchange rate from time units into bits.

Either way, the details of this quickly get complicated, but the basic idea is really quite simple. I hope this post de-mystifies it a bit.

Reading bits in far too many ways (part 3)

(Continued from part 2. “A whirlwind introduction to dataflow graphs” is required reading.)

Last time, we saw a whole bunch of different bit reader implementations. This time, I’ll continue with a few more variants, implementation considerations on pipelined superscalar CPUs, and some ways to use the various degrees of freedom to our advantage.

Dependency structure of bit readers

To get a better feel for where the bottlenecks in bit decoding are, let me restate some of the bit reading approaches we’ve covered in the previous parts again in our pseudo-assembly language, and then we can have a look at the corresponding dependency graphs.

Let’s start with variant 3 from last time, but I’ll do a LSB-first version this time:

refill3_lsb:
    rBytesConsumed = lsr(rBitPos, 3);
    rBitPtr = rBitPtr + rBytesConsumed;
    rBitBuf = load64LE(rBitPtr);
    rBitPos = rBitPos & 7;

peekbits3_lsb(count):
    rBits = lsr(rBitBuf, rBitPos);
    rBitMask = lsl(1, count);
    rBitMask = rBitMask - 1;
    rBits = rBits & rBitMask; // result

consume3_lsb(count):
    rBitPos = rBitPos + count;

Note that if count is a compile-time constant, the computation for rBitMask can be entirely constant-folded. Peeking ahead by a constant, fixed number of bits then working out from the result how many bits to actually consume is quite common in practice, so that’s what we’ll do. If we do a refill followed by two peek/consume cycles with the consume count being determined from the read bits “somehow”, followed by another refill (for the next loop iteration), the resulting pseudo-asm is like this:

    // Initial refill
    rBytesConsumed = lsr(rBitPos, 3);   // Consumed 0
    rBitPtr = rBitPtr + rBytesConsumed; // Advance 0
    rBitBuf = load64LE(rBitPtr);        // Load 0
    rBitPos = rBitPos & 7;              // LeftoverBits 0

    // First decode (peek count==19)
    rBits = lsr(rBitBuf, rBitPos);      // BitsRemaining 0
    rBits = rBits & 0x7ffff;            // BitsMasked 0
    rCount = determineCount(rBits);     // DetermineCount 0
    rBitPos = rBitPos + rCount;         // PosInc 0
    
    // Second decode
    rBits = lsr(rBitBuf, rBitPos);      // BitsRemaining 1
    rBits = rBits & 0x7ffff;            // BitsMasked 1
    rCount = determineCount(rBits);     // DetermineCount 1
    rBitPos = rBitPos + rCount;         // PosInc 1

    // Second refill
    rBytesConsumed = lsr(rBitPos, 3);   // Consumed 1
    rBitPtr = rBitPtr + rBytesConsumed; // Advance 1
    rBitBuf = load64LE(rBitPtr);        // Load 1
    rBitPos = rBitPos & 7;              // LeftoverBits 1

And the dependency graph looks dishearteningly long and skinny:

Dependency graph for bit reading, variant 3

Ouch. That’s averaging less than one instruction per cycle, and it’s all in one big, serial dependency chain. Not depicted in this graph but also worth noting is that the 4-cycle latency edge from “Load” to “BitsRemaining” is a recurring delay that will occur on every refill, because the computation of the updated rBitPtr depends on the decode prior to the refill having been completed. Now this is not a full decoder, since I’m showing only the parts to do with the bitstream IO (presumably a real decoder also contains code to, you know, actually decode the bits and store the result somewhere), but it’s still somewhat disappointing. Note that the DetermineCount step is a placeholder: if the count is known in advance, for example because we’re reading a fixed-length field, you can ignore it completely. The single cycle depicted in the graph is OK for very simple cases; more complicated cases will often need multiple cycles here, for example because they perform a table lookup to figure out the count. Either way, even with our optimistic single-cycle single-operation DetermineCount step, the critical path through this graph is pretty long, and there’s very little latent parallelism in it.

Does variant 4 fare any better? The primitives look like this in pseudo-ASM:

refill4_lsb:
    rNext = load64LE(rBitPtr);
    rNextSh = lsl(rNext, rBitCount);
    rBitBuf = rBitBuf | rNextSh;

    // Most instruction sets don't have a subtract-from-immediate
    // but do have xor-by-immediate, so this is an advantageous
    // way to write 63 - rBitCount. (This works since we know that
    // rBitCount is in [0,63]).
    rBitsAdvance = rBitCount ^ 63;
    rBytesAdvance = lsr(rBitsAdvance, 3);
    rBitPtr = rBitPtr + rBytesAdvance;

    rBitCount = rBitCount | 56;

peekbits4_lsb(count):
    rBitMask = lsl(1, count);
    rBitMask = rBitMask - 1;
    rBits = rBitBuf & rBitMask; // result

consume4_lsb(count):
    rBitBuf = lsr(rBitBuf, count);
    rBitCount = rBitCount - count;

the pseudo-code for our “refill, do two peek/consume cycles, then refill again” scenario looks like this:

    // Initial refill
    rNext = load64LE(rBitPtr);          // LoadNext 0
    rNextSh = lsl(rNext, rBitCount);    // NextShift 0
    rBitBuf = rBitBuf | rNextSh;        // BitInsert 0
    rBitsAdv = rBitCount ^ 63;          // AdvanceBits 0
    rBytesAdv = lsr(rBitsAdv, 3);       // AdvanceBytes 0
    rBitPtr = rBitPtr + rBytesAdv;      // AdvancePtr 0
    rBitCount = rBitCount | 56;         // RefillCount 0

    // First decode (peek count==19)
    rBits = rBitBuf & 0x7ffff;          // BitsMasked 0
    rCount = determineCount(rBits);     // DetermineCount 0
    rBitBuf = lsr(rBitBuf, rCount);     // ConsumeShift 0
    rBitCount = rBitCount - rCount;     // ConsumeSub 0

    // Second decode
    rBits = rBitBuf & 0x7ffff;          // BitsMasked 1
    rCount = determineCount(rBits);     // DetermineCount 1
    rBitBuf = lsr(rBitBuf, rCount);     // ConsumeShift 1
    rBitCount = rBitCount - rCount;     // ConsumeSub 1

    // Second refill
    rNext = load64LE(rBitPtr);          // LoadNext 1
    rNextSh = lsl(rNext, rBitCount);    // NextShift 1
    rBitBuf = rBitBuf | rNextSh;        // BitInsert 1
    rBitsAdv = rBitCount ^ 63;          // AdvanceBits 1
    rBytesAdv = lsr(rBitsAdv, 3);       // AdvanceBytes 1
    rBitPtr = rBitPtr + rBytesAdv;      // AdvancePtr 1
    rBitCount = rBitCount | 56;         // RefillCount 1

with this dependency graph:

Dependency graph for bit reading, variant 4

That’s a bunch of differences, and you might want to look at variant 3 and 4 in different windows side-by-side. The variant 4 refill does take 3 extra instructions, but we can immediately see that we get more latent instruction-level parallelism (ILP) in return:

  1. The variant 4 refill splits into three dependency chains, not two.
  2. The LoadNext for the second refill can start immediately after the AdvancePtr for the first refill, moving the load off the critical path for the second and subsequent iterations. Variant 3 has a 6-cycle latency from the determination of the final rBitPos in the first iteration to a refilled rBitBuf; in variant 4, that latency shrinks to 2 cycles (one shift and an OR). In other words, while the refill takes more instructions, most of them are off the critical path.
  3. The consume step in variant 4 has two parallel computations; in variant 3, the rBitPos update is critical and feeds into the shift in the next “peek” operation. Variant 4 has a single shift (to consume bits) on the critical path to the next peek; as a result, the latency between two subsequent decodes is one cycle less in variant 4: 3 cycles instead of 4.

In short, this version trades a slight increase in refill complexity for a noticeable latency reduction of several key steps, provided it’s running on a superscalar CPU. That’s definitely nice. On the other hand, the key decode steps are still very linear. We’re limited by the latency of a long chain of serial computations, which is a bad place to be: if possible, it’s generally preferable to be limited by throughput (how many instructions we can execute), not latency (how fast we can complete them). Especially so if most of the latency in question comes from integer instructions that already have a single cycle of latency. Over the past 30 years, the number of executions units and instructions per cycle in mainstream CPU parts have steadily, if slowly, increased. But if we want to see any benefit from this, we need to write code that has a use for these extra execution resources.

Multiple streams

As is often the case, the best solution to this problem is the straightforward one: if decoding from a single bitstream is too serial, then why not decode from multiple bitstreams at once? And indeed, this is much better; there’s not much point to showing a graph here, since it’s literally just two copies of a single-stream graph next to each other. Even with a very serial decoder like variant 3 above, you can come a lot closer to filling up a wide out-of-order machine as long as you use enough streams. To a first-order approximation, using N streams will also give you N times the latent ILP—and given how serial a lot of the direct decoders are, this will translate into a substantial (usually not quite N-times, but still very noticeable) speed-up in the decoder on wide-enough processors. So what’s the catch? There are several:

  1. Using multiple streams is a change to the bitstream format, not just an implementation detail. In particular, in any long-term storage format, any change in the number of bitstreams is effectively a change in the protocol or file format.
  2. You need to define how to turn the multiple streams into a single output bytestream. This can be simple concatenation along with a header, it can be some form of interleaving or a sophisticated framing format, but no matter what it ends up being, it’s an increase in complexity (and usually also in storage overhead) relative to producing a single bitstream that contains everything in the order it’s read.
  3. For anything with short packets and low latency requirements (e.g. game packets or voice chat), you either have to interleave streams fairly finely-grained (increasing size overhead), or suffer latency increases.
  4. Decoding from N streams in parallel increases the amount of internal state in the decoder. In the decoder variants shown above, a N-wide variant needs N copies of rBitBuf, rBitPos/rBitCount and rBitPtr, at the very least, plus several temporary registers. For N=2 this is usually not a big deal, but for large counts you will start to run out of registers at least on some targets. There’s relatively little work being done on any given individual data item; if values get spilled from registers, the resulting loads and stores tend to have a very noticeable cost and will easily negate the benefit from using more streams.

In short, it’s not a panacea, but one of the usual engineering trade-offs. So how many streams should you use? It depends. At this point, for anything that is even remotely performance-sensitive, I would recommend trying at least N=2 streams. Even if your decoder has a lot of other stuff going on (computations with the decoded values etc.), bitstream decoding tends to be serial enough that there’s many wasted cycles otherwise, even on something relatively narrow like a dual-issue in-order machine. Having two streams adds a relatively small amount of overhead to the bitstream format (to signal the start of the data for stream 2 in every coding unit, or something equivalent), needs a modest amount of extra state for the second bit decoder, and tends to result in sizeable wins on pretty much any current CPU.

Using more than 2 streams can be a significant win in tight loops that do nothing but bitstream decoding, but is overkill in most other cases. Before you commit to a specific (high) number, you ideally want to try implementations on at least a few different target devices; a good number on one device may be past a big performance cliff on another, and having that kind of thing enshrined in a protocol or file format is unfortunate.

Aside: SIMD? GPU?

If you use many streams, can you use SIMD instructions, or offload work to a GPU? Yes, you can, but the trade-offs get a bit icky here.

Vectorizing the simple decoders outlined above directly is, generally speaking, not great. There’s not a lot of computation going on per iteration, and operations such as refills end up using gathers, which tend to have a high associated overhead. To hide this overhead, and the associated latencies, you generally still need to be running multiple instances of your SIMD decoder in parallel, so your total number of streams ends up being the number of SIMD lanes times two (or more, if you need more instances). Having a high number of streams may be OK if all your targets have good wide SIMD support, but can be a real pain if you need to decode on at least one that doesn’t.

The same thing goes for GPUs, but even more so. With single warps/wavefronts of usually 16-64 invocations, we’re talking many streams just to not be running a kernel at quarter utilization, and we generally need to dispatch multiple warps worth of work to hide memory access latency. Between these two factors, it’s easy to end up needing well over 100 parallel streams just to not be stalled most of the time. At that scale, the extra overhead for signaling individual stream boundaries is definitely not negligible anymore, and the magic numbers are different between different GPU vendors; striking a useful compromise between the needs of different GPUs while also retaining the ability to decode on a CPU if no suitable GPU is available starts to get quite tricky.

There are techniques to at least make the memory access patterns and interleaving overhead somewhat more palatable (I wrote about this elsewhere), but this is an area of ongoing research, and so far there’s no silver bullet I could point at and just say “do this”. This is definitely a challenge going forward.

Tricks with multiple streams

If you’re using multiple streams, you need to decide how these multiple streams get assembled into the final output bitstream. If you don’t have any particular reason to go with a fine-grained interleaving, the easiest and most straightforward option is to concatenate the sub-streams, with a header telling you how long the individual pieces are, here pictured for 3 streams:

Three sub-streams, linear layout

Also pictured are the initial stream bit pointers before reading anything (pointers in a C-like or assembly-like setting; if you’re using something higher-level, probably indices into a byte slice). The beginning of stream 0 is implicit—right after the end of the header—and the end of the final stream is often supplied by an outer framing layer, but the initial positions of bitptr1 and bitptr2 need to be signaled in the bytestream somehow, usually by encoding the length of streams 0 and 1 in the header.

One thing I haven’t mentioned so far are bounds checks. Compressed data is normally untrusted since all the channels you might get that data from tend to be prone to either accidental (error during storage or in transit) or intentional (malicious attacker trying to craft harmful data) corruption, so careful input validation is not optional. What this usually boils down to in practice is that every load from the bitstream needs to be guarded by a range check that guarantees it’s in bounds. The overhead of this can be reduced in various ways. For example, one popular method is to unroll loops a few times and check at the top that there are enough bytes left for worst-case number of bytes consumed in the number of unrolled iterations, then only dropping to a careful loop that checks every single byte access at the very end of the stream. I’ve written about another useful technique before.

But why am I mentioning this here? Because it turns out that with multiple streams laid out sequentially, the overhead of bounds checking can be reduced. A direct range check for 3 streams that checks whether there are at least K bytes left would look like this:

// This needs to happen before we do any loads:
// If any of the streams are close to exhausted
// (fewer than K bytes left), drop to careful loop
if (bitend0 - bitptr0 < K ||
    bitend1 - bitptr1 < K ||
    bitend2 - bitptr2 < K)
    break;

But when the three streams are sequential, we can use a simpler expression. First, we don’t actually need to worry about reading past the end of stream 0 or stream 1 as long as we still stay within the overall containing byte slice. And second, we can relax the check in the inner loop to use a much weaker test:

// Only check the last stream against the end; for
// other streams, simply test whether an the read
// pointer for an earlier stream is overtaking the
// read ponter for a later stream (which is never
// valid)
if (bitptr0 > bitptr1 ||
    bitptr1 > bitptr2 ||
    bitend2 - bitptr2 < K)
    break;

The idea is that bitptr1 starts out pointing at bitend0, and only keeps increasing from there. Therefore, if we ever have bitptr0 > bitptr1, we know for sure that something went wrong and we read past the end of stream 0. That will give us garbage data (which we need to handle anyway), but not read out of bounds, since the checks maintain the invariant that bitptr0bitptr1bitptr2bitend2 - K. A later careful loop should use more precise checking, but this variant of the test is simpler and doesn’t require most of the bitend values to be reloaded in every iteration of our decoding loop.

Another interesting option is to reverse the order of some of the streams (which flips endianness as a side effect), and then glue pairs of forward and backward streams together, like shown here for streams 1 and 2:

Three sub-streams with forward/backward pair

I admit this sounds odd, but this has a few interesting properties. One of them is that it shrinks the amount of header space somewhat: in the image, the initial stream pointer for stream 2 is the same as the end of the buffer, and if there were 4 streams, the initial read pointers for stream 2 and 3 would start out in the same location (but going opposite directions). In general, we only need to denote the boundaries between stream pairs instead of individual streams. Then we let the decoder run as before, checking that the read cursors for the forward/backward pair don’t cross. If everything went right, once we’ve consumed the entire input stream, the final read cursors in a forward/backward pair should end up right next to each other. It’s a bit strange in that we don’t know the size of either stream in advance, just their sum, but it works fine.

Another consequence is that there’s no need to keep track of an explicit end pointer in the inner decoder loop if the final stream is a backwards stream; the pointer-crossing check takes care of it. In our running example, we’re now down to

// Check for pointer crossing; if done right, we get end-of-buffer
// checks for free.
if (bitptr0 > bitptr1 ||
    bitptr1 > bitptr2)
    break;

In this version, bitptr0 and bitptr1 point at the next byte to be read in the forwards stream, whereas bitptr2 is offset by -K to ensure we don’t overrun the buffer; this is just a constant offset however, which folds into the memory access on regular load instructions. It’s all a bit tricky, but it saves a couple instructions, makes the bitstream slightly smaller and reduces the number of live variables in a hot loop, with the savings usually being larger the cost of a single extra endian swap. With a two-stream layout, generating the second bitstream in reverse also happens to be convenient on the encoder side, because we can reserve memory for the expected (or budgeted) size of the combined bitstream without having to guess how many bytes end up in either half; it’s just a regular double-ended stack. Once encoding is done, the two parts can be compacted in-place by moving the second half downwards.

None of these properties are a big deal in and of themselves, but they make for a nice package, and a two-stream setup with a forwards/backwards pair is now our default layout for most parts in most parts of the Oodle bitstream (Oodle is a lossless data compression library I work on).

Between the various tricks outlined so far, the size overhead and the extra CPU cost for wrangling multiple streams can be squeezed down quite far. But we still have to deal with the increased number of live variables that multiple streams imply. It turns out that if we’re willing to tolerate a moderate increase in critical path latency, we can reduce the amount of state variables per bit reader, in some cases while simultaneously (slightly) reducing the number of instructions executed. The advantage here is that we can fit more streams into a given number of working registers than we could otherwise; if we can use enough streams that we’re primarily limited by execution throughput and not critical path latency, increasing said latency is OK, and reducing the overall number of instructions helps us increase the throughput even more. So how does that work?

Bit reader variant 5: minimal state, throughput-optimized

The bit reader variants I’ve shown so far generally split the bit buffer state across two variables: one containing the actual bits and another keeping track of how many bits are left in the buffer (or, equivalently, keeping track of the current read position within the buffer). But there’s a simple trick that allows us to reduce this to a single state variable: the bit shifts we use always shift in zeros. If we turn the MSB (for a LSB-first bit buffer) or the LSB (for a MSB-first bit buffer) into a marker bit that’s always set, we can use that marker to track how many bits we’ve consumed in total come the next refill. That allows us to get rid of the bit count and the instructions that manipulate it. That means one less variable in need of a register, and depending on which variant we’re comparing to, also fewer instructions executed per “consume”.

I’ll present this variant in the LSB-first version, and this time there’s an actual reason to (slightly) prefer LSB-first over MSB-first.

const uint8_t *bitptr; // Pointer to current byte
uint64_t bitbuf = 1ull << 63; // Init to marker in MSB

void refill5_lsb() {
    assert(bitbuf != 0);

    // Count how many bits we consumed using a "leading zero
    // count" instruction. See notes below.
    int bits_consumed = CountLeadingZeros64(bitbuf);

    // Advance the pointer
    bitptr += bits_consumed >> 3;

    // Refill and put the marker in the MSB
    bitbuf = read64LE(bitptr) | (1ull << 63);

    // Consume the bits in this byte that we've already used.
    bitbuf >>= bits_consumed & 7;
}

uint64_t peekbits5_lsb(int count) {
    assert(count >= 1 && count <= 56);
    // Just need to mask the low bits.
    return bitbuf & ((1ull << count) - 1);
}

void consume5_lsb(int count) {
    bitbuf >>= count;
}

This “count leading zeros” operation might seem strange and weird if you haven’t seen it before, but it happens to be something that’s useful in other contexts as well, and most current CPU architectures have fast instructions that do this! Other than the strangeness going on in the refill, where we first have to figure out the number of bits consumed from the old marker bit, then insert a new marker bit and do a final shift to consume the partial bits from the first byte, this is like a hybrid between variants 3 and 4 from last time.

The pseudo-assembly for our running “refill, two decodes, then another refill” scenario goes like this: (not writing out the marker constant explicitly here)

    // Initial refill
    rBitsConsumed = clz64(rBitBuf);     // CountLZ 0
    rBytesAdv = lsr(rBitsConsumed, 3);  // AdvanceBytes 0
    rBitPtr = rBitPtr + rBytesAdv;      // AdvancePtr 0
    rNext = load64LE(rBitPtr);          // LoadNext 0
    rMarked = rNext | MARKER;           // OrMarker 0
    rLeftover = rBitsConsumed & 7;      // LeftoverBits 0
    rBitBuf = lsr(rMarked, rLeftover);  // ConsumeLeftover 0

    // First decode (peek count==19)
    rBits = rBitBuf & 0x7ffff;          // BitsMasked 0
    rCount = determineCount(rBits);     // DetermineCount 0
    rBitBuf = lsr(rBitBuf, rCount);     // Consume 0

    // Second decode
    rBits = rBitBuf & 0x7ffff;          // BitsMasked 1
    rCount = determineCount(rBits);     // DetermineCount 1
    rBitBuf = lsr(rBitBuf, rCount);     // Consume 1

    // Second refill
    rBitsConsumed = clz64(rBitBuf);     // CountLZ 1
    rBytesAdv = lsr(rBitsConsumed, 3);  // AdvanceBytes 1
    rBitPtr = rBitPtr + rBytesAdv;      // AdvancePtr 1
    rNext = load64LE(rBitPtr);          // LoadNext 1
    rMarked = rNext | MARKER;           // OrMarker 1
    rLeftover = rBitsConsumed & 7;      // LeftoverBits 1
    rBitBuf = lsr(rMarked, rLeftover);  // ConsumeLeftover 1

The refill has 7 integer operations, the same as variant 4 (“looakhead”) above, and 3 more than variant 3 (“bit extract”), while the decode step takes 3 operations (including the determineCount step), one fewer than variants 3 (“bit extract”) and 4 (“lookahead”). The latter means that we equalize with the regular bit extract form in terms of instruction count when we perform at least 3 decodes per refill, and start to pull ahead if we manage more than 3. For completeness, here’s the dependency graph:

Dependency graph for bit reading, variant 5

Easily the longest critical path of the variants we’ve seen so far, and very serial indeed. It doesn’t help that not only do we not know the load address early, we also have several more steps in the refill compared to the basic variant 3. But having the entire “hot” bit buffer state concentrated in a single register (rBitBuf) during the decodes means that we can afford many streams at once, and with enough streams that extra latency can be hidden.

This one definitely needs to be deployed carefully, but it’s a powerful tool when used in the right place. Several of the fastest (and hottest) decoder loops in Oodle use it.

Note that with this variation, there’s a reason to stick with the LSB-first version: the equivalent MSB-first version needs a way to count the number of trailing zero bits, which is a much less common instruction, although it can be synthesized from a leading zero count and standard arithmetic/logical operations at acceptable extra cost. Which brings me to my final topic for this post.

MSB-first vs. LSB-first: the final showdown

Throughout this 3-parter series, I’ve been continually emphasizing that there’s no major reason to prefer MSB-first or LSB-first for bit IO. Both are broadly equivalent and have efficient algorithms. But having now belabored that point sufficiently, if we can make both of them work, which one should we choose?

There are definitely differences that push you into one direction or another, depending on your intended use case. Here are some you might want to consider, in no particular order:

  • As we saw in part 2, the natural MSB-first peekbits and getbits implementations run into trouble (of the undefined-behavior and hardware-actually-behaving-in-surprising-ways kind) when count == 0, whereas with the natural LSB-first implementation, this case is unproblematic. If you need to support counts of 0 (usefol for e.g. variable-length codes), LSB-first tends to be slightly more convenient. Alternatives for MSB-first are a rotate-based implementation (which has no problems with 0 count) or using an extra shift, turning x >> (64 - count) into (x >> 1) >> (63 - count).
  • MSB-first coding tends to have a big edge for universal variable-length codes. Unary codes can be decoded quickly via the aforementioned “count leading zero” instructions; gamma codes and the closely related Exp-Golomb codes also admit direct decoding in a fairly slick way; and the same goes for Golomb-Rice codes and a few others. If you’re considering universal codes, MSB-first is definitely handier.
  • At the other extreme, LSB-first coding often ends up slightly cheaper for the table-based decoders commonly used when a code isn’t fixed as part of the format; Huffman decoders for example.
  • MSB-first meshes somewhat more naturally with big-endian byte order, and LSB-first with little-endian. If you’re deeply committed to either side in this particular holy war, this might drive you one way or the other.

Charles and me both tend to default to MSB-first but will switch to LSB-first where it’s a win on multiple target architectures (or on a single important target).

Conclusion

That’s it for both this post and this mini-series; apologies for the long delay, caused by first a surprise deadline that got dropped in my lap right as I was writing the series originally, and then exacerbated by a combination of technical difficulties (alas, still ongoing) and me having gotten “out of the groove” in the intervening time.

This post ended up longer than my usual, and skips around topics a bit more than I’d like, but I really didn’t want to make this series a four-parter; I still have a few notes here and there, but I don’t want to drag this topic out much longer, not on this general level anyway. Instead, my plan is to write about some more down-to-earth case studies soon, so I can be less hand-wavy, and maybe even do some actual assembly-level analysis for an actual real-world CPU instead of an abstract idealized machine. We’ll see.

Until then!

A whirlwind introduction to dataflow graphs

While in the middle of writing “Reading bits in far too many ways, part 3”, I realized that I had written a lot of background material that had absolutely nothing to do with bit I/O and really was worth putting in its own post. This is that post.

The problem I’m concerned with is fairly easy to state: say we have some piece of C++ code that we’re trying to understand (and perhaps improve) the performance of. A good first step is to profile it, which will give us some hints which parts are slow, but not necessarily why. On a fundamental level, any kind of profiling (or other measurement) is descriptive, not predictive: it can tell you how an existing system is behaving, but if you’re designing something that’s more than a few afternoons worth of work, you probably don’t have the time or resources to implement 5 or 6 completely different design alternatives, pick whichever one happens to work best, and throw the rest away. You should be able to make informed decisions up front from an algorithm sketch without having to actually write a fleshed-out implementation.

One thing I want to emphasize particularly here is that experiments coupled with before/after measurements are no adequate substitute for a useful performance model. These kinds of measurements can tell you how much you’ve improved, but not if you are where you should be: if I tell you that by tweaking some config files, I managed to double the number of requests served per second by the web server, that sounds great. It sounds less good if I give you the additional piece of information that with this fix deployed, we’re now at a whopping 1.5 requests per second; having an absolute scale of reference matters!

This goes especially for microbenchmarks. With microbenchmarks, like a trial lawyer during cross-examination, you should never ask a question you don’t know the answer to (or at least have a pretty good idea of what it is). Real-world systems are generally too complex and intertwined to understand from surface measurements alone. If you have no idea how a system works at all, you don’t know what the right questions are, nor how to ask them, and any answers you get will be opaque at best, if not outright garbage. Microbenchmarks are a useful tool to confirm that an existing model is a good approximation to reality, but not very helpful in building these models to begin with.

Machine models

So, if we want to go deeper than just squinting at C/C++ code and doing some hand-waving, we need to start looking at a somewhat lower abstraction level and define a machine model that is more sophisticated than “statements execute one by one”. If you’re only interested in a single specific processor, one option is to use whatever documentation and tools you can find for the chip in question and analyze your code in detail for that specific machine. And if you’re willing to go all-out on microarchitectural tweaking, that’s indeed the way to go, but it’s a giant step from looking at C++ code, and complete overkill in most cases.

Instead, what I’m going to do is use a simplified machine model that allows us to make quantitative predictions about the behavior of straightforward compute-bound loops, which is simple to describe but still gives us a lot of useful groundwork for more complex scenarios. Here’s what I’ll use:

  • We have an unlimited set of 64-bit integer general-purpose registers, which I’ll refer to by names like rSomething. Any “identifiers” that aren’t prefixed with a lowercase r are either symbolic constants or things like labels.
  • We have the usual 64-bit integer arithmetic and logic operations. All operations can either be performed between two registers or a register and an immediate constant, and the result is placed in another register. All arithmetic uses two’s complement. For simplicity, all 64-bit values are permitted as immediate constants.
  • There’s a flat, byte-granular 64-bit address space, and pointers are just represented as integers.
  • All memory accesses require explicit load and store operations. Memory accesses are either 8, 16, 32, or 64 bits in size and can use (for my convenience) both little-endian or big-endian byte ordering, when requested. One of these is the default, but both are the same cost. Narrow stores store the least significant bits of the register in question; narrow loads zero-extend to 64 bits. Loads and stores have a few common addressing modes (that I’ll introduce as I use them). Unaligned loads and stores are supported.
  • There’s unconditional branches, which just jump to a given location, and conditional branches, which compare a register to either another register or an immediate constant, and branch to a given destination if the condition is true.

Code will be written in a pseudo-C form, at most one instruction per line. Here’s a brief example showing what kind of thing I have in mind:

loop:                            // label
  rFoo = rBar | 1;               // bitwise logical OR
  rFoo = lsl(rFoo, 3);           // logical shift left
  rBar = asr(rBar, rBaz);        // arithmetic shift right
  rMem = load64LE(rBase + rFoo); // little-endian load
  store16BE(rDest + 3, rMem);    // big-endian store
  rCount = rCount - 1;           // basic arithmetic
  if rCount != 0 goto loop;      // branch

Shifts use explicit mnemonics because there’s different types of right shifts and at this level of abstraction, registers are generally treated as untyped bags of bits. I’ll introduce other operations and addressing modes as we get to them. What we’ve seen so far is quite close to classic RISC instruction sets, although I’ll allow a larger set of addressing modes than some of the more minimalist designs, and require support for unaligned access on all loads and stores. It’s also close in spirit to an IR (Intermediate Representation) you’d expect to see early in the backend of a modern compiler: somewhat lower-level than LLVM IR, and comparable to early-stage LLVM Machine IR or GCC RTL.

This model requires us to make the distinction between values kept in registers and memory accesses explicit, and flattens down control flow to basic blocks connected by branches. But it’s still relatively easy to look at a small snippet of C++ and e.g. figure out how many arithmetic instructions it boils down to: just count the number of operations.

As a next step, we could now specify a virtual processor to go with our instruction set, but I don’t want to really get into that level of detail; instead of specifying the actual processor, I’ll work the same way actual architectures do: we require that the end result (eventual register and memory contents in our model) of running a program must be as if we had executed the instructions sequentially one by one (as-if rule). Beyond that, an aggressive implementation is free to cut corners as much as it wants provided it doesn’t get caught. We’ll assume we’re in an environment—the combination of compilers/tools and the processor itself—that uses pipelining and tries to extract instruction-level parallelism to achieve higher performance, in particular:

  • Instructions can launch independent from each other, and take some number of clock cycles to complete. For an instruction to start executing, all the operands it depends on need to have been computed. As long as the dependencies are respected, all reorderings are valid.
  • There is some limit W (“width”) on how many new instructions we can start per clock cycle. In-flight instructions don’t interfere with each other; as long as we have enough independent work, we can start W new instructions every cycle. We’re going to treat W as variable.
  • Memory operations have a latency of 4 cycles, meaning that the result of a load is available 4 cycles after the load issued, and a load reading the bytes written by a prior store can issue 4 cycles after the store. That’s a fairly typical latency for a load that hits in the L1 cache, in case you were wondering.
  • Branches (conditional or not) count as a single instruction, but their latency is variable. Unconditional branches or easily predicted branches such as the loop counter in along-running loop have an effective latency of 0 cycles, meaning the instructions being branched to can issue at the same time as the branch itself. Unpredictable branches have a nonzero cost that depends on how unpredictable they are—I won’t even try to be more precise here.
  • Every other instruction has a latency of 1 clock cycle, meaning the result is available in the next cycle.

This model can be understood as approximating either a dataflow architecture, an out-of-order machine with a very large issue window (and infinitely fast front-end), or a statically scheduled in-order machine running code compiled with a Sufficiently Smart Scheduler. (The kind that actually exists; e.g. a compiler implementing software pipelining).

Furthermore, I’m assuming that while there is explicit control flow (unlike a pure dataflow machine), there is a branch prediction mechanism in place that allows the machine to guess the control flow path taken arbitrarily far in advance. When these guesses are correct, the branches are effectively free other than still taking an instruction slot, during which time the machine checks whether its prediction was correct. When the guess was incorrect, the machine reverts all computations that were down the incorrectly guessed path, and takes some number of clock cycles to recover. If this idea of branch prediction is new to you, I’ll refer you to Dan Luu’s excellent article on the subject, which explains both how and why computers would be doing this.

The end result of these model assumptions is that while control flow exists, it’s on the sidelines: its only observable effect is that it sometimes causes us to throw away a bunch of work and take a brief pause to recover when we guessed wrong. Dataflow, on the other hand—the dependencies between instructions, and how long it takes for these dependencies to be satisfied—is front and center.

Dataflow graphs

Why this emphasis? Because dataflow and data dependencies is because they can be viewed as the fundamental expression of the structure of a particular computation, whether it’s done on a small sequential machine, a larger superscalar out-of-order CPU, a GPU, or in hardware (be it a hand-soldered digital circuit, a FPGA, or an ASIC). Dataflow and keeping track of the shape of data dependencies is an organizing principle of both the machines themselves and the compilers that target them.

And these dependencies are naturally expressed in graph form, with individual operations being the nodes and data dependencies denoted by directed edges. In this post, I’ll have dependent operations point towards the operations they depend on, with the directed edges labeled with their latency. To reduce clutter, I’ll only write latency numbers when they’re not 1.

With all that covered, and to see what the point of this all is, let’s start with a simple, short toy program that just sums the 64-bit integers in some array delineated by two pointers stored in rCurPtr (which starts pointing to the first element) and rEndPtr (which points to one past the last element), idiomatic C++ iterator-style.

loop:
  rCurInt = load64(rCurPtr);        // Load
  rSum = rSum + rCurInt;            // Sum
  rCurPtr = rCurPtr + 8;            // Advance
  if rCurPtr != rEndPtr goto loop;  // Done?

We load a 64-bit integer from the current pointer, add it to our current running total in register rSum, increment the pointer by 8 bytes (since we grabbed a 64-bit integer), and then loop until we’re done. Now let’s say we run this program for a short 6 iterations and draw the corresponding dataflow graph (click to see full-size version):

Dataflow graph for basic array sum

Note I group nodes into ranks by which cycle they can execute in, at the earliest, assuming we can issue as many instructions in parallel as we want, purely constrained by the data dependencies. The “Load” and “Advance” from the first iteration can execute immediately; the “Done?” check from the first iteration looks at the updated rCurPtr, which is only known one cycle later; and “Sum” from the first iteration needs to wait for the load to finish, which means it can only start a full 4 cycles later.

As we can see, during the first four cycles, all we do is keep issuing more loads and advancing the pointer. It takes until cycle 4 for the results of the first load to become available, so we can actually do some summing. After that, one more load completes every cycle, allowing us to add one more integer to the running sum in turn. If we let this process continue for longer, all the middle iterations would look the way cycles 4 and 5 do: in our state state, we’re issuing a copy of all four instructions in the loop every cycle, but from different iterations.

There’s a few conclusions we can draw from this: first, we can see that this four-instruction loop achieves a steady-state throughput of one integer added to the sum in every clock cycle. We take a few cycles to get into the steady state, and then a few more cycles at the end to drain out the pipeline, but if we start in cycle 0 and keep running N iterations, then the final sum will be completed by cycle N+4. Second, even though I said that our model has infinite lookahead and is free to issue as many instructions per cycle as it wants, we “only” end up using at most 4 instructions per cycle. The limiter here ends up being the address increment (“Advance”); we increment the pointer after every load, per our cost model this increment takes a cycle of latency, and therefore the load in the next iteration of the loop (which wants to use the updated pointer) can start in the next cycle at the earliest.

This is a crucial point: the longest-latency instruction in this loop is definitely the load, at 4 cycles. But that’s not a limiting factor; we can schedule around the load and do the summing later. The actual problem here is with the pointer advance; every single instruction that comes after it in program order depends on it either directly or indirectly, and therefore, its 1 cycle of latency determines when the next loop iteration can start. We say it’s on the critical path. In loops specifically, we generally distinguish between intra-iteration dependencies (between instructions within the same iteration, say “Sum 0” depending on “Load 0”) and inter-iteration or loop-carried dependencies (say “Sum 1” depending on “Sum 0”, or “Load 1” depending on “Advance 0”). Intra-iteration dependencies may end up delaying instructions within that iteration quite a lot, but it’s inter-iteration dependencies that determine how soon we can start working on the next iteration of the loop, which is usually more important because it tends to open up more independent instructions to work on.

The good news is that W=4 is actually a fairly typical number of instructions decoded/retired per cycle in current (as of this writing in early 2018) out-of-order designs, and the instruction mixture here (1 load, 1 branch, 2 arithmetic instructions) is also one that is quite likely to be able to issue in parallel on a realistic 4-wide decode/retire design. While many machines can issue a lot more instructions than that in short bursts, a steady state of 4 instructions per cycle is definitely good. So even though we’re not making much of the infinite parallel computing power of our theoretical machine, in practical terms, we’re doing OK, although on real machines we might want to apply some more transforms to the loop; see below.

Because these real-world machines can’t start an arbitrary number of instructions at the same time, we have another concern: throughput. Say we’re running the same loop on a processor that has W=2, i.e. only two instructions can start every cycle. Because our loop has 4 instructions, that means that we can’t possibly start a new loop iteration more often than once every two clock cycles, and the limiter aren’t the data dependencies, but the number of instructions our imaginary processor can execute in a clock cycle; we’re throughput-bound. We would also be throughput-bound on a machine with W=3, with a steady state of 3 new instructions issued per clock cycle, where we can start working on a new iteration every 4/3≈1.33 cycles.

A different example

For the next example, we’re going to look at what’s turned into everyone’s favorite punching-bag of a data structure, the linked list. Let’s do the exact same task as before, only this time, the integers are stored in a singly-linked list instead of laid out as an array. We store first a 64-bit integer and then a 64-bit pointer to the next element, with the end of the list denoted by a special value stored in rEndPtr as before. We also assume the list has at least 1 element. The corresponding program looks like this:

loop:
  rCurInt = load64(rCurPtr);        // LoadInt
  rSum = rSum + rCurInt;            // Sum
  rCurPtr = load64(rCurPtr + 8);    // LoadNext
  if rCurPtr != rEndPtr goto loop;  // Done?

Very similar to before, only this time, instead of incrementing the pointer, we do another load to grab the “next” pointer. And here’s what happens to the dataflow graph if we make this one-line change:

Dataflow graph for linked list sum

Switching from a contiguous array to a linked list means that we have to wait for the load to finish before we can start the next iteration. Because loads have a latency of 4 cycles in our model, that means we can’t start a new iteration any more often than once every 4 cycles. With our 4-instruction loop, we don’t even need any instruction-level parallelism to reach that target; we might as well just execute one instruction per cycle and still hit the same overall throughput.

Now, this example, with its short 4-instruction loop, is fairly extreme; if our loop had say a total of 12 instructions that worked out nicely, the same figure might well end up averaging 3 instructions per clock cycle, and that’s not so bad. But the underlying problem here is a nasty one: because our longest-latency instruction is on the critical path between iterations, it ends up determining the overall loop throughput.

In our model, we’re still primarily focused on compute-bound code, and memory access is very simple: there’s no memory hierarchy with different cache levels, all memory accesses take the same time. If we instead had a more realistic model, we would also have to deal with the fact that some memory accesses take a whole lot longer than 4 cycles to complete. For example, suppose we have three cache levels and, at the bottom, DRAM. Sticking with the powers-of-4 theme, let’s say that a L1 cache hit takes 4 cycles (i.e. our current memory access latency), a L2 hit takes 16 cycles, a L3 hit takes 64 cycles, and an actual memory access takes 256 cycles—for what it’s worth, all these numbers are roughly in the right ballpark for high-frequency desktop CPUs under medium memory subsystem load as of this writing.

Finding work to keep the machine otherwise occupied for the next 4 cycles (L1 hit) is usually not that big a deal, unless we have a very short loop with unfavorable dependency structure, as in the above example. Fully covering the 16 cycles for a L1 miss but L2 hit is a bit trickier and requires a larger out-of-order window, but current out-of-order CPUs have those, and as long as there’s enough other independent work and not too many hard-to-predict branches along the way, things will work out okay. With a L3 cache hit, we’ll generally be hard-pressed to find enough independent work to keep the core usefully busy during the wait for the result, and if we actually miss all the way to DRAM, then in our current model, the machine is all but guaranteed to stall; that is, to have many cycles with no instructions executed at all, just like the gaps in the diagram above.

Because linked lists have this nasty habit of putting memory access latencies on the critical path, they have a reputation of being slow “because they’re bad for the cache”. Now while it’s definitely true that most CPUs with a cache would much rather have you iterate sequentially over an array, we have to be careful how we think about it. To elaborate, suppose we have yet another sum kernel, this time processing an array of pointers to integers, to compute the sum of the pointed-to values.

loop:
  rCurIntPtr = load64(rCurPtr);      // LoadPtr
  rCurInt = load64(rCurIntPtr);      // LoadInt
  rSum = rSum + rCurInt;             // Sum
  rCurPtr = rCurPtr + 8;             // Advance
  if rCurPtr != rEndPtr goto loop;   // Done?

And this time, I’ll prune the dataflow graph to show only the current iteration and its direct dependency relationships with earlier and later iterations, because otherwise these more complicated graphs will get cluttered and unreadable quickly:

Dataflow graph for indirect array sum

A quick look over that graph shows us that copies of the same instruction from different iterations are all spaced 1 cycle apart; this means that in the steady state, we will again execute one iteration of the loop per clock cycle, this time issuing 5 instructions instead of 4 (because there are 5 instructions in the loop). Just like in the linked list case, the pointer indirection here allows us to jump all over memory (potentially incurring cache misses along the way) if we want to, but there’s a crucial difference: in this setup, we can keep setting up future iterations of the loop and get more loads started while we’re waiting for the first memory access to complete.

To explain what I mean, let’s pretend that every single of the “LoadInt”s misses the L1 cache, but hits in the L2 cache, so its actual latency is 16 cycles, not 4. But a latency of 16 cycles just means that it takes 16 cycles between issuing the load and getting the result; we can keep issuing other loads for the entire time. So the only thing that ends up happening is that the “Sum k” in the graph above happens 12 cycles later. We still start two new loads every clock cycle in the steady state; some of them end up taking longer, but that does not keep us from starting work on a new iteration of the loop in every cycle.

Both the linked list and the indirect-sum examples have the opportunity to skip all over memory if they want to; but in the linked-list case, we need to wait for the result of the previous load until we can get started on the next one, whereas in the indirect-sum case, we get to overlap the wait times from the different iterations nicely. As a result, in the indirect-sum case, the extra latency towards reaching the final sum is essentially determined by the worst single iteration we had, whereas in the linked-list case, every single cache miss makes our final result later (and costs us throughput).

The fundamental issue isn’t that the linked-list traversal might end up missing the cache a lot; while this isn’t ideal (and might cost us in other ways), the far more serious issue is that any such cache miss prevents us from making progress elsewhere. Having a lot of cache misses isn’t necessarily a problem if we get to overlap them; having long stretches of time were we can’t do anything else, because everything else we could do depends on that one cache-missing load, is.

In fact, when we hit this kind of problem, our best bet is to just switch to doing something else entirely. This is what CPUs with simultaneous multithreading/hardware threads (“hyperthreads”) and essentially all GPUs do: build the machine so that it can process instructions from multiple instruction streams (threads), and then if one of the threads isn’t really making progress right now because it’s waiting for something, just work on something else for a while. If we have enough threads, then we can hopefully fill those gaps and always have something useful to work on. This trade-off is worthwhile if we have many threads and aren’t really worried about the extra latency caused by time-slicing, which is why this approach is especially popular in throughput-centric architectures that don’t worry about slight latency increases.

Unrolling

But let’s get back to our original integer sum code for a second:

loop:
  rCurInt = load64(rCurPtr);        // Load
  rSum = rSum + rCurInt;            // Sum
  rCurPtr = rCurPtr + 8;            // Advance
  if rCurPtr != rEndPtr goto loop;  // Done?

We have a kernel with four instructions here. Out of these four, two (“Load” and “Sum”) do the actual work we want done, whereas “Advance” and “Done?” just implement the loop itself and are essentially overhead. This type of loop is a prime target for unrolling, where we collapse two or more iterations of the loop into one to decrease the overhead fraction. Let’s not worry about the setup or what to do when the number of elements in the array is odd right now, and only focus on the “meat” of the loop. Then a 2× unrolled version might look like this:

loop:
  rCurInt = load64(rCurPtr);        // LoadEven
  rSum = rSum + rCurInt;            // SumEven
  rCurInt = load64(rCurPtr + 8);    // LoadOdd
  rSum = rSum + rCurInt;            // SumOdd
  rCurPtr = rCurPtr + 16;           // Advance
  if rCurPtr != rEndPtr goto loop;  // Done?

which has this dataflow graph:

Dataflow graph for unrolled array sum, first attempt

Note that even though I’m writing to rCurInt twice in an iteration, which constitutes a write-after-write (WAW) or “output dependency”, there’s no actual dataflow between the loads and sums for the first and second version of rCurInt, so the loads can issue in parallel just fine.

This isn’t bad: we now have two loads every iteration and spend 6N instructions to sum 2N integers, meaning we take 3 instructions per integer summed, whereas our original kernel took 4. That’s an improvement, and (among other things) means that while our original integer-summing loop needed a machine that sustained 4 instructions per clock cycle to hit full throughput, we can now hit the same throuhgput on a smaller machine that only does 3 instructions per clock. This is definitely progress.

However, there’s a problem: if we look at the diagram, we can see that we can indeed start a new pair of loads every clock cycle, but there’s a problem with the summing: we have two dependent adds in our loop, and as we can see from the relationship between “SumEven k” and “SumEven k+1”, the actual summing part of the computation still takes 2 cycles per iteration. On our idealized dataflow machine with infinite lookahead, that just means that all the loads will get front-loaded, and then the adds computing the final sum proceed at their own pace; the result will eventually be available, but it will still take a bit more than 2N cycles, no faster than we were in the original version of the loop. On a more realistic machine (which can only look ahead by a limited number of instructions), we would eventually stop being able to start new loop iterations until some of the old loop iterations have completed. No matter how we slice it, we’ve gone from adding one integer to the sum per cycle to adding two integers to the sum every two cycles. We might take fewer instructions to do so, which is a nice consolation prize, but this is not what we wanted!

What’s happened is that unrolling shifted the critical path. Before, the critical path between iterations went through the pointer advance (or, to be more precise, there were two critical paths, one through the pointer advance and one through the sum, and they were both the same length). Now that we do half the number of advances per item, that isn’t a problem anymore; but the fact that we’re summing these integers sequentially is now the limiter.

A working solution is to change the algorithm slightly: instead of keeping a single sum of all integers, we keep two separate sums. One for the integers at even-numbered array positions, and one for the integers at odd-numberd positions. Then we need to sum those two values at the end. This is the algorithm:

loop:
  rCurInt = load64(rCurPtr);        // LoadEven
  rSumEven = rSumEven + rCurInt;    // SumEven
  rCurInt = load64(rCurPtr + 8);    // LoadOdd
  rSumOdd = rSumOdd + rCurInt;      // SumOdd
  rCurPtr = rCurPtr + 16;           // Advance
  if rCurPtr != rEndPtr goto loop;  // Done?

  rSum = rSumEven + rSumOdd;        // FinalSum

And the dataflow graph for the loop kernel looks as follows:

Dataflow graph for unrolled array sum, second attempt

Where before all the summing was in what’s called the same dependency chain (the name should be self-explanatory by now, I hope), we have now split the summation into two dependency chains. And this is enough to make a sufficiently-wide machine that can sustain 6 instructions per cycle complete our integer-summing task in just slightly more than half a cycle per integer being summed. Progress!

On a somewhat narrower 4-wide design, we are now throughput-bound, and take around 6/4=1.5 cycles per two integers summed, or 0.75 cycles per integer. That’s still a good improvement from the 1 cycle per integer we would have gotten on the same machine from the non-unrolled version; this gain is purely from reduction the loop overhead fraction, and further unrolling could reduce it even further. (That said, unless your loop really is as tiny as our example, you don’t generally want to go overboard with unrolling.)

Tying it all together

In the introduction, I talked about the need for a model detailed enough to make quantitative, not just qualitative, predictions; and at least for very simple compute-bound loops, that is exactly what we have now. At this point, you should know enough to look at the dependency structure of simple loops, and have some idea for how much (or how little) latent parallelism there is, and be able to compute a coarse upper bound on their “speed of light” on various machines with different peak instructions/cycle rates.

Of course, there are many simplifications here, most of which have been already noted in the text; we’re mostly ignoring the effects of the memory hierarchy, we’re not worrying at all about where the decoded instructions come from and how fast they can possibly be delivered, we’ve been flat-out assuming that our branch prediction oracle is perfect, and we’ve been pretending that while there may be a limit on the total number of instructions we can issue per cycle, it doesn’t matter what these instructions are. None of these are true. And even if we’re still compute-bound, we need to worry at least about that latter constraint: sometimes it can make a noticeable difference to tweak the “instruction mix” so it matches better what the hardware can actually do in a given clock cycle.

But all these caveats aside, the basic concepts introduced here are very general, and even just sketching out the dependency graph of a loop like this and seeing it in front of you should give you useful ideas about what potential problems are and how you might address them. If you’re interested in performance optimization, it is definitely worth your time practicing this so you can look at loops and get a “feel” for how they execute, and how the shape of your algorithm (or your data structures, in the linked list case) aids or constrains the compiler and processor.

UPDATE: Some additional clarifications in answer to some questions: paraphrasing one, “if you have to first write C code, translate it to some pseudo-assembly, and then look at the graph, how can this possibly be a better process than just measuring the code in the first place?” Well, the trick here is that to measure anything, you actually need a working program. You don’t to draw a dataflow graph. For example, a common scenario is that there are many ways you could structure some task, and they all want their data structured differently. Actually implementing and testing multiple variants like this requires you to write a lot of plumbing to massage data from one format into another (all of which can be buggy). Drawing a graph can be done from a brief description of the inner loop alone, and you can leave out the parts that you don’t currently care about, or “dummy them out” by replacing them with a coarse approximation (“random work here, maybe 10 cycles latency?”). You only need to make these things precise when they become close to the critical path (or you’re throughput-bound).

The other thing I’ll say is that even though I’ve been talking about adding cycle estimates for compute-bound loops here, this technique works and is useful at pretty much any scale. It’s applicable in any system where work is started and then processed asynchronously, with the results arriving some time later. If you’re analyzing a tight, compute-bound loop, cycle-level granularity is the way to go. But you can zoom out and use the same technique to figure out how your decomposition of an algorithm into tasklets processed by a thread pool works out: do you actually have some meaningful overlap, or is there still one long serial dependency chain that dominates everything, and all you’re doing by splitting it into tasklets like that is adding overhead? Zooming out even further, it works to analyze RPCs you’re sending to a different machine, or queries to some database. Say you have a 30ms target response time, and each RPC takes about 2ms to return its results. In a system that takes 50 RPCs to produce a result, can you meet that deadline? The answer depends on how the dataflow between them looks. If they’re all in series, almost certainly not. If they’re in 5 “layers” that each fan out to 10 different machines then collect the results, you probably can. It certainly applies in project scheduling, and is one of the big reasons the “man-month” isn’t a very useful metric: adding manpower increases your available resources but does nothing to relax your dependencies. In fact, it often adds more of them, to bring new people up to speed. If the extra manpower ends up resulting in more work on the critical path towards finishing your project (for example to train new hires), then adding these extra people to the project made it finish later. And so forth. The point being, this is not just limited to cycle-by-cycle analysis, even though that’s the context I’ve been introducing it in. It’s far more general than that.

And I think that’s enough material for today. Next up, I’ll continue my “Reading bits in far too many ways” series with the third part, where I’ll be using these techniques to get some insight into what kind of difference the algorithm variants make. Until then!

Reading bits in far too many ways (part 2)

(Continued from part 1.)

Last time, I established the basic problem and went through various ways of doing shifting and masking, and the surprising difficulties inherent therein. The “bit extract” style I picked is based on a stateless primitive, which made it convenient to start with because there’s no loop invariants involved.

This time, we’re going to switch to the stateful style employed by most bit readers you’re likely to encounter (because it ends up cheaper). We’re also going to switch from a monolithic getbits function to something a bit more fine-grained. But let’s start at the beginning.

Variant 1: reading the input one byte at a time

Our “extract”-style reader assumed the entire bitstream was available in memory ahead of time. This is not always possible or desirable; so let’s investigate the other extreme, a bit reader that requests additional bytes one at a time, and only when they’re needed.

In general, our stateful variants will dribble in input a few bytes at a time, and have partially processed bytes lying around. We need to store that data in a variable that I will call the “bit buffer”:

// Again, this is understood to be per-bit-reader state or local
// variables, not globals.
uint64_t bitbuf = 0;   // value of bits in buffer
int      bitcount = 0; // number of bits in buffer

While processing input, we will always be seesawing between putting more bits into that buffer when we’re running low, and then consuming bits from that buffer while we can.

Without further ado, let’s do our first stateful getbits implementation, reading one byte at a time, and starting with MSB-first this time:

// Invariant: there are "bitcount" bits in "bitbuf", stored from the
// MSB downwards; the remaining bits in "bitbuf" are 0.

uint64_t getbits1_msb(int count) {
    assert(count >= 1 && count <= 57);

    // Keep reading extra bytes until we have enough bits buffered
    // Big endian; our current bits are at the top of the word,
    // new bits get added at the bottom.
    while (bitcount < count) {
        bitbuf |= (uint64_t)getbyte() << (56 - bitcount);
        bitcount += 8;
    }

    // We now have enough bits present; the most significant
    // "count" bits in "bitbuf" are our result.
    uint64_t result = bitbuf >> (64 - count);

    // Now remove these bits from the buffer
    bitbuf <<= count;
    bitcount -= count;

    return result;
}

As before, we can get rid of the count≥1 requirement by changing the way we grab the result bits, as explained in the last part. This is the last time I’ll mention this; just keep in mind that whenever I show any algorithm variant here, the variations from last time automatically apply.

The idea here is quite simple: first, we check whether there’s enough bits in our buffer to satisfy the request immediately. If not, we dribble in extra bytes one at a time until we have enough. getbyte() here is understood to ideally use some buffered IO mechanism that just boils down to dereferencing and incrementing a pointer on the hot path; it should not be a function call or anything expensive if you can avoid it. Because we insert 8 bits at a time, the maximum number of bits we can read in a single call is 57 bits; that’s the largest number of bits we can refill the buffer to without risking anything dropping out.

After that, we grab the top count bits from our buffer, then shift them out. The invariant we maintain here is that the first unconsumed bit is kept at the MSB of the buffer.

The other thing I want you to notice is that this process breaks down neatly into three separate smaller operations, which I’m going to call “refill”, “peek” and “consume”, respectively. The “refill” phase ensures that a certain given minimum number of bits is present in the buffer; “peek” returns the next few bits in the buffer, without discarding them; and “consume” removes bits without looking at them. These all turn out to be individually useful operations; to show how things shake out, here’s the equivalent LSB-first algorithm, factored into smaller parts:

// Invariant: there are "bitcount" bits in "bitbuf", stored from the
// LSB upwards; the remaining bits in "bitbuf" are 0.
void refill1_lsb(int count) {
    assert(count >= 0 && count <= 57);
    // New bytes get inserted at the top end.
    while (bitcount < count) {
        bitbuf |= (uint64_t)getbyte() << bitcount;
        bitcount += 8;
    }
}

uint64_t peekbits1_lsb(int count) {
    assert(bit_count >= count);
    return bitbuf & ((1ull << count) - 1);
}

void consume1_lsb(int count) {
    assert(bit_count >= count);
    bitbuf >>= count;
    bitcount -= count;
}

uint64_t getbits1_lsb(int count) {
    refill1_lsb(count);
    uint64_t result = peekbits1_lsb(count);
    consume1_lsb(count);
    return result;
}

Writing getbits as the composition of these three smaller primitives is not always optimal. For example, if you use the rotate method for MSB-first bit buffers, you really want to have only rotate shared by the peekbits and consume phases; an optimal implementation shares that work between the two. However, breaking it down into these individual steps is still a useful thing to do, because once you conceptualize these three phases as distinct things, you can start putting them together differently.

Lookahead

The most important such transform is amortizing refills over multiple decode operations. Let’s start with a simple toy example: say we want to read our three example bit fields from the last part:

    a = getbits(4);
    b = getbits(3);
    c = getbits(5);

With getbits implemented as above, this will do the refill check (and potentially some actual refilling) up to 3 times. But this is silly; we know in advance that we’re going to be reading 4+3+5=12 bits in one go, so we might as well grab them all at once:

    refill(4+3+5);
    a = getbits_no_refill(4);
    b = getbits_no_refill(3);
    c = getbits_no_refill(5);

where getbits_no_refill is yet another getbits variant that does peekbits and consume, but, as the name suggests, no refilling. And once you get rid of the refill loop between the individual getbits invocations, you’re just left with straight-line integer code, which compilers are good at optimizing further. That said, the all-fixed-length case is a bit of a cheap shot; it gets far more interesting when we’re not sure exactly how many bits we’re actually going to consume, like in this example:

    temp = getbits(5);
    if (temp < 28)
        result = temp;
    else
        result = 28 + (temp - 28)*16 + getbits(4);

This is a simple variable-length code where values from 0 through 27 are sent in 5 bits, and values from 28 through 91 are sent in 9 bits. The point being, in this case, we don’t know in advance how many bits we’re eventually going to consume. We do know that it’s going to be no more than 9 bits, though, so we can still make sure we only refill once:

    refill(9);
    temp = getbits_no_refill(5);
    if (temp < 28)
        result = temp;
    else
        result = 28 + (temp - 28)*16 + getbits_no_refill(4);

In fact, if you want to, you can go wild and split operations even more, so that both execution paths only consume bits once. For example, assuming a MSB-first bit buffer, we could write this small decoder as follows:

    refill(9);
    temp = peekbits(5);
    if (temp < 28) {
        result = temp;
        consume(5);
    } else {
        // The "upper" and "lower" code are back-to-back,
        // and combine exactly the way we want! Synergy!
        result = getbits_no_refill(9) - 28*16 + 28;
    }

This kind of micro-tweaking is really not recommended outside very hot loops, but as I mentioned in the previous part, some of these decoder loops are quite hot indeed, and in that case saving a few instructions here and a few instructions there adds up. One particularly important technique for decoding variable-length codes (e.g. Huffman codes) is to peek ahead by some fixed number of bits and then do a table lookup based on the result. The table entry then lists what the decoded symbol should be, and how many bits to consume (i.e. how many of the bits we just peeked at really belonged to the symbol). This is significantly faster than reading the code a bit or two at a time and consulting a Huffman tree at every step (the method sadly taught in many textbooks and other introductory texts.)

There’s a problem, though. The technique of peeking ahead a bit (no pun intended) and then later deciding how many bits you actually want to consume is quite powerful, but we’ve just changed the rules: the getbits implementation above is careful to only read extra bytes when it’s strictly necessary. But our modified variable-length code reader example above always refills so the buffer contains at least 9 bits, even when we’re only going to consume 5 bits in the end. Depending on where that refill happens, it might even cause us to read past the end of the actual data stream.

In short, we’ve introduced lookahead. The modified code reader starts grabbing extra input bytes before it’s sure whether it needs them. This has many advantages, but the trade-off is that it may cause us to read past the logical end of a bit stream; it certainly implies that we have to make sure this case is handled correctly. It certainly should never crash or read out of bounds; but beyond that, it implies certain thing about the way input buffering or the framing layer have to work.

Namely, if we’re going to do any lookahead, we need to figure out a way to handle this. The primary options are as follows:

  • We can punt and make it someone else’s problem by just requiring that everyone hand us valid data with some extra padding bytes after the end. This makes our lives easier but is an inconvenience for everyone else.
  • We can arrange things so the outer framing layer that feeds bytes to our getbits routine knows when the real data stream is over (either due to some escape mechanism or because the size is sent explicitly); then we can either stop doing any lookahead and switch to a more careful decoder when we’re getting close to the end, or pad the stream with some dummy value after its end (zeroes being the most popular choice).
  • We can make sure that whatever bytes we might grab during lookahead while decoding a valid stream are still part of our overall byte stream that’s being processed by our code. For example, if you use a 64-bit bit buffer, we can finagle our way around the problem by requiring that some 8-byte footer (say a checksum or something) be present right after a valid bit stream. So while our bit buffer might overshoot, it’s still data that’s ultimately going to be consumed by the decoder, which simplifies the logistics considerably.
  • Barring that, whatever I/O buffering layer we’re using needs to allow us to return some extra bytes we didn’t actually consume into the buffer. Namely, whatever lookahead bytes we have left in our bit buffer after we’re done decoding need to be returned to the buffer for whoever is going to read it next. This is essentially what the C standard library function ungetc does, except you’re not allowed to call ungetc more than once, and we might need to. So going along this route essentially dooms you to taking charge of IO buffer management.

I won’t sugarcoat it, all of these options are a pain in the neck, some more so than others: hand-waving it away by putting something else at the end is easiest, handling it in some outer framing layer isn’t too bad, and taking over all IO buffering so you can un-read multiple bytes is positively hideous, but you don’t have great options when you don’t control your framing. In the past, I’ve written posts about handy techniques that might help you in this context; and in some implementations you can work around it, for example by setting bitcount to a huge value just after inserting the final byte from the stream. But in general, if you want lookahead, you’re going to have to put some amount of work into it. That said, the winnings tend to be fairly good, so it’s not all for nothing.

Variant 2: when you really want to read 64 bits at once

The methods I’ve discussed so far both have some “slop” from working in byte granularity. The extract-style bit reader started with a full 64-bit read but then had to shift by up to 7 positions to discard the part of the current byte that’s already consumed, and the getbits1 above inserts one byte at a time into the bit buffer; if there’s 57 bits already in the buffer, there’s no space for another byte (because that would make 65 bits, more than the width of our buffer), so that’s the maximum width the getbits1 method supports. Now 57 bits is a useful amount; but if you’re doing this on a 32-bit platform, the equivalent magic number is 25 bits (32-7), and that’s definitely on the low side, enough so to be inconvenient sometimes.

Luckily, if you want the full width, there’s a way to do it (like the rotate-and-mask technique for MSB-first bit buffers, I learned this at RAD). At this point, I think you get the correspondence between the MSB-first and LSB-first methods, so I’m only going to show one per variant. Let’s do LSB-first for this one:

// Invariant: "bitbuf" contains "bitcount" bits, starting from the
// LSB up; 1 <= bitcount <= 64
uint64_t bitbuf = 0;     // value of bits in buffer
int      bitcount = 0;   // number of bits in buffer
uint64_t lookahead = 0;  // next 64 bits
bool     have_lookahead = false;

// Must do this to prime the pump!
void initialize() {
    bitbuf = get_uint64LE();
    bitcount = 64;
    have_lookahead = false;
}

void ensure_lookahead() {
    // grabs the lookahead word, if we don't have
    // it already.
    if (!have_lookahead) {
        lookahead = get_uint64LE();
        have_lookahead = true;
    }
}

uint64_t peekbits2_lsb(int count) {
    assert(bitcount >= 1);
    assert(count >= 0 && count <= 64);

    if (count <= bitcount) { // enough bits in buf
        return bitbuf & width_to_mask_table[count];
    } else {
        ensure_lookahead();

        // combine current bitbuf with lookahead
        // (lookahead bits go above end of current buf)
        uint64_t next_bits = bitbuf;
        next_bits |= lookahead << bitcount;
        return next_bits & width_to_mask_table[count];
    }
}

void consume2_lsb(int count) {
    assert(bitcount >= 1);
    assert(count >= 0 && count <= 64);

    if (count < bitcount) { // still in current buf
        // just shift the bits out
        bitbuf >>= count;
        bitcount -= count;
    } else { // all of current buf consumed
        ensure_lookahead();
         
        // we advanced fully into the lookahead word
        int lookahead_consumed = count - bitcount;
        bitbuf = lookahead >> lookahead_consumed;
        bitcount = 64 - lookahead_consumed;
        have_lookahead = false;
    }

    assert(bitcount >= 1);
}

uint64_t getbits2_lsb(int count) {
    uint64_t result = peekbits2_lsb(count);
    consume2_lsb(count);
    return result;
}

This one is a bit more complicated than the ones we’ve seen before, and needs an explicit initialization step to make the invariants work out just right. It also involves several extra branches compared to the variants we’ve seen before, which makes it less than ideal for deeply pipelined machines, which includes desktop PCs, and note that I’m using the width_to_mask_table again, and not just for show: none of the arithmetic expressions we saw last time to compute the mask for a given width work for the full 0-64 range of allowed widths on any common 64-bit architecture that’s not POWER, and even that only if we ignore them invoking undefined behavior, which we really shouldn’t.

The underlying idea is fairly simple: instead of just one bit buffer, we keep track of two values. We have however many bits are left of the last 64-bit value we read, and when that’s not enough for a peekbits, we grab the next 64-bit value from the input stream (via some externally-implemented get_uint64LE()) to give us the bits we’re missing. Likewise, consume checks whether there will still be any bits left in the current input buffer after consuming width bits. If not, we switch over to the bits from the lookahead value (shifting out however many of them we consumed) and clear the have_lookahead flag to indicate that what used to be our lookahead value is now just the contents of our bit buffer.

There are some contortions in this code to ensure we don’t do out-of-range (undefined-behavior-inducing) shifts. For example, note how peekbits tests whether count <= bitcount to detect the bits-present-in-buffer case, whereas consume uses count < bitcount. This is not an accident: in peekbits, the next_bits calculation involves a right-shift by bitcount. Since it only happens in the path where bitcount < count ≤ 64, that means that bitcount < 64, and we’re safe. In consume, the situation is reversed: we shift by lookahead_consumed = count - bitcount. The condition around the block guarantees that lookahead_consumed ≥ 0; in the other direction, because count is at most 64 and bitcount is at least 1, we have lookahead_consumed ≤ 64 – 1 = 63. That said, to paraphrase Knuth: beware of bugs in the above code; I’ve only proved it correct, not tried it.

This technique has another thing going for it besides supporting bigger bit field widths: note how it always reads full 64-bit uints at a time. Variant 1 above only reads bytes at a time, but requires a refill loop; the various branchless variants we’ll see later implicitly rely on the target CPU supporting fast unaligned reads. This version alone has the distinction of doing reads of a single size and with consistent alignment, which makes it more attractive on targets that don’t support fast unaligned reads, such as many old-school RISC CPUs.

Finally, as usual, there’s several more variations here that I’m not showing. For example, if you happen to have the data you’re decoding fully in memory, there’s no reason to bother with the boolean have_lookahead flag; just keep a pointer to the current lookahead word, and bump that pointer up whenever the current lookahead is consumed.

Variant 3: bit extraction redux

The original bit extraction-based bit reader from the previous part was a bit on the expensive side. But as long as we’re OK with the requirement that the entire input stream be in memory at once, we can wrangle it into the refill/peek/consume pattern to get something useful. It still gives us a bit reader that looks ahead (and hence has the resulting difficulties), but such is life. For this one, let’s do MSB again:

const uint8_t *bitptr; // Pointer to current byte
uint64_t       bitbuf = 0; // last 64 bits we read
int            bitpos = 0; // how many of those bits we've consumed

void refill3_msb() {
    assert(bitpos <= 64);

    // Advance the pointer by however many full bytes we consumed
    bitptr += bitpos >> 3;

    // Refill
    bitbuf = read64BE(bitptr);

    // Number of bits in the current byte we've already consumed
    // (we took care of the full bytes; these are the leftover
    // bits that didn't make a full byte.)
    bitpos &= 7;
}

uint64_t peekbits3_msb(int count) {
    assert(bitpos + count <= 64);
    assert(count >= 1 && count <= 64 - 7);

    // Shift out the bits we've already consumed
    uint64_t remaining = bitbuf << bitpos;

    // Return the top "count" bits
    return remaining >> (64 - count);
}

void consume3_msb(int count) {
    bitpos += count;
}

This time, I’ve also left out the getbits built from refill / peek / consume calls, because that’s yet another pattern that should be pretty clear by now.

It’s a pretty sweet variant. Once we break the bit extraction logic into separate “refill” and “peek”/”consume” pieces, it becomes clear how all of the individual pieces are fairly small and clean. It’s also completely branchless! It does expect unaligned 64-bit big-endian reads to exist and be reasonably cheap (not a problem on mainstream x86s or ARMs), and of course a realistic implementation needs to include handling of the end-of-buffer cases; see the discussion in the “lookahead” section.

Variant 4: a different kind of lookahead

And now that we’re here, let’s do another branchless lookahead variant. This exact variant is, to the best of my knowledge, another RAD special – discovered by my colleague Charles Bloom and me while working on Kraken (UPDATE: as Yann points out in the comments, this basic idea was apparently used in Eric Biggers’ “Xpack” long before Kraken was launched; I wasn’t aware of this and I don’t think Charles was either, but that means we’re definitely not the first ones to come up with the idea. Our variant has an interesting wrinkle though – details in my reply). Now all branchless (well, branchless if you ignore end-of-buffer checking in the refill etc.) bit readers look very much alike, but this particular variant has a few interesting properties (some of which I’ll only discuss later because we’re lacking a bit of necessary background right now), and that I haven’t seen anywhere else in this combination; if someone else did it first, feel free to inform me in the comments, and I’ll add the proper attribution! Here goes; back to LSB-first again, because I’m committed to hammering home just how similar and interchangeable LSB-first/MSB-first are at this level, holy wars notwithstanding.

const uint8_t *bitptr;   // Pointer to next byte to insert into buf
uint64_t bitbuf = 0;     // value of bits in buffer
int      bitcount = 0;   // number of bits in buffer

void refill4_lsb() {
    // Grab the next few bytes and insert them right above
    // the current top.
    bitbuf |= read64LE(bitptr) << bitcount;

    // Advance the read pointer for next iteration
    bitptr += (63 - bitcount) >> 3;

    // Update the available bit count
    bitcount |= 56; // now bitcount is in [56,63]
}

uint64_t peekbits4_lsb(int count) {
    assert(count >= 0 && count <= 56);
    assert(count <= bitcount);
    
    return bitbuf & ((1ull << count) - 1);
}

void consume4_lsb(int count) {
    assert(count <= bitcount);

    bitbuf >>= count;
    bitcount -= count;
}

The peek and consume phases are nothing we haven’t seen before, although this time the maximum permissible bit width seems to have shrunk by one more bit down to 56 bits for some reason.

That reason is in the refill phase, which works slightly differently from the ones we’ve seen so far. Reading 64 little-endian bits and shifting them up to align with the top of our current bit buffer should be straightforward at this point. But the bitptr / bitcount manipulation needs some explanation.

It’s actually easier to start with the bitcount part. The variants we’ve seen so far generally have between 57 and 64 bits in the buffer after refill. This version instead targets having between 56 and 63 bits in the buffer (which is also why the limit on count went down by one). But why? Well, inserting some integer number of bytes means bitcount is going to be incremented by some multiple of 8 during the refill; that means that bitcount & 7 (the low 3 bits) won’t change. And if we refill to a target of [56,63] bits in the buffer, we can compute the updated bit count with a single binary OR operation.

Which brings me to the question of how many bytes we should advance the pointer by. Well, let’s just look at the values of the original bitcount:

  • If 56 ≤ bitcount ≤ 63, we were already in our target range and don’t want to advance by another byte.
  • If 48 ≤ bitcount ≤ 55, we’re adding exactly 1 byte (and so want to advance bit_ptr by that much).
  • If 40 ≤ bitcount ≤ 47, we’re adding exactly 2 bytes.

and so forth. This works out to the (63 - bitcount) >> 3 bytes we’re adding to bitptr.

Now, the bits in bitbuf above bitcount can end up getting ORed over multiple times. However, when that happens, we OR over the same value every time, so it doesn’t change the result. Therefore, once they later travel downwards (from the right-shift in the consume function), they’re fine; no need to worry about garbage.

Okay. So what’s interesting, but what’s so special about this particular variant? When would you choose this over, say, variant 3 above?

One simple reason: in this variant, the address the refill is loading from does not have a dependency on the current value of bitcount. In fact, the next load address is known as soon as the previous refill is complete. This is a subtle distinction that turns out to be a fairly major advantage on an out-of-order CPU. Among integer operations, even when hitting the L1 cache, loads are on the high latency side (typically somewhere between 3 and 5 cycles, whereas most integer operations take a single cycle), and the exact value of bitcount at the end of some loop iteration is often only known late (consider the simple variable-length code example I gave above).

Having the load address not depend on bitcount means the load can potentially issue as soon as the previous refill is complete; then we have plenty of time to complete the load, potentially byte-swap the value if the endianness of our load doesn’t match the target ISA (say because we’re using a MSB-first bit buffer on a little-endian CPU), and then the only thing that depends on the previous value of bitcount is the shift, which is a regular ALU operation and generally takes a single cycle.

In short, this somewhat obscure form of refill looks weird, but provides a tangible increase in available instruction-level parallelism. It was good for about a 10% throughput improvement on desktop PCs (vs. the earlier branchless refill it replaced) in the then-current version of the Kraken Huffman decoder when I tested it in early 2016.

Consider this a teaser for the next (and hopefully last) part of this series, in which I won’t introduce many more variants (maybe one more), and will instead talk a lot more about the performance of bit stream decoders and what kinds of things to watch out for.

Until then!

Reading bits in far too many ways (part 1)

Resurrecting an old tradition of this blog, let’s take a simple problem, go over a way too long list of alternative solutions, and evaluate their merits.

Our simple problem is this: we want to read data encoded using some variable-bit-length code from a byte stream. Reading individual bytes, machine words etc. is directly supported by most CPUs and many programming languages, but for bit-granularity IO, you generally need to implement it yourself.

This sounds simple enough, and in some sense it is. The first source of problems is that this operation tends to be a hot spot in codecs—-and yes, compute bound, not memory or I/O bound. So we’d like not just an implementation that works; we’d like it to be efficient as well. And along the way we’ll run into many other complications: interactions with IO buffering, end-of-buffer handling, corner cases in the way bit shifts are specified both in C/C++ and in various processor architectures, as well as other bit shift peculiarities.

I’ll mainly focus on many different ways to handle the reader side in this post; essentially all techniques covered in here apply equally to the writer side, but I don’t want to double the number of algorithm variations I’m presenting. There will be plenty as it is.

Degrees of freedom

“Read a variable number of bits” is not a sufficient problem specification. There are lots of plausible ways to pack bits into bytes, and all have their strengths and weaknesses that I’ll go into later. For now, let’s just cover the differences.

The first major decision to make is whether fields are packed MSB-first or LSB-first (“most significant bit” and “least significant bit”, respectively). That is, if we call our to-be-implemented function getbits and run the code sequence

a = getbits(4);
b = getbits(3);

on some bitstream that we just opened, we expect both values to come from the same byte, but how are they arranged in that byte? If bits are packed MSB-first, then “a” occupies 4 bits starting at the MSB, and “b” is below “a”, leading to this arrangement:

MSB-first bit packing

I’m numbering bits with the LSB being bit 0 and increasing towards the MSB, which is the conventional ordering in most contexts. LSB-first packing is the opposite, where the first field occupies bit 0 and later fields proceed upwards:

LSB-first bit packing

Both conventions are used in everyday file formats. For example, JPEG uses MSB-first bit packing for its bitstream, and DEFLATE (zip) uses LSB-first.

The next question we have to settle is what’s supposed to happen when a value ends up spanning multiple bytes. Say we have another value, “c”, that we want to encode in 5 bits. What do we want to end up with? We can postpone the issue slightly be declaring that we’re packing values into 32-bit words or 64-bit words and not bytes, but ultimately we’ll have to settle on something. This is where we suddenly get a whole lot of different variants, and I’m only going to cover the main contenders.

We can think of MSB-first bit packing as iterating over our bitfield “c” from its MSB towards the LSB, inserting one bit at a time. And once we fill up one byte, we start with the next byte. Following these rules for our new bitfield c, this is where the bits end up in our stream:

MSB-first bit packing across multiple bytes

Note that by following these rules, we end up with the same two bytes we would have gotten from MSB-first bit packing into a big integer and then stored it in big-endian byte order. If we had instead decided to split c such that its LSB goes into the first byte and the four higher-order bits into the second byte, this wouldn’t have worked. I’m going to call bit-packing rules that are self-consistent like that “natural” rules.

LSB-first bit-packing of course also has its corresponding natural rule, which is to insert the new value bit by bit starting from the LSB upwards, and if we do that here, we end up with this bit stream:

LSB-first bit packing across multiple bytes

LSB-first natural packing gives us the same bytes as LSB-first packing into a big integer then storing it in little-endian byte order. Also, we’re clearly running into some awkwardness with the drawing here: the logically contiguous packing of c into multiple bytes looks discontinuous when drawing it like this, whereas the drawing for the MSB-first packing looked the way you’d expect. But there’s trouble brewing there too: in the MSB-first drawing, we’re numbering the bits in increasing order from right to left (as is common) and the bytes in increasing order from left to right (as is also common).

Here’s what happens with the LSB-first bit diagram if we draw bit 0 (the LSB) in each byte at the left and bit 7 (the MSB) at the right:

LSB-first bitpacking across multiple bytes, LSB-&gt;MSB from left to right

If you draw it this way, it looks the way you’d expect. Putting the MSB on the right feels weird when thinking of a byte as a number and much less weird when you just think about it as an array of 8 bits (which is, effectively, what we’re treating it as when we’re doing bit-wise IO).

Incidentally, some big-endian architectures (e.g. IBM POWER) do number bits this way – bit 0 is the MSB, bit 31 (or 63) is the LSB. Diagramming MSB-first bit packing on such a machine with bit 0=MSB, and also numbering our own bit fields so that their bit 0 corresponds to the MSB, we’d get the exact same diagram (but it would mean something slightly different). This convention makes bit and byte ordering consistent (nice) but breaks the handy convention of bit k corresponding to a value of 2k (less nice).

And if the idea of renumbering bits so bit 0 is the MSB breaks your head, you can also leave the bit numbering as it is but be a rebel and draw byte addresses increasing to the left. Or alternatively keep drawing increasing addresses to the right but be a slightly different kind of rebel and write your byte stream backwards. If you do either of those, you end up with this layout:

LSB-first bitpacking across multiple bytes, reverse byte order

I realize this is getting confusing, and I’ll stop now, but I’m trying to make an actual point here: you should not get too attached to the way this stuff is drawn. It’s easy to fool yourself into thinking that one variant is better than another because it looks prettier, but the conventions of drawing bytes left to right and bits within them right to left are completely arbitrary, and no matter which one you pick, they always end up having that One Weird Thing about them.

It turns out that MSB-first and LSB-first packing conventions both have advantages and disadvantages, and it’s much more useful to think of them as tools with different areas of application than it is to designate one as the “right way” and the other as the “wrong way”. As for byte order, and whether to pack values into bytes, words, or something larger, I highly recommend that you use whatever the natural order for your bit-packing convention is: MSB-first corresponds naturally to big-endian style byte ordering, and LSB-first corresponds naturally to little-endian byte ordering. Unless you’re writing byte streams in reverse—believe it or not, there’s good reasons to do that too—in which case we have reverse-order MSB-first corresponding to little-endian and reverse-order LSB-first corresponding to big-endian.

The reason to prefer “natural” orders is that they tend to give you more freedom for different implementations. A natural-order stream admits a variety of different decoders (and encoders), all with different trade-offs (and good on different targets). “Unnatural” orders are usually designed with exactly one implementation in mind and tend to very awkward to decode any other way.

Our first getbits (bit extract)

Now that we’ve specified the problem sufficiently, we can implement a solution. A particularly simple version is possible if we assume the entire bit stream is sequential in memory (as a byte array) and also completely ignore, for the time being, such pesky issues as running into the end of the array. Let’s just pretend it’s infinite! (Or large enough and zero-padded, anyway.)

In that case, we can base everything on a purely functional “bit extract” function, which I am then going to illustrate a number of issues that come up in all kind of bit-reading functions. Let’s start with LSB-first bit packing:

// Treating buf[] as a giant little-endian integer, grab "width"
// bits starting at bit number "pos" (LSB=bit 0).
uint64_t bit_extract_lsb(const uint8_t *buf, size_t pos, int width) {
    assert(width >= 0 && width <= 64 - 7);

    // Read a 64-bit little-endian number starting from the byte
    // containing bit number "pos" (relative to "buf").
    uint64_t bits = read64LE(&buf[pos / 8]);

    // Shift out the bits inside the first byte that we've
    // already consumed.
    // After this, the LSB of our bit field is in the LSB of bits.
    bits >>= pos % 8;

    // Return the low "width" bits, zeroing the rest via bit mask.
    return bits & ((1ull << width) - 1);
}

// State variable, assumed to be local variables, or factored
// into some object; hopefully not actual globals.
const uint8_t *bitstream; // The input bistream
size_t bit_pos; // Current position in the stream, in bits.

uint64_t getbits_extract_lsb(int width) {
    // Read the bits
    uint64_t result = bit_extract_lsb(bitstream, bit_pos, width);
    // Advance the cursor
    bit_pos += width;
    return result;
}

We’re just using the fact I noted earlier that a LSB-first bit stream is just a big little-endian number. We first grab 64 contiguous byte-aligned bits starting from the first byte containing any bits we care about, do a right shift to get rid of the remaining 0-7 extra bits below the first bit we care about, and then return the result masked to the desired width.

Depending on the value of pos, that right shift can cost us an extra 7 bits. So even though we read a full 64 bits, the maximum number of bits we can read in one go with the code above is 64-7=57 bits.

With bitextract in hand, getbits is straightforward; we just keep track of the current position in the bitstream (in bits), and increment it after reading. Easy.

The corresponding MSB-first variant works out quite similarly, except for one annoying issue I’ll explain after showing the code:

// Treating buf[] as a giant big-endian integer, grab "width"
// bits starting at bit number "pos" (MSB=bit 0).
uint64_t bit_extract_msb(const uint8_t *buf, size_t pos, int width) {
    assert(width >= 1 && width <= 64 - 7);

    // Read a 64-bit big-endian number starting from the byte
    // containing bit number "pos" (relative to "buf").
    uint64_t bits = read64BE(&buf[pos / 8]);

    // Shift out the bits we've already consumed.
    // After this, the MSB of our bit field is in the MSB of bits.
    bits <<= pos % 8;

    // Return the top "width" bits.
    return bits >> (64 - width);
}

uint64_t getbits_extract_msb(int width) {
    // Read the bits
    uint64_t result = bit_extract_msb(bitstream, bit_pos, width);
    // Advance the cursor
    bit_pos += width;
    return result;
}

This works very similar to the one before: read 64 contiguous byte-aligned bits (big-endian this time), do a left shift to align the top of the bit field we want with the MSB of bits (where before, we did a right shift to align the bottom of our bit field with the LSB of bits), and then do a right shift to place the top width bits at the bottom to return them, because when somebody calls getbits(3), they generally expect to see a value between 0 and 7 inclusive.

Shift boundary cases

So what’s the problem? Well, this version doesn’t allow width to be zero. The issue is that if we allow width == 0, then the final shift will try to right-shift a 64-bit value by 64 bits, and that’s undefined behavior in C/C++! In this case, we may only shift by 0 through 63.

Now, in some cases, C/C++ leave details unspecified for backwards-compatibility with machines essentially nobody cares about at this point. Not requiring that the representation of signed numbers use two’s complement being a famous example; while non-two’s complement architectures exist, they’re all museum pieces at this point.

Sadly, this is not one of these cases. Here’s what happens on different widespread CPU architectures when the shift amount is out of range:

  • For 32-bit x86 and x86-64, shift amounts are interpreted mod 32 for operand widths of 32 bits and lower, and mod 64 for 64-bit operands. So right-shifting a 64-bit value by 64 will yield the same as shifting by 0, i.e. a no-op.
  • In 32-bit ARM (A32/T32 instruction sets), shift amounts are taken mod 256. Right-shifting a 32-bit value by 32 (or 64) will hence yield 0, as will right-shifting it by 255, but right-shifting it by 256 will leave the value untouched.
  • In 64-bit ARM (A64 ISA), shift amounts are taken mod 32 for 32-bit shifts and mod 64 for 64-bit shifts (essentially the same as x86-64).
  • RISC-V also follows the same rule: 32-bit shifts distances are mod 32, 64-bit shift distances are mod 64.
  • For POWER/PowerPC, 32-bit shifts take the shift amount mod 64 and 64-bit shifts take the shift amount mod 128.

To make matters even more confusing, in most of these instruction sets with SIMD extensions, the SIMD integer instructions have different out-of-range shift behavior
than the non-SIMD instructions. In short, this is one of those cases where there’s actual architectural differences between mainstream platforms; POWER and RISC-V might be a bit obscure for most of you, but say 32-bit vs. 64-bit ARM both have hundreds of millions of devices sold at this point.

Therefore, even if all the compiler does with that right-shift is to emit the corresponding right-shift instruction for the target architecture (which is generally what happens), you will still see different behavior on different architectures, and on both ARM A64 and x86-64, the result of a shift by 64 will be effectively a no-op, and getbits(0) will therefore (generally) return a non-0 value, where one would expect that it’s always zero.

Why does it matter in the first place? Having a hard-coded getbits(0) is indeed not an interesting use case; but sometimes you might want to perform a getbits(x) for some variable x, where x can be zero in some cases, and it’s nice for that case to just work and not require some special-case testing.

If you really need that case to work, one option is to explicitly test for width == 0 and handle that specially; another is to use an branch-less expression that works for zero widths, for example this variant used in Yann Collet‘s FSE:

    // Return the top "width" bits, avoiding width==0 edge cases.
    return (bits >> 1) >> (63 - width);

This particular case is easier to handle with LSB-first bit streams. And while I’m mentioning them, let’s talk about the masking operation I used to isolate the lowest width bits:

    // Return the low "width" bits, zeroing the rest via bit mask.
    return bits & ((1ull << width) - 1);

There’s an equivalent form that is slightly cheaper on architectures with a three-operand and-with-complement (AND-NOT) instruction. This includes many RISC CPUs as well as x86s with BMI1. Namely, we can take a mask of all-1 bits, left-shift it to introduce width zeroes at the bottom, then complement the whole thing:

    return bits & ~(~0ull << width);

If you’re on an x86 with not just BMI1, but also BMI2 support, you can also use the BZHI instruction that is tailor-made for this use case, if you can figure out how to make your compiler emit it (or use assembly). Yet another option that is advantageous in some cases is to simply prepare a small look-up table of bit masks, which then simplifies the code into

    return bits & width_to_mask_table[width];

It may feel ridiculous to prepare a lookup table storing the result of two integer operations, especially since computing the address of the table element to load usually involves both a shift and an addition—the exact two operations we’d be performing if we didn’t have the table!—but there’s a method to this madness: the required address calculation can be done as part of the memory access in a single load instruction on e.g. x86 and ARM machines, and so this shift and add get executed on an Address Generation Unit (AGU) as part of the load pipeline of the CPU, not with integer arithmetic and shift instructions. So counter-intuitive as it may sound, replacing two integer ALU instruction with one integer load instruction like this can result in a noticeable speed-up in bit-IO-intensive code, because it balances the workload better between available execution units.

Another interesting property is that the LSB-first version using the mask table lookup only performs one shift by a variable amount (to shift out the already-consumed bits). This matters because, for a variety of (often banal) reasons, integer shifts by a variable amount are more expensive than shifts by a compile-time constant amount on many micro-architectures. For example, the Cell PPU/Xbox 360 Xenon CPU were infamous for having variable-distance shifts that stalled the CPU core for a whopping 12 cycles, whereas regular shifts were pipelined and would complete in two cycles. Less drastically, on many Intel x86 microarchitectures, “traditional” variable x86 shifts (SHR reg, cl and friends) are about three times as expensive as shifts by a compile-time constant.

This seems like another way in which the MSB-first variant is behind: the variants we’ve seen so far perform either two or three shifts per extract operation, two of which are variable-distance. But there’s a trick to get down to a single shift-type operation, namely by using a bit rotate (cyclical shift) instead of a regular shift:

// Treating buf[] as a giant big-endian integer, grab "width"
// bits starting at bit number "pos" (MSB=bit 0).
uint64_t bit_extract_rot(const uint8_t *buf, size_t pos, int width) {
    assert(width >= 0 && width <= 64 - 7);

    // Read a 64-bit big-endian number starting from the byte
    // containing bit number "pos" (relative to "buf").
    uint64_t bits = read64BE(&buf[pos >> 3]);

    // Rotate left to align the bottom of our bit field with the LSB.
    bits = rotate_left64(bits, (pos & 7) + width);

    // Return the bottom "width" bits.
    // (Using a table here, but the other ways of masking discussed
    // for LSB-first bit IO also work.)
    return bits & width_to_mask_table[width];
}

Here, the rotate-left (which you need to figure out how to best access in your C compiler yourself; I want to avoid further digressions) first takes up the work of the original left-shift, and then rotates by an extra “width” bits to make our bit field wrap around from the most-significant bits of the value down to the least-significant, where we can then mask it the same way as in the LSB-first variant.

Oh, and I’ve also started writing the division by 8 as shift-right by 3, and the mod-8 of our unsigned position as a binary AND with 7. These are equivalent substitutions, and you’ll see both forms in real-world code, so I figured I’d mix it up a little.

This rotate-style masking operation was how we (being RAD Game Tools) used to perform bit reading on the Cell PPU and Xbox 360, purely because the variable-distance shifts were so dismal on that particular core. And I’ll note that this version also has no problems whatsoever with width == 0; the only problem is its dependence on rotate instructions, which are available (and fast) in most architectures, but tend to be somewhat inconvenient to access from C code.

At this point, I’ve talked a lot about many, many different ways to do a bit shifting and masking, and I’ll be referring back to these later. What I haven’t shown you yet is any actual practical bit IO logic—the bit extraction form is a useful starting point, and convenient to explain these concepts in without having to worry about state, but you’re unlikely to ever use it as presented in any production code, not least because it assumes that the entire bit stream is in memory and will gleefully read past the end of the buffer during normal operation!

However, this post is already long enough that WordPress editing is starting to get sluggish on me. So what I’ll do is split this up into a multi-part series; you get to take a breather, and we’ll meet back in the next part. Until then!

Network latencies and speed of light

In this short post I’m going to attempt to convince you that current network (Internet) latencies are here to stay, because they are already within a fairly small factor of what is possible under known physics, and getting much closer to that limit – say, another 2x gain – requires heroics of civil and network engineering as well as massive capital expenditures that are very unlikely to be used for general internet links in the foreseeable future.

This is a conversation I’ve had a few times in my life, usually with surprised conversation partners; last year I happened to write a mail about it that I’m now going to recycle and turn into this blog post so that in the future, I can just link to this if it ever comes up again. :)

When I originally wrote said mail in October 2017, I started by pinging csail.mit.edu, a machine that as far as I know (and Geo-IP services are with me on this one) is actually at MIT, so in Cambridge, MA. I did this sitting at my Windows work machine in Seattle, WA, and got this result:

Pinging csail.mit.edu [128.30.2.121] with 32 bytes of data:
Reply from 128.30.2.121: bytes=32 time=71ms TTL=48
Reply from 128.30.2.121: bytes=32 time=71ms TTL=48
Reply from 128.30.2.121: bytes=32 time=70ms TTL=48
Reply from 128.30.2.121: bytes=32 time=71ms TTL=48

Ping times are round-trip times and include the time it takes for the network packet to go from my machine to the MIT server, the time it takes for the server to prepare a reply, and the time for said reply to make it back to my machine and get delivered to the running ping process. The best guess for a single-way trip is to just divide the RTT by 2, giving me an estimate of about 35ms for my ping packet to make it from Seattle to Boston.

Google Maps tells me the great circle distance from my office to Cambridge MA is about 4000km (2500 miles, if a kilometer means nothing to you). Any network packets I’m sending these days over normal network infrastructure are likely to use either optical fiber (especially for long-distance links) or copper cable as a transmission medium. The rule of thumb I learned in university that effective signal transmission speed over both is about 2/3rds of the speed of light in vacuum. This is called the velocity factor; that article has some actual numbers, which work out to 0.65c for Cat-6A twisted pair cable (used for 10Gbit Ethernet), 0.64c for Cat-5e (1Gbit Ethernet), and 0.67c for optical fiber, all of which are close enough to each other and to the aforementioned 2/3c rule of thumb that I won’t bother differentiating between different types of cables in the rest of this post.

Divide our distance of 2500mi by 2/3c, we get about 4×106m / (2×108 m/s) = 2 × 10-2 s = 20ms. That is the transmission delay we would have for a hypothetical optical fiber strung along the great circle between Seattle and Cambridge, the shortest distance between two points on a sphere; I’m neglecting height differences and Earth being not quite spherical here, but I’m only doing a back-of-the-envelope estimate. Note that the actual measured one-way latency I quoted above is already well below twice that. Hence my earlier comment about even a factor-of-2 improvement being unlikely.

Now, if your goal is actually building a network (and not just a single point-to-point link), you don’t want to have long stretches of cable in the middle of nowhere. Again as per Google Maps, the distance from Seattle to Cambridge actually driving along major roads is about 4800km (3000mi). So in this case, we get about 20% extra “overhead” distance from building a network that follows the lay of the land, goes through major population centers, and doesn’t try to tunnel below the Great Lakes or similar. That’s a decent trade-off when your plan is to have an actual Internet and not just one very fast link. So this extra 20% overhead puts our corrected estimate of transmission delay along a more realistic network layout at about 24ms.

That means that of our approximately 35ms one-way trip, 24ms is just from physics (speed of light, index of refraction of optical fiber) and logistics (not having a full mesh of minimum-distance point-to-point links between any two points in the network). The remaining 11ms of the transit time are, presumably, spent with the packets briefly enqueued in some routers, in signal boosters and network stacks. These are the parts that could be improved with advances in network technology. But even reducing that part of the cost to zero still wouldn’t give us even a full 1.5× speed-up. And that’s why I think mainstream network tech isn’t going to get that much faster anytime soon.

What if we are willing to pay up and lay a private dedicated fiber link for that distance (or use some dark fiber going the right direction that’s already there)? That’s still using mainstream tech, just spending money to reduce the fraction of time spent on sub-optimal routing (physical route that is) and in the network, and it seems likely that you could get the RTT to somewhere between 45ms and 50ms using regular fiber if you were willing to spend the money to set it up (and maintain it).

But that’s still assuming using something as pedestrian as fiber (or copper cable), and actually sticking to the surface of the Earth. Going even further along the “money is no object” scale, and edging a teensy bit into Bond Villain territory, we can upgrade our signal velocity to full light speed and also get rid of pesky detours like the curvature of the Earth by digging a perfectly straight tunnel between the two points, keeping a vacuum inside (not strictly necessary, but might as well while we’re at it) and then establishing a really high-powered microwave link; it would have to be to make it along a distance of a few thousand kilometers, given beam divergence. Keeping in theme, such a link would then also be attractive because the transceivers at either end because a sufficiently high-powered microwave transmitter should make for a serviceable death ray, in a pinch. More seriously, high-powered point-to-point microwave links are a thing, and are used in very latency-sensitive applications such as high-frequency trading. However, as far as I know (which might be wrong – feel free to correct me!), individual segments typically span distances of a few miles, not tens of miles, and definitely not hundreds. Longer links are then built by chaining multiple segments together (effectively a sequence of repeaters), adding a small delay on every split, so the effective transmission speed is definitely lower than full speed of light, though I don’t know by how much. And of course, such systems require uninterrupted lines of sight, which (under non-Bond-villain conditions) means they tend to not work, or only work in a diminished capacity, under bad weather conditions with poor visibility.

Anyway, for that level of investment, you should be able to get one-way trip times down to about 12ms or so, and round-trips of around 24ms. That is about 3× faster than current mainstream network tech, and gets us fairly close to the limits of known physics, but it’s also quite expensive to set up and operate, not as reliable, and has various other problems.

So, summarizing:

  • If you have a good consumer/business-level Internet connection, ping someone who does likewise, and there are no major issues on the network in between, the RTTs you will get right now are within about a factor of 3 of the best they can possibly be as per currently known physics: unless we figure out a way around Special Relativity, it’s not getting better than straight-line line-of-sight light-speed communication.
  • If you’re willing to spend a bunch of money to buy existing dark fiber
    capacity that goes in the right direction and lay down some of your own
    where there isn’t any, and you build it as a dedicated link, you should be able to get within 2× of the speed of light limit using fairly mainstream tech. (So about a 1.5× improvement over just using existing networks.)
  • Getting substantially better than that with current tech requires
    major civil engineering as well as a death ray. Upside: a death ray is its own reward.

Papers I like (part 7)

Continued from part 6.

27. Qureshi, Jaleel et al. – “Adaptive Insertion Policies for High Performance Caching” (2007; computer architecture/caches)

I have written about caches before, so I’ll skip explaining what exactly a cache is, why you would want them and how they interact with the rest of the system and go straight to (some of) the details.

Caches in hardware are superficially similar to hash tables/maps/dictionaries on the software side: they’re associative data structures, meaning that instead of being accessed using something like an array index that (more or less) directly corresponds to a memory location, they’re indexed using a key, and then there’s an extra step where these keys are converted into locations of the corresponding memory cells. But unlike most software hash tables, caches in HW usually have a fixed size and hence “forget” old data as new data is being inserted.

There are a few ways this can work. On one extreme, you have what’s called “direct-mapped” caches. These compute a “hash function” of the key (the key often being a physical or virtual memory address, and the “hash function” most often being simply “take some of the address bits”, hence the quotation marks) and then
have a 1:1 mapping from the hash to a corresponding memory location in some dedicated local memory (mostly SRAM). The important point being that each key (address) has exactly one location it can go in the cache. Either the corresponding value is in that location, or it’s not in the cache at all. You need to store some extra metadata (the “tags”) so you know what keys are stored in what location, so you can verify whether the key in that location is the key you wanted, or just another key that happens to have the same hash code. Direct-mapped caches are very fast and have low power consumption, but they perform poorly when access patterns lead to lots of collisions (which often happens by accident).

At the other extreme, there’s “fully associative” caches. These allow any key to be stored in any location in the cache. Finding the corresponding location for a key means comparing that key against all the tags – that is, basically a linear search, except it’s usually done all in parallel. This is slower and a lot more power-hungry than direct-mapped caches; it also really doesn’t scale to large sizes. The primary advantage is that fully associative caches don’t have pathological cases caused by key hash collisions. When they’re used (mostly for things like TLBs), are rarely bigger than 40 entries or so.

In the middle between the two, there’s “set associative” caches (usually denoted N-way associative, with some number N). In a N-way associative cache, each key can be stored in one of N locations, which collectively make up the “set”. The set index can be determined directly from the key hash, and then there’s a (parallel) linear search over the remaining N items using the tags to determine which item in the set, if any, matches the key. This is the way most caches you’re likely to encounter work. Note that a direct-mapped cache is effectively a 1-way associative cache, and a fully associative cache is a cache where the degree of associativity is the same as the number of entries.

Because caches have a fixed size, inserting a new entry means that (in general) an older entry needs be evicted first (or replaced, anyway). In a direct-mapped cache, we have no choice: there is a single location where each key can go, so whatever item was in that slot gets evicted. When we have a higher number of ways, though, there’s multiple possible locations where that key could go, and we get to decide what to evict to free up space.

This decision is the “insertion policy” (also “replacement policy” or “eviction policy” in the literature – these are not quite the same, but very closely related) in the paper title. The problem to be solved here is essentially the same problem solved by operating systems that are trying to decide which portions of a larger virtual memory pool to keep resident in physical memory, and therefore classic page replacement algorithms apply.

This is a classic CS problem with a known, provably optimal solution: Bélády’s OPT. The only problem with OPT is that it heavily relies on time travel, which is permissible for research purposes but illegal to use in production worldwide, as per the 1953 Lucerne Convention on Causality and Timeline Integrity. Therefore, mass-market hardware relies on other heuristics, most commonly LRU (Least-Recently Used), meaning that the item that gets evicted is whatever was used least recently. (In practice, essentially nobody does true LRU for N>2 either, but rather one of various LRU approximations that are cheaper to implement in HW.)

LRU works quite well in practice, but has one big weakness: it’s not “scan resistant”, meaning it performs poorly on workloads that iterate over a working set larger than the size of the cache – the problem being that doing so will tend to wipe out most of the prior contents of the cache, some of which is still useful, with data that is quite unlikely to be referenced again before it gets evicted (“thrashing”).

One option to avoid this problem is to not use LRU (or an approximation) at all. For example, several of the ARM Cortex CPU cores use (pseudo-)random replacement policies in their caches, which is not as good as well-working LRU, but at the same time better than a thrashing LRU.

This paper instead proposes a sequence of modifications to standard LRU that require very little in the way of extra hardware but substantially improve the performance for LRU-thrashing workloads.

The first step is what they call LIP (“LRU Insertion Policy”): when thrashing is occurring, it’s not useful to keep cache lines around unless we think they’re going to be reused. The idea is simple: replace the least-recently-used cache line as before, but then don’t mark the newly-inserted cache line as recently used, meaning that cache line stays flagged as least-recently used. If there is a second access to this cache line before the next insertion, it will get flagged as recently used; but if there isn’t, that cache line will get replaced immediately. In a N-way associative cache, this means that at most 1/N’th of the cache will get wiped on a scan-like workload that iterates over a bunch of data that is not referenced again in the near future. (It is worth noting here that updating the recently-used state as described makes sense for L2 and deeper cache levels, where the primary type of transactions involves full-size cache lines, but is less useful in L1 caches, since most processors don’t have load instructions that are as wide as a full cache line, so even a pure sequential scan will often show up as multiple accesses to the same cache line at the L1 level.)

LIP is scan-resistant, but also very aggressive at discarding data that isn’t quickly re-referenced, and hence doesn’t deal well with scenarios where the working set changes (which happens frequently). The next step is what the paper calls BIP (“bimodal insertion policy”): this is like LIP, except that the just-inserted cache line is flagged as recently used some of the time, with low probability. The paper just uses a small counter (5 bits) and performs a recency update on insertion whenever the counter is zero.

BIP is a good policy for workloads that would thrash with pure LRU, but not as good as LRU on workloads that have high locality of reference (which many do). What we’d really like to do is have the best of both worlds: use LRU when it’s working well, and only switch to BIP when we’re running into thrashing, which is what the authors call DIP (“Dynamic Insertion Policy”), the “adaptive insertion policy” from the paper title. The way they end up recommending (DIP via “Set Dueling”) is really simple and cool: a small fraction of the sets in the cache always performs updates via pure LRU, and another small fraction (of the same size) always performs updates via pure BIP. These two fractions make up the “dedicated sets”. The cache then keeps track of how many misses occur in the pure-LRU and pure-BIP sets, using a single saturating 10-bit counter, “PSEL”: when there’s a miss in one of the pure-LRU sets, PSEL is incremented. When one of the pure-BIP sets misses, PSEL is decremented.

In effect, the algorithm is continuously running experiments in a small fraction of the cache. When PSEL is small (below 512), LRU is “winning”: there are fewer misses in the LRU sets than in the BIP sets. When PSEL is larger, BIP is in the lead, producing fewer misses than LRU. The rest of the sets (which makes up the majority of them) are “follower sets”, and use whatever insertion policy is currently winning. If PSEL is below 512, the follower sets use LRU, else they use BIP.

And… that’s it. The implementation of this in hardware is a really simple modification to an existing cache module; it requires the two counters (15 bits of storage total), a tiny bit of glue logic, and the only extra decision that needs to happen is whether the LRU status should be updated on newly-inserted cache lines (entries) or not. In particular, the timing-critical regular cache access path (read with no new insertion) stays exactly the same. Most of the relevant bits of the design fits into a tiny block diagram in figure 16 of the paper.

What’s amazing is that despite this relative simplicity, DIP results in a significant improvement on many SPEC CPU2000 benchmarks, bridging two thirds of the gap between regular LRU and the unimplementable OPT, and performing significantly better than all other evaluated competing schemes (including significantly more complex algorithms such as using ARC), despite only having a fraction of the hardware overhead.

Even if you’re not designing hardware (and most of use are not), the ideas in here are pretty useful in designing software-managed caches of various types.

28. Thompson – “Reflections on Trusting Trust” (1983 Turing Award lecture; security)

Ken Thompson and Dennis Ritchie got the 1983 ACM Turing Award for their work on Unix. Ken Thompson delivered this acceptance speech, which turned into a classic paper on its own right.

In it, he describes what is now known as the “Thompson hack”. The idea is simple: suppose you put a really blatant backdoor in some important system program like the “login” command. This is easy to do given the source, but also really blatant – why would anyone run your modified login program of their own volition?

So for the next stage, suppose you don’t actually put the backdoor in the login program itself. Instead, you hijack some program that is used to produce such system programs, like the C compiler. Say, instead of inserting your backdoor directly in “login”, you insert some evil code in the system C compiler that checks whether you’re compiling the login program, and if so, inserts your backdoor code itself. Now, a security review of the login source code will not turn up anything suspicious, but the compiler will still insert the backdoor on every compilation. However, you now put some (even more highly suspicious) code into the C compiler!

For the third and final stage, we can make this disappear as well: the important thing to note is that all major C compilers are, themselves, written in C (“bootstrapping”). Therefore, we can now add some more logic to the backdoor we added to the C compiler in the previous step: we also add some code to the C compiler that adds in our backdoors (including modification of the login program, as well as the changes we’re making to the C compiler itself) whenever the C compiler itself is compiled.

Having done this successfully, we can remove all traces of the backdoor from the C compiler source code: we now have a binary of a modified. evil C compiler that backdoors “login”, and also detects when we’re trying to compile a clean C compiler, producing another evil C compiler with the same backdoors instead. This kind of trickery is impossible to detect via source code review; and if you’re thinking that you would spot it in the binary, well, you can imagine more involved versions of the same idea that also hijack your operating system, file system drivers etc. to make it impossible to see the surreptitious modifications to the original binary.

That is not just idle musing; 33 years after Ken Thompson’s acceptance speech, this problem is more current than ever. Persistent malware that burrows into trusted components of the machine (like firmware) and is undetectable by the rest of the firmware, the OS and everything that runs on the machine is a reality, and the recent security vulnerabilities in the Intel Management Engine present on most processors Intel has shipped in the past 10 years are a reality.

Over the past few years, there have been efforts towards reproducible builds of open-source OSes and trustable (trustworthy?) end-to-end verifiable hardware design, but this is a real problem as more and more of our lives and data moves to computers we have no reason to trust (and sometimes every reason not to). We’re not getting out of this any time soon.

29. Dijkstra, Lamport et. al-“On-the-fly Garbage Collection: an Exercise in Cooperation” (1978; garbage collection)

This is the paper that introduced concurrent Garbage Collection via the tri-color marking invariant. It forms the basis for most non-concurrent incremental collectors as well. No matter where you stand on garbage collection, I think it’s useful (and interesting!) to know how collectors work, and this is one paper you’ll be hard-pressed to avoid when delving into the matter; despite the claim in the paper’s introduction that “it has hardly been our purpose to contribute specifically to the art of garbage collection, and consequently no practical significance is claimed for our solution”, this is definitely one of the most important and influential papers on GC ever written.

I won’t go into the details here; but as with anything Lamport has written or co-authored, his notes on the paper are worth checking out.

30. Lamport – “Multiple Byte Processing with Full-Word Instructions” (1975; bit twiddling)

In my original list on Twitter, I summarized this by saying that “I thought I wouldn’t need to do this anymore when we got byte-wise SIMD instructions on CPUs, and then CUDA came along and it was suddenly profitable on GPUs.”

This style of technique (often called “SWAR” for SIMD-within-a-register) is not new (nor was it in 1975, as far as I can tell), but Lamport’s exposition is better than most, since most sources just give a few examples and expect you to pick up on the underlying ideas themselves. Fundamentally, it’s about synthesizing at least some basic integer SIMD operations from non-SIMD integer arithmetic.

I used to consider this kind of thing a cute hack back when I was first doing graphics with software rendering back in the mid-90s. As my earlier quote illustrates, I thought these tricks were mostly obsolete when most CPUs started getting some form of packed SIMD support.

However, I used this technique in my DXT1 compressor back in 2007 (primarily because I only need a few narrow bit fields and SIMD intrinsic support in most compilers was in a relatively sorry state back then), published the source code, and was soon told by Ignacio Castaño (then at NVidia) that this technique was good a for about a 15% increase (if I remember correctly) in throughput for the NVidia Texture Tools CUDA encoder.

Since then, I have used SWAR techniques in GPU kernels myself, and I’ve also used them to paper over missing instructions in some SIMD instruction sets for multi-platform code – some of this quite recently (within the past year).

Within the last few years, PC GPUs have started adding support for narrower-than-32-bit data types and packed arithmetic, and mobile GPUs have always had them. And no matter what kind of computing device you’re writing code for, sometimes you just need to execute a huge number of very simple operations on tons of data, and a 2x-4x (or even more) density increase over standard data types is worth the extra hassle.

Separately, I just keep bumping into ways to map certain kinds of parallel-scan type operations to integer adder carry logic. It’s never as good as dedicated hardware would be, but unlike specialized hardware, integer adders are available everywhere right now and are not going anywhere.

In short, I’ve changed my mind on this one. It’s probably always going to be a niche technique, but it’s worth checking out if you’re interested in low-level bit hacks, and the wins can be fairly substantial sometimes. You might never get to use it (and if you do, please please write proper comments with references!), but I’ve come back to it often enough by now to consider it a useful part of my tool chest.

The end

And that’s it! It took me a while to get through the full list of my recommendations and write the extended summaries. That said, spacing the posts out like this is probably better to read than having a giant dump of 30+ papers recommended all at once.

Either way, with this list done, regular blogging will resume in the near future. At least that’s the plan.