Skip to content

Why do CPUs have multiple cache levels?

This is a reader question from “jlforrest” that seems worth answering in more detail than just a single sentence:

I understand the need for a cache but I don’t understand why there are multiple levels of cache instead of having just one larger level. In other words, let’s say the L1 cache is 32K, the L2 cache is 256K, and the L3 cache is 2M, why not have a single 32K + 256K + 2M L1 cache?

The short version is that the various cache levels have very large variations in how they are designed; they are subject to different constraints and fulfill different purposes. As a rule of thumb, as you climb up the levels of the cache hierarchy, the caches get larger, slower, higher density (more bits stored per unit area), consume less power per bit stored, and gain additional tasks.

In the hopes of building some intuition here, let’s start with an elaborate and somewhat quaint analogy. That’s right, it is…

Cache story time!

Suppose you’re a white-collar office worker in some unnamed sprawling 1960s bureaucracy, with no computers in sight, and your job involves a lot of looking at and cross-referencing case files (here being folders containing sheets of paper).

You have a desk (the L1 data cache). On your desk are the files (cache lines) you’re working on right now, and some other files you recently pulled and are either done with or expect to be looking at again. Working with a file generally means looking at the individual pages it contains (corresponding to bytes in a cache line). But unless they’re on your desk, files are just treated as a unit. You always grab whole files at a time, even if there’s only one page in them that you actually care about right now.

Also in the office is a filing cabinet (L2 cache). That cabinet contains files you’ve handled recently, but aren’t using right now. When you’re done with what’s on your desk, most of these files will go back into that filing cabinet. Grabbing something from the cabinet isn’t immediate – you need to walk up there, open the right drawer and thumb through a few index cards to find the right file – but it’s still pretty quick.

Sometimes other people need to look at a file that’s in your cabinet. There’s a guy with a cart, Buster (representing a sort of ring bus) who just keeps doing his rounds through all the offices. When an office worker needs a file they don’t have in their private cabinet, they just write a small request slip and hand it to Buster. For simplicity’s sake, let’s just say Buster knows where everything is. So the next time he comes by your office, Buster will check if someone requested any files that are in your cabinet, and if so will just silently pull these files out of the cabinet and put them on his cart. The next time he comes by the office of whoever requested the file, he’ll drop the file in their cabinet, and leave a receipt on the desk.

Every once in a while, Buster notices a requested file isn’t in the cabinet, but on the desk instead. In that case, he can’t just silently grab it; he needs to ask the worker at the desk whether they’re done with it, and if no, that worker and the one who put in the request need to agree on what to do. There are tediously elaborate corporate protocols on what to do in that situation (meetings will be called for sure!).

The filing cabinets are usually full. That means Buster can’t just put a new file in; he has to make space first, which he does by grabbing another file, preferably one that hasn’t been used in a while. Those files, Buster takes to the archive in the basement (L3 cache). In the basement, groups of files are kept densely packed in cardboard boxes on industrial shelving units. The regular office workers don’t get to come down here at all; it’s well out of their way and they’re not familiar with the details of the filing system. We leave that to the archival clerks.

Whenever Buster gets down here, he drops all the old files he grabbed for archival in the “in” tray at the front desk. He also drops off all the request slips for files that aren’t in any of the filing cabinets upstairs, but he doesn’t wait around until the clerks bring the files back, since it takes a while. They’re just going to take the request slips, pick up the file in question, and drop it off in the “out” tray whenever they’re done. So every time Buster comes around, he grabs whatever is in the “out” tray, puts it on his cart and takes it to the recipient when he next comes by.

Now, the problem is, there’s a lot of these files, and even with the efficient packing, they’re not even close to fitting in the basement. Most of the files are kept off-site; this is an office building in a nice part of town, and rents over here are way too high to spend that much space on storage. Instead, the company rents warehouse space 30 minutes out of town, where most of the old files are kept (this corresponds to DRAM). At the front desk of the basement sits Megan, the Head Archivist. Megan keeps track of which files are kept in the basement, and which are warehoused. So when Buster drops his request slips in the “in” tray, she checks which of them correspond to files in the basement (to be handled by the archival clerks) and which aren’t on-site. The latter just get added to a big pile of requests. Maybe once or twice a day, they send a van to the warehouse to grab the requested files, along with a corresponding number of old files to be mothballed (as said, the basement is full; they need to make space before they can store any more files from the warehouse).

Buster doesn’t know or care about the details of the whole warehousing operation; that’s Megan’s job. All he knows is that usually, those request slips he hands to the archive are handled pretty quickly, but sometimes they take hours.

Back to the original question

So, what’s the point of this whole elaborate exercise? Briefly, to establish a more concrete model than an opaque “magic cache” that allows us to think more clearly about the logistics involved. The logistics are just as important in designing a chip as they are in running an efficient office.

The original question was “why don’t we build a single large cache, instead of several small ones”. So if you have say a quad-core machine with 4 cores that each have 32KB L1 data cache, 256KB L2 cache, plus 2MB of shared L3 cache, why don’t we just have a ~3MB shared cache to begin with?

In our analogy: for pretty much the same reason we have individual office desks that are maybe 1.50m wide, instead of seating four different people at a single enormous desk that is 150m wide.

The point of having something on the desk is that it’s within easy reach. If we make the desk too big, we defeat that purpose: if we need to walk 50m to get the file we want, the fact that it’s technically “right here on our desk” isn’t really helping. The same goes for L1 caches! Making them bigger makes them (physically) larger. Among other things, this makes accessing them slower and consume more energy (for various reasons). L1 caches are sized so they’re large enough to be useful, but small enough so they’re still fast to access.

A second point is that L1 caches deal with different types of accesses than other levels in the cache hierarchy. First off, there’s several of them: there’s the L1 data cache, but there’s also a L1 instruction cache, and e.g. Intel Core CPUs also have another instruction cache, the uOp cache, which is (depending on your point of view) either a parallel L1 instruction cache or a “L0 instruction cache”.

L1 data caches gets asked to read and write individual items that are most commonly between 1 and 8 bytes in size, somewhat more rarely larger (for SIMD instructions). Cache levels higher up in the hierarchy don’t generally bother with individual bytes. In our office analogy, everything that’s not on a desk somewhere is just handled at the granularity of individual files (or larger), corresponding to cache lines. The same is true in memory subsystems. When a core is performing a memory access, you deal in individual bytes; the higher-level caches generally handle data wholesale, one cache line at a time.

L1 instruction caches have quite different access patterns than data caches do, and unlike the L1 data cache, they’re read-only as far as the core is concerned. (Writing into the instruction cache generally happens indirectly, by putting the data in one of the higher-level unified caches, and then having the instruction cache reload their data from there). For these (and other) reasons, instruction caches are generally built quite differently from data caches; using a single unified L1 cache means that the resulting design needs to meet several conflicting design criteria, forcing compromises that make it worse at either purpose. A unified L1 cache also needs to handle both the instruction and data traffic, which is quite the load!

As an aside: as a programmer, it’s easy to ignore how much cache bandwidth is needed to fetch instructions, but it’s quite a lot. For example, when not running code from the uOp cache, all Intel Core i7 CPU cores can fetch 16 bytes worth of instructions from the L1 instruction cache, every cycle, and will in fact keep doing so as long as instruction execution is keeping up with decoding. At 3GHz, we’re talking on the order of 50GB/s per core here, just for instruction fetches – though, granted, only if the instruction fetch unit is busy all the time, instead of being stalled for some reason or other. In practice, the L2 cache usually only sees a small fraction of this, because L1 instruction caches work quite well. But if you’re designing a unified L1 cache, you need to anticipate at least bursts of both high instruction and high data traffic (think something like a fast memcpy of a few kilobytes of data with both source and destination in the L1 data caches).

This is a general point, by the way. CPU cores can handle many memory accesses per cycle, as long as they all hit within the L1 caches. For a “Haswell” or later Core i7 at 3GHz, we’re talking aggregate code+data L1 bandwidths well over 300GB/s per core if you have just the right instruction mix; very unlikely in practice, but you still get bursts of very hot activity sometimes.

L1 caches are designed to be as fast as possible to handle these bursts of activity when they occur. Only what misses L1 needs to be passed on to higher cache levels, which don’t need to be nearly as fast, nor have as much bandwidth. They can worry more about power efficiency and density instead.

Third point: sharing. A big part of the point of having individual desks in the office analogy, or per-core L1 caches, is that they’re private. If it’s on your private desk, you don’t need to ask anyone; you can just grab it.

This is crucial. In the office analogy, if you were sharing a giant desk with 4 people, you can’t just grab a file. That’s not just your desk, and one of your other 3 co-workers might need that file right now (maybe they’re trying to cross-reference it with another file they just picked up at the other end of the desk!). Every single time you want to pick up something, you need to yell out “everyone OK if I grab this?”, and if someone else wanted it first, you have to wait. Or you can have some scheme where everyone needs to grab a ticket and wait in line until it’s their turn if there’s a conflict. Or something else; the details don’t matter much here, but anything you do requires coordination with others.

The same thing applies with multiple cores sharing a cache. You can’t just start stomping over data unannounced; anything you do in a shared cache needs to be coordinated with all the others you’re sharing it with.

That’s why we have private L1 caches. The L1 cache is your “desk”. While you’re sitting there, you can just go ahead and work. The L2 cache (“filing cabinet”) handles most of the coordination with others. Most of the time, the worker (the CPU core) is sitting at the desk. Buster can just come by, pick up a list of new requests, and put previously requested files into the filing cabinet without interrupting the worker at all.

It’s only when the worker and Buster want to access the filing cabinet at the same time, or when someone else has requested a file that’s lying on the worker’s desk, that they need to stop and talk to each other.

In short, the L1 cache’s job is to serve its CPU core first and foremost. Because it’s private, it requires very little coordination. The L2 cache is still private to the CPU core, but along with just caching, it has an extra responsibility: deal with most bus traffic and communication instead of interrupting the core (which has better things to do).

The L3 cache is a shared resource, and access to it does need to be coordinated globally. In the office analogy, this worked by the workers only having a single way to access it: namely, by going through Buster (the bus). The bus is a choke point; the hope is that the preceding two cache levels have winnowed down the number of memory accesses far enough that this doesn’t end up being a performance bottleneck.


This article covers one particular cache topology that is matches current desktop (and notebook) x86 CPUs: per-core split L1I/L1D caches, per-core unified L2 cache, shared unified L3 cache, with the cores connected via a ring bus.

Not every system looks like this. Some (primarily older) systems don’t have split instruction and data caches; some have full Harvard architectures that treat instruction and data memory as completely separate all the way through. Often L2s are shared between multiple cores (think one office with one filing cabinet and multiple desks). In this type of configuration, the L2 caches effectively act as part of the bus between cores. Many systems don’t have L3 caches, and some have both L3 and L4 caches! I also didn’t talk about systems with multiple CPU sockets etc.

I stuck with a ring bus because it fits in nicely with the analogy. Ring buses are reasonably common. Sometimes (especially when only two or three blocks need to be connected) it’s a full mesh; sometimes it is multiple ring buses connected with a crossbar (which maps reasonably to the office analogy: a multi-story office building with one “Buster” making his rounds per floor, and an elevator connecting the floors).

As a software developer, there’s a natural tendency to assume that you can magically connect module A to module B and the data just teleports from one end to the other. The way memory actually works is incredibly complicated by now, but the abstraction presented to the programmer is just one of a large, flat array of bytes.

Hardware doesn’t work like that. Pieces aren’t magically connected through an invisible ether. Modules A and B aren’t abstract concepts; they’re physical devices, actual tiny machines, that take up actual physical area on a silicon die. Chips have a “floor plan”, and that’s not a vague nod or an in-joke; it’s an actual 2D map of what goes where. If you want to connect A to B, you need to run an actual, physical wire between them. Wires take up space, and driving them takes power (more the longer they are). Running a bunch of wires between A and B means it’s physically blocking area that could be used to connect other things (yes, chips use wires on multiple layers, but it’s still a serious problem; google “routing congestion” if you’re interested). Moving data around on a chip is an actual logistical problem, and a fiendishly complicated one at that.

So while the office thing is tongue-in-cheek, “who needs to talk to whom” and “how does the geometry of this system look – does it admit a sensible layout?” are very relevant questions for both hardware and overall system design that can have a huge impact. Spatial metaphors are a useful way of conceptualizing the underlying reality.

SSE: mind the gap!

SSE and SSE2 are available in every single x86-family CPU with 64-bit support. You too can play around with SIMD, which is great fun! Unfortunately, SSE2 level in particular also happens to be what is probably the most maddeningly non-orthogonal SIMD instruction set in the world, where operations are either available or not available for particular data types with little rhyme or reason, especially where integers are involved. Later revisions (especially starting around SSE4.1) fill in some of the more annoying gaps, but plenty of us are stuck with supporting the older CPUs for at least a few more years, and besides – not to mess with the authentic SSE experience – even on AVX2-supporting CPUs, there’s still a few of the classic gaps remaining.

So, here’s a list of tricks to get you around some of the more common, eh, “idiosyncrasies” of SSE and its descendants. This happens to be mostly focused on the integer side; the floating-point side is generally less, well, weird. I’ll keep the individual descriptions relatively brief since the whole point of this post is to collect lots of tricks. The assumption here is that you’re already somewhat familiar with the instructions, so I’ll not explain the basics (maybe another time). I’ll use the official Intel intrinsics (as exposed in C/C++) since that’s probably the most common way people interact with these instructions intentionally (awkward glance in the direction of auto-vectorization here. No making eye contact. Moving on.)

Branchless “select” (cond ? a : b)

The natural mode of operation in SIMD computations is to do things branchlessly. If some part of a computation is conditional, rather than doing the equivalent of an if, it’s more typical to do both the computation for the “if” and “else” forks, and then merge the results based on the condition. The “select” I mean is the operation which takes the condition and both results and performs the rough equivalent of C’s ternary operator cond ? a : b. You first evaluate both sides, giving a and b. You then evaluate the condition using a SIMD compare, which returns a vector containing a bit mask that is has all bits set for lanes that meet cond, and all bits clear for lanes that don’t.

This select operation can always be done using a few bitwise operations (which is well known), but starting in SSE 4.1 we get slightly more efficient variants too (less well known, and the reason I mention this):

  • Integer (all vers): _mm_or_si128(_mm_and_si128(a, cond), _mm_andnot_si128(cond, b)).
  • 32-bit float (all vers): _mm_or_ps(_mm_and_ps(a, cond), _mm_andnot_ps(cond, b)).
  • 64-bit float (all vers): _mm_or_pd(_mm_and_pd(a, cond), _mm_andnot_pd(cond, b)).
  • Integer (SSE4.1+): _mm_blendv_epi8(a, b, cond).
  • 32-bit float (SSE4.1+): _mm_blendv_ps(a, b, cond).
  • 64-bit float (SSE4.1+): _mm_blendv_pd(a, b, cond).

The andnot operations don’t come in handy very often, but they’re the best choice here (pre-SSE4.1).

If you don’t want to use cond but its logical negation, just switch the positions of a and b, since (!cond) ? a : b is the same as cond ? b : a.

Unsigned integer compares

SSE, in all incarnations, offers precisely two types of integer comparisons: test for equality (PCMPEQt, _mm_cmpeq_T, where t and T stand for various type suffixes) and test for signed greater-than (PCMPGTt, _mm_cmpgt_T). Most other comparison types can be produced using nothing but logical negation and standard identities:

  • a == b is supported directly.
  • a != b is !(a == b).
  • a > b (signed) is supported directly.
  • a < b (signed) is the same as b > a (swap a and b).
  • a >= b (signed) is !(a < b) (which in turn is !(b > a)).
  • a <= b (signed) is !(a > b).

See previous note on selection operations on how to get rid of the NOT in the most common use case. Conspicuously absent from that list is any type of unsigned ordered comparison. However, a trick that works is to bias both integers so that signed comparison does the right thing:

  • a > b (unsigned, 8-bit) is the same as (a - 0x80) > (b - 0x80) (signed, 8-bit).
  • a > b (unsigned, 16-bit) is the same as (a - 0x8000) > (b - 0x8000) (signed, 16-bit).
  • a > b (unsigned, 32-bit) is the same as (a - 0x80000000) > (b - 0x80000000) (signed, 32-bit).

The same argument-swapping and NOT-ing tricks as above still apply to give you the other compare types. In general, the trick is to add (or subtract, or XOR – they all do the same thing in this particular case) the INT_MIN for the respective type from both operands before doing the compare. This turns the smallest possible unsigned integer, 0, into the smallest possible signed integer for the given type; after that, the ordering works out. In particular, when comparing against a constant, this addition (or subtraction, or XOR) can be baked into the constant operand, so the unsigned compare “only” ends up doing one more operation than a signed compare (instead of two).

A completely different approach is to use the unsigned integer min/max instructions (more about those in a second) to build less-or-equal or greater-or-equal comparisons:

  • a <= b if and only if max(a, b) == b.
  • a >= b if and only if min(a, b) == b.

The good news is that this reduces unsigned comparisons to either an unsigned min or a max, followed by an equality comparison, which is only 2 instead of 3 operations. The bad news is that the requisite unsigned min/max operations only exist for uint8s in SSE2. The uint16/uint32 variants were finally added with SSE4.1; if your minimum target is earlier, you’re stuck with the bias-then-compare variants above.

Integer min and max

SSE4.1 has the full set of integer min/max for 8-, 16- and 32-bit types, both signed and unsigned. So if you’re targeting SSE4.1 or later, good for you!

If you’re stuck with anything earlier, you’re decidedly more limited. In SSE2, you get integer min/max for uint8 and int16. If you need min/max for int8, uint16, or anything 32-bit, you’re on your own.

Luckily, we can just combine some of the techniques above to derive a solution. The general patterns here are:

  max(a, b) == (a > b) ? a : b;
  min(a, b) == (a > b) ? b : a;

So this is just a combination of a compare and a “select” operation. When the compare is signed (the int8 and int32 cases), the comparison maps to a single SSE intrinsic. The unsigned compares (uint16 and uint32) can be solved using the bias-then-signed-compare trick which in turn gives us an unsigned min/max.

32-bit and 64-bit loads/stores

This one has nothing to do with the actual instruction set and everything to do with the intrinsics: yes, SSE2 has 32-bit (MOVD) and 64-bit (MOVQ) loads and stores, the standard intrinsics just do their best to confuse you about it:

  • 64-bit loads are _mm_loadl_epi64. This intrinsic takes a __m128i * as an argument. Don’t take that seriously. The actual load is 64-bit sized, not 128-bit sized, and there is no alignment requirement.
  • 64-bit stores are _mm_storel_epi64. Again, the __m128i * is confusing and does not mean that the actual store is 128-bit or that there are alignment requirements. It isn’t and there are not.
  • 32-bit loads are even more hidden! Namely, you write _mm_cvtsi32_si128(*x) where x is a pointer to a 32-bit integer. No direct load intrinsic, but compilers will turn this into a MOVD with memory operand where applicable.
  • 32-bit stores, likewise: *x = _mm_cvtsi128_si32(value). Now you know.


There’s lots of different ways to provide multiplies in a SIMD instruction set, and by now SSE has tried most of them in one form or another.

Let’s start with the (historically) first variant: multiplying 16-bit numbers. The relevant instructions originated in the Pentium MMX and compute the low and high halves (bottom and top 16 bits) of a signed 16-bit×16-bit product. MMX only has signed multiplies, but SSE also added a “high half of unsigned 16-bit times 16-bit product” instruction (the low halves of signed and unsigned products are identical), so we’re not gonna have to worry about that particular problem, not yet anyway.

These instructions are fine if you want the low or high halves of the product. What if you want the full 32-bit product of vectors of 16-bit values? You compute the low and high halves and then merge them using the “unpack” instructions. This is the standard approach, but not very obvious if you haven’t deal with this kind of thing before. So for a full 16×16→32-bit product (note this produces two vectors worth of results), we get:

  // EITHER: a*b (16-bit lanes), signed
  __m128i lo16 = _mm_mullo_epi16(a, b);
  __m128i hi16 = _mm_mulhi_epi16(a, b);

  // OR: a*b (16-bit lanes), unsigned
  __m128i lo16 = _mm_mullo_epi16(a, b);
  __m128i hi16 = _mm_mulhi_epu16(a, b);

  // THEN: merge results
  __m128i res0 = _mm_unpacklo_epi16(lo16, hi16); // result lanes 0..3
  __m128i res1 = _mm_unpackhi_epi16(lo16, hi16); // result lanes 4..7

But what if you’re working with 32-bit values? There is a 32×32→32-bit product (PMULLD / _mm_mullo_epi32), but it was only added with SSE4.1, and it’s significantly slower than the other SSE2 multiplies in many implementations. So you might either not want to set your minimum target that high, or you might be looking for something quicker.

There’s full 32×32→64-bit products, which are available from SSE2 on as
PMULUDQ/_mm_mul_epu32 (unsigned). SSE4.1 adds the signed equivalent PMULDQ/_mm_mul_epi32 (UPDATE: An older version of this post incorrectly stated that PMULDQ was SSE2. Thanks Exophase for pointing it out!). These ones only compute two products (between the even lanes of the two sources) and place them in a 128-bit result. The odd 32-bit lanes are ignored completely, so if you want four 32×32→32-bit products, you need at least two of these multiplies and a lot of wrangling:

  // res = _mm_mullo_epi32(a, b) equivalent using SSE2, via PMULUDQ.

  // even and odd lane products
  __m128i evnp = _mm_mul_epu32(a, b);
  __m128i odda = _mm_srli_epi64(a, 32);
  __m128i oddb = _mm_srli_epi64(b, 32);
  __m128i oddp = _mm_mul_epu32(odda, oddb);

  // merge results
  __m128i evn_mask = _mm_setr_epi32(-1, 0, -1, 0);
  __m128i evn_result = _mm_and_si128(evnp, evn_mask);
  __m128i odd_result = _mm_slli_epi64(oddp, 32);
  __m128i res = _mm_or_si128(evn_result, odd_result);

It works, but it’s a mouthful.

But what if you’re using 32-bit vector lanes, but happen to know that the numbers we’re trying to multiply are in fact in the range [-32768,32767] (i.e. representable as signed 16-bit integers)? We could try narrowing the 32-bit lanes into 16 bits then using the 16×16→32 sequences above, but is that really the best we can do?

It is not: PMADDWD (_mm_madd_epi16), MMX/SSE2’s amazing and strange (but mostly amazing) dot product operation, has our back, for we can do this:

   // a and b have 32-bit lanes with values that fit in int16s.
   // produces the 32-bit result
   //   res[i] = a[i] * b[i]

   // clears high 16 bits of every 32-bit lane
   __m128i bm = _mm_and_si128(b, _mm_set1_epi32(0xffff));

   // after this, madd_epi16 does what we want!
   __m128i res = _mm_madd_epi16(a, bm);

   // can swap role of a and b above too, when convenient.

That’s a lot shorter than narrowing to 16-bit first would be! Alas, it only works for int16 (signed). What if we’re working in 32-bit lanes with values that fit inside a uint16 (unsigned)? It’s not quite as slick, but still, better than narrowing to 16-bit first or dealing with the logistics when synthesizing 32×32→32-bit muls from PMULDQ/PMULUDQ:

   // a and b have 32-bit lanes with values that fit in uint16s,
   // i.e. a[i] == (uint16)a[i] and same for b[i].
   // produces the 32-bit result
   //   res[i] = a[i] * b[i]

   // compute low and high 16-bit products
   __m128i lop = _mm_mullo_epi16(a, b);
   __m128i hip = _mm_mulhi_epu16(a, b);

   // merge results
   __m128i res = _mm_or_si128(lop, _mm_slli_epi32(hip, 16));

Horizontal adds, dot products etc. (float)

SSE3 adds horizontal adds HADDPS (_mm_hadd_ps) and HADDPD (_mm_hadd_pd) and SSE4.1 throws in the dot-product instructions DPPS (_mm_dp_ps) and DPPD (_mm_dp_pd).

Generally, don’t expect these operations to be magic. They exist in the instruction set but are fast precisely nowhere; in all x86 implementations I’m familiar with, they just turn into a canned sequence of more basic (SSE2-level) operations. So more often that not, you will end up requiring a higher minimum CPU target for little to no speed gain. Caveat: these instructions are a smaller than their replacement instruction sequence, so using them can reduce code size slightly. But still, don’t expect this to be fast.

If you want good SIMD performance, don’t lean on horizontal and dot-product style operations; process data in batches (not just one vec4 at a time) and transpose on input, or use a SoA layout to begin with.

The other kind of horizontal adds, dot products etc. (integer)

SSE does have a bunch of horizontal add and dot product-style operations that don’t suck, but they’re on the integer pipe, and not what you’d expect.

Nope, not PHADDW/PHADDD (_mm_hadd_epi16/_mm_hadd_epi32). These are SSSE3 and later only and OK but not particularly great (similar disclaimer as for the floating-point horizontal adds applies).

No, I’m talking about PMADDWD (_mm_madd_epi16, SSE2 with its ancestor around since the original MMX), PSADBW (_mm_sad_epu8, SSE2) and PMADDUBSW (_mm_maddubs_epi16, SSSE3). The official manuals describe what these instructions do, so I won’t bother going into too much detail, but here’s the basic idea: PMADDWD and PMADDUBSW are 2-element dot-product instructions between pairs of adjacent SIMD lanes. PMADDWD computes two int16 by int16 products for each pair of 16-bit lanes and sums the 32-bit integer products to yield the 32-bit result lanes. PMADDUBSW computes two uint8 by int8 products for each pair of 8-bit lanes and sums the 16-bit integer products to yield the 16-bit result lanes. These can be used to compute dot products of this problem; but they also have “degenerate” configurations that are very useful:

  • _mm_madd_epi16(x, _mm_set1_epi16(1)) sums the 16-bit even and odd lanes of x in pairs to yield 32-bit results.
  • _mm_maddubs_epi16(_mm_unpacklo_epi8(a, b), _mm_setr_epi8(1, -1, 1, -1, ..., 1, -1)) happens to be the fastest way to compute the 16-bit signed differences between 8-bit unsigned vectors a and b on processors that support SSSE3.
  • The 16-bit multiply example above shows another special configuration.

Long story short, these dot product instructions are surprisingly versatile in decidedly non-obvious ways.

Finally, PSADBW (_mm_sad_epu8, SSE2). This one is intended for motion estimation in video codecs, but it also happens to be the one actually really fast horizontal add you get on x86. In particular, _mm_sad_epu8(x, _mm_setzero_si128()) computes two 16-bit horizontal sums of groups of 8 uint8 lanes in a single, and quite fast, operation. We can do the same trick we did for compares in reverse to compute the sum of 8 int8s instead: add (or subtract, or XOR) _mm_set1_epi8(-128) to x (before the PSADBW), the subtract 128×8 from the resulting 16-bit sums.

To be continued!

There’s a lot more of these, but this feels like enough to chew on for a single blog post. So there will be a sequel covering, at least, integer widening/narrowing and variable shifts. Until then!

Repeated match offsets in BitKnit

I spent the majority of last year working on LZ77-style codecs. I’ve written about some results before. But there were also several smaller (in scope) but still quite neat discoveries along the way.

One of them has to do with repeated match offsets. BitKnit was originally designed for Granny files, which usually contain 3D meshes, animations, sometimes textures, and can also store other user-defined data. As far as a compressor is concerned, Granny files are highly structured, mostly consisting of a few large, homogeneous arrays of fixed-size records.

Repeated match offsets

Often, there is significant correlation between adjacent records, for various reasons. What this means in a LZ77-style dictionary compressor is that there will usually be a lot of matches with a match distance (or match offset) that is a small integer multiple of the record size, and matches with the same offset tend to clump together.

The way LZ77 compressors typically exploit this fact is by reserving special codes for “reuse a recent match distance”. To my knowledge, this technique first appeared in LZX, which keeps a 3-element cache of recent match offsets with a LRU eviction policy. The basic idea seems to have spread from there. Many compressors (too many to list) reserve at least a single special, cheaper code to send another match with the same offset as the previous one (corresponding to a 1-element “cache”). This, among other things, gives a cheaper way to code “gap matches” (a match that resumes after being interrupted by a few mismatching bytes) and appears to be beneficial on most types of data.

On text and data that skews towards variable-size records, having extra codes for more repeated match offsets doesn’t help much, if at all (at least, they don’t seem to hurt, either). However, on data heavy on fixed-size records, it is often a big win. LZX, as mentioned before, has 3 “repeat offset slots”. LZMA uses 4. Several experiments early in the design of BitKnit indicated that at least for the highly structured Granny files it was designed for, there was a good case to be made for having even more repeat offset slots. We re-evaluated this several times, but a repeat offset count of 8 made it into the final codec; essentially, having a larger number of offset slots allows us to “get away” with an overall less sophisticated offset coder (reducing compression, but improving decoder speed), and is a very solid win on the highly structured data that was the target use case.

Experiments with lots of repeat match offsets

Two interesting problems arise from making the “repeat offset cache” this large. First, 8 entries is large enough that it’s worth thinking about different algorithmic variants. Second, at that size, it makes sense to investigate different eviction policies as well as other strategies such as maybe “pinning” a few match distances that we expect to be useful (for example, multiples of the record size in front of homogeneous sections).

Second part first: the effect of either “preloading” or pinning useful match distances was either “in the noise” (almost any change to a LZ encoder using adaptive models will make some files larger and others smaller simply due to getting a slightly different parse) or strictly worse in all our tests. Considering how much interface complications this implies (the pre-loaded offsets for different sections need to get to the compressor somehow, and they either need to be known in the decoder from other sources or stored in the compression stream, reducing the gains even further) that makes the idea fairly uninteresting. Empirically, at least in our tests, LZ compressors find useful match distances quickly, and once they’re in the cache, they tend to stick around. Since such a cache is naturally adaptive to the data (whereas static pinned offsets are not), keeping them fully dynamic seems like a good idea.

The next test was to decouple the eviction policy from offset modeling. LZX, LZMA etc. always keep their list of recent match offsets in MRU order: slot 0 is the most recently used offset, slot 1 is the second-most recent, and so forth. One experiment we tried with a very early version of BitKnit was a variant I dubbed “stable index MRU”: offsets are still evicted on a LRU basis, but instead of shuffling the indices around on every match so that the new most recent match gets offset 0, new offsets would get inserted into the least recently referenced slot without moving the slot IDs around.

This affects the modeling; before, you have a very skewed distribution: slot 0 (most recent) is much more important than slot 1, which is more important than slot 2, and so forth. After, they are more spread out; but the idea was that in highly structured files where the same few offsets stick around for a fairly long time, you might capture more useful correlations by keeping these offsets in a single spot (which the entropy coder then tried to capitalize on).

Here were the results on a few granny files, listing the compressed sizes in bytes, with “anchor” being a x86 executable file that doesn’t have a significant amount of record-structured data in it, as a baseline. “MTF” refers to move-to-front index update policy, “stable” is the stable-index variant just described.

Configuration granny1 granny2 granny3 anchor
4 offsets, MTF 18561994 22127519 15156085 1300427
4 offsets, stable 18533728 22261691 15177584 1300980
8 offsets, MTF 17825495 21746140 14800616 1300534
8 offsets, stable 17580201 21777192 14819439 1304406
12 offsets, MTF 17619775 21640749 14677706 1301007
12 offsets, stable 17415560 21448142 14681142 1306434
16 offsets, MTF 17554769 21523058 14600474 1300984
16 offsets, stable 17341154 21462241 14669793 1308508

First off, as you can see from these experiments, going from 4 to 8 repeat match offsets really does help significantly on these files; an extra 1.5%-4% reduction in file size may not sound like much, but it’s a fairly big deal in compression terms. The experiments with even more repeat offsets were mainly to get a feel for when we start to hit diminishing returns; also, as you can see, the compressed size for the “anchor” file (which is not record-structured) doesn’t seem to care much about the difference between 4 and 8 repeat offset slots, and gets worse after.

As for stable-index coding, well, it’s a mixed bag. It does help on some files, and on the files that do seem to improve from it, it’s a bigger win when using more offset slots, but on e.g. “granny3” and “anchor” it was a net negative. Interesting experiment, but it didn’t go in.

Insertion policy

Another experiment we ran was on insertion policy. Specifically, our hypothesis was that at least on highly structured data (where the repeat offsets really help), we really want to make sure the important offsets stay “in front”. But occasionally, you will still get other matches that “don’t fit the pattern”. The problem is that this puts some other random offset in front that will then slowly “slide down” and meanwhile cause our actually important offsets to be more expensive to code.

This is more of a problem with greedy LZ parsers (which make their decisions locally); optimizing parsers (which usually try to look ahead by a few kilobytes or so) are better at correctly estimating the cost of “disrupting” the set of offsets. Either way, it’s annoying.

We tried a couple different approaches with this; the best overall approach we found in our tests was to stick with a basic MTF coding scheme and LRU eviction policy (bog-standard in other words), but distinguish between updates caused by repeat matches and those caused by inserting a new offset not currently in the repeat offset set. The former (repeat matches) just do a full move-to-front step, as usual. The latter don’t; instead of inserting a new offset all the way in front, we insert it further back. If it then gets reused a second time, it really will be moved all the way to the front of the list; but if it doesn’t get referenced again, it will drop out more quickly and with less disruption of the remaining repeat offsets.

Here’s the batch of test results, from the same compressor version. “kSlotNew” is the slot where new offsets are inserted; 0 corresponds to inserting in front (regular MTF), 1 is the second position, and so forth.

Configuration granny1 granny2 granny3 anchor
4 offsets, kSlotNew=0 18561994 22127519 15156085 1300427
4 offsets, kSlotNew=1 18450961 22153774 15154609 1304707
4 offsets, kSlotNew=2 18118014 22000266 15181128 1305755
4 offsets, kSlotNew=3 17736663 22002942 15209073 1307550
8 offsets, kSlotNew=0 17825495 21746140 14800616 1300534
8 offsets, kSlotNew=4 17327247 21546289 14771634 1305128
8 offsets, kSlotNew=6 17197347 21425116 14713121 1305588
16 offsets, kSlotNew=0 17554769 21523058 14600474 1300984
16 offsets, kSlotNew=14 17122510 21177337 14578492 1305432

We can see that the anchor prefers pure MTF, but the Granny files definitely see a win from not moving new offsets all the way to the front the first time they’re seen. There were a few more tests than the one shown, but in general, inserting new offsets in the second-to-last slot seemed like a good rule of thumb for the Granny files.

This one is definitely more contextual. As you can see, different types of files really prefer different settings here. BitKnit went with 8 offsets and insertion at the second-to-last slot (corresponding to the “8 offsets, kSlotNew=6” row above), because it produced the overall best results on the data it was designed for. (As evaluated on a larger test set not shown here.)

So, this is fairly neat, and a comparatively major win over the baseline 4 offsets and insert-in-front variant (a la LZMA) for the data in question. Now how to implement this efficiently?

Implementation notes

The basic implementation of the offset maintenance logic in the decoder is dead simple. You just keep an array of recent offsets and shuffle it around with something like this:

if (is_repeat_match) {
    // move slot "rep_idx" to front.
    // this involves grabbing the offset at the corresponding
    // location and then sliding everything before that position
    // down by one slot.
    tmp = offsets[rep_idx];
    for (uint i = rep_idx; i > 0; --i)
        offsets[i] = offsets[i - 1];
    offsets[0] = tmp;
} else {
    // implement the "insert in second-to-last position"
    // rule, which touches exactly two elements.
    offsets[kNumReps - 1] = offsets[kNumReps - 2];
    offsets[kNumReps - 2] = newOffset;

This works just fine, but it has a lot of data-dependent branches in the repeat match case, which is a performance trap in decompressors; generally speaking, you want to avoid branching on data you just read out of a bitstream, because it tends to be relatively high entropy and thus cause a lot of branch mispredictions, which are expensive.

One way to fix this is to add several entries worth of padding in front of the actual used part of offsets, and always copy the same number of entries in the “sliding down” phase. This gets rid of the data-dependent branches and makes it easy to unroll the loop fully (since the trip count is now constant) or express it using a few unaligned SIMD loads/stores (where supported).

However, BitKnit uses a different approach derived from our earlier experiments with “stable index MRU” that doesn’t need anything beyond regular integer arithmetic. The basic idea is to leave the offsets array alone; instead, we keep a secondary “data structure” that tells us which logical “repeat offset” list position corresponds to which index in the offsets array.

I write “data structure” in quotes because that information is actually stored in (drum roll)… a single 32-bit unsigned integer! Here’s the idea: we have a uint32_t mtf_state that represents the current offset permutation. It does this by storing the offset array index for the i’th logical repeat offset slot in the i’th nibble (numbered starting from the LSB upwards). At initialization time, we set mtf_state = 0x76543210, the identity mapping: the logical and actual offset indices coincide.

Why does this help? Because the fundamental operation for move-to-front processing is moving a bunch of offsets “one slot down” in their array position. If they’re separate integers, that means either a lot of copying, or less copying but using much wider (e.g. SIMD) instructions. Our array of 4-bit indices is compact enough that 8 indices fit inside a single 32-bit uint; we can slide them all “down” or “up” using nothing but a single bit shift. Now, our code above doesn’t actually move all elements, just the ones at position ≥rep_idx; but that turns out to be easily remedied with some bit masking operations.

So the alternative variant is this:

if (is_repeat_match) {
    // move slot "rep_idx" to front by permuting mtf_state. first,
    // determine the offset slot ID at that position in the list
    uint32_t rep_idx4 = rep_idx*4;
    uint32_t slot_id = (mtf_state >> rep_idx4) & 0xf;
    match_offs = offsets[slot_id]; // decoder needs this later!

    // moved_mtf: slide down everything by one slot, then put
    // "slot_id" in front.
    uint32_t moved_mtf = (mtf_state << 4) + slot_id;
    uint32_t keep_mask = ~0xf << rep_idx4; // bits that don't move
    mtf_state = (mtf_state & keep_mask) | (moved_mtf & ~keep_mask);
} else {
    // implement the "insert in second-to-last position"
    // rule, which touches exactly two elements. this is easier
    // to do by just modifying the offsets directly.
    uint32_t last = (mtf_state >> ((kNumReps - 1)*4)) & 0xf;
    uint32_t before_last = (mtf_state >> ((kNumReps - 2)*4)) & 0xf;

    offsets[last] = offsets[before_last];
    offsets[before_last] = newOffset;

It’s a bit of integer arithmetic, but not a lot, and there’s no dependence on vector instructions, fast unaligned memory access, or in fact anything outside of standard C/C++. BitKnit uses a 32-bit mtf_state to implement an 8-entry LRU cache. Using 64-bit values (and still using nibbles to store array indices), the exact same approach (with essentially no modifications to the source save for type names) can manage a 16-entry LRU.

An 8-entry LRU actually only needs 24 bits (when storing array indices in groups of three bits instead of nibbles), but that’s not a very useful size. A 4-entry LRU state fits in 4*log2(4) = 8 bits, which is nice and compact, although for 4 entries, this way is generally not a win (at least in our tests).

And now this is time to come clean: I kinda like this approach, and it’s the real reason I wrote this whole thing up. I probably would’ve still written it up even if it hadn’t turned out to be useful in practice, but it did, which is always a nice bonus.

Finally, over the years, I’ve found a few instances like this where packing a small “data structure” (using the term loosely) inside a single register-width integer produces interesting results. There’s a good chance I’ll write about more in the future! Until then.


I dislike the way many (most?) people seem to conceptualize “smartness” or intelligence in others, because I feel it misses the mark in two separate, important ways.

1. Many of the things most people consider “intelligence” are in fact acquired (or at least acquirable) skills

You think someone being “smart” means they automatically can do things you can’t, and will never be able to learn, so there’s no point in even trying? Maybe, but it’s generally unlikely.

She has a phenomenal memory for facts and can just rattle them off? Must be eidetic memory, right? Actually, probably not. You too can improve your memory for abstract facts greatly by learning mnemonic techniques, if you want to. More so than you probably think.

He is great at mathematical problem-solving? Some of that requires genuine insight, sure. A lot of it is just pattern matching (which takes mainly familiarity and practice), some fairly general problem-solving heuristics that help you if you’re stuck (if you don’t know that book and want to become better at math, just buy it or lend it at a library!), and enough patience and stamina to keep going.

And so forth. Now I don’t mean to suggest that all that stands between you and a Nobel prize is three self-help books, a week of work and some autosuggestion! Anyone who claims that is a crank trying to sell you something (probably self-help books). But many people “don’t understand science” or “are just not smart” or “just don’t get math” in the same way that I am terrible at pole vaulting: not only do I not possess the skill, I also have never once seriously tried it or made an effort to become better at it in my life!

Which brings me to my second and more important point.

2. A lot of “being smart” actually consists of getting comfortable with feeling stupid

I knew a few people back in my early teens who were Mensa members and made sure everyone knew. They didn’t really do so well in the medium and long term. The problem was that they were brilliant, they knew it, and so they never really learned how to work for something; when they ran into a problem they didn’t immediately see how to handle, they would quickly give up in frustration.

Guess what; many of the problems you will actually face, both professionally and personally, cannot be solved using brilliance. They just take effort and stamina. And those that can benefit from brilliance…. well, usually we don’t really know how to solve them yet.

Most schools teach you well-known solutions to well-known, well-specified problems. And standardized IQ tests likewise ask clear questions with known “right” answers. Being good at that is a particular (and somewhat peculiar) skill. Real-world problem solving is mostly about heuristic solutions to messy, unclear, unfamiliar problems, usually subject to random external constraints, frequently not all satisfiable at once. And it tends to make you feel stupid.

This is normal, and it has been said better elsewhere, for example in “The importance of stupidity in scientific research”. I’ll quote a bit from it, but really you should just read the whole essay, it’s pretty short.

I recently saw an old friend for the first time in many years. We had been Ph.D. students at the same time, both studying science, although in different areas. She later dropped out of graduate school, went to Harvard Law School and is now a senior lawyer for a major environmental organization. At some point, the conversation turned to why she had left graduate school. To my utter astonishment, she said it was because it made her feel stupid. After a couple of years of feeling stupid every day, she was ready to do something else.

I had thought of her as one of the brightest people I knew and her subsequent career supports that view. What she said bothered me. I kept thinking about it; sometime the next day, it hit me. Science makes me feel stupid too. It’s just that I’ve gotten used to it. [..] But high-school and college science means taking courses, and doing well in courses means getting the right answers on tests. If you know those answers, you do well and get to feel smart.

A Ph.D., in which you have to do a research project, is a whole different thing. For me, it was a daunting task. How could I possibly frame the questions that would lead to significant discoveries; design and interpret an experiment so that the conclusions were absolutely convincing; foresee difficulties and see ways around them, or, failing that, solve them when they occurred? My Ph.D. project was somewhat interdisciplinary and, for a while, whenever I ran into a problem, I pestered the faculty in my department who were experts in the various disciplines that I needed. I remember the day when Henry Taube (who won the Nobel Prize two years later) told me he didn’t know how to solve the problem I was having in his area. I was a third-year graduate student and I figured that Taube knew about 1000 times more than I did (conservative estimate). If he didn’t have the answer, nobody did.

That’s when it hit me: nobody did. That’s why it was a research problem. And being my research problem, it was up to me to solve.

This quote is talking about academic research, but the same thing applies elsewhere. I’ve done programming, I’ve done research, and I’ve done art (in the form of PC demos). What all three have in common is that most of the people I know and respect in those disciplines spend the majority of their time feeling like idiots and talentless hacks. Impostor syndrome is the norm.

Being “smart” is not actually about knowing all the answers. One of the biggest parts is being aware of the limits of your knowledge and not running around like a headless chicken when you don’t know what to do. And it’s about being wrong a lot of the time, realizing the fact, and taking steps to be slightly less wrong next time round.

Thoughts on The Witness

I have been playing The Witness since it came out this Tuesday. The Witness is what is probably best described as a “puzzle game”, and if you haven’t heard about it yet and that sounds at all interesting, I encourage you to look at the release date trailer (and some reviews if you have to), buy it and stop reading this. Just come back after you’ve played a few hours.

I will not spoil anything in here; what I will say is that I’ve been playing it for something like 30-40 hours so far (I haven’t been keeping track exactly), and that it resonates incredibly strongly with me (no doubt in part because its preoccupations match my own). It’s hard to compare to any other game I’ve played because it really is in a category (or “genre” if you want to frame it in marketing terms) of its own; I do not mean this as some kind of hyperbole, but in a literal sense: it does not really make sense to me to directly compare The Witness to most puzzle games, because it’s fundamentally trying for something different. But let’s back up a bit.

Puzzle games

I will not talk about the puzzles in The Witness, but to explain how they are different from other games, let’s talk about a few different examples in that class. One example would be various types of matching games; say the immense number of match-three games (Bejeweled, Candy Crush, you name it), but also games such like Tetris and Dr. Mario. These games have simple rules and emphasize speed; with sufficient practice, the game experience is one of continuous flow, a detached state where you just intuitively keep going without thinking about individual moves until eventually you either win, aren’t fast enough or the random number generator just screws you over. Let’s call this type of game “flow-based” for the purposes of this article.

I like these games, a lot. I still play 10-15 minutes of Tetris essentially every day (have been for years), and just discovered to my horror that my Steam play time for Bejeweled 3 is 257 hours (holy crap, that’s a lot of hours!) — although to my defense, I mainly tend to play that kind of game to have something to do with my hands while I’m listening to podcasts or similar and my attention is elsewhere (but still, man, 257 hours).

A second type is what’s commonly called “logic puzzles”. A well-known example would be Sudoku. These have a set of constraints (in Sudoku, “all 9 digits must appear in every row, column, and delineated 3×3 sub-square”) and a goal (“fill out all the cells”) as part of their rule set. A Sudoku puzzle is then a particular starting configuration (only some cells filled), and you use deductive reasoning to proceed from there to a full solution, initially very much step-by-step. Over time, as you gain proficiency, you start to observe certain recurring patterns and turn them into general inference rules; as you do so, your gameplay experience shifts into spurts of “flow mode” (where you just apply general rules you learned) interrupted by deductive reasoning at “choke points”.

This kind of game, I also like a lot. One particular (probably not that well-known but whatever) example would be “Everyday Genius: SquareLogic”, essentially a modified version of Sudoku with a lot more rules (and different puzzle types) that is specifically designed so that solving a puzzle never needs trial-and-error or backtracking. I mention here because it’s the second place on my Steam all-time playtime stats at 148 hours (narrowly beating out Civilization IV at 141 hours; it’s not all puzzle games!).

Then there’s puzzle games that actually include a manifestation of the player character in some way, and involve actually controlling that character directly in a more typical video game fashion, focusing on the interaction between the player and the world. Let’s call them “motion-based” for the purposes of this post. One classic example that’s still turn-based and essentially a logic puzzle is Sokoban. The more typical example is games which require both puzzle-solving to figure out what to do and skilled execution; e.g. puzzle platformers (like Jon Blow’s previous game, “Braid”) or games such as Valve’s Portal series. At the extreme, you have games that are still a puzzle (or at least “puzzling”), but are relatively easy to figure out, with all the difficulty being in the execution, for example Kaizo Mario World.

That latter example is not for me, but again, puzzle platformers and spatial puzzle games, I like. (Did I mention I like puzzle games?)

Final example: Rubik’s Cube. This one’s kind of interesting, because it ships in solved form. In that configuration, it can’t really be called a puzzle; it’s more of a toy. But it turns into a puzzle the instant you apply a sequence of moves and forget what exactly you did, so you can’t undo them; a Rubik’s Cube is a great visual aid if you ever needed to convince somebody that cleaning up a mess can be orders of magnitude harder than making one. (Alas, this does not seem like a particularly difficult argument to make even without such showmanship).

The fascinating thing about Rubik’s Cube is that when it first shipped in 1977, nobody knew how to solve it, and is somewhat notorious for vastly understating its difficulty on the packaging. Initially marketed as having “over 3 billion combinations but only one solution”, the actual number of states a cube can reach from the starting configuration is in fact about 4.325*1019 – 43.25 quintillions, which is 43.25 billion billions. And by any objective measure, solving a Rubik’s Cube from scratch is hard. The first published general algorithmic solution I’m aware of was David Singmaster’s in 1981, after the Cube was already being sold for 4 years! Believe it or not, part of the maths underlying Rubik’s Cube is still an active research subject. For example, the question of how many moves were required to solve the Cube in the face-turn and quarter-turn metrics (20 and 26, respectively) was open until very recently (2010 and 2014, respectively).

Today, most Cubes ship with a folded flyer or instruction booklet that states a solution algorithm, and there are speed-cubing competitions; the fastest speed-cubers, as of this writing, can solve an “average” (randomly scrambled) cube in about 6.5 seconds. Speed-cubers use relatively complicated algorithms that mostly rely on visual pattern-matching to figure out which one of a long list of lengthy memorized move sequence to perform. And thus, after nearly 40 years, Rubik’s Cube has come all the way from a literally unsolved problem that took serious research, to being yet another flow-based puzzle that people play as a competitive game. What people today do when they solve the Rubik’s Cube with known algorithms in “flow mode” bears little resemblance to the experience puzzle-buyers in 1980 would have had when trying to grapple with the cube in “discovery mode”.

Which, at long last, brings me back to The Witness.


Discovery, in more than one sense, is what The Witness is all about (hence everyone’s insistence to please avoid spoilers). You find yourself on an island and have to figure out what to do. The game does not tell you what to do. It cannot tell you what to do without defeating its own purpose. For example, logic puzzle games will tell you the rules and leave you to figure out how to apply them successfully, or efficiently.

That is, fundamentally, not what The Witness is interested in. What The Witness instead tries to do is, essentially, recreate that moment in the late 70s and early 80s when the Rubik’s Cube was out, but nobody really knew what to do with it yet. It was this intriguing object with certain mechanics that let it move in some ways but not others; some of the configurations are nice and symmetric and satisfying, others are a mess, or at least appear that way (though they may be just a few moves away from being solved, if you know the right thing to do!). But it’s not clear how to solve it at all. (In fact the only reason original buyers of the cube had to believe that it was even solvable was that it shipped in solved state, and every move they make is obviously reversible).

The Witness is difficult, no doubt, but to put it into perspective, none of the puzzles in The Witness are anywhere near as hard as the Rubik’s Cube; the hardest I’ve encountered so far (and I’m now in a state where I could start the endgame if I wanted to, which so far I don’t) are maybe as hard as a “hard” Sudoku puzzle you would find on a website or in a magazine (though it’s hard to compare, obviously), provided that you know the rules.

The primary source of actual difficulty in The Witness is exactly this – for most of the game, you’re not quite sure about the rules. You discover them as you play the game, and with alarming regularity, you run into a challenge that seems to make no sense (or be impossible to solve) given what you know so far, forcing you to re-examine your assumptions about the game world and what you think the rules are. Where most conventional puzzle games give you some knowledge front-loaded and leave you to figure out the implications, The Witness is far more interested in how knowledge is formed than how it is applied.

To some players, this evidently feels like the game is intentionally messing with them, being deliberately vague and then getting annoyed with them when they get it wrong. This is unfortunate; but really the game generally goes out of its way to avoid bottlenecking you on a single puzzle that eludes you, and if you’re stuck on a particularly difficult problem, there’s usually another, simpler puzzle elsewhere that lets you figure out things more gradually. There are plenty of things to do at any given time, and while any individual idea might not be obvious from the puzzle you’re looking at, rest assured that for every concept in the game, there are plenty of puzzles allowing you to discover and understand its meaning.

But why do this in the first place? Simply said, because the joy and satisfaction of figuring out the rules, of realizing the thing that you’ve been missing even though it’s been in front of you the whole time, is far greater than the more mechanical pleasure of becoming good at solving any particular kind of puzzle well that is the bread and butter of most puzzle games (though no worries, The Witness does give you enough of that satisfaction as well). The Witness is a game about discovery, careful observation and, most of all, epiphany—that sudden feeling of clarity as you realize something and suddenly everything clicks into place. It may seem distant and withholding at first, but it only does what it needs to do to truly let you feel the exhilaration of actually discovering something about the world. Where other games all too often tell you exactly what to do and then pat you on the back as soon as you accomplish some trivial task. The Witness respects you enough to simply trust that you are smart enough to figure it out, and never talks down to you.

One complaint I’ve heard from a few players boils down to the game being very stingy with any kind of tangible rewards. The aforementioned pats on the back are, indeed, conspicuously absent; usually, your reward (if any) for solving puzzles is just… more puzzles! All I can say on the subject is this: as I’ve been trying to explain, The Witness is a game trying to evoke the joy of discovery, which is in itself rewarding. If it’s not working for you (fair enough!) and you need external motivators to string you along, if the game feels just like a chore to be completed that needs some carrot along with the stick, then it’s evidently not working for you, and you should spend your time doing something else; provided you’ve played at least for two hours or so, your experience is broadly representative. If that’s not doing it for you, then by all means, stop.


I have to admit that I was a bit worried about this going in. I really liked Jon Blow’s previous game, “Braid”, and found its mechanics and gameplay very satisfying, but the story elements, though interesting, never clicked for me.

I need not have worried. The Witness does not have a story as such (at least not as far as I’ve played it!), but it does have themes, and the audio logs and other narrative elements scatter over the island reinforce the very same themes already present in the game play: The Witness is mechanically a game about clarity and persistence, about discovery, false alleys and doubt, about perspective shifts and epiphanies. These are exactly the themes of “story elements” (if you want to call them that) as well: they are, primarily, quote from scientists and philosophers on those very themes. And it does let them make their point in full; rather than feeding you a quick sound bite, these recordings try to give you enough context to truly stay faithful to the source material. Not exactly light fare, but all of it is very much coherent with the rest of the game. (Though if you don’t care, fair enough; just skip them, they are not important to your progression at all.)

The Witness is not being coy with you when it hands you these recordings; these are not hidden messages, nor ciphers to be decoded. In showing you what it does, the game is simply wearing its heart on its sleeve. You may care for it or not, but calling it “pretentious” as some do seems to miss the point entirely to me. When the game is quoting philosophers and scientists, it is not trying to put on airs, nor trying to bask in reflected glory; there is no pretense here. The themes present in the narration are precisely the themes explored by the game mechanics themselves.

On that note, one further observation: as mentioned before, Braid’s story didn’t really do it for me, but it’s very interesting to contrast with The Witness. Braid’s story revolves around themes of narcissism and obsession; about the protagonist, Tim, literally warping the world (via the time-manipulation that is Braid’s core mechanic) to get what he wants. The Witness is the polar opposite; the titular player character is never even given a name (or a face), and the game is emphatically not concerned with changing the world in one’s image, but rather learning to see the world as it really is; there is no conflict to speak of, and all major developments are internal (purely in the player’s head) rather than external. What little changes there are in the world at all never alter it substantially; on the scale of opening a door or flicking a light switch. It’s an interesting choice.


Here’s the thing that’s most baffling to me, as a former game developer and current developer of tools for game developers (I guess that makes me a metagame developer?): as a game developer, you know there’s a certain semi-industrial process to making games. Things get made somewhat separately and independently, by different parts of the team, and then at some point they get joined together. And if you’ll allow me an extended metaphor, it’s a bit like injection-molded bits of plastic. You cast these two or more separate shells to form your shape, and they all come out slightly warped because that’s the nature of the process. And then you apply a bit of force and glue it all together along the fault lines, and there’s some rough edges that need to be filed off, and once it’s smooth enough you say “good enough” and ship it. So as a developer, I’m used to that injection-molded plastic process, and I know what works well with that and what doesn’t, and you just try to stay away from that. And you think you know things work because you know everything worth knowing about injection-molded plastic. And some people do these tiny jade figurines that are hand-carved but everyone knows you can’t do anything big like that!

Cue The Witness, apparently carved out of a single massive slab of marble, a lot more solid than what you’re used to, with no visible fault lines, no glue residue, no filed-off corners. And you look at it and it’s there and it makes no frickin’ sense to you whatsoever; this is just not How Things Are Done. How did this happen? How can anyone make something this big without slicing it into parts and gluing it together?

That’s where I am right now. I have never seen another game as cohesive as this. Not even close. Without getting into specifics or spoiler territory, I have never seen a game with such manifest intention behind every single detail, nor one where all the details cohere this well into a whole. This goes for the design and gameplay itself, but also across traditional hard boundaries. Game design and art play off each other, and both of these are tightly coupled with some of the engine tech in the game. It’s amazing. There is no single detail in the game that would be hard to accomplish by itself, but the whole is much more than the sum of its parts, with no signs of the endless compromise that are normally a reality in every project.

In one sense, seven years of work seems like a lot of time to spend on developing a game. In another, I don’t see most teams being able to deliver a game this whole, for lack of a better word, in any time frame.

If you had told me a couple of years ago that anyone was making a puzzle game about epistemology and epiphany, I’d have laughed you out of the room. That the game ever got made feels like a small miracle. How well it succeeds at doing what it’s trying to do is another. One of the recurring frustrations of my life is that I’ve never been able to explain to anyone really close to me just how much the sense of discovery and true understanding I get sometimes when programming matters to me. The reason this blog exists is because of my desire to share my discoveries, such as they are, with others that might appreciate them. The Witness means that I can now share this feeling itself with others that are receptive to it. This means more to me than I can put into words.

Final notes

It should be clear at this point just how much I like this game.

One dissenting opinion from Tom Chick at Quarter to Three, who evidently isn’t feeling it (warning, review contains mild spoilers!):

When I was done, the feeling wasn’t elation or even satisfaction. It was that feeling you get when you finally pass part of a game you never want to have to play again. I couldn’t shake a vague resentment that I’d squandered dozens of hours to no effect beyond now knowing the made-up language of The Witness’ puzzles. Not that I’m above squandering dozens of hours in a videogame. It’s just that I prefer squandering them because I’m building something, or leveling up a character, or beating a time or score, or resolving Trevor’s storyline, or collecting more pointless stuff in virtual Gotham, or figuring out how to use banelings, or rescuing the princess from whatever damn castle she’s finally in.

As I’ve said before, by all means, if you aren’t enjoying the game, just stop. Really. As I already said, the game very much wears its heart on its sleeve. If you’re grinding through hours of the game hoping for the Big Twist, stop. It is what it is.

But do let me point out that the levels of our character, or a score, or “pointless stuff in virtual Gotham” are just as made-up, arbitrary and ultimately ethereal as anything in The Witness. And is the level really what you care about? In most non-online RPGs, you can get your character to level 80 and complete all the quests with nothing but a minute of “work” in an ordinary hex editor, and save yourself hours of play to get there! Would you want to? Probably not.

The levels, the scores, the collectibles aren’t the meaningful part at all; in the end, they’re nothing but proof that you’ve spent a certain amount of time (and exhibited a certain amount of skill) in the game world. Whether that time spent was meaningful or not is entirely up to you. On which subject I am going to close with another quote; feel free to call me pretentious too if you want!

In music, one doesn’t make the end of the composition the point of the composition.

If that were so, the best conductors would be those who played fastest. And there would be composers who wrote only finales. People would go to concerts just to hear one crashing chord—and that’s the end!

But we don’t see that as something brought by our education into our everyday conduct. We’ve got a system of schooling that gives a completely different impression. It’s all graded, and what we do is we put the child into this corridor of this grade system, with a kind of, “come on kitty kitty kitty”, and now you go to kindergarten, you know, and that’s a great thing because when you finish that you get into first grade, and then come on, first grade leads to second grade and so on. And then you get out of grade school, you got high school, and it’s revving up—the thing is coming—and then you’ve got to go to college. And, by Jove, then you get into graduate school. And when you’re through with graduate school, you go out and join the world.

And then you get into some racket where you’re selling insurance, and they’ve got that quota to make. And you’re going to make that. And all the time the thing is coming! It’s coming, it’s coming, that great thing, the success you’re working for. Then when you wake up one day about 40 years old, you say “my god, I’ve arrived!” “I’m there!” And you don’t feel very different from what you always felt. And there’s a slight letdown because you feel there’s a hoax. And there was a hoax. A dreadful hoax. They made you miss everything!

We thought of life by analogy with a journey, with a pilgrimage, which had a serious purpose at the end. And the thing was to get to that end: success or whatever it is, or maybe heaven after you’re dead. But, we missed the point the whole way along. It was a musical thing and you were supposed to sing or to dance while the music was being played.

—Alan Watts

End-of-buffer checks in decompressors

This post is about general techniques for handling end-of-buffer checks in code that processes an input stream a byte at a time, or a few bytes at a time at the most. Concretely, I’ll be talking about decompression code, but many of these ideas are also applicable to related sequential input processing tasks like lexical analysis.

A basic decoder

To show how the problem crops up, let’s look at a simple decompressor and at what happens when we try to make an efficient implementation. Here’s our simple decoder for a toy LZ77 variant:

while (!done) { // main loop
    if (get_bits(1) != 0) { // match
        int offset = 1 + get_bits(13);
        int len = 3 + get_bits(5);

        copy_match(dest, dest - offset, len);
    } else { // uncompressed 8-bit literal
       *dest++ = get_bits(8);

This particular coding scheme is just arbitrarily chosen to have a simple example, by the way. It’s not one I would actually use.

How does get_bits look like? The design space of bit IO is a big topic on its own, and I won’t be spending any time on the trade-offs here; let’s just use a basic variant with MSB-first (big endian-like) bit packing, reading the input stream from a memory buffer, one byte at a time:

const uint8_t *input_cursor;    // current input cursor
const uint8_t *input_end;       // end of input buffer

uint8_t read_byte()
    // If we reached the end of the input buffer, return 0!
    if (input_cursor >= input_end)
        return 0;

    return *input_cursor++;

uint32_t bitcount; // number of bits in bitbuf
uint32_t bitbuf;   // values of bits (bitcount bits from MSB down)

uint32_t get_bits(uint32_t nbits)
    assert(0 < nbits && nbits <= 24);

    // Refill: read extra bytes until we have enough bits
    // in buffer. Insert new bits below the ones we already
    // have.
    while (bitcount < nbits) {
        bitbuf |= read_byte() << (24 - bitcount);
        bitcount += 8;

    // The requested bits are the top nbits bits of bitbuf.
    uint32_t ret = bitbuf >> (32 - nbits);

    // Shift them out.
    bitbuf <<= nbits;
    bitcount -= nbits;
    return ret;

Note we do an explicit end-of-buffer check in read_byte and return a defined value (0 in this case) past the end of the input stream. This kind of check is generally required to avoid crashes (or buffer overrun bugs!) if there is any chance the input stream might be invalid or corrupted – be it as the result of a deliberate attack, or just a transmission error. Returning 0 past the end of buffer is an arbitrary choice, but a convention I tend to stick with in my code.

Reducing overhead

As for get_bits, the implementation is a fairly typical one. However, as should be obvious, reading a few bits like this is still a relatively involved process, because every call to get_bits involves the refill check and an update of the bit buffer state. A key trick in many decompressors is to reduce this overhead by separating looking at bits from consuming them, which allows us to grab lots of bits at once (speculatively), and then later decide how far to move the input cursor. This basically boils down to splitting get_bits into two parts:

uint32_t peek_bits(uint32_t nbits)
    assert(0 < nbits && nbits <= 24);

    // Refill: read extra bytes until we have enough bits
    // in buffer. Insert new bits below the ones we already
    // have.
    while (bitcount < nbits) {
        bitbuf |= readbyte() << (24 - bitcount);
        bitcount += 8;

    // Return requested bits, starting from the MSB in bitbuf.
    return bitbuf >> (32 - nbits);

void consume_bits(uint32_t nbits)
    assert(bitcount <= nbits);
    bitbuf <<= nbits; // shift them out
    bitcount -= nbits;

Using this new interface, we can modify our decoder to reduce bit IO overhead, by doing a single peek_bits call early and then manually extracting the different sub-fields from it:

while (!done) { // main loop
    // We read up to 19 bits; grab them all at once!
    uint32_t bits = peek_bits(19);
    if (bits & (1u << 18)) { // match bit set?
        int offset = 1 + ((bits >> 5) & 0x1fff);
        int len = 3 + (bits & 0x1f);

        consume_bits(19); // 1b flag + 13b offs + 5b len
        copy_match(dest, dest - offset, len);
    } else { // uncompressed 8-bit literal
        *dest++ = (uint8_t) (bits >> 10);
        consume_bits(9); // 1b flag + 8b value

This trick of peeking ahead and deciding later how many bits were actually consumed is very important in practice. The example given here is a simple one; a very important use case is decoding Huffman codes (or other variable-length codes) aided by a look-up table.

Note, however, that we changed the input behavior: before, we really only called read_byte when we knew it was necessary to complete reading the current code. Now, we peek ahead more aggressively, and will actually peek past the end of the input bitstream whenever the last token is a literal. It’s possible to avoid this type of problem by being more restrained in the usage of peek_bits: only ever peek ahead by the minimum amount of bits that we know is going to get consumed no matter what. However, doing so forces us to do a bit more work at runtime than the code fragment shown above entails.

However, the variant shown above is still completely correct: our implementation of read_byte checks for the end of the input stream, and returns zeroes once we’ve passed it. However, this is no longer an exceptional condition: rather than being a “contingency plan” in case of corrupted input data, we can now expect to hit this path when decoding many valid bit streams.

In short, we’re taking a check we need for correctness (the end-of-buffer check) and making it serve double duty to simplify the rest of our decoder. So far, all the code we’ve seen is very standard and not remarkable at all. The resulting bit-IO implementation is fairly typical, more so once we stop trying to only call read_byte when strictly necessary and simplify the buffer refill logic slightly by always refilling to have >24 bits in the buffer no matter what the peek amount is.

Even beyond such details, though, this underlying idea is actually quite interesting: the end-of-buffer check is not one we can easily get rid of without losing correctness (or at least robustness in the face of invalid data). But we can leverage it to simplify other parts of the decoder, reducing the “sting”.

How far can we push this? If we take as granted that reading past the end of the buffer is never acceptable, what is the least amount of work we can do to enforce that invariant?

Relaxed requirements

In fact, let’s first go one further and just allow reading past the end-of-buffer too. You only live once, right? Let’s pull out all the stops and worry about correctness later!

It turns out that if we’re allowed to read a few bytes past the end of the buffer, we can use a nifty branch-free refill technique. At this point, I’m going to manually inline the bit IO so we can see more clearly what’s going on:

while (!done) { // main loop
    // how many bytes to read into bit buffer?
    uint32_t refill_bytes = (32 - bitcount) / 8;
    // refill!
    bitbuf |= read_be32_unaligned(input_cursor) >> bitcount;
    bitcount += refill_bytes * 8;
    input_cursor += refill_bytes;

    assert(bitcount > 24);

    // peek at next 19 bits
    uint32_t bits = bitbuf >> (32 - 19);

    if (bits & (1u << 18)) { // match bit set?
        int offset = 1 + ((bits >> 5) & 0x1fff);
        int len = 3 + (bits & 0x1f);

        // consume_bits(19);
        bitbuf <<= 19;
        bitcount -= 19;
        copy_match(dest, dest - offset, len);
    } else { // uncompressed 8-bit literal
        *dest++ = (uint8_t) (bits >> 10);
        // consume_bits(9);
        bitbuf <<= 9;
        bitcount -= 9;

This style of branchless bit IO is used in e.g. Yann Collet’s FSE and works great when the target machine supports reading unaligned 32-bit big endian values quickly — the read_be32_unaligned function referenced above. This is the case on x86 (MOV and BSWAP or just MOVBE where supported), ARMv6 and later (LDR provided unaligned accesses are allowed, plus REV when in little-endian mode) and POWER/PPC; not sure about other architectures. And for what it’s worth, I’m only showing 32-bit IO here, but this technique really comes into its own on 64-bit architectures, since having at least 56 bits in the buffer means we can usually go for a long while without further refill checks.

That’s a pretty nice decoder! The only problem being that we have no insurance against corrupted bit streams at all, and even valid streams will read past the end of the buffer as part of regular operation. This is, ahem, hardly ideal.

But all is not lost. We know exactly how this code behaves: every iteration, it will try reading 4 bytes starting at input_cursor. We just need to make sure that we don’t execute this load if we know it’s going to be trouble.

Let’s say we work out the location of the spot where we need to start being careful:

// Before the decoder runs:
const uint8_t *input_mark;

if (input_end - input_cursor >= 4)
    input_mark = input_end - 4;
    input_mark = input_cursor;

The simplest thing we can do with that information is to just switch over to a slower (but safe) decoder once we’re past that spot:

while (!done && input_cursor <= input_mark) {
    // fast decoder here: we know that reading 4 bytes
    // starting at input_cursor is safe, so we can use
    // branchless bit IO

while (!done) {
    // finish using safe decoder that refills one byte at
    // a time with careful checks!

This works just fine, and is the technique chosen in e.g. the zlib inflate implementation: one fast decoder that runs when the buffer pointers are well away from the boundaries, and a slower decoder that does precise checking.

Note that the input_cursor < input_mark check is the only addition to our fast decoder that was necessary to make the overall process safe. We have some more prep work, and it turns out we ended up with an entire extra copy of the decoder for the cold “near the end of the buffer” path, but the path we expect to be much more common — decoding while still being safely away from the end of the input stream — really only does that one extra compare (and branch) more than the “fast but unshippable” decoder does!

And now that I’ve done my due diligence and told you about the boring way that involves code duplication, let’s do something much more fun instead!

One decoder should be enough for anyone!

The problem we’re running into is that our buffer is running out of bytes to read. The “safe decoder” solution just tries to be really careful in that scenario. But if we’re not feeling very careful today, well, there’s always the ham-fisted alternative: just switch to a different input buffer that’s not as close to being exhausted yet!

Our input buffers are just arrays of bytes. If we start getting too close to the end of our “real” input buffer, we can just copy the remaining bytes over to a small temp buffer that ends with a few padding bytes:

uint8_t temp_buf[16]; // any size >=4 bytes will do.

while (!done) {
    if (input_cursor >= input_mark) {
        assert(input_cursor < input_end);

        // copy remaining bytes to temp_buf
        size_t bytes_left = (size_t) (input_end - input_cursor);
        assert(bytes_left < sizeof(temp_buf));
        memmove(temp_buf, input_cursor, bytes_left);

        // fill rest of temp_buf with zeros
        memset(temp_buf + bytes_left, 0, sizeof(temp_buf) - bytes_left);

        // and update our buffer pointers!
        input_cursor = temp_buf;
        input_end = temp_buf + sizeof(temp_buf);
        input_mark = input_end - 4;

    assert(input_cursor <= input_mark);
    // rest of fast decoder using branchless bit IO

And with that little bit of extra logic, we can use our fast decoder for everything: note that we never read past the bounds of the original buffer. Also note that the logic given above can generate an arbitrary amount of trailing zero bytes: if after swapping buffers around, our input cursor hits the mark again, we just hit the refill path again to generate more zeroes. (This is why the copying code uses memmove).

This is nifty already, but we can push this idea much further still.

Switching input buffers

So far, we’re effectively switching from our regular input buffer to the conceptual equivalent of /dev/zero. But there’s no need for that restriction: we can use the same technique to switch over to a different input buffer altogether.

We again use a temporary transition buffer that we switch to when we reach the end of the current input buffer, but this time, we copy over the first few bytes from the next input buffer after the end of the current buffer, instead of filling the rest with zeroes. We still do this using our small temp buffer.

We place our input mark at the position in the temp buffer where data from the new input buffer starts. Once our input cursor is past that mark, we can change pointers again to resume reading from the new input buffer directly, instead of copying data to the temp buffer.

Note that handling cases like really short input buffers (shorter than our 4-byte “looakhead window”) requires some care here, whereas it’s not a big deal when we do the bounds checking on every consumed input byte. We’re not getting something for nothing here: our “sloppy” end-of-input window simplifies the core loop at the expense of adding some complexity in the boundary case handling.

Once we reach the actual end of the input stream, we start zero-filling, just as before. This all dovetails nicely into my old post “Buffer-centric IO” which combines very well with this technique. Together, we get almost-zero-copy IO, except for the copies into the transition buffer near buffer boundaries, which only touch a small fraction of all bytes and are there to make our lives easier.

A final few generalizations

The example I’ve been using was based on a single get_bits (or later peek_bits) call. But this is really not substantial at all. The crucial property we’re exploiting in the decoder above is that we have a known bound for the number of bytes that can be consumed by a single iteration of the loop. As long as we can establish such a bound, we can do a single check per iteration, and in general, we need to check our input cursor at least once inside every loop that consume a potentially unbounded (or at least large) number of input bytes — which in this example is only the main loop.

For the final generalization, note that a lot of compressors use a stream interface similar to zlib. In essence, this is a buffer interface similar to the one described in “Buffer-centric IO” for both the input and output buffers; the decompressor then gets called and processes data until the input or output buffers are exhausted, the end of stream is reached, or an error occurs, whichever happens first. This type of interface makes the (de)compressor somewhat harder to write but is much more convenient for the client code, which can just hand in whatever.

A typical way to implement this type of interface is described in Simon Tatham’s old article “Coroutines in C” — the key property being that the called function needs to be able to save its state at any point where I/O happens, in case it runs out of buffer space; and furthermore it needs to be able to later resume at exactly that point.

The solution is to effectively turn the (de)compressor into a state machine, and Tatham’s article describes a way to do so using a variant of Duff’s Device, quite probably the most infamous coding trick in the C language. Most (de)compressors with a zlib-like interface end up using this technique (or an equivalent) so they can jump into the middle of the decoder and resume where they left off.

So why do I mention all this? Well, the technique I’ve outlined in this article is applicable here as well: Tatham’s description assumes byte-level granularity IO, which means there’s generally lots of points inside the decoder main loop where we might need to save our state and resume later. If the decoder instead ensures there’s enough bytes left in the buffer to make it through one full iteration of the main loop no matter what, that means we have many fewer points where we need to save our state and later resume, often only in a single location.

What’s particularly interesting about combining the relaxed-refill technique with a coroutine-style decoder is that all of the refill and transition buffer logic can be pulled outside of the decoder proper. In library code, that means it can be shared between multiple decoders; so the logic that deals with the transition buffers and short input buffers only needs to be implemented and debugged once.


The key simplification in this scheme is relaxing the strict “check for end of buffer on every byte consumed” check. Instead, we establish an upper bound N on the number of input bytes that can be consumed in a single iteration through our decoder main loop, and make sure that our current input buffer always has at least N bytes left — by switching to a different temporary input buffer if necessary.

This allows us to reduce the number of end-of-buffer checks we need to execute substantially. More importantly, it greatly increases the applicability of branch-less refill techniques in bit IO and arithmetic coding, without having to keep a separate “safe” decoder around.

The net effect is one of concentrating a little complexity from several places in hot code paths (end-of-buffer checks on every byte consumed) into somewhat increased complexity in a single cold code path (buffer switching). This is often desirable.

The biggest single caveat with this technique is that as a result of the decoder requiring N bytes in the input buffer at all times, the decoder effectively “lags behind” by that many bytes – or, depending on your point of view, it “looks ahead” by N bytes, reading from the input stream sooner than strictly necessary.

This can be a problem when, for example, several compressed streams are concatenated into a single file: the decoder may only get to decoding the “end of stream” symbol for stream A after N bytes from stream B have already been submitted to the decoder. The decoder would then need to “un-read” (in the sense of ungetc) the last few bytes or seek backwards. No matter how you dice it, this is annoying and awkward.

As a result, this technique is not all that useful when this is a required feature (e.g. as part of a DEFLATE decoder obeying the zlib interface).
However, there are ways to sidestep this problem: if the bitstream specifies the compressed size for either the entire stream or individual blocks, or if the framing format ends in N or more trailing “footer” bytes (a checksum or something similar), we can use this approach just fine.

UPDATE: As commenter derf_ notes on Hacker News, there’s a nice trick to produce implicit trailing zero bits in a bit reader like the one described above by just setting bitcount to a high value once the last byte’s been read into bitbuf. However, this only works with a decoder exactly like the one shown above. The nice part about switching to an explicit zero-padding buffer is that it works not just with all bit IO implementations I’m aware of, but also with byte-normalized (or larger) arithmetic coders like typical range coders or rANS.

rANS in practice

This year, we (RAD) shipped two new lossless codecs, both using rANS. One of the two is Oodle LZNA (released in May), which Charles has already written about. The other is called BitKnit, which first shipped in July as part of Granny, and is slated for inclusion into more RAD products.

So, with two production-quality versions written and successfully shipped, this seems like a good time to write up some of the things we’ve learned, especially in terms of implementation concerns. Let’s get cracking! (I’m assuming you’re familiar with ANS. If not, I wrote a paper that has a brief explanation, and various older blog posts that give more details on the proofs. I’m also assuming you’re familiar with “conventional” arithmetic coding; if not, you’re not gonna get much out of this.)

One small note before we start…

I’ll be referring to the ANS family as a class of arithmetic coders, because that’s what they are (and so are “range coders“, by the way). So here’s a small historical note before we get cracking: the “bottom-up” construction of ANS and the LIFO encoding seem quite foreign once you’re used to most “modern” arithmetic coders, but what’s interesting is that some of the earliest arithmetic coders actually looked very similar.

In particular, I’m talking about Rissanen’s 1976 paper “Generalized Kraft Inequality and Arithmetic Coding” (which coined the term!). Note the encoding and decoding functions C and D on the second page, very reminiscent to the “bottom-up” construction of ANS (with the code being represented by a number that keeps growing), and the decoder returning symbols in the opposite order they were encoded!

Rissanen’s coder (with its rather cumbersome manual truncated floating point arithmetic subject to careful rounding considerations) never saw any widespread application, as far as I can tell. The coders that actually got traction use the now familiar top-down interval subdivision approach, and a different strategy to adapt to fixed-precision operation. But reading Rissanen’s paper from today’s perspective is really interesting; it feels like a very natural precursor to ANS, and much closer in spirit to ANS than to most of the other algorithms that descended from it.

Why rANS (and not FSE/tANS)?

On my blog especially, I’ve been talking almost exclusively about rANS, and not so much FSE/tANS, the members of the ANS family that have probably been getting the most attention. Why is that?

Briefly, because they’re good at different things. FSE/tANS are (nearly) drop-in replacements for Huffman coding, and have similar strengths and weaknesses. They have very low (and similar) per-symbol decode overhead, but the fast decoders are table-driven, where the table depends on the symbol probabilities. Building Huffman decoding tables is somewhat faster; FSE/tANS offer better compression. Both Huffman and FSE/tANS can in theory support adaptive probabilities, but there’s little point in doing anything but periodic rebuilds; true incremental updates are too slow to be worthwhile. At that point you might as well use a coder which is more suited to incremental adaptation.

Which brings us to rANS. rANS is (again, nearly) a drop-in replacement for multi-symbol Arithmetic coders (such as range coders). It uses fewer registers than most arithmetic coders, has good precision, and the decoder is division-free without introducing any approximations that hurt coding efficiency. Especially with the various tweaks I’ll describe throughout this post, rANS has what are easily the fastest practical multi-symbol alphabet arithmetic decoders I know. rANS coders are also quite simple in implementation, with none of the tricky overflow and underflow concerns that plague most arithmetic coders.

So rANS is a pretty sweet deal if you want a fast arithmetic coder that deals well with relatively fast-changing probabilities. Great! How do we make it work?

Reverse encoding

As mentioned above, ANS coders are LIFO: whatever order you encode symbols in, the decoder will produce them in the opposite order. All my ANS coders (including the public ryg_rans) use the convention that the encoder processes the data in reverse (working from the end towards the beginning), whereas the decoder works forwards (beginning towards end).

With a static model, this is odd, but not especially problematic. With an adaptive model, decoder and model have to process data in the same direction, since the decoder updates the model as it goes along and needs current model probabilities to even know what to read next. So the decoder and the model want to process data in the same direction (forward being the natural choice), and the rANS encoder needs to be processing symbols in the opposite order.

This is where it comes in handy that rANS has an interface very much like a regular arithmetic coder. For example, in my sample implementation, the symbol is described by two values, start and freq, which are equivalent to the symbol interval lower bound and size in a conventional arithmetic coder, respectively.

Most arithmetic coders perform the encoding operation right there and then. In rANS, we need to do the actual encoding backwards, which means we need to buffer the symbols first: (the first description I’ve seen of this idea was in Matt Mahoney’s fpaqa)

// Say our probabilities use 16-bit fixed point.
struct RansSymbol {
  uint16_t start; // start of symbol interval
  uint16_t range; // size of symbol interval

class BufferedRansEncoder {
  std::vector<RansSymbol> syms; // or similar

  void encode(uint16_t start, uint16_t range)
    assert(range >= 1);
    assert(start + range <= 0x10000); // no wrap-around

    RansSymbol sym = { start, range };

  void flush_to(RansEncoder &coder);

With this, we can use rANS exactly like we would any other arithmetic coder. However, it will not be generating the bitstream incrementally during calls to encode; instead, it simply buffers up operations to be performed later. Once we’re done we can then pop off the symbols one by one, in reverse order, and generate the output bitstream. Easy:

void BufferedRansEncoder::flush_to(RansEncoder &coder)
  // Replays the buffered symbols in reverse order to
  // the actual encoder.
  while (!syms.empty()) {
    RansSymbol sym = syms.back();
    coder.encode(sym.start, sym.range);

Once you have this small piece of scaffolding, you really can use rANS as a drop-in replacement for a conventional arithmetic coder. There’s two problems with this, though: if you use this to encode an entire large file, your symbol buffer can get pretty huge, and you won’t get a single bit of output until you’ve processed the entire input stream.

The solution is simple (and can also be found in the aforementioned fpaqa): instead of accumulating all symbols emitted over the entire input stream and doing one big flush at the end, you just process the input data in chunks and flush the coder periodically, resetting the rANS state every time. That reduces compression slightly but means the encoder memory usage remains bounded, which is an important practical consideration. It also guarantees that output is not delayed until the end of stream; finally, lots of compressors are already chunk-based anyway. (For the opportunity to send incompressible data uncompressed rather than wasting time on the decoder end, among other things.)

Basic interleaving

One thing that rANS makes easy is interleaving the output from several encoders into a single bitstream without needing any extra signaling. I’m not going into detail why that works here; I wrote a paper on the subject if you’re interested in details. But the upshot is that you can use multiple rANS encoders and decoders simultaneously, writing to the same output bitstream, rather than having a single one.

Why do we care? Because this is what decoding a single symbol via rANS looks like (adapted from my public ryg_rans code):

static const uint32_t kProbBits = 16;
static const uint32_t kProbMask = (1 << kScaleBits) - 1;

class RansDecoder {
  uint32_t state; // current rANS state
  // (IO stuff omitted)

  uint32_t renormalize_state(uint32_t x)
    // Byte-wise for simplicity; can use other ways.
    while (x < RANS_L) 
      x = (x << 8) | read_byte();

    return x;

  uint32_t decode_symbol()
    uint32_t x = state; // Current state value

    uint32_t xm = x & kProbMask; // low bits determine symbol
    Symbol sym = lookup_symbol(xm); // (various ways to do this)

    // rANS state advance
    x = sym.range * (x >> kProbBits) + xm - sym.start;
    x = renormalize_state(x);

    // Save updated state and return symbol
    state = x;

Note how literally every single line depends on the results of the previous one. This translates to machine code that has a single, very long, dependency chain with relatively low potential for instruction-level parallelism (ILP). This is fairly typical for all entropy coder inner loops, by the way, not just rANS. And because superscalar processors depend on ILP to deliver high performance, this is bad news; we’re not making good use of the machine.

Hence interleaving. The idea is that we have two RansDecoder instances, each with their own state, but implicitly sharing the same bitstream read pointer (referenced by read_byte). Now, when we have code like this:

  RansDecoder dec0, dec1;
  // ...
  uint32_t sym0 = dec0.decode_symbol():
  uint32_t sym1 = dec1.decode_symbol();

the processor’s out-of-order execution logic can overlap execution of both decodes, effectively running them at the same time. The renormalize step for dec0 needs to happen before the renormalize of dec1, but other than that, there’s no dependencies between the two. For what it’s worth, this does not actually require out-of-order execution; a compiler for an in-order architecture can also work with this, provided it has enough dataflow information to know that dec0 calling read_byte() does not influence anything that dec1 does before its renormalize step. So what interleaving does is convert a very serial task into one that’s much more amenable to superscalar execution.

What it boils down to is this: a regular rANS decoder is a fast, precise, divide-less arithmetic decoder (which is nice to begin with). Interleave two streams using what is otherwise the exact same code, and you get a pretty good boost in performance; (very roughly) around 1.4× faster, on both the decoder and encoder. But this is by now an old hat; this was all in the initial release of ryg_rans.

Some of the early experiments leading up to what later became BitKnit uses this directly, pretty much the same as in the ryg_rans example code, but it turns out it was a bit of a pain in the neck to work with: because the two streams need to interleave properly, the BufferedRansEncoder needs to keep track of which symbol goes to which stream, and both the encoder and decoder code needs to (somewhat arbitrarily) assign symbols to either stream 0 or stream 1. You’d prefer the streams to keep alternating along any given control-flow path, but that’s not always possible, since sometimes you have conditionals where there’s an even number of symbols send on one path, and an odd number sent on the other! So having two explicit streams: not so great. But we found a better way.

Implicit interleaving to the rescue

What we actually ended up doing was interleaving with a twist – literally. We give the underlying rANS encoders (and decoders) two separate state values, and simply swap the two stream states after every encoding and decoding operation (that’s where the “BitKnit” name comes from – it keeps two active states on the same “needle” and alternates between them). The modifications from the decoder shown above are pretty small:

class RansDecoder {
  uint32_t state1; // state for "thread 1"
  uint32_t state2; // state for "thread 2"

  // ...

  uint32_t decode_symbol()
    uint32_t x = state1; // Pick up thread 1

    // ---- BEGIN of code that's identical to the above

    uint32_t xm = x & kProbMask; // low bits determine symbol
    Symbol sym = lookup_symbol(xm); // (various ways to do this)

    // rANS state advance
    x = sym.range * (x >> kProbBits) + xm - sym.start;
    x = renormalize_state(x);

    // ---- END of code that's identical to the above

    // Save updated state, switch the threads, and return symbol
    state1 = state2; // state2 becomes new state1
    state2 = x;      // updated state goes to state2


The changes to the encoder are analogous and just as simple. It turns out that this really is enough to get all the performance benefits of 2× interleaving, with none of the extra interface complexity. It just looks like a regular arithmetic decoder (or encoder). And assuming you write your implementation carefully, compilers are able to eliminate the one extra register-register move instruction we get from swapping the threads on most paths. It’s all win, basically.

Bypass coding

Borrowing a term from CABAC here; the “bypass coding mode” refers to a mode in the arithmetic coder that just sends raw bits, which you use for data that’s known a priori to be essentially random/incompressible, or at least not worth modeling further. With conventional arithmetic coders, you really need special support for this, since interleaving an arithmetic code stream with a raw bitstream is not trivial.

With rANS, that’s much less of a problem: you can just use a separate bitbuffer and mix it into the target bitstream with no trouble. However, you may not want to: rANS has essentially all of the machinery you need to act as a bit buffer. Can you do it?

Well, first of, you can just use the arithmetic coder with a uniform distribution to send a set number of bits (up to the probability resolution). This works with any arithmetic coder, rANS included, and is fairly trivial:

  // write value "bits" using "numbits"
  coder.encode(bits << (kProbBits - numbits), 
                  1 << (kProbBits - numbits));

and the equivalent on the decoder side. However, this is not particularly fast. Fortunately, it’s actually really easy to throw raw bit IO into a rANS coder: we just add the bits at the bottom of our state variable (or remove them from there in the decoder). That’s it! The only thing we need to do is work out the renormalization condition in the encoder. Using the conventions from the bytewise ryg_rans, an example implementation of the encoder is:

static inline void RansEncPutBits(RansState* r, uint8_t** pptr,
  uint32_t val, uint32_t nbits)
  assert(nbits <= 16);
  assert(val < (1u << nbits));

  // nbits <= 16!
  RansState x = RansEncRenorm(*r, pptr, 1 << (16 - nbits), 16);

  // x = C(s,x)
  *r = (x << nbits) | val;

and the corresponding getbits in our ongoing example decoder looks like this:

class RansDecoder {
  // ...

  uint32_t get_bits(uint32_t nbits)
    uint32_t x = state1; // Pick up thread 1

    // Get value from low bits then shift them out and
    // renormalize
    uint32_t val = x & ((1u << nbits) - 1);
    x = renormalize_state(x >> nbits);

    // Save updated state, switch the threads, and return value
    state1 = state2; // state2 becomes new state1
    state2 = x;      // updated state goes to state2

    return val;

note that except for the funky state swap (which we carry through for consistency), this is essentially just a regular bit IO implementation. So our dual-state rANS admits a “bypass mode” that is quite cheap; usually cheaper than having a separate bit buffer would be (which would occupy yet another CPU register in the decoder), at least in my tests.

Note that if you combine this with the buffering encoder described above, you need a way to flag whether you want to emit a regular symbol or a burst of raw bits, so our RansSymbol structure (and the code doing the actual encoding) gets slightly more complicated since we now have two separate types of “opcodes”.

The implementation above has a limit of 16 bits you can write in a single call to RansEncPutBits. How many bits you can send at once depends on the details of your renormalization logic, and how many bits of rANS state you keep. If you need to send more than 16, you need to split it into multiple operations.

Tying the knot

I got one more: a rANS encoder needs to write its final state to the bitstream, so that the decoder knows where to start. You can just send this state raw; it works just fine. That’s what the ryg_rans example code does.

However, rANS states are not equally likely. In fact, state x occurs with a probability proportional to 1/x. That means that an ideal code should spend approximately \log_2(x) bits to encode a final state of x. Charles has already written about this. Fortunately, the ideal coder for this distribution is easy: we simply send the index of the highest set bit in the state (using a uniform code), followed by the remaining bits.

One options is to do this using regular bit I/O. But now you still need a separate bit IO implementation!

Fortunately, we just covered how do send raw bits through a rANS encoder. So one thing we can do is encode the final state value of stream 2 using the “stream 1” rANS as the output bit buffer, using the putbits functionality just described (albeit without the thread-switching this time). Then we send the final state of the “stream 1” rANS raw (or using a byte-aligned encoding).

This approach is interesting because it takes a pair of two rANS encoder threads and “ties them together” – making a knot, so to speak. In the decoder, undoing the knot is serial (and uses a single rANS decoder), but immediately after initialization, you have a working dual-stream coder. This saves a few bytes compared to the sloppier flushing and is just plain cute.

This technique really comes into its own for the wide-interleave SIMD rANS coders described in my paper, because it can be done in parallel on lots of simultaneous rANS coders in a reduction tree: group lanes into pairs, have each odd-indexed lane flush into its even-indexed neighbor. Now look at groups of 4 lanes; two have already been flushed, and we can flush the rightmost “live” lane into the leftmost lane coder. And so forth. This allows flushing a N× interleaved SIMD rANS coder in O(\log(N)) coding operations, and still has some parallelism while doing so. This is not very exciting for a 2× or 4× interleaved coder, but for GPU applications N is typically on the order of 32 or 64, and at that level it’s definitely interesting.

Conclusion and final notes

Using the techniques described in this post, you can write rANS encoders and decoders that have about the same amount of code as a conventional arithmetic coder with comparable feature set, have a similar interface (aside from the requirement to flush the encoder regularly), are significantly faster to decode (due to the combination of the already-fast rANS decoder with implicit interleaving), and have very cheap “bypass coding” modes.

This is a really sweet package, and we’re quite happy with it. Anyone interested in (de)compression using adaptive models should at least take a look. (For static models, FSE/tANS are stronger contenders.)

What example code there is in this article uses byte-wise renormalization. That’s probably the simplest way, but not the fastest. Oodle LZNA uses a 63-bit rANS state with 32-bits-at-a-time renormalization, just like rans64.h in ryg_rans. That’s a good choice if you’re primarily targeting 64-bit platforms and can afford a 64-bit divide in the encoder (which is quite a bit more expensive than a 32-bit divide on common CPUs). BitKnit uses a 32-bit rANS state with 16-bits-at-a-time renormalization, similar to the coder in ryg_rans rans_word_sse41.h. This is friendlier to 32-bit targets and admits a branch-free renormalize option, but also means the coder has somewhat lower precision. Using a probability range of 16 bits would not be wise in this case; BitKnit uses 14 bits.


Get every new post delivered to your Inbox.

Join 289 other followers