Skip to content

Networks all the way down, part 2.

In the previous post, I talked a bit about how essentially anything we would call a computer these days is in fact, for all practical purposes, a heterogeneous cluster made up of various specialized smaller computers, all connected using various networks that go by different names and are specified in different standards, yet are all suspiciously similar at the architecture level; a fractal of switched, packet-based networks of heterogeneous nodes that make up what we call a single “computer”. Which we then promptly connect with other computers to form home networks, work networks, and ultimately the internet.

So far, that is just an observation. But does it mean anything? Well, sort of. For one thing, it means that all the network security problems that plague inter-computer networking also exist within computers themselves. Just as you can own a machine from outside using a network-based attack (ultimately delivered via Ethernet frames), you can own it using exploits delivered via USB packets, or eSATA, or Thunderbolt. This is not a particularly bad security nightmare – all these attacks require physical access, and once you have physical access to a machine you have all kinds of ways of working around security mechanisms anyway – but it does mean that people trying to sell you hardware that you aren’t allowed to run your own code on have a pretty large attack surface to worry about.

That said, as far as implications of “everything is a computer running a network stack” go, “there are going to be implementation issues” is not exactly the most exciting place to end up at. Things get substantially more interesting once we approach the same underlying truth from another angle, though. To do so, let’s pick a concrete example. Being an affluent male geek, I socialize with a bunch of other affluent male geeks, a lot of whom store enough data (or enough computers) at home to require a NAS (Network Attached Storage) device to house terabytes of (doubtlessly 100% legally acquired!) data and serve it to the multitude of desktop computers, tables, phones and media devices they keep in their home. Now, these boxes are essentially marketed as a bunch of drive enclosures with a network plug and a bunch of added value features; RAID, video streaming (and sometimes live transcoding as well), that kind of thing. Since they’re essentially storage appliances, most people would view them as somewhere between an external hard drive enclosure and a proper computer. Of course, there’s no such thing as “half a computer”, and these devices are typically just small computers with a NIC and a few hundred megabytes to a few gigabytes of RAM, usually running some variant of Linux. They could be used for all kinds of computing; they just happen to be specialized in a way that makes them more suitable for some tasks than others.

That’s not why I bring this up, though; I bring it up because from an end user’s point of view, a NAS is essentially just a disk drive you can plug into multiple machines simultaneously, using an Ethernet plug instead of a SATA plug. Now from the computer’s (and OSes) point of view, a local disk versus a NAS are very different things; at that level, they are speaking very different languages (namely, a block device with a local file system on it, versus a remote network filesystem such as CIFS or NFS). But to most applications running on top of the OS (and by extension most users that use them), the distinction between a local drive and a mounted network share only really becomes apparent when the abstraction breaks.

Interestingly, we see the same pattern today even when we zoom in to the local storage connected to a computer. If you were to get a 2004 SATA hard drive, it would very likely be exactly that – still a “dumb” hard drive, possibly with a bit of cache, a microcontroller to talk SATA to the host and really address the disk, plus a fair amount of (possibly programmable) DSP circuitry to actually handle the encoding of bits into and decoding of bits from magnetic flux patterns, because doing this is way more involved – and way less discrete – than you probably think. So even in 2004, we already have a considerable amount of essentially invisible compute power being thrown at a problem that the higher-up abstraction layers gloss over; but at least all this processing still essentially respects the underlying abstraction of a block device. For a 2014 USB thumb drive, that’s not necessarily true anymore either: they do things like wear leveling, remapping block addresses to distribute write loads evenly over the device (and hence prevent some blocks aging much faster than others). These processes often involve parsing the file system itself (at least if it’s FAT or a variant). Think about this for a second: a storage device that, by itself, understands the file system that is stored on it, and automatically rearranges data to maximize its expected lifetime. This level of autonomy is less than a NAS transcoding videos on the fly, but it’s still a far cry from the truly “dumb” disk and tape drives our reigning block device abstractions were designed for. And we’re not talking about some expensive high-availability low-volume server technology here; this is a storage technology that is both cheap and ubiquitous.

That brings up another point: Flash memory technology is a vastly different beast from magnetic recording technology. And yet, all of those differences get hidden and papered over at a real low level; we could have taught OSes what Flash devices are, how they work, and declared them as a separate device family, with separate file systems that handle the wear leveling on the software side. Yet this is not what happened; evidently, the path of least resistance was to add dedicated hardware and software to every single Flash storage device that makes it look like a hard drive, even though it works very differently, with very different trade-offs.

This is not an isolated incident: if you go over the list of peripherals from last time, you’ll find something similar for every single one of them. Ethernet technology has essentially been completely redesigned between 1989 and 2014, yet the only part of the Ethernet standard that’s both device-independent and directly visible to software – namely, Ethernet frames – has remained essentially unchanged (except for the addition of jumbo frames, which are still not widely used). SAS and SATA may use a very different encoding and topology than their parallel counterparts, but from a software perspective the changes are evolutionary not revolutionary. Regular PCI and PCIe are essentially the same from the perspective of the OS. And so forth. Probably the most impressive example of all is the x86 architecture; a modern x86 processor can (and will) go to substantial lengths to pretend that it’s really just a fancy 8086, a 286, a 386, or a slightly extended AMD Athlon 64, depending on the operating mode; the architecture and implementation of current x86 processors is nothing like the original processors it’s emulating, but while the implementations may change substantially over time, the interfaces – protocols, to stay within our networking terminology – stay mostly constant over large time scales, warts and all.

Note how much this runs counter to the common attitude that software is easy to change, and hardware difficult. This might be true for individuals; but across computing as a whole, it’s missing the point entirely. The more relevant distinction is that implementations are easy to change, and protocols difficult. And once a protocol (or API in software terms) is firmly established and widely supported, it’s damn near impossible to get rid of. If that means that USB thumb drives have to pretend they are hard disks, and need the capability to understand the file system that’s stored on them to do a good job; if that means that Ethernet devices have to pretend that it’s 1989 and all nodes on the network are connected to one wire and identified by their MAC address, rather than connected by point-to-point links via network switches that understand IP just fine; if that means that “x86″ processors have to pretend they’re not actually out-of-order cores with an internal RISC-ish instruction set, hundreds of registers and sophisticated “decoding” (really, translation) engines; then so be it. That is the price we pay to keep things more or less in working order – except for that occasional instance when all our sophisticated abstractions break, the underlying reality shines through, and that mounted network share is revealed to not actually be on the local machine at all, by producing errors that no app knows how to handle.

Just keep hitting refresh. It’ll be back in a minute.

Networks all the way down.

I first started playing around with PCs about 25 years ago, so around 1989/1990 (and began coding maybe a year later). At the time, there was a large variety of different connectors, plugs and buses in use: keyboards used DIN or PS/2 plugs, a point-to-point link with a custom protocol. Mice and modems (if you had them!) were typically connected using RS-232, a point-to-point serial link. Your monitor was connected using a VGA connector, a point-to-point link carrying analog component video (RGBHV). Printers were connected using the parallel port, a point-to-point parallel link (initially 8 bits host->device unidirectional with some unidirectional and bidirectional status bits; later, there were also modes with a bidirectional link on the 8 data bits, used to connect devices like scanners). The PC had an extension bus, now called ISA, that you could use to plug in peripherals such as video cards, floppy controllers, hard disk controllers, SCSI controllers, sound cards and even memory expansions. On the storage side, floppy drives were on a parallel link (not sure if it qualifies as a bus or not) that could multiplex two floppy drives onto one wide ribbon cable. Consumer hard drives typically spoke what is now called PATA (parallel ATA) but was then just called ATA or IDE; essentially just a thinly wrapped ISA bus protocol (these drives already had their controllers on-board). As such, ATA/IDE was a parallel bus (more ribbon cables). There was also SCSI; on PCs, it was considered the “high-end” storage option, on other platforms it was the default. Either way, SCSI was another parallel bus. Finally, if you had an Ethernet card (and in 1990 you probably didn’t in your home machine, because there was no point – it was quite likely the only computer in your household, or in a half-mile radius for that matter), then at the time that meant one long coaxial cable (“bus topology”), T-plugs to plug in machines, and terminators at both ends of the cable to prevent reflections. Ethernet at the time already sent packets (not the IP kind, Ethernet packets, properly called “frames”) over a high-speed (10Mbit/s!) serial link, relying on collision detection in lieu of the more complicated arbitration techniques its competitors used at the time.

Okay, so much for the history lesson. Now let’s fast-forward 25 years. How does the situation look today?

We are still using “Ethernet”, but 2014 Ethernet has very little in common with its 1989 counterpart other than the name. It still uses serial links, but as of 10BASE-T it uses a star topology (using switches and hubs) rather than a bus, and with the introduction of Gigabit Ethernet, the whole “collision detection” business fell by the wayside too; Gigabit Ethernet is a bunch of point-to-point links connected by switches (active devices, instead of hubs which are mostly passive), and hence there’s no physically shared medium anymore; there’s explicit switching. So: serial point-to-point links, packet based, switching. The same tech (albeit at different scaling levels) is used in your home, at your workplace, at your ISP, and in major internet exchanges/peering spots.

Extension cards used to be on the ISA bus; in 2014, it’s all PCI Express, and while it’s still called a “bus”, it hasn’t been one ever since the “Express” part got added. PCIe uses serial not parallel links, the connectivity is point-to-point instead of a shared medium, with arbitration by the “root complex” (which logically sits at the center of the bus, making this a star instead of a bus topology), oh and communication is packet-based. So PCIe uses serial point-to-point links is packet-based, and uses switching.

On the storage front, PATA was succeeded by SATA. SATA is, as the name says, serial instead of parallel. The cables, which used to form a bus (with each cable typically allowing up to two devices to be connected to the controller) now are used for single point-to-point links between devices and the controller. Oh, and there is a more explicit framing protocol based around packets. In other words, SATA has a star topology of serial point-to-point links and is packet-based.

Of course, SCSI didn’t disappear off the face of the earth either; it turned into Serial Attached SCSI (SAS). SAS keeps the SCSI command set, but as the name suggests, links are now serial. Oh, and – interrupt me if this sounds familiar – they’re now point-to-point, with everything connected either directly to the “initiator”, or indirectly through one or more “expanders” (switches), meaning SAS has a tiered-star topology. I’m not sure whether or not SAS is internally packet-based though, but I have my suspicions :) [can anyone confirm this?]. Oh, and SAS can also tunnel SATA streams, because why not.

Keyboard, mice, and printers are now all on USB. USB stands for “Universal Serial Bus”; the first two words of that are accurate (it is pretty much universally used at this point, and it is serial) but the latter is a blatant lie. USB may pretend to be a bus, but it really is a packet-based network with a tiered-star topology. Serial point-to-point links, possibly with hubs in between, and all packet-based. And if you happen to attach a USB storage device such as a thumb drive or external hard disk, well, USB storage really just wraps SCSI commands in USB packets – and these SCSI commands may in turn wrap ATA commands.

And what about displays? Well, if you use DVI or HDMI, you still have a dedicated point-to-point link, much as in the VGA days, although it’s now all digital. But if you happen to be using the newer DisplayPort, well, that’s a – can you guess? – serial, packet-based point-to-point link protocol. Which can also carry USB across the link, by the way. Oh, and there’s also Thunderbolt, which can multiplex both DisplayPort and PCIe packets over its serial point-to-point links. Among other things, this has fun consequences such as allowing your monitor or TV (or anything else plugged into your display connector) full freaking bus master DMA access – a killer feature if I’ve ever seen one! – and, just as fun, finally gives you the opportunity to have a storage device that talks ATA through SCSI by way of USB over DisplayPort inside Thunderbolt, because we can.

Okay, so pretty much every single pluggable link interface that used to be in 1989 PCs has been replaced by what are, effectively, all very similar spins on the same underlying network technology (and they really are all networks, albeit some with funky protocols). Is that the end of it?

Well, no, not quite. Because we now have multiple CPUs, too. And guess how they talk to each other? Well, in the case of PCs, there’s two primary options: AMD’s HyperTransport and Intel’s QuickPath Interconnect (QPI). Needless to say, both of these are packet-based, serial, point-to-point links, typically used to communicate between CPU cores (or CPU sockets anyway).

So what’s my point here? Well, logically, these CPUs appear to share the same memory, but physically it’s more common at this point to have each CPU (or at least each socket) own a certain fraction of physical memory that it’s connected to directly; the rest, it can only access by asking the other socket (=NUMA). And then your hard drive (or SSD), which speaks SCSI or ATA or both, needs a controller to understand these protocols – often programmable. And, in the case of hard drivers, usually also a DSP to decode/encode the stuff that’s on disk. And your GPU also has its own memory, and a wide array of vector processors that is (at this point) easily Turing-complete. And usually several other fairly general-purpose blocks you don’t even know about. In fact, for each general-purpose processor (with its own memory) in your system that you know about, there are probably 2 or 3 that you don’t. There are tiny ARM cores absolutely everywhere, sometimes where you’d least expect them, and Intel likes to use embedded 486s for the same kind of thing (they’re so small by now, you literally need a microscope to even see them).

We don’t really tend to think of them as such, but not only are all kinds of peripherals on network links these days, they also have general-purpose CPUs, often with substantially more processing power (and memory!) than the 286 @ 10MHz I learned programming on in 1990. And this is all still just talking about PCs – I haven’t even started on things like cars or planes.

So what’s my point? Well, just to say that what we think of as “computers” today are, in practice, already fairly large, heterogeneous clusters of different specialized smaller computers, and while only a small part of these computers is visible to the user, there’s a bunch of them lurking just below the surface. This goes not just for PCs, but also phones and tablets. And of course, absolutely everything you plug into a computer these days is – by itself – another computer. Even stuff you think of as “passive” components, like batteries and thumb drives.

And they’re all talking to each other over networks, most of which are starting to look pretty damn similar at this point – and with greatly varying degrees of implementation quality, eliciting both awe that anything works at all and dread at how easy it is all to exploit. So let me conclude by formulating an analogue to Zawinski’s Law of Software Envelopment:

Every data link standard converges to serial point-to-point links connected in a tiered-star topology and transporting packets. Those link standards which cannot so converge are replaced by ones which can.

rANS with static probability distributions

In the previous post, I wrote about rANS in general. The ANS family is, in essence, just a different design approach for arithmetic coders, with somewhat different trade-offs, strengths and weaknesses than existing coders. In this post, I am going to talk specifically about using rANS as a drop-in replacement for (static) Huffman coding: that is, we are encoding data with a known, static probability distribution for symbols. I am also going to assume a compress-once decode-often scenario: slowing down the encoder (within reason) is acceptable if doing so makes the decoder faster. It turns out that rANS is very useful in this kind of setting.

Review

Last time, we defined the rANS encoding and decoding functions, assuming a finite alphabet \mathcal{A} = \{ 0, \dots, n - 1 \} of n symbols numbered 0 to n-1.

C(s,x) := M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s)
D(x) := (s, F_s \lfloor x/M \rfloor + (x \bmod M) - B_s) where s = s(x \bmod M).

where F_s is the frequency of symbol s, B_s = \sum_{i=0}^{s-1} F_i is the sum of the frequencies of all symbols before s, and M = \sum_{i=0}^{n-1} F_i is the sum of all symbol frequencies. Then a given symbol s has (assumed) probability p_s = F_s / M.

Furthermore, as noted in the previous post, M can’t be chosen arbitrarily; it must divide L (the lower bound of our normalized interval) for the encoding/decoding algorithms we saw to work.

Given these constraints and the form of C and D, it’s very convenient to have M be a power of 2; this replaces the divisions and modulo operations in the decoder with bit masking and bit shifts. We also choose L as a power of 2 (which needs to be at least as large as M, since otherwise M can’t divide L).

This means that, starting from a reference probability distribution, we need to approximate the probabilities as fractions with common denominator M. My colleague Charles Bloom just wrote a blog post on that very topic, so I’m going to refer you there for details on how to do this optimally.

Getting rid of per-symbol divisions in the encoder

Making M a power of two removes the division/modulo operations in the decoder, but the encoder still has to perform them. However, note that we’re only ever dividing by the symbol frequencies F_s, which are known at the start of the encoding operation (in our “static probability distribution” setting). The question is, does that help?

You bet it does. A little known fact (amongst most programmers who aren’t compiler writers or bit hacking aficionados anyway) is that division of a p-bit unsigned integer by a constant can always be performed as fixed-point multiplication with a reciprocal, using 2p+1 bits (or less) of intermediate precision. This is exact – no round-off error involved. Compilers like to use this technique on integer divisions by constants, since multiplication (even long multiplication) is typically much faster than division.

There are several papers on how to choose the “magic constants” (with proofs); however, most of them are designed to be used in the code generator of a compiler. As such, they generally have several possible code sequences for division by constants, and try to use the cheapest one that works for the given divisor. This makes sense in a compiler, but not in our case, where the exact frequencies are not known at compile time and doing run-time branching between different possible instruction sequences would cost more than it saves. Therefore, I would suggest sticking with Alverson’s original paper “Integer division using reciprocals”.

The example code I linked to implements this approach, replacing the division/modulo pair with a pair of integer multiplications; when using this approach, it makes sense to limit the state variable to 31 bits (or 63 bits on 64-bit architectures): as said before, the reciprocal method requires 2p+1 bits working precision for worst-case divisors, and reducing the range by 1 bit enables a faster (and simpler) implementation than would be required for a full-range variant (especially in C/C++ code, where multi-precision arithmetic is not easy to express). Note that handling the case F_s=1 requires some extra work; details are explained in the code.

Symbol lookup

There’s one important step in the decoder that I haven’t talked about yet: mapping from the “slot index” x \bmod M to the corresponding symbol index. In normal rANS, each symbol covers a contiguous range of the “slot index” space (by contrast to say tANS, where the slots for any given symbol are spread relatively uniformly across the slot index space). That means that, if all else fails, we can figure out the symbol ID using a binary search in \lceil\log_2 n\rceil steps (remember that n is the size of our alphabet) from the cumulative frequency table (the B_s, which take O(n) space) – independent of the size of M. That’s comforting to know, but doing a binary search per symbol is, in practice, quite expensive compared to the rest of the decoding work we do.

At the other extreme, we can just prepare a look-up table mapping from the cumulative frequency to the corresponding symbol ID. This is very simple (and the technique used in the example code) and theoretically constant-time per symbol, but it requires a table with M entries – and if the table ends up being too large to fit comfortably in a core’s L1 cache, real-world performance (although still technically bounded by a constant per symbol) can get quite bad. Moving in the other direction, if M is small enough, it can make sense to store the per-symbol information in the M-indexed table too, and avoid the extra indirection; I would not recommend this far beyond M=212 though.

Anyway, that gives us two design points: we can use O(n) space, at a cost of O(\log n) per symbol lookup; or we can use O(M) space, with O(1) symbol lookup cost. Now what we’d really like is to get O(1) symbol lookup in O(n) space, but sadly that option’s not on the menu.

Or is it?

The alias method

To make a long story short, I’m not aware of any way to meet our performance goals with the original unmodified rANS algorithm; however, we can do much better if we’re willing to relax our requirements a bit. Notably, there’s no deep reason for us to require that the slots assigned to a given symbol s be contiguous; we already know that e.g. tANS works in a much more relaxed setting. So let’s assume, for the time being, that we can rearrange our slot to symbol mapping arbitrarily (we’ll have to check if this is actually true later, and also work through what it means for our encoder). What does that buy us?

It buys us all we need to meet our performance goals, it turns out (props to my colleague Sean Barrett, who was the first one to figure this out, in our internal email exchanges anyway). As the section title says, the key turns out to be a stochastic sampling technique called the “alias method”. I’m not gonna explain the details here and instead refer you to this short introduction (written by a computational evolutionary geneticist, on randomly picking base pairs) and “Darts, Dice and Coins”, a much longer article that covers multiple ways to sample from a nonuniform distribution (by the way, note that the warnings about numerical instability that often accompany descriptions of the alias method need not worry us; we’re dealing with integer frequencies here so there’s no round-off error).

At this point, you might be wondering what the alias method, a technique for sampling from a non-uniform discrete probability distribution, has anything to do with entropy (de)coding. The answer is that the symbol look-up problem is essentially the same thing: we have a “random” value x \bmod M from the interval [0,M-1], and a matching non-uniform probability distribution (our symbol frequencies). Drawing symbols according to that distribution defines a map from [0,M-1] to our symbol alphabet, which is precisely what we need for our decoding function.

So what does the alias method do? Well, if you followed the link to the article I mentioned earlier, you get the general picture: it partitions the probabilities for our n-symbol alphabet into n “buckets”, such that each bucket i references at most 2 symbols (one of which is symbol i), and the probabilities within each bucket sum to the same value (namely, 1/n). This is always possible, and there is an algorithm (due to Vose) which determines such a partition in O(n) time. More generally, we can do so for any N≥n, by just adding some dummy symbols with frequency 0 at the end. In practice, it’s convenient to have N be a power of two, so for arbitrary n we would just pick N to be the smallest power of 2 that is ≥n.

Translating the sampling terminology to our rANS setting: we can subdivide our interval [0,M-1] into N sub-intervals (“buckets”) of equal size k=M/N, such that each bucket i references at most 2 distinct symbols, one of which is symbol i. We also need to know what the other symbol referenced in this bucket is – alias[i], the “alias” that gives the methods its name – and the position divider[i] of the “dividing line” between the two symbols.

With these two tables, determining the symbol ID from x is quick and easy:

  uint xM = x % M; // bit masking (power-of-2 M)
  uint bucket_id = xM / K; // shift (power-of-2 M/N!)
  uint symbol = bucket_id;
  if (xM >= divider[bucket_id]) // primary symbol or alias?
    symbol = alias[bucket_id];

This is O(1) time and O(N) = O(n) space (for the “divider” and “alias” arrays), as promised. However, this isn’t quite enough for rANS: remember that for our decoding function D, we need to know not just the symbol ID, but also which of the (potentially many) slots assigned to that symbol we ended up in; with regular rANS, this was simple since all slots assigned to a symbol are sequential, starting at slot B_s:
D(x) = (s, F_s \lfloor x/M \rfloor + (x \bmod M) - B_s) where s = s(x \bmod M).
Here, the (x \bmod M) - B_s part is the number we need. Now with the alias method, the slot IDs assigned to a symbol aren’t necessarily contiguous anymore. However, within each bucket, the slot IDs assigned to a symbol are sequential – which means that instead of the cumulative frequencies B_s, we now have two separate per bucket. This allows us to define the complete “alias rANS” decoding function:

  // s, x = D(x) with "alias rANS"
  uint xM = x % M;
  uint bucket_id = xM / K;
  uint symbol, bias;
  if (xM < divider[bucket_id]) { // primary symbol or alias?
    symbol = bucket_id;
    bias = primary_start[bucket_id];
  } else {
    symbol = alias[bucket_id];
    bias = alias_start[bucket_id];
  }
  x = (x / M) * freq[symbol] + xM - bias;

And although this code is written with branches for clarity, it is in fact fairly easy to do branch-free. We gained another two tables indexed with the bucket ID; generating them is another straightforward linear pass over the buckets: we just need to keep track of how many slots we’ve assigned to each symbol so far. And that’s it – this is all we need for a complete “alias rANS” decoder.

However, there’s one more minor tweak we can make: note that the only part of the computation that actually depends on symbol is the evaluation of freq[symbol]; if we store the frequencies for both symbols in each alias table bucket, we can get rid of the dependent look-ups. This can be a performance win in practice; on the other hand, it does waste a bit of extra memory on the alias table, so you might want to skip on it.

Either way, this alias method allow us to perform quite fast (though not as fast as a fully-unrolled table for small M) symbol look-ups, for large M, with memory overhead (and preparation time) proportional to n. That’s quite cool, and might be particularly interesting in cases where you either have relatively small alphabets (say on the order of 15-25 symbols), need lots of different tables, or frequently switch between tables.

Encoding

However, we haven’t covered encoding yet. With regular rANS, encoding is easy, since – again – the slot ranges for each symbol are contiguous; the encoder just does
C(s,x) = M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s)
where B_s + (x \bmod F_s) is the slot id corresponding to the (x \bmod F_s)‘th appearance of symbol s.

With alias rANS, each symbol may have its slots distributed across multiple, disjoint intervals – up to N of them. And the encoder now needs to map (x \bmod F_s) to a corresponding slot index that will decode correctly. One way to do this is to just keep track of the mapping as we build the alias table; this takes O(M) space and is O(1) cost per symbol. Another is to keep a sorted list of subintervals (and their cumulative sizes) assigned to each symbol; this takes only O(N) space, but adds a $O(\log_2 N)$ (worst-case) lookup per symbol in the encoder. Sound familiar?

In short, using the alias method doesn’t really solve the symbol lookup problem for large M; or, more precisely, it solves the lookup problem on the decoder side, but at the cost of adding an equivalent problem on the encoder side. What this means is that we have to pick our poison: faster encoding (at the some extra cost in the decoder), or faster decoding (at some extra cost in the encoder). This is fine, though; it means we get to make a trade-off, depending on which of the two steps is more important to us overall. And as long as we are in a compress-once decompress-often scenario (which is fairly typical), making the decoder faster at some reasonable extra cost in the encoder is definitely useful.

Conclusion

We can exploit static, known probabilities in several ways for rANS and related coders: for encoding, we can precompute the right “magic values” to avoid divisions in the hot encoding loop; and if we want to support large M, the alias method enables fast decoding without generating a giant table with M entries – with an O(n) preprocessing step (where n is the number of symbols), we can still support O(1) symbol decoding, albeit with a (slightly) higher constant factor.

I’m aware that this post is somewhat hand-wavey; the problem is that while Vose’s algorithm and the associated post-processing are actually quite straightforward, there’s a lot of index manipulation, and I find the relevant steps to be quite hard to express in prose without the “explanation” ending up harder to read than the actual code. So instead, my intent was to convey the “big picture”; a sample implementation of alias table rANS, with all the details, can be found – as usual – on Github.

And that’s it for today – take care!

rANS notes

We’ve been spending some time at RAD looking at Jarek Duda’s ANS/ABS coders (paper). This is essentially a new way of doing arithmetic coding with some different trade-offs from the standard methods. In particular, they have a few sweet spots for situations that were (previously) hard to handle efficiently with regular arithmetic coding.

The paper covers numerous algorithms. Of those given, I think what Jarek calls “rANS” and “tANS” are the most interesting ones; there’s plenty of binary arithmetic coders already and “uABS”, “rABS” or “tABS” do not, at first sight, offer any compelling reasons to switch (that said, there’s some non-obvious advantages that I’ll get to in a later post).

Charles has already posted a good introduction; I recommend you start there. Charles has also spent a lot of time dealing with the table-based coders, and the table-based construction allows some extra degrees of freedom that make it hard to see what’s actually going on. In this post, I’m going to mostly talk about rANS, the range coder equivalent. Charles already describes it up to the point where you try to make it “streaming” (i.e. finite precision); I’m going to continue from there.

Streaming rANS

In Charles’ notation (which I’m going to use because the paper uses both indexed upper-case I’s and indexed lower-case l’s, which would make reading this with the default font a disaster), we have two functions now: The coding function C and the decoding function D defined as

C(s,x) := M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s)
D(x) := (s, F_s \lfloor x/M \rfloor + (x \bmod M) - B_s) where s = s(x \bmod M).

D determines the encoded symbol s by looking at (x mod M) and looking up the corresponding value in our cumulative frequency table (s is the unique s such that B_s \le x \bmod M < B_s + F_s).

Okay. If you're working with infinite-precision integers, that's all you need, but that's not exactly suitable for fast (de)compression. We want a way to perform this using finite-precision integers, which means we want to limit the size of x somehow. So what we do is define a "normalized" interval

I := \{ L, L+1, \dots, bL - 1 \} = [L:bL)

The [L:bL) thing on the right is the notation I'll be using for a half-open interval of integers). b is the radix of our encoder; b=2 means we're emitting bits at a time, b=256 means we're emitting bytes, and so forth. Okay. We now want to keep x in this range. Too large x don't fit in finite-precision integers, but why not allow small x? The (hand-wavey) intuition is that too small x don't contain enough information; as Charles mentioned, x essentially contains about \log_2(x) bits of information. If x < 4, it can only be one of four distinct values, and as you saw above the value of x (mod M) directly determines which symbol we're coding. So we want x just right: not too large, not too small. Okay. Now let's look at how a corresponding decoder would look:

  while (!done) {
    // Loop invariant: x is normalized.
    assert(L <= x && x < b*L);

    // Decode a symbol
    s, x = D(x);

    // Renormalization: While x is too small,
    // keep reading more bits (nibbles, bytes, ...)
    while (x < L)
      x = x*b + readFromStream();
  }

Turns out that's actually our decoding algorithm, period. What we need to do now is figure out how the corresponding encoder looks. As long as we're only using C and D with big integers, it's simple; the two are just inverses of each other. Likewise, we want our encoder to be the inverse of our decoder - exactly. That means we have to do the inverse of the operations the decoder does, in the opposite order. Which means our encoder has to look something like this:

  while (!done) {
    // Inverse renormalization: emit bits/bytes etc.
    while (???) {
      writeToStream(x % b);
      x /= b;
    }

    // Encode a symbol
    x = C(s, x);
 
    // Loop invariant: x is normalized
    assert(L <= x && x < b*L);
  }

So far, this is purely mechanical. The only question is what happens in the "???" - when exactly do we emit bits? Well, for the encoder and decoder to be inverses of each other, the answer has to be "exactly when the decoder would read them". Well, the decoder reads bits whenever the normalization variant is violated after decoding a symbol, to make sure it holds for the next iteration of the loop. The encoder, again, needs to do the opposite - we need to proactively emit bits before coding s to make sure that, after we've applied C, x will be normalized.

In fact, that's all we need for a first sketch of renormalization:

  while (!done) {
    // Keep trying until we succeed
    for (;;) {
      x_try = C(s, x);
      if (L <= x_try && x_try < b*L) { // ok?
        x = x_try;
        break;
      } else {
        // Shrink x a bit and try again
        writeToStream(x % b);
        x /= b;
      }
    }

    x = x_try;
  }

Does this work? Well, it might. It depends. We certainly can't guarantee it from where we're standing, though. And even if it does, it's kind of ugly. Can't we do better? What about the normalization - I've just written down the normalization loops, but just because both decoder and encoder maintain the same invariants doesn't necessarily mean they are in sync. What if at some point of the loop, there are more than two possible normalized configurations - can this happen? Plus there's some hidden assumptions in here: the encoder, by only ever shrinking x before C, assumes that C always causes x to increase (or at least never causes x to decrease); similarly, the decoder assumes that applying D won't increase x.

And I'm afraid this is where the proofs begin.

Streaming rANS: the proofs (part 1)

Let's start with the last question first: does C always increase x? It certainly looks like it might, but there's floors involved - what if there's some corner case lurking? Time to check:

C(s,x)
= M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s)
= F_s \lfloor x/F_s \rfloor + (x \bmod F_s) + (M - F_s) \lfloor x/F_s \rfloor + B_s
= x + (M - F_s) \lfloor x/F_s \rfloor + B_s
\ge x

since x/F_s, B_s, and M - F_s are all non-negative (the latter because F_s < M - each symbol's frequency is less than the sum of all symbol frequencies). So C indeed always increases x, or more precisely, never decreases it.

Next up: normalization. Let's tackle the decoder first. First off, is normalization in the decoder unique? That is, if x \in I, is that the only normalized x it could be at that stage in the decoding process? Yes, it is: x \in I = [L:bL), so x \ge L, so

bx + d \ge bx \ge bL where d \in [0:b) arbitrary

That is, no matter what bit / byte / whatever the decoder would read next (that's 'd'), running through the normalization loop once more would cause x to become too large; there's only one way for x to be normalized. But do we always end up with a normalized x? Again, yes. Suppose that x < L, then (because we're working in integers) x \le L - 1, and hence

bx + d \le bL - b + d \le bL - 1 (again d \in [0:b) arbitrary)

The same kind of argument works for the encoder, which floor-divides by b instead of multiplying b it. The key here is that our normalization interval I has a property the ANS paper calls "b-uniqueness": it's of the form I=[k:bk) for some positive integer k (its upper bound is b times its lower bound). Any process that grows (or shrinks) x in multiples of b can't "skip over" I (as in, transition from being larger than the largest value in I to smaller than the smallest value in I or vice versa) in a single step. Simultaneously, there's also never two valid states the encoder/decoder could be in (which could make them go out of sync: both encoder and decoder think they're normalized, but they're at different state values).

To elaborate, suppose we have b=2 and some interval where the ratio between lower and upper bound is a tiny bit less than 2: say I' = [k:2k-1) = \{ k, k+1, \dots, 2k-2 \}. There's just one value missing. Now suppose the decoder is in state x=k-1 and reads a bit, which turns out to be 1. Then the new state is x' = 2x+1 = 2(k-1) + 1 = 2k - 1 \not\in I' - we overshot! We were below the lower bound of I', and yet with a single bit read, we're now past its upper bound. I' is "too small".

Now let's try the other direction; again b=2, and this time we make the ratio between upper and lower bound a bit too high: set I' = [k:2k+1) = \{ k, k+1, \dots, 2k \}. There's no risk of not reaching a state in that interval now, but now there is ambiguity. Where's the problem? Suppose the encoder is in state x=2k. Coding any symbol will require renormalization to "shift out" one bit. The encoder writes that bit (a zero), goes to state x=k, and moves on. But there's a problem: the decoder, after decoding that symbol, will be in state x=k too. And it doesn't know that the encoder got there from state x=2k by shifting a bit; all the decoder knows is that it's in state x=k \in I', which is normalized and doesn't require reading any more bits. So the encoder has written a bit that the decoder doesn't read, and now the two go out of sync.

Long story short: if the state intervals involved aren't b-unique, bad things happen. And on the subject of bad things happening, our encoder tries to find an x such that C(s,x) is inside I by shrinking x - but what if that process doesn't find a suitable x? We need to know a bit more about which values of x lead to C(s,x) being inside I, which leads us to the sets

I_s := \{ x | C(s,x) \in I \}

i.e. the set of all x such that encoding 's' puts us into I again. If all these sets turn out to be b-unique intervals, we're good. If not, we're in trouble. Time for a brief example.

Intermission: why b-uniqueness is necessary

Let's pick L=5, b=2 and M=3. We have just two symbols: 'a' has probability P_a=2/3, and 'b' has probability P_b=1/3, which we turn into F_a = 2, B_a = 0, F_b = 1, B_b = 2. Our normalization interval is I = [5:2\cdot5) = \{5, 6, 7, 8, 9\}. By computing C(s,x) for the relevant values of s and x, we find out that

I_a = \{ 4,5,6 \} = [4:7)
I_b = \{ 1,2 \} = [1:3)

Uh-oh. Neither of these two intervals is b-unique. I_a is too small, and I_b is too big. So what goes wrong?

Well, suppose that we're in state x=7 and want to encode an 'a'. 7 is not in I_a (too large). So the encoder emits the LSB of x, divides by 2 and... now x=3. Well, that's not in I_a either (too small), and shrinking it even further won't help that. So at this point, the encoder is stuck; there's no x it can reach that works.

Proofs (part 2): a sufficient condition for b-uniqueness

So we just saw that in certain scenarios, rANS can just get stuck. Is there anything we can do to avoid it? Yes: the paper points out that the embarrassing situation we just ran into can't happen when M (the sum of all symbol frequencies, the denominator in our probability distribution) divides L, our normalization interval lower bound. That is, L=kM for some positive integer k. It doesn't give details, though; so, knowing this, can we prove anything about the I_s that would help us? Well, let's just look at the elements of I_s and see what we can do:

I_s = \{ x | C(s,x) \in I \}

let's work on that condition:

C(s,x) \in I
\Leftrightarrow L \le C(s,x) < bL
\Leftrightarrow L \le M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s) < bL

at this point, we can use that L=kM and divide by M:

\Leftrightarrow kM \le M \lfloor x/F_s \rfloor + B_s + (x \bmod F_s) < bkM
\Leftrightarrow k \le \lfloor x/F_s \rfloor + (B_s + (x \bmod F_s))/M < bk

Now, for arbitrary real numbers r and natural numbers n we have that

n \le r \Leftrightarrow n \le \lfloor r \rfloor \quad \textrm{and} \quad r < n \Leftrightarrow \lfloor r \rfloor < n

Using this, we get:

\Leftrightarrow k \le \lfloor \lfloor x/F_s \rfloor + (B_s + (x \bmod F_s))/M \rfloor < bk

note the term in the outer floor bracket is the sum of an integer and a real value inside [0,1), since 0 \le B_s + (x \bmod F_s) < M, so we can simplify drastically

\Leftrightarrow k \le \lfloor x/F_s \rfloor < bk
\Leftrightarrow k \le x/F_s < bk
\Leftrightarrow kF_s \le x < bkF_s
\Leftrightarrow x \in [kF_s:bkF_s)

where we applied the floor identities above again and then just multiplied by F_s. Note that the result is an interval of integers with its (exclusive) upper bound being equal to b times its (inclusive) lower bound, just like we need - in other words, assuming that L=kM, all the I_s are b-unique and we're golden (this is mentioned in the paper in section 3.3, but not proven, at least not in the Jan 6 2014 version).

Note that this also gives us a much nicer expression to check for our encoder. In fact, we only need the upper bound (due to b-uniqueness, we know there's no risk of us falling through the lower bound), and we end up with the encoding function

  while (!done) {
    // Loop invariant: x is normalized
    assert(L <= x && x < b*L);

    // Renormalize
    x_max = (b * (L / M)) * freq[s]; // all but freq[s] constant
    while (x >= x_max) {
      writeToStream(x % b);
      x /= b;
    }

    // Encode a symbol
    // x = C(s, x);
    x = freq[s] * (x / M) + (x % M) + base[s];
  }

No "???"s left - we have a "streaming" (finite-precision) version of rANS, which is almost like the arithmetic coders you know and love (and in fact quite closely related) except for the bit where you need to encode your data in reverse (and reverse the resulting byte stream).

I put an actual implementation on Github for the curious.

Some conclusions

This is an arithmetic coder, just a weird one. The reverse encoding seems like a total pain at first, and it kind of is, but it comes with a bunch of really non-obvious but amazing advantages that I'll cover in a later post (or just read the comments in the code). The fact that M (the sum of all frequencies) has to be a multiple of L is a serious limitation, but I don't (yet?) see any way to work around that while preserving b-uniqueness. So the compromise is to pick M and L to be both powers of 2. This makes the decoder's division/mod with M fast. The power-of-2 limitation makes rANS really bad for adaptive coding (where you're constantly updating your stats, and resampling to a power-of-2-sized distribution is expensive), but hey, so is Huffman. As a Huffman replacement, it's really quite good.

In particular, it supports a divide-free decoder (and actually no per-symbol division in the encoder either, if you have a static table; see my code on Github, RansEncPutSymbol in particular). This is something you can't (easily) do with existing multi-symbol arithmetic coders, and is a really cool property to have, because it really puts it a lot closer to being a viable Huffman replacement in a lot of places that do care about the speed.

If you look at the decoder, you'll notice that its entire behavior for a decoding step only depends on the value of x at the beginning: figure out the symbol from the low-order bits of x, go to a new state, read some bits until we're normalized again. This is where the table-based versions (tANS etc.) come into play: you can just tabulate their behavior! To make this work, you want to keep b and L relatively small. Then you just make a table of what happens in every possible state.

Interestingly, because these tables really do tabulate the behavior of a "proper" arithmetic coder, they're compatible: if you have two table-baked distributions with use that same values of b and L (i.e. the same interval I), you can switch between them freely; the states mean the same in both of them. It's not at all obvious that it's even possible for a table-based encoder to have this property, so it's even cooler that it comes with no onerous requirements on the distribution!

That said, as interesting as the table-based schemes are, I think the non-table-based variant (rANS) is actually more widely useful. Having small tables severely limits your probability resolution (and range precision), and big tables are somewhat dubious: adds, integer multiplies and bit operations are fine. We can do these quickly. More compute power is a safe thing to bet on right now (memory access is not). (I do have some data points on what you can do on current HW, but I'll get there in a later post.)

As said, rANS has a bunch of really cool, unusual properties, some of which I'd never have expected to see in any practical entropy coder, with cool consequences. I'll put that in a separate post, though - this one is long (and technical) enough already. Until then!

Index compression follow-up

I got a few mails about the previous post, including some pretty cool suggestions that I figured were worth posting.

Won Chun (who wrote a book chapter for OpenGL insights on the topic, “WebGL Models: End-to-End”) writes: (this is pasted together from multiple mails; my apologies)

A vertex-cache optimized list is still way more compressible than random: clearly there is lots of coherence to exploit. In fact, that’s pretty much with the edge-sharing-quad business is (OK, some tools might do this because they are working naturally in quads, but you get lots of these cases with any vertex cache optimizer).

So you can also get a pretty big win by exploiting pre-transform vertex cache optimization. I call it “high water mark encoding.” Basically: for such a properly optimized index list, the next index you see is either (a) one you’ve seen before or (b) one higher than the current highest seen index. So, instead of encoding actual indices, you can instead encode them relative to this high water mark (the largest index yet to be seen, initialized to 0). You see “n” and that corresponds to an index of (high water mark – n). When you see 0, you also increment high watermark.

The benefit here is that the encoded indices are very small, and then you can do some kind of varint coding, then your encoded indices are a bit more than a byte on average. If you plan on zipping later, then make sure the varints are byte-aligned and LSB-first.

There are lots of variants on this, like:

  • keep a small LRU (~32 elements, or whatever transform cache size you’ve optimized for) of indices, and preferentially reference indices based on recent use (e.g. values 1-32), rather than actual value (deltas are offset by 32),
  • make the LRU reference edges rather than verts, so a LRU “hit” gives you two indices instead of one,
  • do two-level high water mark encoding, which makes it efficient to store in multi-indexed form (like .OBJ, which has separate indices per attribute) and decode into normal single-indexed form

And also, these approaches let you be smart about attribute compression as well, since it gives you useful hints; e.g. any edge match lets you do parallelogram prediction, multi-indexing can implicitly tell you how to predict normals from positions.

“High watermark encoding”

I really like this idea. Previously I’d been using simple delta encoding on the resulting index lists; that works, but the problem with delta coding is that a single outlier will produce two large steps – one to go from the current region to the outlier, then another one to get back. The high watermark scheme is almost as straightforward as straight delta coding and avoids this case completely.

Now, if you have an index list straight out of vertex cache optimization and vertex renumbering, the idea works as described. However, with the hybrid tri/paired-tri encoding I described last time, we have to be a bit more careful. While the original index list will indeed have each index be at most 1 larger than the highest index we’ve seen so far, our use of “A ≥ B” to encode whether the next set of indices describes a single triangle or a pair means that we might end up having to start from the second or third vertex of a triangle, and consequently see a larger jump than just 1. Luckily, the fix for this is simple – rather than keeping the high watermark always 1 higher than the largest vertex index we’ve seen so far, we keep it N higher where N is the largest possible “step” we can have in the index list. With that, the transform is really easy, so I’m just going to post my code in full:

static void watermark_transform(std::vector<int>& out_inds,
    const std::vector<int>& in_inds, int max_step)
{
    int hi = max_step - 1; // high watermark
    out_inds.clear();
    out_inds.reserve(in_inds.size());
    for (int v : in_inds)
    {
        assert(v <= hi);
        out_inds.push_back(hi - v);
        hi = std::max(hi, v + max_step);
    }
}

and the inverse is exactly the same, with the push_back in the middle replaced by the two lines

v = hi - v;
out_inds.push_back(v);

So what’s the value of N (aka max_step in the code), the largest step that a new index can be from the highest index we’ve seen so far? Well, for the encoding described last time, it turns out to be 3:

  • When encoding a single triangle, the worst case is a triangle with all-new verts. Suppose the highest index we’ve seen so far is k, and the next triangle has indices (k+1,k+2,k+3). Our encoding for single triangles requires that the first index be larger than the second one, so we would send this triangle as (k+3,k+1,k+2). That’s a step of 3.
  • For a pair of triangles, we get 4 new indices. So it might seem like we might get a worst-case step of 4. However, we know that the two triangles share an edge; and for that to be the case, the shared edge must have been present in the first triangle. Furthermore, we require that the smaller of the two indices be sent first (that’s what flags this as a paired tri). So the worst cases we can have for the first two indices are (k+2,k+3) and (k+1,k+3), both of which have a largest step size of 2. After the first two indices, we only have another two indices to send; worst-case, they are both new, and the third index is larger than the fourth. This corresponds to a step size of 2. All other configurations have step sizes ≤1.

And again, that’s it. Still very simple. So does it help?

Results

Let’s dig out the table again (new entries bold, all percentages relative to the “Vertex cache optimized” row):

Stage Size .zip size .7z size
Original mesh 6082k 3312k 2682k
Vertex cache optimized 6082k 2084k 1504k
Vcache opt, watermark 6082k 1808k (-13.2%) 1388k (-7.7%)
Postprocessed 4939k (-18.8%) 1830k (-12.2%) 1340k (-10.9%)
Postproc, watermark 4939k (-18.8%) 1563k (-25.0%) 1198k (-20.3%)

In short: the two techniques work together perfectly and very nicely complement each other. This is without any varint encoding by the way, still sending raw 32-bit indices. Variable-sized integer encoding would probably help, but I haven’t checked how much. The code on Github has been updated in case you want to play around with it.

Summary

I think this is at a fairly sweet spot in terms of compression ratio vs. decoder complexity. Won originally used this for WebGL, using UTF-8 as his variable-integer encoding: turn the relative indices into Unicode code points, encode the result as UTF-8. This is a bit hacky and limits you to indices that use slightly more than 20 bits (still more than a million, so probably fine for most meshes you’d use in WebGL), but it does mean that the Browser can handle the variable-sized decoding for you (instead of having to deal with byte packing in JS code).

Overall, this approach (postprocess + high watermark) gives a decoder that is maybe 30 lines or so of C++ code, and which does one linear pass over the data (the example program does two passes to keep things clearer, but they can be combined without problems) with no complicated logic whatsoever. It’s simple to get right, easy to drop in, and the results are quite good.

It is not, however, state of the art for mesh compression by any stretch of the imagination. This is a small domain-specific transform that can be applied to fully baked and optimized game assets to make them a bit smaller on disk. I also did not cover vertex data; this is not because vertex data is unimportant, but simply because, so far at least, I’ve not seen any mesh compressors that do something besides the obvious (that is, quantization and parallelogram prediction) for vertex data.

Finally, if what you actually want is state-of-the art triangle mesh compression, you should look elsewhere. This is outside the scope of this post, but good papers to start with are “Triangle Mesh Compression” by Touma and Gotsman (Proceedings of Graphics Interface, Vancouver, June 1998) and “TFAN: A low complexity 3D mesh compression algorithm” by Khaled, Zaharia, and Prêteux (Computer Animation and Virtual Worlds 20.2‐3 (2009)). I’ve played around with the former (the character mesh in Candytron was encoded using an algorithm descended from it) but not the latter; however, Touma-Gotsman and descendants are limited to 2-manifold meshes, whereas TFAN promises better compression and handles arbitrarily topologies, so it looks good on paper at least.

Anyway; that’s it for today. Thanks to Won for his mail! And if you should use this somewhere or figure out a way to get more mileage out of it without making it much more complicated, I’d absolute love to know!

Simple lossless(*) index buffer compression

(Almost) everyone uses indexed primitives. At this point, the primary choice is between indexed triangle lists, which are flexible but always take 3 indices per triangle, and indexed triangle strips; since your meshes are unlikely to be one big tri strip, that’s gonna involve primitive restarts of some kind. So for a strip with N triangles, you’re generally gonna spend 3+N indices – 2 indices to “prime the pump”, after which every new index will emit a new triangle, and finally a single primitive restart index at the end (supposing your HW target does have primitive restarts, that is).

Indexed triangle strips are nice because they’re smaller, but finding triangle strips is a bit of a pain, and more importantly, long triangle strips actually aren’t ideal because they tend to “wander away” from the origin triangle, which means they’re not getting very good mileage out of the vertex cache (see, for example, Tom Forsyth’s old article on vertex cache optimization).

So here’s the thing: we’d like our index buffers for indexed tri lists to be smaller, but we’d also like to do our other processing (like vertex cache optimization) and then not mess with the results – not too much anyway. Can we do that?

The plan

Yes we can (I had this idea a while back, but never bothered to work out the details). The key insight is that, in a normal mesh, almost all triangles share an edge with another triangle. And if you’ve vertex-cache optimized your index buffer, two triangles that are adjacent in the index buffer are also quite likely to be adjacent in the geometric sense, i.e. share an edge.

So, here’s the idea: loop over the index buffer. For each triangle, check if it shares an edge with its immediate successor. If so, the two triangles can (in theory anyway) be described with four indices, rather than the usual six (since two vertices appear in both triangles).

But how do we encode this? We could try to steal a bit somewhere, but in fact there’s no need – we can do better than that.

Suppose you have a triangle with vertex indices (A, B, C). The choice of which vertex is first is somewhat arbitrary: the other two even cycles (B, C, A) and (C, A, B) describe the same triangle with the same winding order (and the odd cycles describe the same triangle with opposite winding order, but let’s leave that alone). We can use this choice to encode a bit of information: say if A≥B, we are indeed coding a triangle – otherwise (A<B), what follows will be two triangles sharing a common edge (namely, the edge AB). We can always pick an even permutation of triangle indices such that A≥B, since for any integer A, B, C we have

0 = (A - A) + (B - B) + (C - C) = (A - B) + (B - C) + (C - A)

Because the sum is 0, not all three terms can be negative, which in turn means that at least one of A≥B, B≥C or C≥A must be true. Furthermore, if A, B, and C are all distinct (i.e. the triangle is non-degenerate), all three terms are nonzero, and hence we must have both negative and positive terms for the sum to come out as 0.

Paired triangles

Okay, so if the triangle wasn’t paired up, we can always cyclically permute the vertices such that A≥B. What do we do when we have two triangles sharing an edge, say AB?

Two triangles sharing edge AB.

Two triangles sharing edge AB.

For this configuration, we need to send the 4 indices A, B, C, D, which encode the two triangles (A, B, C) and (A, D, B).

If A<B, we can just send the 4 indices directly, leading to this very simple decoding algorithm that unpacks our mixed triangle/double-triangle indexed buffer back to a regular triangle list:

  1. Read 3 indices A, B, C.
  2. Output triangle (A, B, C).
  3. If A<B, read another index D and output triangle (A, D, B).

Okay, so this works out really nicely if A<B. But what if it’s not? Well, there’s just two cases left. If A=B, the shared edge is a degenerate edge and both triangles are degenerate triangles; not exactly common, so the pragmatic solution is to say “if either triangle is degenerate, you have to send them un-paired”. That leaves the case A>B; but that means B<A, and BA is also a shared edge! In fact, we can simply rotate the diagram by 180 degrees; this swaps the position of (B,A) and (C,D) but corresponds to the same triangles. With the algorithm above, (B, A, D, C) will decode as the two triangles (B, A, D), (B, C, A) – same two triangles as before, just in a different order. So we’re good.

Why this is cool

What this means is that, under fairly mild assumptions (but see “limitations” section below), we can have a representation of index buffers that mixes triangles and pairs of adjacent triangles, with no need for any special flag bits (as recommended in Christer’s article) or other hackery to distinguish the two.

In most closed meshes, every triangle has at least one adjacent neighbor (usually several); isolated triangles are very rare. We can store such meshes using 4 indices for every pair of triangles, instead of 6, for about a 33% reduction. Furthermore, most meshes in fact contain a significant number of quadriliterals (quads), and this representation supports quads directly (stored with 4 indices). 33% reduction for index buffers isn’t a huge deal if you have “fat” vertex formats, but for relatively small vertices (as you have in collision detection, among other things), indices can actually end up being a significant part of your overall mesh data.

Finally, this is simple enough to decode that it would probably be viable in GPU hardware. I wouldn’t hold my breath for that one, just thought I might point it out. :)

Implementation

I wrote up a quick test for this and put it on Github, as usual. This code loads a mesh, vertex-cache optimizes the index buffer (using Tom’s algorithm), then checks for each triangle whether it shares an edge with its immediate successor and if so, sends them as a pair – otherwise, send the triangle alone. That’s it. No attempt is made to be more thorough than that; I just wanted to be as faithful to the original index buffer as possible.

On the “Armadillo” mesh from the Stanford 3D scanning repository, the program outputs this: (UPDATE: I added some more features to the sample program and updated the results here accordingly)

172974 verts, 1037832 inds.
before:
ACMR: 2.617 (16-entry FIFO)
62558 paired tris, 283386 single
IB inds: list=1037832, fancy=975274 (-6.03%)
after:
ACMR: 0.814 (16-entry FIFO)
292458 paired tris, 53486 single
IB inds: list=1037832, fancy=745374 (-28.18%)
745374 inds packed
1037832 inds unpacked
index buffers match.
ACMR: 0.815 (16-entry FIFO)

“Before” is the average cache miss rate (vertex cache misses/triangle) assuming a 16-entry FIFO cache for the original Armadillo mesh (not optimized). As you can see, it’s pretty bad.

I then run the simple pairing algorithm (“fancy”) on that, which (surprisingly enough) manages to reduce the index list size by about 6%.

“After” is after vertex cache optimization. Note that Tom’s algorithm is cache size agnostic; it does not assume any particular vertex cache size, and the only reason I’m dumping stats for a 16-entry FIFO is because I had to pick a number and wanted to pick a relatively conservative estimate. As expected, ACMR is much better; and the index buffer packing algorithm reduces the IB size by about 28%. Considering that the best possible case is a reduction of 33%, this is quite good. Finally, I verify that packing and unpacking the index buffer gives back the expected results (it does), and then re-compute the ACMR on the unpacked index buffer (which has vertices and triangles in a slightly different order, after all).

Long story short: it works, even the basic “only look 1 triangle ahead” algorithm gives good results on vertex cache optimized meshes, and the slight reordering performed by the algorithm does not seem to harm vertex cache hit rate much (on this test mesh anyway). Apologies for only testing on one 3D-scanned mesh, but I don’t actually have any realistic art assets lying around at home, and even if I did, loading them would’ve probably taken me more time than writing the entire rest of this program did.

UPDATE: Some more results

The original program was missing one more step that is normally done after vertex cache optimization: reordering the vertex buffer so that vertices appear in the order they’re referenced from the index buffer. This usually improves the efficiency of the pre-transform cache (as opposed to the post-transform cache that the vertex cache optimization algorithm takes care of) because it gives better locality of reference, and has the added side effect of also making the index data more compressible for general purpose lossless compression algorithms like Deflate or LZMA.

Anyway, here’s the results for taking the entire Armadillo mesh – vertices, which just store X/Y/Z position as floats, and 32-bit indices both – and processing it with some standard general-purpose compressors at various stages of the optimization process: (all sizes in binary kilobytes, KiB)

Stage Size .zip size .7z size
Original mesh 6082k 3312k 2682k
Vertex cache optimized 6082k 2084k 1504k
Postprocessed 4939k (-18.8%) 1830k (-12.2%) 1340k (-10.9%)

So the post-process yields a good 10% reduction in the compressed size for what would be the final packaged assets here. This value is to be taken with a grain of salt: “real” art assets have other per-vertex data besides just 12 bytes for the 3D position, and nothing I described here does anything about vertex data. In other words, this comparison is on data that favors the algorithm in the sense that roughly half of the mesh file is indices, so keep that in mind. Still, 10% reduction post-LZMA is quite good for such a simple algorithm, especially compared to the effort it takes to get the same level of reduction on, say, x86 code.

Also note that the vertex cache optimization by itself massively helps the compressors here; the index list for this mesh comes from a 3D reconstruction of range-scanned data and is pathologically bad (the vertex order is really quite random), but the data you get out of a regular 3D mesh export is quite crappy too. So if you’re not doing any optimization on your mesh data yet, you should really consider doing so – it will reduce both your frame timings and your asset sizes.

Limitations

This is where the (*) from the title comes in. While I think this is fairly nice, there’s two cases where you can’t use this scheme, at least not always:

  1. When the order of vertices within a triangle matters. An example would be meshes using flat attribute interpolation, where the value chosen for a primitive depends on the “provoking vertex”. And I remember some fairly old graphics hardware where the Z interpolation depended on vertex specification order, so you could get Z-fighting between the passes in multi-pass rendering if they used different subsets of triangles.
  2. When the order of triangles within a mesh matters (remember that we in the two-tris case, we might end up swapping them to make the encoding work). Having the triangles in a particular order in the index buffer can be very useful with alpha blending, for example. That said, the typical case for this application is that the index buffer partitions into several chunks that should be drawn in order, but with no particular ordering requirements within that chunk, which is easy to do – just prohibit merging tris across chunk boundaries.

That said, it seems to me that it really should be useful in every case where you’d use a vertex cache optimizer (which messes with the order anyway). So it’s probably fine.

Anyway, that’s it. No idea whether it’s useful to anyone, but I think it’s fairly cute, and it definitely seemed worth writing up.

Symmetry

As a general rule, when trying to understand or prove something, symmetry is a big help. If something has obvious symmetries, or can be brought into a symmetric form, doing so is usually worth trying. One example of this is my old post on index determination for DXT5/BC3 alpha blocks. But a while ago I ran into something that makes for a really nice example: consider the expression

|a + b| + |a - b|

If these were regular parentheses, this would simply evaluate to 2a, but we’re taking absolute values here, so it’s a bit more complicated than that. However, the expression does look pretty symmetric, so let’s see if we can do something with that. Name and conquer, so let’s define:

f(a,b) := |a + b| + |a - b|

This expression looks fairly symmetric in a and b, so what happens if we switch the two?

f(b,a) = |b + a| + |b - a| = |a + b| + |-(a - b)| = \\ |a + b| + |a - b| = f(a,b)

So it’s indeed symmetric in the two arguments. Another candidate to check is sign flips – we’re taking absolute values here, so we might find something interesting there as well:

f(a,-b) = |a + (-b)| + |a - (-b)| = |a - b| + |a + b| = f(a,b)

And because we already know that f is symmetric in its arguments, this gives us f(-a,b) = f(a,b) for free – bonus! Putting all these together, we now know that

f(a,b) = f(b,a) = f(|a|,|b|) = | |a| + |b| | + | |a| - |b| |

which isn’t earth-shattering but not obvious either. You could prove this directly from the original expression, but I like doing it this way (defining a function f and poking at it) because it’s much easier to see what’s going on.

But we’re not quite done yet: One final thing you can do with absolute values is figure out what needs to happen for the expression to be non-negative and see if you can simplify it further. Now, both |a| and |b| are always non-negative, so in fact we have

f(a,b) = |a| + |b| + | |a| - |b| |.

Now suppose that |a| \ge |b|. In that case, we know the sign of the expression on the right and can simplify further:

f(a,b) = |a| + |b| + (|a| - |b|) = 2 |a|

and similarly, when |b| \ge |a|, we get f(a,b) = 2 |b|. Putting the two together, we arrive at the conclusion

|a + b| + |a - b| = f(a,b) = 2 \max(|a|,|b|)

which I think is fairly cute. (This expression comes up in SATD computations and can be used to save some work in the final step.)

Follow

Get every new post delivered to your Inbox.

Join 183 other followers