So far, that is just an observation. But does it mean anything? Well, sort of. For one thing, it means that all the network security problems that plague inter-computer networking also exist within computers themselves. Just as you can own a machine from outside using a network-based attack (ultimately delivered via Ethernet frames), you can own it using exploits delivered via USB packets, or eSATA, or Thunderbolt. This is not a particularly bad security nightmare – all these attacks require physical access, and once you have physical access to a machine you have all kinds of ways of working around security mechanisms anyway – but it does mean that people trying to sell you hardware that you aren’t allowed to run your own code on have a pretty large attack surface to worry about.

That said, as far as implications of “everything is a computer running a network stack” go, “there are going to be implementation issues” is not exactly the most exciting place to end up at. Things get substantially more interesting once we approach the same underlying truth from another angle, though. To do so, let’s pick a concrete example. Being an affluent male geek, I socialize with a bunch of other affluent male geeks, a lot of whom store enough data (or enough computers) at home to require a NAS (Network Attached Storage) device to house terabytes of (doubtlessly 100% legally acquired!) data and serve it to the multitude of desktop computers, tables, phones and media devices they keep in their home. Now, these boxes are essentially marketed as a bunch of drive enclosures with a network plug and a bunch of added value features; RAID, video streaming (and sometimes live transcoding as well), that kind of thing. Since they’re essentially storage appliances, most people would view them as somewhere between an external hard drive enclosure and a proper computer. Of course, there’s no such thing as “half a computer”, and these devices are typically just small computers with a NIC and a few hundred megabytes to a few gigabytes of RAM, usually running some variant of Linux. They *could* be used for all kinds of computing; they just happen to be specialized in a way that makes them more suitable for some tasks than others.

That’s not why I bring this up, though; I bring it up because from an end user’s point of view, a NAS is essentially just a disk drive you can plug into multiple machines simultaneously, using an Ethernet plug instead of a SATA plug. Now from the computer’s (and OSes) point of view, a local disk versus a NAS are very different things; at that level, they are speaking very different languages (namely, a block device with a local file system on it, versus a remote network filesystem such as CIFS or NFS). But to most applications running on top of the OS (and by extension most users that use them), the distinction between a local drive and a mounted network share only really becomes apparent when the abstraction breaks.

Interestingly, we see the same pattern today even when we zoom in to the local storage connected to a computer. If you were to get a 2004 SATA hard drive, it would very likely be exactly that – still a “dumb” hard drive, possibly with a bit of cache, a microcontroller to talk SATA to the host and really address the disk, plus a fair amount of (possibly programmable) DSP circuitry to actually handle the encoding of bits into and decoding of bits from magnetic flux patterns, because doing this is way more involved – and way less discrete – than you probably think. So even in 2004, we already have a *considerable* amount of essentially invisible compute power being thrown at a problem that the higher-up abstraction layers gloss over; but at least all this processing still essentially respects the underlying abstraction of a block device. For a 2014 USB thumb drive, that’s not necessarily true anymore either: they do things like wear leveling, remapping block addresses to distribute write loads evenly over the device (and hence prevent some blocks aging much faster than others). These processes often involve *parsing the file system* itself (at least if it’s FAT or a variant). Think about this for a second: a storage device that, by itself, understands the file system that is stored on it, and automatically rearranges data to maximize its expected lifetime. This level of autonomy is less than a NAS transcoding videos on the fly, but it’s still a far cry from the truly “dumb” disk and tape drives our reigning block device abstractions were designed for. And we’re not talking about some expensive high-availability low-volume server technology here; this is a storage technology that is both cheap and ubiquitous.

That brings up another point: Flash memory technology is a *vastly* different beast from magnetic recording technology. And yet, all of those differences get hidden and papered over at a real low level; we could have taught OSes what Flash devices are, how they work, and declared them as a separate device family, with separate file systems that handle the wear leveling on the software side. Yet this is not what happened; evidently, the path of least resistance was to add dedicated hardware and software to *every single Flash storage device* that makes it look like a hard drive, even though it works very differently, with very different trade-offs.

This is not an isolated incident: if you go over the list of peripherals from last time, you’ll find something similar for every single one of them. Ethernet technology has essentially been completely redesigned between 1989 and 2014, yet the only part of the Ethernet standard that’s both device-independent and directly visible to software – namely, Ethernet frames – has remained essentially unchanged (except for the addition of jumbo frames, which are still not widely used). SAS and SATA may use a very different encoding and topology than their parallel counterparts, but from a software perspective the changes are evolutionary not revolutionary. Regular PCI and PCIe are essentially the same from the perspective of the OS. And so forth. Probably the most impressive example of all is the x86 architecture; a modern x86 processor can (and will) go to substantial lengths to pretend that it’s really just a fancy 8086, a 286, a 386, or a slightly extended AMD Athlon 64, depending on the operating mode; the architecture and implementation of current x86 processors is nothing like the original processors it’s emulating, but while the implementations may change substantially over time, the interfaces – protocols, to stay within our networking terminology – stay mostly constant over large time scales, warts and all.

Note how much this runs counter to the common attitude that software is easy to change, and hardware difficult. This might be true for individuals; but across computing as a whole, it’s missing the point entirely. The more relevant distinction is that implementations are easy to change, and protocols difficult. And once a protocol (or API in software terms) is firmly established and widely supported, it’s damn near impossible to get rid of. If that means that USB thumb drives have to pretend they are hard disks, and need the capability to understand the file system that’s stored on them to do a good job; if that means that Ethernet devices have to pretend that it’s 1989 and all nodes on the network are connected to one wire and identified by their MAC address, rather than connected by point-to-point links via network switches that understand IP just fine; if that means that “x86″ processors have to pretend they’re not actually out-of-order cores with an internal RISC-ish instruction set, hundreds of registers and sophisticated “decoding” (really, translation) engines; then so be it. That is the price we pay to keep things more or less in working order – except for that occasional instance when all our sophisticated abstractions break, the underlying reality shines through, and that mounted network share is revealed to not actually be on the local machine at all, by producing errors that no app knows how to handle.

Just keep hitting refresh. It’ll be back in a minute.

]]>

Okay, so much for the history lesson. Now let’s fast-forward 25 years. How does the situation look today?

We are still using “Ethernet”, but 2014 Ethernet has very little in common with its 1989 counterpart other than the name. It still uses serial links, but as of 10BASE-T it uses a star topology (using switches and hubs) rather than a bus, and with the introduction of Gigabit Ethernet, the whole “collision detection” business fell by the wayside too; Gigabit Ethernet is a bunch of point-to-point links connected by switches (active devices, instead of hubs which are mostly passive), and hence there’s no physically shared medium anymore; there’s explicit switching. So: serial point-to-point links, packet based, switching. The same tech (albeit at different scaling levels) is used in your home, at your workplace, at your ISP, and in major internet exchanges/peering spots.

Extension cards used to be on the ISA bus; in 2014, it’s all PCI Express, and while it’s still called a “bus”, it hasn’t been one ever since the “Express” part got added. PCIe uses serial not parallel links, the connectivity is point-to-point instead of a shared medium, with arbitration by the “root complex” (which logically sits at the center of the bus, making this a star instead of a bus topology), oh and communication is packet-based. So PCIe uses serial point-to-point links is packet-based, and uses switching.

On the storage front, PATA was succeeded by SATA. SATA is, as the name says, serial instead of parallel. The cables, which used to form a bus (with each cable typically allowing up to two devices to be connected to the controller) now are used for single point-to-point links between devices and the controller. Oh, and there is a more explicit framing protocol based around packets. In other words, SATA has a star topology of serial point-to-point links and is packet-based.

Of course, SCSI didn’t disappear off the face of the earth either; it turned into Serial Attached SCSI (SAS). SAS keeps the SCSI command set, but as the name suggests, links are now serial. Oh, and – interrupt me if this sounds familiar – they’re now point-to-point, with everything connected either directly to the “initiator”, or indirectly through one or more “expanders” (switches), meaning SAS has a tiered-star topology. I’m not sure whether or not SAS is internally packet-based though, but I have my suspicions :) [can anyone confirm this?]. Oh, and SAS can also tunnel SATA streams, because why not.

Keyboard, mice, and printers are now all on USB. USB stands for “Universal Serial Bus”; the first two words of that are accurate (it is pretty much universally used at this point, and it is serial) but the latter is a blatant lie. USB may pretend to be a bus, but it really is a packet-based network with a tiered-star topology. Serial point-to-point links, possibly with hubs in between, and all packet-based. And if you happen to attach a USB storage device such as a thumb drive or external hard disk, well, USB storage really just wraps SCSI commands in USB packets – and these SCSI commands may in turn wrap ATA commands.

And what about displays? Well, if you use DVI or HDMI, you still have a dedicated point-to-point link, much as in the VGA days, although it’s now all digital. But if you happen to be using the newer DisplayPort, well, that’s a – can you guess? – serial, packet-based point-to-point link protocol. Which can also carry USB across the link, by the way. Oh, and there’s also Thunderbolt, which can multiplex both DisplayPort and PCIe packets over its serial point-to-point links. Among other things, this has fun consequences such as allowing your *monitor* or TV (or anything else plugged into your display connector) full freaking bus master DMA access – a killer feature if I’ve ever seen one! – and, just as fun, finally gives you the opportunity to have a storage device that talks ATA through SCSI by way of USB over DisplayPort inside Thunderbolt, *because we can*.

Okay, so pretty much every single pluggable link interface that used to be in 1989 PCs has been replaced by what are, effectively, all very similar spins on the same underlying network technology (and they really are all networks, albeit some with funky protocols). Is that the end of it?

Well, no, not quite. Because we now have multiple CPUs, too. And guess how they talk to each other? Well, in the case of PCs, there’s two primary options: AMD’s HyperTransport and Intel’s QuickPath Interconnect (QPI). Needless to say, both of these are packet-based, serial, point-to-point links, typically used to communicate between CPU cores (or CPU sockets anyway).

So what’s my point here? Well, logically, these CPUs appear to share the same memory, but physically it’s more common at this point to have each CPU (or at least each socket) own a certain fraction of physical memory that it’s connected to directly; the rest, it can only access by asking the other socket (=NUMA). And then your hard drive (or SSD), which speaks SCSI or ATA or both, needs a controller to understand these protocols – often programmable. And, in the case of hard drivers, usually also a DSP to decode/encode the stuff that’s on disk. And your GPU also has its own memory, and a wide array of vector processors that is (at this point) easily Turing-complete. And usually several other fairly general-purpose blocks you don’t even know about. In fact, for each general-purpose processor (with its own memory) in your system that you know about, there are probably 2 or 3 that you don’t. There are tiny ARM cores absolutely everywhere, sometimes where you’d least expect them, and Intel likes to use embedded 486s for the same kind of thing (they’re so small by now, you literally need a microscope to even see them).

We don’t really tend to think of them as such, but not only are all kinds of peripherals on network links these days, they also have general-purpose CPUs, often with substantially more processing power (and memory!) than the 286 @ 10MHz I learned programming on in 1990. And this is all still just talking about PCs – I haven’t even started on things like cars or planes.

So what’s my point? Well, just to say that what we think of as “computers” today are, in practice, already fairly large, heterogeneous clusters of different specialized smaller computers, and while only a small part of these computers is visible to the user, there’s a bunch of them lurking just below the surface. This goes not just for PCs, but also phones and tablets. And of course, absolutely everything you plug into a computer these days is – by itself – another computer. Even stuff you think of as “passive” components, like batteries and thumb drives.

And they’re all talking to each other over networks, most of which are starting to look pretty damn similar at this point – and with greatly varying degrees of implementation quality, eliciting both awe that anything works at all and dread at how easy it is all to exploit. So let me conclude by formulating an analogue to Zawinski’s Law of Software Envelopment:

Every data link standard converges to serial point-to-point links connected in a tiered-star topology and transporting packets. Those link standards which cannot so converge are replaced by ones which can.

]]>

Last time, we defined the rANS encoding and decoding functions, assuming a finite alphabet of n symbols numbered 0 to n-1.

where .

where is the frequency of symbol s, is the sum of the frequencies of all symbols before s, and is the sum of all symbol frequencies. Then a given symbol s has (assumed) probability .

Furthermore, as noted in the previous post, M can’t be chosen arbitrarily; it must divide L (the lower bound of our normalized interval) for the encoding/decoding algorithms we saw to work.

Given these constraints and the form of C and D, it’s very convenient to have M be a power of 2; this replaces the divisions and modulo operations in the decoder with bit masking and bit shifts. We also choose L as a power of 2 (which needs to be at least as large as M, since otherwise M can’t divide L).

This means that, starting from a reference probability distribution, we need to approximate the probabilities as fractions with common denominator M. My colleague Charles Bloom just wrote a blog post on that very topic, so I’m going to refer you there for details on how to do this optimally.

Making M a power of two removes the division/modulo operations in the decoder, but the encoder still has to perform them. However, note that we’re only ever dividing by the symbol frequencies , which are known at the start of the encoding operation (in our “static probability distribution” setting). The question is, does that help?

You bet it does. A little known fact (amongst most programmers who aren’t compiler writers or bit hacking aficionados anyway) is that division of a -bit unsigned integer by a constant can always be performed as fixed-point multiplication with a reciprocal, using bits (or less) of intermediate precision. This is *exact* – no round-off error involved. Compilers like to use this technique on integer divisions by constants, since multiplication (even long multiplication) is typically much faster than division.

There are several papers on how to choose the “magic constants” (with proofs); however, most of them are designed to be used in the code generator of a compiler. As such, they generally have several possible code sequences for division by constants, and try to use the cheapest one that works for the given divisor. This makes sense in a compiler, but not in our case, where the exact frequencies are not known at compile time and doing run-time branching between different possible instruction sequences would cost more than it saves. Therefore, I would suggest sticking with Alverson’s original paper “Integer division using reciprocals”.

The example code I linked to implements this approach, replacing the division/modulo pair with a pair of integer multiplications; when using this approach, it makes sense to limit the state variable to 31 bits (or 63 bits on 64-bit architectures): as said before, the reciprocal method requires bits working precision for worst-case divisors, and reducing the range by 1 bit enables a faster (and simpler) implementation than would be required for a full-range variant (especially in C/C++ code, where multi-precision arithmetic is not easy to express). Note that handling the case requires some extra work; details are explained in the code.

There’s one important step in the decoder that I haven’t talked about yet: mapping from the “slot index” to the corresponding symbol index. In normal rANS, each symbol covers a contiguous range of the “slot index” space (by contrast to say tANS, where the slots for any given symbol are spread relatively uniformly across the slot index space). That means that, if all else fails, we can figure out the symbol ID using a binary search in steps (remember that n is the size of our alphabet) from the cumulative frequency table (the , which take O(n) space) – independent of the size of M. That’s comforting to know, but doing a binary search per symbol is, in practice, quite expensive compared to the rest of the decoding work we do.

At the other extreme, we can just prepare a look-up table mapping from the cumulative frequency to the corresponding symbol ID. This is very simple (and the technique used in the example code) and theoretically constant-time per symbol, but it requires a table with M entries – and if the table ends up being too large to fit comfortably in a core’s L1 cache, real-world performance (although still technically bounded by a constant per symbol) can get quite bad. Moving in the other direction, if M is small enough, it can make sense to store the per-symbol information in the M-indexed table too, and avoid the extra indirection; I would not recommend this far beyond M=2^{12} though.

Anyway, that gives us two design points: we can use space, at a cost of per symbol lookup; or we can use space, with symbol lookup cost. Now what we’d really like is to get symbol lookup in space, but sadly that option’s not on the menu.

Or is it?

To make a long story short, I’m not aware of any way to meet our performance goals with the original unmodified rANS algorithm; however, we can do much better if we’re willing to relax our requirements a bit. Notably, there’s no deep reason for us to require that the slots assigned to a given symbol s be contiguous; we already know that e.g. tANS works in a much more relaxed setting. So let’s assume, for the time being, that we can rearrange our slot to symbol mapping arbitrarily (we’ll have to check if this is actually true later, and also work through what it means for our encoder). What does that buy us?

It buys us all we need to meet our performance goals, it turns out (props to my colleague Sean Barrett, who was the first one to figure this out, in our internal email exchanges anyway). As the section title says, the key turns out to be a stochastic sampling technique called the “alias method”. I’m not gonna explain the details here and instead refer you to this short introduction (written by a computational evolutionary geneticist, on randomly picking base pairs) and “Darts, Dice and Coins”, a much longer article that covers multiple ways to sample from a nonuniform distribution (by the way, note that the warnings about numerical instability that often accompany descriptions of the alias method need not worry us; we’re dealing with integer frequencies here so there’s no round-off error).

At this point, you might be wondering what the alias method, a technique for sampling from a non-uniform discrete probability distribution, has anything to do with entropy (de)coding. The answer is that the symbol look-up problem is essentially the same thing: we have a “random” value from the interval [0,M-1], and a matching non-uniform probability distribution (our symbol frequencies). Drawing symbols according to that distribution defines a map from [0,M-1] to our symbol alphabet, which is precisely what we need for our decoding function.

So what does the alias method do? Well, if you followed the link to the article I mentioned earlier, you get the general picture: it partitions the probabilities for our n-symbol alphabet into n “buckets”, such that each bucket i references at most 2 symbols (one of which is symbol i), and the probabilities within each bucket sum to the same value (namely, 1/n). This is *always* possible, and there is an algorithm (due to Vose) which determines such a partition in O(n) time. More generally, we can do so for any N≥n, by just adding some dummy symbols with frequency 0 at the end. In practice, it’s convenient to have N be a power of two, so for arbitrary n we would just pick N to be the smallest power of 2 that is ≥n.

Translating the sampling terminology to our rANS setting: we can subdivide our interval [0,M-1] into N sub-intervals (“buckets”) of equal size k=M/N, such that each bucket i references at most 2 distinct symbols, one of which is symbol i. We also need to know what the other symbol referenced in this bucket is – `alias[i]`

, the “alias” that gives the methods its name – and the position `divider[i]`

of the “dividing line” between the two symbols.

With these two tables, determining the symbol ID from x is quick and easy:

uint xM = x % M; // bit masking (power-of-2 M) uint bucket_id = xM / K; // shift (power-of-2 M/N!) uint symbol = bucket_id; if (xM >= divider[bucket_id]) // primary symbol or alias? symbol = alias[bucket_id];

This is O(1) time and O(N) = O(n) space (for the “divider” and “alias” arrays), as promised. However, this isn’t quite enough for rANS: remember that for our decoding function D, we need to know not just the symbol ID, but also which of the (potentially many) slots assigned to that symbol we ended up in; with regular rANS, this was simple since all slots assigned to a symbol are sequential, starting at slot :

where .

Here, the part is the number we need. Now with the alias method, the slot IDs assigned to a symbol aren’t necessarily contiguous anymore. However, within each bucket, the slot IDs assigned to a symbol are sequential – which means that instead of the cumulative frequencies , we now have two separate per bucket. This allows us to define the complete “alias rANS” decoding function:

// s, x = D(x) with "alias rANS" uint xM = x % M; uint bucket_id = xM / K; uint symbol, bias; if (xM < divider[bucket_id]) { // primary symbol or alias? symbol = bucket_id; bias = primary_start[bucket_id]; } else { symbol = alias[bucket_id]; bias = alias_start[bucket_id]; } x = (x / M) * freq[symbol] + xM - bias;

And although this code is written with branches for clarity, it is in fact fairly easy to do branch-free. We gained another two tables indexed with the bucket ID; generating them is another straightforward linear pass over the buckets: we just need to keep track of how many slots we’ve assigned to each symbol so far. And that’s it – this is all we need for a complete “alias rANS” decoder.

However, there’s one more minor tweak we can make: note that the only part of the computation that actually depends on `symbol`

is the evaluation of `freq[symbol]`

; if we store the frequencies for both symbols in each alias table bucket, we can get rid of the dependent look-ups. This can be a performance win in practice; on the other hand, it does waste a bit of extra memory on the alias table, so you might want to skip on it.

Either way, this alias method allow us to perform quite fast (though not as fast as a fully-unrolled table for small M) symbol look-ups, for large M, with memory overhead (and preparation time) proportional to n. That’s quite cool, and might be particularly interesting in cases where you either have relatively small alphabets (say on the order of 15-25 symbols), need lots of different tables, or frequently switch between tables.

However, we haven’t covered encoding yet. With regular rANS, encoding is easy, since – again – the slot ranges for each symbol are contiguous; the encoder just does

where is the slot id corresponding to the ‘th appearance of symbol s.

With alias rANS, each symbol may have its slots distributed across multiple, disjoint intervals – up to N of them. And the encoder now needs to map to a corresponding slot index that will decode correctly. One way to do this is to just keep track of the mapping as we build the alias table; this takes O(M) space and is O(1) cost per symbol. Another is to keep a sorted list of subintervals (and their cumulative sizes) assigned to each symbol; this takes only O(N) space, but adds a $O(\log_2 N)$ (worst-case) lookup per symbol in the encoder. Sound familiar?

In short, using the alias method doesn’t *really* solve the symbol lookup problem for large M; or, more precisely, it solves the lookup problem on the decoder side, but at the cost of adding an equivalent problem on the encoder side. What this means is that we have to pick our poison: faster encoding (at the some extra cost in the decoder), or faster decoding (at some extra cost in the encoder). This is fine, though; it means we get to make a trade-off, depending on which of the two steps is more important to us overall. And as long as we are in a compress-once decompress-often scenario (which is fairly typical), making the decoder faster at some reasonable extra cost in the encoder is definitely useful.

We can exploit static, known probabilities in several ways for rANS and related coders: for encoding, we can precompute the right “magic values” to avoid divisions in the hot encoding loop; and if we want to support large M, the alias method enables fast decoding without generating a giant table with M entries – with an O(n) preprocessing step (where n is the number of symbols), we can still support O(1) symbol decoding, albeit with a (slightly) higher constant factor.

I’m aware that this post is somewhat hand-wavey; the problem is that while Vose’s algorithm and the associated post-processing are actually quite straightforward, there’s a lot of index manipulation, and I find the relevant steps to be quite hard to express in prose without the “explanation” ending up harder to read than the actual code. So instead, my intent was to convey the “big picture”; a sample implementation of alias table rANS, with all the details, can be found – as usual – on Github.

And that’s it for today – take care!

]]>

The paper covers numerous algorithms. Of those given, I think what Jarek calls “rANS” and “tANS” are the most interesting ones; there’s plenty of binary arithmetic coders already and “uABS”, “rABS” or “tABS” do not, at first sight, offer any compelling reasons to switch (that said, there’s some non-obvious advantages that I’ll get to in a later post).

Charles has already posted a good introduction; I recommend you start there. Charles has also spent a lot of time dealing with the table-based coders, and the table-based construction allows some extra degrees of freedom that make it hard to see what’s actually going on. In this post, I’m going to mostly talk about rANS, the range coder equivalent. Charles already describes it up to the point where you try to make it “streaming” (i.e. finite precision); I’m going to continue from there.

In Charles’ notation (which I’m going to use because the paper uses both indexed upper-case I’s and indexed lower-case l’s, which would make reading this with the default font a disaster), we have two functions now: The coding function C and the decoding function D defined as

where .

D determines the encoded symbol s by looking at (x mod M) and looking up the corresponding value in our cumulative frequency table (s is the unique s such that ).

Okay. If you're working with infinite-precision integers, that's all you need, but that's not exactly suitable for fast (de)compression. We want a way to perform this using finite-precision integers, which means we want to limit the size of x somehow. So what we do is define a "normalized" interval

The [L:bL) thing on the right is the notation I'll be using for a half-open interval of integers). b is the radix of our encoder; b=2 means we're emitting bits at a time, b=256 means we're emitting bytes, and so forth. Okay. We now want to keep x in this range. Too large x don't fit in finite-precision integers, but why not allow small x? The (hand-wavey) intuition is that too small x don't contain enough information; as Charles mentioned, x essentially contains about bits of information. If x < 4, it can only be one of four distinct values, and as you saw above the value of x (mod M) directly determines which symbol we're coding. So we want x just right: not too large, not too small. Okay. Now let's look at how a corresponding decoder would look:

while (!done) { // Loop invariant: x is normalized. assert(L <= x && x < b*L); // Decode a symbol s, x = D(x); // Renormalization: While x is too small, // keep reading more bits (nibbles, bytes, ...) while (x < L) x = x*b + readFromStream(); }

Turns out that's actually our decoding algorithm, period. What we need to do now is figure out how the corresponding encoder looks. As long as we're only using C and D with big integers, it's simple; the two are just inverses of each other. Likewise, we want our encoder to be the inverse of our decoder - exactly. That means we have to do the inverse of the operations the decoder does, in the opposite order. Which means our encoder has to look something like this:

while (!done) { // Inverse renormalization: emit bits/bytes etc. while (???) { writeToStream(x % b); x /= b; } // Encode a symbol x = C(s, x); // Loop invariant: x is normalized assert(L <= x && x < b*L); }

So far, this is purely mechanical. The only question is what happens in the "???" - when exactly *do* we emit bits? Well, for the encoder and decoder to be inverses of each other, the answer has to be "exactly when the decoder would read them". Well, the decoder reads bits whenever the normalization variant is violated after decoding a symbol, to make sure it holds for the next iteration of the loop. The encoder, again, needs to do the opposite - we need to proactively emit bits *before* coding s to make sure that, after we've applied C, x will be normalized.

In fact, that's all we need for a first sketch of renormalization:

while (!done) { // Keep trying until we succeed for (;;) { x_try = C(s, x); if (L <= x_try && x_try < b*L) { // ok? x = x_try; break; } else { // Shrink x a bit and try again writeToStream(x % b); x /= b; } } x = x_try; }

Does this work? Well, it might. It depends. We certainly can't guarantee it from where we're standing, though. And even if it does, it's kind of ugly. Can't we do better? What about the normalization - I've just written down the normalization loops, but just because both decoder and encoder maintain the same invariants doesn't necessarily mean they are in sync. What if at some point of the loop, there are more than two possible normalized configurations - can this happen? Plus there's some hidden assumptions in here: the encoder, by only ever shrinking x before C, assumes that C always causes x to increase (or at least never causes x to decrease); similarly, the decoder assumes that applying D won't increase x.

And I'm afraid this is where the proofs begin.

Let's start with the last question first: does C always increase x? It certainly looks like it might, but there's floors involved - what if there's some corner case lurking? Time to check:

since , , and are all non-negative (the latter because - each symbol's frequency is less than the sum of all symbol frequencies). So C indeed always increases x, or more precisely, never decreases it.

Next up: normalization. Let's tackle the decoder first. First off, is normalization in the decoder unique? That is, if , is that the only normalized x it could be at that stage in the decoding process? Yes, it is: , so , so

where arbitrary

That is, no matter what bit / byte / whatever the decoder would read next (that's 'd'), running through the normalization loop once more would cause x to become too large; there's only one way for x to be normalized. But do we always end up with a normalized x? Again, yes. Suppose that , then (because we're working in integers) , and hence

(again arbitrary)

The same kind of argument works for the encoder, which floor-divides by b instead of multiplying b it. The key here is that our normalization interval I has a property the ANS paper calls "b-uniqueness": it's of the form for some positive integer k (its upper bound is b times its lower bound). Any process that grows (or shrinks) x in multiples of b can't "skip over" I (as in, transition from being larger than the largest value in I to smaller than the smallest value in I or vice versa) in a single step. Simultaneously, there's also never two valid states the encoder/decoder could be in (which could make them go out of sync: both encoder and decoder think they're normalized, but they're at different state values).

To elaborate, suppose we have b=2 and some interval where the ratio between lower and upper bound is a tiny bit less than 2: say There's just one value missing. Now suppose the decoder is in state and reads a bit, which turns out to be 1. Then the new state is - we overshot! We were below the lower bound of I', and yet with a single bit read, we're now past its upper bound. I' is "too small".

Now let's try the other direction; again b=2, and this time we make the ratio between upper and lower bound a bit too high: set There's no risk of not reaching a state in that interval now, but now there is ambiguity. Where's the problem? Suppose the encoder is in state . Coding any symbol will require renormalization to "shift out" one bit. The encoder writes that bit (a zero), goes to state , and moves on. But there's a problem: the decoder, after decoding that symbol, will be in state too. And it doesn't know that the encoder got there from state by shifting a bit; all the decoder knows is that it's in state , which is normalized and doesn't require reading any more bits. So the encoder has written a bit that the decoder doesn't read, and now the two go out of sync.

Long story short: if the state intervals involved aren't b-unique, bad things happen. And on the subject of bad things happening, our encoder tries to find an x such that C(s,x) is inside I by shrinking x - but what if that process doesn't find a suitable x? We need to know a bit more about which values of x lead to C(s,x) being inside I, which leads us to the sets

i.e. the set of all x such that encoding 's' puts us into I again. If all these sets turn out to be b-unique intervals, we're good. If not, we're in trouble. Time for a brief example.

Let's pick L=5, b=2 and M=3. We have just two symbols: 'a' has probability , and 'b' has probability , which we turn into , , , . Our normalization interval is . By computing C(s,x) for the relevant values of s and x, we find out that

Uh-oh. Neither of these two intervals is b-unique. is too small, and is too big. So what goes wrong?

Well, suppose that we're in state x=7 and want to encode an 'a'. 7 is not in (too large). So the encoder emits the LSB of x, divides by 2 and... now x=3. Well, that's not in either (too small), and shrinking it even further won't help that. So at this point, the encoder is stuck; there's no x it can reach that works.

So we just saw that in certain scenarios, rANS can just get stuck. Is there anything we can do to avoid it? Yes: the paper points out that the embarrassing situation we just ran into can't happen when M (the sum of all symbol frequencies, the denominator in our probability distribution) divides L, our normalization interval lower bound. That is, for some positive integer k. It doesn't give details, though; so, knowing this, can we prove anything about the that would help us? Well, let's just look at the elements of and see what we can do:

let's work on that condition:

at this point, we can use that and divide by M:

Now, for arbitrary real numbers r and natural numbers n we have that

Using this, we get:

note the term in the outer floor bracket is the sum of an integer and a real value inside , since , so we can simplify drastically

where we applied the floor identities above again and then just multiplied by . Note that the result is an interval of integers with its (exclusive) upper bound being equal to b times its (inclusive) lower bound, just like we need - in other words, assuming that , all the are b-unique and we're golden (this is mentioned in the paper in section 3.3, but not proven, at least not in the Jan 6 2014 version).

Note that this also gives us a much nicer expression to check for our encoder. In fact, we only need the upper bound (due to b-uniqueness, we know there's no risk of us falling through the lower bound), and we end up with the encoding function

while (!done) { // Loop invariant: x is normalized assert(L <= x && x < b*L); // Renormalize x_max = (b * (L / M)) * freq[s]; // all but freq[s] constant while (x >= x_max) { writeToStream(x % b); x /= b; } // Encode a symbol // x = C(s, x); x = freq[s] * (x / M) + (x % M) + base[s]; }

No "???"s left - we have a "streaming" (finite-precision) version of rANS, which is almost like the arithmetic coders you know and love (and in fact quite closely related) except for the bit where you need to encode your data in reverse (and reverse the resulting byte stream).

I put an actual implementation on Github for the curious.

This is an arithmetic coder, just a weird one. The reverse encoding seems like a total pain at first, and it kind of is, but it comes with a bunch of really non-obvious but amazing advantages that I'll cover in a later post (or just read the comments in the code). The fact that M (the sum of all frequencies) has to be a multiple of L is a serious limitation, but I don't (yet?) see any way to work around that while preserving b-uniqueness. So the compromise is to pick M and L to be both powers of 2. This makes the decoder's division/mod with M fast. The power-of-2 limitation makes rANS really bad for adaptive coding (where you're constantly updating your stats, and resampling to a power-of-2-sized distribution is expensive), but hey, so is Huffman. As a Huffman replacement, it's really quite good.

In particular, it supports a divide-free decoder (and actually no per-symbol division in the encoder either, if you have a static table; see my code on Github, `RansEncPutSymbol`

in particular). This is something you can't (easily) do with existing multi-symbol arithmetic coders, and is a really cool property to have, because it really puts it a lot closer to being a viable Huffman replacement in a lot of places that do care about the speed.

If you look at the decoder, you'll notice that its entire behavior for a decoding step only depends on the value of `x`

at the beginning: figure out the symbol from the low-order bits of x, go to a new state, read some bits until we're normalized again. This is where the table-based versions (tANS etc.) come into play: you can just tabulate their behavior! To make this work, you want to keep b and L relatively small. Then you just make a table of what happens in every possible state.

Interestingly, because these tables really do tabulate the behavior of a "proper" arithmetic coder, they're *compatible*: if you have two table-baked distributions with use that same values of b and L (i.e. the same interval I), you can switch between them freely; the states mean the same in both of them. It's not at all obvious that it's even possible for a table-based encoder to have this property, so it's even cooler that it comes with no onerous requirements on the distribution!

That said, as interesting as the table-based schemes are, I think the non-table-based variant (rANS) is actually more widely useful. Having small tables severely limits your probability resolution (and range precision), and big tables are somewhat dubious: adds, integer multiplies and bit operations are fine. We can do these quickly. More compute power is a safe thing to bet on right now (memory access is not). (I do have some data points on what you can do on current HW, but I'll get there in a later post.)

As said, rANS has a bunch of really cool, unusual properties, some of which I'd never have expected to see in any practical entropy coder, with cool consequences. I'll put that in a separate post, though - this one is long (and technical) enough already. Until then!

]]>

Won Chun (who wrote a book chapter for OpenGL insights on the topic, “WebGL Models: End-to-End”) writes: (this is pasted together from multiple mails; my apologies)

A vertex-cache optimized list is still way more compressible than random: clearly there is lots of coherence to exploit. In fact, that’s pretty much with the edge-sharing-quad business is (OK, some tools might do this because they are working naturally in quads, but you get lots of these cases with any vertex cache optimizer).

So you can also get a pretty big win by exploiting pre-transform vertex cache optimization. I call it “high water mark encoding.” Basically: for such a properly optimized index list, the next index you see is either (a) one you’ve seen before or (b) one higher than the current highest seen index. So, instead of encoding actual indices, you can instead encode them relative to this high water mark (the largest index yet to be seen, initialized to 0). You see “n” and that corresponds to an index of (high water mark – n). When you see 0, you also increment high watermark.

The benefit here is that the encoded indices are very small, and then you can do some kind of varint coding, then your encoded indices are a bit more than a byte on average. If you plan on zipping later, then make sure the varints are byte-aligned and LSB-first.

There are lots of variants on this, like:

- keep a small LRU (~32 elements, or whatever transform cache size you’ve optimized for) of indices, and preferentially reference indices based on recent use (e.g. values 1-32), rather than actual value (deltas are offset by 32),
- make the LRU reference edges rather than verts, so a LRU “hit” gives you two indices instead of one,
- do two-level high water mark encoding, which makes it efficient to store in multi-indexed form (like .OBJ, which has separate indices per attribute) and decode into normal single-indexed form
And also, these approaches let you be smart about attribute compression as well, since it gives you useful hints; e.g. any edge match lets you do parallelogram prediction, multi-indexing can implicitly tell you how to predict normals from positions.

I really like this idea. Previously I’d been using simple delta encoding on the resulting index lists; that works, but the problem with delta coding is that a single outlier will produce two large steps – one to go from the current region to the outlier, then another one to get back. The high watermark scheme is almost as straightforward as straight delta coding and avoids this case completely.

Now, if you have an index list straight out of vertex cache optimization and vertex renumbering, the idea works as described. However, with the hybrid tri/paired-tri encoding I described last time, we have to be a bit more careful. While the original index list will indeed have each index be at most 1 larger than the highest index we’ve seen so far, our use of “A ≥ B” to encode whether the next set of indices describes a single triangle or a pair means that we might end up having to start from the second or third vertex of a triangle, and consequently see a larger jump than just 1. Luckily, the fix for this is simple – rather than keeping the high watermark always 1 higher than the largest vertex index we’ve seen so far, we keep it N higher where N is the largest possible “step” we can have in the index list. With that, the transform is really easy, so I’m just going to post my code in full:

static void watermark_transform(std::vector<int>& out_inds, const std::vector<int>& in_inds, int max_step) { int hi = max_step - 1; // high watermark out_inds.clear(); out_inds.reserve(in_inds.size()); for (int v : in_inds) { assert(v <= hi); out_inds.push_back(hi - v); hi = std::max(hi, v + max_step); } }

and the inverse is exactly the same, with the `push_back`

in the middle replaced by the two lines

v = hi - v; out_inds.push_back(v);

So what’s the value of N (aka `max_step`

in the code), the largest step that a new index can be from the highest index we’ve seen so far? Well, for the encoding described last time, it turns out to be 3:

- When encoding a single triangle, the worst case is a triangle with all-new verts. Suppose the highest index we’ve seen so far is k, and the next triangle has indices (k+1,k+2,k+3). Our encoding for single triangles requires that the first index be larger than the second one, so we would send this triangle as (k+3,k+1,k+2). That’s a step of 3.
- For a pair of triangles, we get 4 new indices. So it might seem like we might get a worst-case step of 4. However, we know that the two triangles share an edge; and for that to be the case, the shared edge must have been present in the first triangle. Furthermore, we require that the smaller of the two indices be sent first (that’s what flags this as a paired tri). So the worst cases we can have for the first two indices are (k+2,k+3) and (k+1,k+3), both of which have a largest step size of 2. After the first two indices, we only have another two indices to send; worst-case, they are both new, and the third index is larger than the fourth. This corresponds to a step size of 2. All other configurations have step sizes ≤1.

And again, that’s it. Still very simple. So does it help?

Let’s dig out the table again (new entries bold, all percentages relative to the “Vertex cache optimized” row):

Stage | Size | .zip size | .7z size |
---|---|---|---|

Original mesh | 6082k | 3312k | 2682k |

Vertex cache optimized | 6082k | 2084k | 1504k |

Vcache opt, watermark |
6082k | 1808k (-13.2%) | 1388k (-7.7%) |

Postprocessed | 4939k (-18.8%) | 1830k (-12.2%) | 1340k (-10.9%) |

Postproc, watermark |
4939k (-18.8%) | 1563k (-25.0%) | 1198k (-20.3%) |

In short: the two techniques work together perfectly and very nicely complement each other. This is without any varint encoding by the way, still sending raw 32-bit indices. Variable-sized integer encoding would probably help, but I haven’t checked how much. The code on Github has been updated in case you want to play around with it.

I think this is at a fairly sweet spot in terms of compression ratio vs. decoder complexity. Won originally used this for WebGL, using UTF-8 as his variable-integer encoding: turn the relative indices into Unicode code points, encode the result as UTF-8. This is a bit hacky and limits you to indices that use slightly more than 20 bits (still more than a million, so probably fine for most meshes you’d use in WebGL), but it does mean that the Browser can handle the variable-sized decoding for you (instead of having to deal with byte packing in JS code).

Overall, this approach (postprocess + high watermark) gives a decoder that is maybe 30 lines or so of C++ code, and which does one linear pass over the data (the example program does two passes to keep things clearer, but they can be combined without problems) with no complicated logic whatsoever. It’s simple to get right, easy to drop in, and the results are quite good.

It is not, however, state of the art for mesh compression by any stretch of the imagination. This is a small domain-specific transform that can be applied to fully baked and optimized game assets to make them a bit smaller on disk. I also did not cover vertex data; this is not because vertex data is unimportant, but simply because, so far at least, I’ve not seen *any* mesh compressors that do something besides the obvious (that is, quantization and parallelogram prediction) for vertex data.

Finally, if what you actually want is state-of-the art triangle mesh compression, you should look elsewhere. This is outside the scope of this post, but good papers to start with are “Triangle Mesh Compression” by Touma and Gotsman (Proceedings of Graphics Interface, Vancouver, June 1998) and “TFAN: A low complexity 3D mesh compression algorithm” by Khaled, Zaharia, and Prêteux (Computer Animation and Virtual Worlds 20.2‐3 (2009)). I’ve played around with the former (the character mesh in Candytron was encoded using an algorithm descended from it) but not the latter; however, Touma-Gotsman and descendants are limited to 2-manifold meshes, whereas TFAN promises better compression and handles arbitrarily topologies, so it looks good on paper at least.

Anyway; that’s it for today. Thanks to Won for his mail! And if you should use this somewhere or figure out a way to get more mileage out of it without making it much more complicated, I’d absolute love to know!

]]>

Indexed triangle strips are nice because they’re smaller, but finding triangle strips is a bit of a pain, and more importantly, long triangle strips actually aren’t ideal because they tend to “wander away” from the origin triangle, which means they’re not getting very good mileage out of the vertex cache (see, for example, Tom Forsyth’s old article on vertex cache optimization).

So here’s the thing: we’d like our index buffers for indexed tri lists to be smaller, but we’d also like to do our other processing (like vertex cache optimization) and then not mess with the results – not too much anyway. Can we do that?

Yes we can (I had this idea a while back, but never bothered to work out the details). The key insight is that, in a normal mesh, almost all triangles share an edge with another triangle. And if you’ve vertex-cache optimized your index buffer, two triangles that are adjacent in the index buffer are also quite likely to be adjacent in the geometric sense, i.e. share an edge.

So, here’s the idea: loop over the index buffer. For each triangle, check if it shares an edge with its immediate successor. If so, the two triangles can (in theory anyway) be described with four indices, rather than the usual six (since two vertices appear in both triangles).

But how do we encode this? We could try to steal a bit somewhere, but in fact there’s no need – we can do better than that.

Suppose you have a triangle with vertex indices (A, B, C). The choice of which vertex is first is somewhat arbitrary: the other two even cycles (B, C, A) and (C, A, B) describe the same triangle with the same winding order (and the odd cycles describe the same triangle with opposite winding order, but let’s leave that alone). We can use this choice to encode a bit of information: say if A≥B, we are indeed coding a triangle – otherwise (A<B), what follows will be two triangles sharing a common edge (namely, the edge AB). We can always pick an even permutation of triangle indices such that A≥B, since for *any* integer A, B, C we have

Because the sum is 0, not all three terms can be negative, which in turn means that at least one of A≥B, B≥C or C≥A must be true. Furthermore, if A, B, and C are all distinct (i.e. the triangle is non-degenerate), all three terms are nonzero, and hence we must have both negative and positive terms for the sum to come out as 0.

Okay, so if the triangle *wasn’t* paired up, we can always cyclically permute the vertices such that A≥B. What do we do when we have two triangles sharing an edge, say AB?

For this configuration, we need to send the 4 indices A, B, C, D, which encode the two triangles (A, B, C) and (A, D, B).

If A<B, we can just send the 4 indices directly, leading to this very simple decoding algorithm that unpacks our mixed triangle/double-triangle indexed buffer back to a regular triangle list:

- Read 3 indices A, B, C.
- Output triangle (A, B, C).
- If A<B, read another index D and output triangle (A, D, B).

Okay, so this works out really nicely if A<B. But what if it’s not? Well, there’s just two cases left. If A=B, the shared edge is a degenerate edge and both triangles are degenerate triangles; not exactly common, so the pragmatic solution is to say “if either triangle is degenerate, you have to send them un-paired”. That leaves the case A>B; but that means B<A, and BA is also a shared edge! In fact, we can simply rotate the diagram by 180 degrees; this swaps the position of (B,A) and (C,D) but corresponds to the same triangles. With the algorithm above, (B, A, D, C) will decode as the two triangles (B, A, D), (B, C, A) – same two triangles as before, just in a different order. So we’re good.

What this means is that, under fairly mild assumptions (but see “limitations” section below), we can have a representation of index buffers that mixes triangles and pairs of adjacent triangles, with no need for any special flag bits (as recommended in Christer’s article) or other hackery to distinguish the two.

In most closed meshes, every triangle has at least one adjacent neighbor (usually several); isolated triangles are very rare. We can store such meshes using 4 indices for every pair of triangles, instead of 6, for about a 33% reduction. Furthermore, most meshes in fact contain a significant number of quadriliterals (quads), and this representation supports quads directly (stored with 4 indices). 33% reduction for index buffers isn’t a huge deal if you have “fat” vertex formats, but for relatively small vertices (as you have in collision detection, among other things), indices can actually end up being a significant part of your overall mesh data.

Finally, this is simple enough to decode that it would probably be viable in GPU hardware. I wouldn’t hold my breath for that one, just thought I might point it out. :)

I wrote up a quick test for this and put it on Github, as usual. This code loads a mesh, vertex-cache optimizes the index buffer (using Tom’s algorithm), then checks for each triangle whether it shares an edge with its immediate successor and if so, sends them as a pair – otherwise, send the triangle alone. That’s it. No attempt is made to be more thorough than that; I just wanted to be as faithful to the original index buffer as possible.

On the “Armadillo” mesh from the Stanford 3D scanning repository, the program outputs this: (**UPDATE**: I added some more features to the sample program and updated the results here accordingly)

172974 verts, 1037832 inds. before: ACMR: 2.617 (16-entry FIFO) 62558 paired tris, 283386 single IB inds: list=1037832, fancy=975274 (-6.03%) after: ACMR: 0.814 (16-entry FIFO) 292458 paired tris, 53486 single IB inds: list=1037832, fancy=745374 (-28.18%) 745374 inds packed 1037832 inds unpacked index buffers match. ACMR: 0.815 (16-entry FIFO)

“Before” is the average cache miss rate (vertex cache misses/triangle) assuming a 16-entry FIFO cache for the original Armadillo mesh (not optimized). As you can see, it’s pretty bad.

I then run the simple pairing algorithm (“fancy”) on that, which (surprisingly enough) manages to reduce the index list size by about 6%.

“After” is after vertex cache optimization. Note that Tom’s algorithm is cache size agnostic; it does not assume any particular vertex cache size, and the only reason I’m dumping stats for a 16-entry FIFO is because I had to pick a number and wanted to pick a relatively conservative estimate. As expected, ACMR is much better; and the index buffer packing algorithm reduces the IB size by about 28%. Considering that the best possible case is a reduction of 33%, this is quite good. Finally, I verify that packing and unpacking the index buffer gives back the expected results (it does), and then re-compute the ACMR on the unpacked index buffer (which has vertices and triangles in a slightly different order, after all).

Long story short: it works, even the basic “only look 1 triangle ahead” algorithm gives good results on vertex cache optimized meshes, and the slight reordering performed by the algorithm does not seem to harm vertex cache hit rate much (on this test mesh anyway). Apologies for only testing on one 3D-scanned mesh, but I don’t actually have any realistic art assets lying around at home, and even if I did, loading them would’ve probably taken me more time than writing the entire rest of this program did.

The original program was missing one more step that is normally done after vertex cache optimization: reordering the vertex buffer so that vertices appear in the order they’re referenced from the index buffer. This usually improves the efficiency of the *pre*-transform cache (as opposed to the *post*-transform cache that the vertex cache optimization algorithm takes care of) because it gives better locality of reference, and has the added side effect of also making the index data more compressible for general purpose lossless compression algorithms like Deflate or LZMA.

Anyway, here’s the results for taking the entire Armadillo mesh – vertices, which just store X/Y/Z position as floats, and 32-bit indices both – and processing it with some standard general-purpose compressors at various stages of the optimization process: (all sizes in binary kilobytes, KiB)

Stage | Size | .zip size | .7z size |
---|---|---|---|

Original mesh | 6082k | 3312k | 2682k |

Vertex cache optimized | 6082k | 2084k | 1504k |

Postprocessed | 4939k (-18.8%) | 1830k (-12.2%) | 1340k (-10.9%) |

So the post-process yields a good 10% reduction in the compressed size for what would be the final packaged assets here. This value is to be taken with a grain of salt: “real” art assets have other per-vertex data besides just 12 bytes for the 3D position, and nothing I described here does anything about vertex data. In other words, this comparison is on data that favors the algorithm in the sense that roughly half of the mesh file is indices, so keep that in mind. Still, 10% reduction post-LZMA is quite good for such a simple algorithm, especially compared to the effort it takes to get the same level of reduction on, say, x86 code.

Also note that the vertex cache optimization by itself *massively* helps the compressors here; the index list for this mesh comes from a 3D reconstruction of range-scanned data and is pathologically bad (the vertex order is really quite random), but the data you get out of a regular 3D mesh export is quite crappy too. So if you’re not doing any optimization on your mesh data yet, you should really consider doing so – it will reduce both your frame timings and your asset sizes.

This is where the (*) from the title comes in. While I think this is fairly nice, there’s two cases where you can’t use this scheme, at least not always:

- When the order of vertices within a triangle matters. An example would be meshes using flat attribute interpolation, where the value chosen for a primitive depends on the “provoking vertex”. And I remember some fairly old graphics hardware where the Z interpolation depended on vertex specification order, so you could get Z-fighting between the passes in multi-pass rendering if they used different subsets of triangles.
- When the order of triangles within a mesh matters (remember that we in the two-tris case, we might end up swapping them to make the encoding work). Having the triangles in a particular order in the index buffer can be very useful with alpha blending, for example. That said, the typical case for this application is that the index buffer partitions into several chunks that should be drawn in order, but with no particular ordering requirements within that chunk, which is easy to do – just prohibit merging tris across chunk boundaries.

That said, it seems to me that it really should be useful in every case where you’d use a vertex cache optimizer (which messes with the order anyway). So it’s probably fine.

Anyway, that’s it. No idea whether it’s useful to anyone, but I think it’s fairly cute, and it definitely seemed worth writing up.

]]>

If these were regular parentheses, this would simply evaluate to , but we’re taking absolute values here, so it’s a bit more complicated than that. However, the expression does look pretty symmetric, so let’s see if we can do something with that. Name and conquer, so let’s define:

This expression looks fairly symmetric in a and b, so what happens if we switch the two?

So it’s indeed symmetric in the two arguments. Another candidate to check is sign flips – we’re taking absolute values here, so we might find something interesting there as well:

And because we already know that f is symmetric in its arguments, this gives us for free – bonus! Putting all these together, we now know that

which isn’t earth-shattering but not obvious either. You could prove this directly from the original expression, but I like doing it this way (defining a function f and poking at it) because it’s much easier to see what’s going on.

But we’re not quite done yet: One final thing you can do with absolute values is figure out what needs to happen for the expression to be non-negative and see if you can simplify it further. Now, both and are always non-negative, so in fact we have

.

Now suppose that . In that case, we know the sign of the expression on the right and can simplify further:

and similarly, when , we get . Putting the two together, we arrive at the conclusion

which I think is fairly cute. (This expression comes up in SATD computations and can be used to save some work in the final step.)

]]>

The DCT described last time (and implemented with the Matlab/Octave code above) is a scaled DCT, and I didn’t mention the scale factors last time. However, they’re easy enough to solve for. We have a forward transform matrix, . Since the rows are orthogonal (by construction), the inverse of (and hence our IDCT) is its transpose, up to scaling. We can write this as matrix equation where S ought to be a diagonal matrix. Solving for S yields . This gives the total scaling when multiplying by both and ; what we really want is a normalizing diagonal matrix such that is orthogonal, which means

since N is diagonal, so , and again because the two are diagonal this just means that the diagonal entries of N are the square roots of the diagonal entries of S. Then we multiply once by N just before quantization on the encoder side, and another time by N just after dequantization on the decoder side. In practice, this scaling can be folded directly into the quantization/dequantization process, but conceptually they’re distinct steps.

This normalization is the right one to use to compare against the “proper” orthogonal DCT-II. That said, with a non-lifting all-integer transform, normalizing to unit gain is a bad idea; we actually *want* some gain to reduce round-off error. A good compromise is to normalize everything to the gain of the DC coefficient, which is for the 1D transform. Using this normalization ensures that DC, which usually contains a significant amount of the overall energy, is not subject to scaling-induced round-off error and thus reproduced exactly.

In the 2D transform, each coefficient takes part in first a column DCT and then a row DCT (or vice versa). That means the gain of a separable 2D 8×8 DCT is the square of the gain for the 1D DCT, which in our case means 8. And since the IDCT is just the transpose of the DCT, it has the same gain, so following up the 2D DCT with the corresponding 2D IDCT introduces another 8x gain increase, bringing the total gain for the transform → normalize → inverse transform chain up to 64.

Well, the input to the DCT in a video codec is the difference between the actual pixel values and the prediction (usually motion compensation or some geometric/gradient predictor), and with 8-bit input signals that means the input is the difference between two values in [0, 255]. The worst cases are when the predicted value is 0 and the actual value is 255 or vice versa, leading to a worst-case input value range of [-255, 255] for our integer DCT. Multiply this by a gain factor of 64 and you get [-16320, 16320]. Considering that the plan is to have a 16-bit integer implementation, this should make it obvious what the problem is: it all fits, but only just, and we have to be *really* careful about overflow and value ranges.

This all falls under the umbrella term of the dynamic range of the transform. The smallest representable non-zero sample values we support are ±1. With 16-bit integer, the largest values we can support are ±32767; -32768 is out since there’s no corresponding positive value. The ratio between the two is the total dynamic range we have to work with. For transforms, it’s customary to define the dynamic range relative to the input value range, not the smallest representable non-zero signal; in our case, the maximum total scaling we can tolerate along any signal path is 32767/255, or about 128.498; anything larger and we might get overflows.

Now, this problem might seem like it’s self-made: as described above, both the forward and inverse transforms have a total gain of 8x, which seems wasteful. Why don’t we, say, leave the 8x gain on the forward transform but try to normalize the inverse transform such that it doesn’t have any extra gain?

Well, yes and no. We can, but it’s addressing the wrong part of the problem. The figures I’ve been quoting so far are for the gain as measured in the 2-norm (Euclidean norm). But for dynamic range analysis, what we need to know is the infinity norm (aka “uniform norm” or “maximum norm”): what is the largest absolute value we’re going to encounter in our vectors? And it turns out that with respect to the infinity norm, the picture looks a bit different.

To see why, let’s start with a toy example. Say we just have two samples, and our transform is a single butterfly:

As covered last time, B is a scaled reflection, its own inverse up to said scaling, and its matrix 2-norm is . However, B’s infinity-norm is 2. For example, the input vector (maximum value of 1) maps to (maximum value of 2) under B. In other words, if we add (or subtract) two numbers, we might get a number back that is up to twice as big as either of the inputs – this shouldn’t come as a surprise. Now let’s apply the transpose of B, which happens to be B itself:

This is just a scaled identity transform, and its 2-norm and infinity norm are 2 exactly. In other words, the 2-norm grew in the second application of B, but with respect to the infinity norm, things were already as bad as they were going to get after the forward transform. And it’s this infinity norm that we need to worry about if we want to avoid overflows.

So what’s our actual matrix norms along the signal path, with the “B2″ DCT variant from last time? This is why I published the Octave code on Github. Let’s work through the 1D DCT first:

2> M=bink_dct_B2(eye(8)); 3> S=diag(8 ./ diag(M*M')); 4> norm(M,2) ans = 3.4324 5> norm(M,'inf') ans = 8.7500 6> norm(S*M,2) ans = 3.2962 7> norm(S*M,'inf') ans = 8.4881 8> norm(M'*S*M,2) ans = 8.0000 9> norm(M'*S*M,'inf') ans = 8.0000

This gives both the 2-norms and infinity norms for first the forward DCT only (`M`

), then forward DCT followed by scaling (`S*M`

), and finally with the inverse transform as well (`M'*S*M`

), although at that point it’s really just a scaled identity matrix so the result isn’t particularly interesting. We can do the same thing with 2D transforms using `kron`

which computes the Kronecker product of two matrices (corresponding to the tensor product of linear maps, which is the mathematical background underlying separable transforms):

10> norm(kron(M,M),'inf') ans = 76.563 11> norm(kron(S,S) * kron(M,M),2) ans = 10.865 12> norm(kron(S,S) * kron(M,M),'inf') ans = 72.047 13> norm(kron(M',M') * kron(S,S) * kron(M,M), 2) ans = 64.000 14> norm(kron(M',M') * kron(S,S) * kron(M,M), 'inf') ans = 64.000 15> norm(kron(M',M') * kron(S,S) * kron(M,M) - 64*eye(64), 'inf') ans = 8.5487e-014 16> 64*eps ans = 1.4211e-014

As you can see, the situation is very similar to what we saw in the butterfly example: while the 2-norm keeps growing throughout the entire process, the infinity norm actually maxes out right after the forward transform and stays roughly at the same level afterwards. The last two lines are just there to show that the entire 2D transform chain really does come out to be a scaled identity transform, to within a bit more than 6 ulps.

More importantly however, we do see a gain factor of around 76.6 right after the forward transform, and the inverse transform starts out with a total gain around 72; remember that this value has to be less than 128 for us to be able to do the transform using 16-bit integer arithmetic. As said before, it can fit, but it’s a close shave; we have less than 1 bit of headroom, so we need to be really careful about overflow – there’s no margin for sloppiness. And furthermore, while this means that the *entire transform* fits inside 16 bits, what we really need to make sure is that the way we’re computing it is free of overflows; we need to drill down a bit further.

For this, it helps to have a direct implementation of the DCT and IDCT; a direct implementation (still in Octave) for the DCT is here, and I’ve also checked in a direct IDCT taking an extra parameter that determines how many stages of the IDCT to compute. A “stage” is a set of mutually independent butterflies or rotations that may depend on previous stages (corresponding to “clean breaks” in a butterfly diagram or the various matrix factors in a matrix decomposition).

I’m not going to quote the entire code (it’s not very long, but there’s no point), but here’s a small snippet to show what’s going on:

% odd stage 4 c4 = d4; c5 = d5 + d6; c7 = d5 - d6; c6 = d7; if stages > 1 % odd stage 3 b4 = c4 + c5; b5 = c4 - c5; b6 = c6 + c7; b7 = c6 - c7; % even stage 3 b0 = c0 + c1; b1 = c0 - c1; b2 = c2 + c2/4 + c3/2; % 5/4*c2 + 1/2*c3 b3 = c2/2 - c3 - c3/4; % 1/2*c3 - 5/4*c3 % ... end

The trick is that within each stage, there are only integer additions and right-shifts (here written as divisions because this is Octave code, but in integer code it should really be shifts and not divisions) of values computed in a previous stage. The latter can never overflow. The former may have intermediate overflows in 16-bit integer, but provided that computations are done in two’s complement arithmetic, the results will be equal to the results of the corresponding exact integer computation modulo 2^{16}. So as long as said exact integer computation has a result that fits into 16 bits, the computed results are going to be correct, even if there are intermediate overflows. Note that this is true when computed using SIMD intrinsics that have explicit two’s complement semantics, but is *not* guaranteed to be true when implemented using 16-bit integers in C/C++ code, since signed overflow in C triggers undefined behavior!

Also note that we can’t compute values such as `5/4*c2`

as `(5*c2)/4`

, since the intermediate value `5*c2`

can overflow – and knowing as we do already that we have less than 1 bit of headroom left, likely will for some inputs. One option is to do the shift first: `5*(c2/4)`

. This works but throws away some of the least-significant bits. The form `c2 + (c2/4)`

chosen here is a compromise; it has larger round-off error than the preferred `(5*c2)/4`

but will only overflow if the final result does. The findorth tool described last time automatically determines expressions in this form.

Okay, that’s all the ingredients we need. We have the transform decomposed into separate stages; we know the value range at the input to each stage, and we know that if both the inputs and the outputs fit inside 16 bits, the results computed in each stage will be correct. So now all we need to do is verify that last bit – *are* the outputs of the intermediate stages all 16 bits or less?

2> I=eye(8); 3> F=bink_dct_B2(I); 4> S=diag(8 ./ diag(F*F')); 5> Inv1=bink_idct_B2_partial(I,1); 6> Inv2=bink_idct_B2_partial(I,2); 7> Inv3=bink_idct_B2_partial(I,3); 8> Inv4=bink_idct_B2_partial(I,4); 9> Scaled=kron(S,S) * kron(F,F); 10> norm(kron(I,I) * Scaled, 'inf') ans = 72.047 11> norm(kron(I,Inv1) * Scaled, 'inf') ans = 72.047 12> norm(kron(I,Inv2) * Scaled, 'inf') ans = 77.811 13> norm(kron(I,Inv3) * Scaled, 'inf') ans = 77.811 14> norm(kron(I,Inv4) * Scaled, 'inf') ans = 67.905 15> norm(kron(Inv1,Inv4) * Scaled, 'inf') ans = 67.905 16> norm(kron(Inv2,Inv4) * Scaled, 'inf') ans = 73.337 17> norm(kron(Inv3,Inv4) * Scaled, 'inf') ans = 73.337 18> norm(kron(Inv4,Inv4) * Scaled, 'inf') ans = 64.000

Lines 2-4 compute the forward DCT and scaling matrices. 5 through 9 compute various partial IDCTs (`Inv4`

is the full IDCT) and the matrix representing the results after the 2D forward DCT and scaling. After that, we successively calculate the norms of more and more of the 2D IDCT: no IDCT, first stage of row IDCT, first and second stage of row IDCT, …, full row IDCT, full row IDCT and first stage of column IDCT, …, full 2D IDCT.

Long story short, the worst matrix infinity norm we have along the whole transform chain is 77.811, which is well below 128; hence the Bink B2 IDCT is overflow-free for 8-bit signals when implemented using 16-bit integers.

A similar process can be used for the forward DCT, but in our case, we just implement it in 32-bit arithmetic (and actually floating-point arithmetic at that, purely for convenience). The forward DCT is only used in the encoder, which spends *a lot* more time per block than the decoder ever does, and almost none of it in the DCT. Finally, there’s the scaling step itself; there’s ways to do this using only 16 bits on the decoder side as well (see for example “Low-complexity transform and quantization in H. 264/AVC”), but since the scaling only needs to be applied to non-zero coefficients and the post-quantization DCT coefficients in a video codec are *very* sparse, it’s not a big cost to do the scaling during entropy decoding using regular, scalar 32-bit fixed point arithmetic. And if the scaling is designed such that no intermediate values ever take more than 24 bits (which is much easier to satisfy than 16), this integer computation can be performed exactly using floating-point SIMD too, on platforms where 32-bit integer multiplies are slow.

And that’s it! Congratulations, you now know everything worth knowing about Bink 2.2′s integer DCTs. :)

]]>

There’s several different types of discrete cosine transforms, or DCTs for short. All of them are linear transforms on N-dimensional vectors (i.e. they can be written as a N×N matrix), and they all relate to larger real DFTs (discrete Fourier transforms) in some way; the different types correspond to different boundary conditions and different types of symmetries. In signal processing and data compression, the most important variants are what’s called DCT-II through DCT-IV. What’s commonly called “the” DCT is the DCT-II; this is the transform used in, among other things, JPEG, MPEG-1 and MPEG-2. The DCT-III is the inverse of the DCT-II (up to scaling, depending on how the two are normalized; there’s different conventions in use). Thus it’s often called “the” IDCT (inverse DCT). The DCT-IV is its own inverse, and forms the basis for the MDCT (modified DCT), a lapped transform that’s at the heart of most popular perceptual audio codecs (MP3, AAC, Vorbis, and Opus, among others). DCTs I and V-VIII also exist but off the top of my head I can’t name any signal processing or data compression applications that use them (which doesn’t mean there aren’t any).

Various theoretical justifications exist for the DCT’s use in data compression; they are quite close to the optimal choice of orthogonal basis for certain classes of signals, the relationship to the DFT means there are fast (FFT-related) algorithms available to compute them, and they are a useful building block for other types of transforms. Empirically, after 25 years of DCT-based codecs, there’s little argument that they work fairly well in practice too.

Image (and video) codecs operate on 2D data; like the DFT, there are 2D versions of all DCTs that are separable: a 2D DCT of a M×N block decomposes into N 1D M-element DCTs on the columns followed by M N-element DCTs on the rows (or vice versa). JPEG and the MPEGs up to MPEG-4 ASP use 2D DCTs on blocks of 8×8 pixels. H.264 (aka MPEG-4 AVC) initially switched to 4×4 blocks, then added the 8×8 blocks back in as part of the “High” profile later. H.265 / HEVC has both sizes and also added support for 16×16 and 32×32 transform blocks. Bink 2 sticks with 8×8 DCTs on the block transform level.

There are lots of different DCT algorithms, and I’m only going to mention a few of them. Like the DFT, DCTs have a recursive structure that allows them to be implemented in O(N log N) operations, instead of the O(N^{2}) operations a full matrix-vector multiply would take. Unlike the FFT, which decomposes into nothing but smaller FFTs except for some multiplications by scalars (“twiddle factors”) plus a permutation, the recursive factorizations of one DCT type will usually contain other trigonometric transforms – both smaller DCTs and smaller DST (discrete sine transforms) of varying types.

The first important dedicated DCT algorithm is presented in [Chen77] (see references below), which provides a DCT factorization for any power-of-2 N which is substantially faster than computing the DCT using a larger FFT. Over a decade later, [LLM89] describes a family of minimum-complexity (in terms of number of arithmetic operations) solutions for N=8 derived using graph transforms, a separate algorithm that has no more than one multiply along any path (this is the version used in the IJG JPEG library), and a then-new fast algorithm for N=16, all in a 4-page paper. [AAN88] introduces a scaled algorithm for N=8 which greatly reduces the number of multiplications. A “scaled” DCT is one where the output coefficients have non-uniform scaling factors; compression applications normally follow up the DCT with a quantization stage and precede the IDCT with a dequantization stage. The scale factors can be folded into the quantization / dequantization steps, which makes them effectively “free”. From today’s perspective, the quest for a minimal number of multiplies in DCT algorithms seems quaint; fast, pipelined multipliers are available almost everywhere these days, and today’s *mobile phone CPUs* achieve floating point throughputs higher than those of 1989′s fastest supercomputers (2012 Nexus 4: 2.8 GFLOPS measured with RgbenchMM; Cray-2: 1.9 GFLOPS peak – let that sink in for a minute). That said, the “scaled DCT” idea is important for other reasons.

There’s also direct 2D methods that can reduce number of arithmetic operations further relative to a separable 1D implementation; however, the overall reduction isn’t very large, the 2D algorithms require considerably more code (instead of 2 loops processing N elements each, there is a single “unrolled” computation processing N^{2} elements), and they have data flow patterns that are unfriendly to SIMD or hardware implementations (separable filters, on the other hand, are quite simple to implement in a SIMD fashion).

For 1D DCTs and N=8, the situation hasn’t substantially changed since the 1980s. Larger DCTs (16 and up) have seen some improvement on their arithmetic operation costs in recent years [Plonka02] [Johnson07], with algorithms derived symbolically from split-radix FFTs (though again, how much of a win the better algorithms are heavily depends on the environment).

Independent of the choice of DCT algorithm, they all break down into the following 3 basic components:

**Butterflies**. A butterfly is the transform (they’re called that way because of how they’re drawn in diagram form). A butterfly is also its own inverse, up to a scale factor of two, since and likewise .**Planar rotations**. Take a pair of values, interpret them as coordinates in the plane, and rotate them about the origin. I’ve written about them (even in the context of DCTs) before. The inverse of a rotation by θ radians is a rotation by -θ radians. There’s also planar reflections which are closely related to rotations and work out pretty much the same on the implementation side. Reflections are self-inverse, and in fact a butterfly is just a reflection scaled by .**Scalar multiplies**. Map for some nonzero constant c. The inverse is scaling by 1/c.

There are also a few properties of the DCT-II that simplify the situation further. Namely, the DCT-II (when properly normalized) is an *orthogonal* transform, which means it can be decomposed purely into planar rotations and potentially a few sign flips at the end (which can be represented as scalar multiplications). Butterflies, being as they are a scaled reflection, are used because they are cheaper than a “proper” rotation/reflection, but they introduce a scale factor of . So scalar multiplications are used to normalize the scaling across paths that go through a different number of butterflies; but the DCT algorithms we’re considering here only have scaling steps at the end, which can conveniently be dropped if a non-normalized (scaled) DCT is acceptable.

What this all boils down to is that a DCT implementation basically boils down to which planar rotations are performed when, and which of the various ways to perform rotations are employed. There are tons of tricky trade-offs here, and which variant is optimal really depends on the target machine. Older standards (JPEG, the earlier MPEGs) didn’t specify exactly which IDCT was to be used by decoders, leaving this up to the implementation. Every codec had its own DCT/IDCT implementations, which caused problems when a video was encoded using codec A and decoded using codec B – basically, over time, the decoded image could drift from what the encoder intended. Some codecs even had multiple DCT implementations with different algorithms (say one for MMX, and another for SSE2-capable CPUs) so this problem occurred even between different machines using the same codec. And of course there’s the general problem that the “real” DCT involves irrational numbers as coefficients, so there’s really no efficient way to compute it exactly for arbitrary inputs.

The solution to all these problems is to pick a specific IDCT approximation and specify it exactly – preferably using (narrow) integer operations only, since floating-point is expensive in hardware implementations and getting 100% bit-identical results for floating-point calculations across multiple platforms and architectures is surprisingly hard in software.

So what denotes a good integer DCT? It depends on the application. As the title says, this post is about the 8×8 integer IDCT (and matching DCT) design used for Bink 2.2. We had the following requirements:

- Bit-exact implementation strictly required (Bink 2 will often go hundreds or even thousands of frames without a key frame). Whatever algorithm we use, it must be
*exactly*the same on all targets. - Must be separable, i.e. the 2D DCT factors into 1D DCT steps.
- Must be a close approximation to the DCT. The goal was to be within 2% in terms of the matrix 2-norm, and same for the coding gain compared to a “proper” DCT. Bink 2 was initially designed with a full-precision floating-point (or high-precision fixed point) DCT in mind and we wanted a drop-in solution that wouldn’t affect the rest of the codec too much.
- Basis vectors must be fully orthogonal (or very close) to allow for trellis quantization.
- Must have an efficient SIMD implementation on all of the following:
- x86 with SSE2 or higher in both 32-bit and 64-bit modes.
- ARM with NEON support.
- PowerPC with VMX/AltiVec (e.g. PS3 Cell PPU).
- Cell SPUs (PS3 again).
- PowerPC with VMX128 (Xbox 360 CPU).

Yep, lots of game consoles – RAD’s a game middleware company.

- 16-bit integer preferred, to enable 8-wide operation on 128-bit SIMD instruction sets. To be more precise, for input values in the range [-255,255], we want intermediate values to fit inside 16 bits (signed) through all stages of the transform.

This turns out to limit our design space greatly. In particular, VMX128 removes *all* integer multiplication operations from AltiVec – integer multiplication by scalars, where desired, has to be synthesized from adds, subtractions and shifts. This constraint ended up driving the whole design.

There’s other integer DCTs designed along similar lines, including several used in other video codecs. However, while there’s several that only use 16-bit integer adds and shifts, they are generally much coarser approximations to the “true” DCT than what we were targeting. Note that this doesn’t mean they’re necessarily worse for compression purposes, or that the DCT described here is “better”; it just means that we wanted a drop-in replacement for the full-precision DCT, and the transform we ended up with is a closer to that goal than published variants with similar complexity.

Most of the integer DCT designs I’m aware of start with a DCT factorization; the Bink 2 DCT is no exception, and starts from the DCT-II factorizations given in [Plonka02]. For N=8, this is equivalent to the factorization described in [LLM89]: by reordering the rows of the butterfly matrices in Plonka’s factorization and flipping signs to turn rotation-reflections into pure rotations, the DCT-II factorization from example 2.8 can be expressed as the butterfly diagram (click to zoom)

which corresponds to the DCT variation on the left of the 2nd row in figure 4 of [LLM89], with the standard even part. However, [LLM89] covers only N=8, N=16 and (implicitly) N=4, whereas [Plonka02] provides factorizations for arbitrary power-of-2 N, making it easy to experiment with larger transform sizes with a single basic algorithm. **UPDATE**: The previous version of the diagram had a sign flip in the rightmost butterfly. Thanks to “terop” for pointing this out in the comments!

Now, we’re allowing a scaled DCT, so the final scale factors of can be omitted. As explained above, this leaves butterflies (which have simple and obvious integer implementations) and exactly three rotations, one in the even part (the top half, called “even” since it produces the even-numbered DCT coefficients) and two in the odd part:

As customary for DCT factorizations, and . Note that due to the corresponding trig identities, we have , , and , which means that the three rotations we see here (and indeed all rotations referenced in [LLM89]) can be expressed in terms of just , , , , and .

Now, all we really need to do to get an integer DCT is to pick integer approximations for these rotations (preferably using only adds and shifts, see the constraints above!). One way to do so is the BinDCT [Liang01], which uses the decomposition of rotation into shears I explained previously and then approximates the shears using simple dyadic fractions. This is a cool technique, but yields transforms with basis vectors that aren’t fully orthogonal; how big the non-orthogonality is depends on the quality of approximation, with cheaper approximations being less orthogonal. While not a show-stopper, imperfect orthogonality violates the assumptions inherent in trellis quantization and thus reduces the quality of the rate-distortion optimization performed on the encoder side.

So how do we approximate irrational rotations as integers (preferably small integers) with low error? The critical insight is that *any* matrix of the form

is a planar rotation about some angle θ such that . And more generally, any matrix of the form

is a *scaled* planar rotation. Thus, if we’re okay with a scaled DCT (and we are), we can just pick two arbitrary integers c, s that aren’t both zero and we get a scaled integer rotation (about some arbitrary angle). To approximate a *particular* rotation angle θ, we want to ensure that . Since we’d prefer small integer values s and c, we can find suitable approximations simply by enumerating all possibilities with a suitable ratio, checking how closely they approximate the target angle, and using the best one. This is easily done with a small program, and by itself sufficient to find a good solution for the rotation in the even part of the transform.

The odd part is a bit trickier, because it contains two rotations and butterflies that combine the results from different rotations. Therefore, if the two rotations were scaled differently, such a butterfly will result not in a scaled DCT, but a different transformation altogether. So simply approximating the rotations individually won’t work; however, we can approximate the two rotations simultaneously, and add the additional constraint that (somewhat reminiscent of Pythagorean triples), thus guaranteeing that the norm of the two rotations is the same. Initially I was a bit worried that this might over-constrain the problem, but it turns out that even with this extra complication, there are plenty of solutions to choose from. As before, this problem is amenable to an exhaustive search over all small integers that have the right ratios between each other.

Finally, we need implementations of the resulting rotations using only integer additions/subtractions and shifts. Since we’re already fully integer, all this means is that we need to express the integer multiplications using only adds/subtracts and shifts. This is an interesting and very tricky problem by itself, and I won’t cover it here; see books like Hacker’s Delight for a discussion of the problem. I implemented a simple algorithm that just identifies runs of 0- and 1-bits in the binary representation of values. This is optimal for small integers (below 45) but not necessarily for larger ones, but it’s usually fairly good. And since the program already tries out a bunch of variants, I made it compute costs for the different ways to factor planar rotations as well.

The result is findorth.cpp, a small C++ program that finds candidate small integer approximations to the DCT rotations and determines their approximate cost (in number of adds/subtractions and shifts) as well. Armed with this and the rest of the factorization above, it’s easy to come up with several families of integer DCT-like transforms at various quality and cost levels and compare them to existing published ones:

Variant | (c_{2},s_{2}) |
(c_{1},s_{1}) |
(c_{3},s_{3}) |
Cost | L2 err | Gain (dB) |
---|---|---|---|---|---|---|

A1 | (17,-7)/16 | (8,-1)/8 | (7,4)/8 | 32A 10S | 0.072 | 8.7971 |

B1 | (5,-2)/4 | (8,-1)/8 | (7,4)/8 | 30A 10S | 0.072 | 8.7968 |

A2 | (17,-7)/16 | (19,-4)/16 | (16,11)/16 | 38A 12S | 0.013 | 8.8253 |

B2 | (5,-2)/4 | (19,-4)/16 | (16,11)/16 | 36A 12S | 0.013 | 8.8250 |

A3 | (17,-7)/16 | (65,-13)/64 | (55,37)/64 | 42A 15S | 0.003 | 8.8258 |

B3 | (5,-2)/4 | (65,-13)/64 | (55,37)/64 | 40A 15S | 0.012 | 8.8255 |

Real DCT | - | - | - | - | 0.000 | 8.8259 |

BinDCT-L1 | - | - | - | 40A 22S | 0.013 | 8.8257 |

H.264 | - | - | - | 32A 10S | 0.078 | 8.7833 |

VC-1 | - | - | - | ? | 0.078 | 8.7978 |

The cost is measured in adds/subtracts (“A”) and shifts (“S”); “L2 err” is the matrix 2-norm of the difference between the approximated DCT matrix and the true DCT, and “gain” denotes the coding gain assuming a first-order Gauss-Markov process with autocorrelation coefficient ρ=0.95 (a standard measure for the quality of transforms in image coding).

These are by far not all solutions that `findorth`

finds, but they’re what I consider the most interesting ones, i.e. they’re at good points on the cost/approximation quality curve. The resulting transforms compare favorably to published transforms at similar cost (and approximation quality) levels; Bink 2.2 uses variant “B2″, which is the cheapest one that met our quality targets. The Octave code for everything in this post is available on Github.

The H.264 transform is patented, and I believe the VC-1 transform is as well. The transforms described in this post are not (to the best of my knowledge), and RAD has no intention of ever patenting them or keeping them secret. Part of the reason for this post is that we wanted to make an integer DCT that is of similar (or higher) quality and comparable cost than those in existing video standards freely available to everyone. And should one of the particular transforms derived in this post turn out to be patent-encumbered, `findorth`

provides everything necessary to derive a (possibly slightly worse, or slightly more expensive) alternative. (Big thanks to my boss at RAD, Jeff Roberts, for allowing and even encouraging me to write all this up in such detail!)

This post describes the basic structure of the transform; I’ll later post a follow-up with the details of our 16-bit implementation, and the derivation for why 16 bits are sufficient provided input data is in the range of [-255,255].

- [Chen77]
- Chen, Wen-Hsiung, C. H. Smith, and Sam Fralick. “A fast computational algorithm for the discrete cosine transform.” Communications, IEEE Transactions on 25.9 (1977): 1004-1009. PDF
- [AAN88]
- Yukihiro, Arai, Agui Takeshi, and Masayuki Nakajima. “A fast DCT-SQ scheme for images.” IEICE TRANSACTIONS (1976-1990) 71.11 (1988): 1095-1097.
- [LLM89]
- Loeffler, Christoph, Adriaan Ligtenberg, and George S. Moschytz. “Practical fast 1-D DCT algorithms with 11 multiplications.” Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference on. IEEE, 1989. PDF
- [Liang01]
- Liang, Jie, and Trac D. Tran. “Fast multiplierless approximations of the DCT with the lifting scheme.” Signal Processing, IEEE Transactions on 49.12 (2001): 3032-3044. PDF
- [Plonka02]
- Plonka, Gerhard, and Manfred Tasche. “Split-radix algorithms for discrete trigonometric transforms.” (2002). PDF
- [Johnson07]
- Johnson, Steven G., and Matteo Frigo. “A modified split-radix FFT with fewer arithmetic operations.” Signal Processing, IEEE Transactions on 55.1 (2007): 111-119. PDF

]]>

Some exist as built-in instructions on some architectures and require fairly lengthy multi-instruction sequences when they’re not available. Some examples would be population count (number of bits set; available on recent x86s and POWER but not PowerPC), or bit reversal (available on ARMv6T2 and above).

And then there’s bit scanning operations. All of these can be done fairly efficiently on most architectures, but it’s not always obvious how.

On x86, there’s `BSF`

(bit scan forward), `BSR`

(bit scan reverse), `TZCNT`

(trailing zero count) and `LZCNT`

(leading zero count). The first two are “old” (having existed since 386), the latter are very new. `BSF`

and `TZCNT`

do basically the same thing, with one difference: `BSF`

has “undefined” results on a zero input (what is the index of the first set bit in the integer 0?), whereas the “trailing zeros” definition for `TZCNT`

provides an obvious answer: the width of the register. `BSR`

returns the index of the “last” set bit (again undefined on 0 input), `LZCNT`

returns the number of leading zero bits, so these two actually produce different results.

`x86_bsf(x) = x86_tzcnt(x)`

unless x=0 (in which case`bsf`

is undefined).`x86_bsr(x) = (reg_width - 1) - x86_lzcnt(x)`

unless x=0 (in which case`bsr`

is undefined).

In the following, I’ll only use `LZCNT`

and `TZCNT`

(use above table to convert where necessary) simply to avoid that nasty edge case at 0.

PPC has `cntlzw`

(Count Leading Zeros Word) and `cntlzd`

(Count Leading Zeros Double Word) but no equivalent for trailing zero bits. We can get fairly close though: there’s the old trick `x & -x`

which clears all but the lowest set bit in `x`

. As long as x is nonzero, this value has exactly one bit set, and we can determine its position using a leading zero count.

`x86_tzcnt(x) = (reg_width - 1) - ppc_cntlz(x & -x)`

unless x=0 (in which case the PPC expression returns -1).`x86_lzcnt(x) = ppc_cntlz(x)`

.

ARM also has `CLZ`

(Count Leading Zeros) but nothing for trailing zero bits. But it also offers the aforementioned `RBIT`

which reverses the bits in a word, which makes a trailing zero count easy to accomplish:

`x86_tzcnt(x) = arm_clz(arm_rbit(x))`

.`x86_lzcnt(x) = arm_clz(x)`

.

Finally, ARMs NEON also offers `VCLS`

(Vector Count Leading Sign Bits), which (quoting from the documentation) “counts the number of consecutive bits following the topmost bit, that are the same as the topmost bit”. Well, we can do that on all architectures I mentioned as well, using only ingredients we already have: `arm_cls(x) = x86_lzcnt(x ^ (x >> 1)) - 1`

(the shift here is an arithmetic shift). The expression `y = x ^ (x >> 1)`

gives a value that has bit `n`

set if and only if bits `n`

and `n + 1`

of x are the same. By induction, the number of leading zeros in `y`

is thus exactly the number of leading bits in `x`

that match the sign bit. This count includes the topmost (sign) bit, so it’s always at least 1, and the instruction definition I just quoted requires us to return the number of bits *following* the topmost bit that match it. So we subtract 1 to get the right result. Since we can do a fast leading zero count on all quoted platforms, we’re good.

Conclusion: bit scans / zero bit counts in either direction can be done fairly efficiently on all architectures covered, but you need to be careful when zeros are involved.

]]>