Last time, I talked about DCTs in general and how the core problem in designing integer DCTs are the rotations. However, I promised an 8×8 IDCT that is suitable for SIMD implementation using 16-bit integer arithmetic without multiplies, and we’re not there yet. I’m going to be focusing on the “B2″ variant of the transform in this post, for concreteness. That means the starting point will be this implementation, which is just a scaled version of Plonka’s DCT factorization with the rotation approximations discussed last time.

### Normalization

The DCT described last time (and implemented with the Matlab/Octave code above) is a scaled DCT, and I didn’t mention the scale factors last time. However, they’re easy enough to solve for. We have a forward transform matrix, $M$. Since the rows are orthogonal (by construction), the inverse of $M$ (and hence our IDCT) is its transpose, up to scaling. We can write this as matrix equation $I = M^T S M$ where S ought to be a diagonal matrix. Solving for S yields $S = (M M^T)^{-1}$. This gives the total scaling when multiplying by both $M$ and $M^T$; what we really want is a normalizing diagonal matrix $N$ such that $\tilde{M} = NM$ is orthogonal, which means

$I = \tilde{M}^T \tilde{M} = M^T N^T N M = M^T (N^2) M$

since N is diagonal, so $N^2 = S$, and again because the two are diagonal this just means that the diagonal entries of N are the square roots of the diagonal entries of S. Then we multiply once by N just before quantization on the encoder side, and another time by N just after dequantization on the decoder side. In practice, this scaling can be folded directly into the quantization/dequantization process, but conceptually they’re distinct steps.

This normalization is the right one to use to compare against the “proper” orthogonal DCT-II. That said, with a non-lifting all-integer transform, normalizing to unit gain is a bad idea; we actually want some gain to reduce round-off error. A good compromise is to normalize everything to the gain of the DC coefficient, which is $\sqrt{8}$ for the 1D transform. Using this normalization ensures that DC, which usually contains a significant amount of the overall energy, is not subject to scaling-induced round-off error and thus reproduced exactly.

In the 2D transform, each coefficient takes part in first a column DCT and then a row DCT (or vice versa). That means the gain of a separable 2D 8×8 DCT is the square of the gain for the 1D DCT, which in our case means 8. And since the IDCT is just the transpose of the DCT, it has the same gain, so following up the 2D DCT with the corresponding 2D IDCT introduces another 8x gain increase, bringing the total gain for the transform → normalize → inverse transform chain up to 64.

Well, the input to the DCT in a video codec is the difference between the actual pixel values and the prediction (usually motion compensation or some geometric/gradient predictor), and with 8-bit input signals that means the input is the difference between two values in [0, 255]. The worst cases are when the predicted value is 0 and the actual value is 255 or vice versa, leading to a worst-case input value range of [-255, 255] for our integer DCT. Multiply this by a gain factor of 64 and you get [-16320, 16320]. Considering that the plan is to have a 16-bit integer implementation, this should make it obvious what the problem is: it all fits, but only just, and we have to be really careful about overflow and value ranges.

### Dynamic range

This all falls under the umbrella term of the dynamic range of the transform. The smallest representable non-zero sample values we support are ±1. With 16-bit integer, the largest values we can support are ±32767; -32768 is out since there’s no corresponding positive value. The ratio between the two is the total dynamic range we have to work with. For transforms, it’s customary to define the dynamic range relative to the input value range, not the smallest representable non-zero signal; in our case, the maximum total scaling we can tolerate along any signal path is 32767/255, or about 128.498; anything larger and we might get overflows.

Now, this problem might seem like it’s self-made: as described above, both the forward and inverse transforms have a total gain of 8x, which seems wasteful. Why don’t we, say, leave the 8x gain on the forward transform but try to normalize the inverse transform such that it doesn’t have any extra gain?

Well, yes and no. We can, but it’s addressing the wrong part of the problem. The figures I’ve been quoting so far are for the gain as measured in the 2-norm (Euclidean norm). But for dynamic range analysis, what we need to know is the infinity norm (aka “uniform norm” or “maximum norm”): what is the largest absolute value we’re going to encounter in our vectors? And it turns out that with respect to the infinity norm, the picture looks a bit different.

To see why, let’s start with a toy example. Say we just have two samples, and our transform is a single butterfly:

$B = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}$

As covered last time, B is a scaled reflection, its own inverse up to said scaling, and its matrix 2-norm is $\sqrt{2} \approx 1.4142$. However, B’s infinity-norm is 2. For example, the input vector $(1, 1)^T$ (maximum value of 1) maps to $(2, 0)^T$ (maximum value of 2) under B. In other words, if we add (or subtract) two numbers, we might get a number back that is up to twice as big as either of the inputs – this shouldn’t come as a surprise. Now let’s apply the transpose of B, which happens to be B itself:

$B^T B = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}^2 = \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix}$

This is just a scaled identity transform, and its 2-norm and infinity norm are 2 exactly. In other words, the 2-norm grew in the second application of B, but with respect to the infinity norm, things were already as bad as they were going to get after the forward transform. And it’s this infinity norm that we need to worry about if we want to avoid overflows.

### Scaling along the signal path

So what’s our actual matrix norms along the signal path, with the “B2″ DCT variant from last time? This is why I published the Octave code on Github. Let’s work through the 1D DCT first:

2> M=bink_dct_B2(eye(8));
3> S=diag(8 ./ diag(M*M'));
4> norm(M,2)
ans =  3.4324
5> norm(M,'inf')
ans =  8.7500
6> norm(S*M,2)
ans =  3.2962
7> norm(S*M,'inf')
ans =  8.4881
8> norm(M'*S*M,2)
ans =  8.0000
9> norm(M'*S*M,'inf')
ans =  8.0000


This gives both the 2-norms and infinity norms for first the forward DCT only (M), then forward DCT followed by scaling (S*M), and finally with the inverse transform as well (M'*S*M), although at that point it’s really just a scaled identity matrix so the result isn’t particularly interesting. We can do the same thing with 2D transforms using kron which computes the Kronecker product of two matrices (corresponding to the tensor product of linear maps, which is the mathematical background underlying separable transforms):

10> norm(kron(M,M),'inf')
ans =  76.563
11> norm(kron(S,S) * kron(M,M),2)
ans =  10.865
12> norm(kron(S,S) * kron(M,M),'inf')
ans =  72.047
13> norm(kron(M',M') * kron(S,S) * kron(M,M), 2)
ans =  64.000
14> norm(kron(M',M') * kron(S,S) * kron(M,M), 'inf')
ans =  64.000
15> norm(kron(M',M') * kron(S,S) * kron(M,M) - 64*eye(64), 'inf')
ans = 8.5487e-014
16> 64*eps
ans = 1.4211e-014


As you can see, the situation is very similar to what we saw in the butterfly example: while the 2-norm keeps growing throughout the entire process, the infinity norm actually maxes out right after the forward transform and stays roughly at the same level afterwards. The last two lines are just there to show that the entire 2D transform chain really does come out to be a scaled identity transform, to within a bit more than 6 ulps.

More importantly however, we do see a gain factor of around 76.6 right after the forward transform, and the inverse transform starts out with a total gain around 72; remember that this value has to be less than 128 for us to be able to do the transform using 16-bit integer arithmetic. As said before, it can fit, but it’s a close shave; we have less than 1 bit of headroom, so we need to be really careful about overflow – there’s no margin for sloppiness. And furthermore, while this means that the entire transform fits inside 16 bits, what we really need to make sure is that the way we’re computing it is free of overflows; we need to drill down a bit further.

For this, it helps to have a direct implementation of the DCT and IDCT; a direct implementation (still in Octave) for the DCT is here, and I’ve also checked in a direct IDCT taking an extra parameter that determines how many stages of the IDCT to compute. A “stage” is a set of mutually independent butterflies or rotations that may depend on previous stages (corresponding to “clean breaks” in a butterfly diagram or the various matrix factors in a matrix decomposition).

I’m not going to quote the entire code (it’s not very long, but there’s no point), but here’s a small snippet to show what’s going on:

  % odd stage 4
c4 = d4;
c5 = d5 + d6;
c7 = d5 - d6;
c6 = d7;

if stages > 1
% odd stage 3
b4 = c4 + c5;
b5 = c4 - c5;
b6 = c6 + c7;
b7 = c6 - c7;

% even stage 3
b0 = c0 + c1;
b1 = c0 - c1;
b2 = c2 + c2/4 + c3/2; % 5/4*c2 + 1/2*c3
b3 = c2/2 - c3 - c3/4; % 1/2*c3 - 5/4*c3

% ...
end


The trick is that within each stage, there are only integer additions and right-shifts (here written as divisions because this is Octave code, but in integer code it should really be shifts and not divisions) of values computed in a previous stage. The latter can never overflow. The former may have intermediate overflows in 16-bit integer, but provided that computations are done in two’s complement arithmetic, the results will be equal to the results of the corresponding exact integer computation modulo 216. So as long as said exact integer computation has a result that fits into 16 bits, the computed results are going to be correct, even if there are intermediate overflows. Note that this is true when computed using SIMD intrinsics that have explicit two’s complement semantics, but is not guaranteed to be true when implemented using 16-bit integers in C/C++ code, since signed overflow in C triggers undefined behavior!

Also note that we can’t compute values such as 5/4*c2 as (5*c2)/4, since the intermediate value 5*c2 can overflow – and knowing as we do already that we have less than 1 bit of headroom left, likely will for some inputs. One option is to do the shift first: 5*(c2/4). This works but throws away some of the least-significant bits. The form c2 + (c2/4) chosen here is a compromise; it has larger round-off error than the preferred (5*c2)/4 but will only overflow if the final result does. The findorth tool described last time automatically determines expressions in this form.

### Verification

Okay, that’s all the ingredients we need. We have the transform decomposed into separate stages; we know the value range at the input to each stage, and we know that if both the inputs and the outputs fit inside 16 bits, the results computed in each stage will be correct. So now all we need to do is verify that last bit – are the outputs of the intermediate stages all 16 bits or less?

2> I=eye(8);
3> F=bink_dct_B2(I);
4> S=diag(8 ./ diag(F*F'));
5> Inv1=bink_idct_B2_partial(I,1);
6> Inv2=bink_idct_B2_partial(I,2);
7> Inv3=bink_idct_B2_partial(I,3);
8> Inv4=bink_idct_B2_partial(I,4);
9> Scaled=kron(S,S) * kron(F,F);
10> norm(kron(I,I) * Scaled, 'inf')
ans =  72.047
11> norm(kron(I,Inv1) * Scaled, 'inf')
ans =  72.047
12> norm(kron(I,Inv2) * Scaled, 'inf')
ans =  77.811
13> norm(kron(I,Inv3) * Scaled, 'inf')
ans =  77.811
14> norm(kron(I,Inv4) * Scaled, 'inf')
ans =  67.905
15> norm(kron(Inv1,Inv4) * Scaled, 'inf')
ans =  67.905
16> norm(kron(Inv2,Inv4) * Scaled, 'inf')
ans =  73.337
17> norm(kron(Inv3,Inv4) * Scaled, 'inf')
ans =  73.337
18> norm(kron(Inv4,Inv4) * Scaled, 'inf')
ans =  64.000


Lines 2-4 compute the forward DCT and scaling matrices. 5 through 9 compute various partial IDCTs (Inv4 is the full IDCT) and the matrix representing the results after the 2D forward DCT and scaling. After that, we successively calculate the norms of more and more of the 2D IDCT: no IDCT, first stage of row IDCT, first and second stage of row IDCT, …, full row IDCT, full row IDCT and first stage of column IDCT, …, full 2D IDCT.

Long story short, the worst matrix infinity norm we have along the whole transform chain is 77.811, which is well below 128; hence the Bink B2 IDCT is overflow-free for 8-bit signals when implemented using 16-bit integers.

A similar process can be used for the forward DCT, but in our case, we just implement it in 32-bit arithmetic (and actually floating-point arithmetic at that, purely for convenience). The forward DCT is only used in the encoder, which spends a lot more time per block than the decoder ever does, and almost none of it in the DCT. Finally, there’s the scaling step itself; there’s ways to do this using only 16 bits on the decoder side as well (see for example “Low-complexity transform and quantization in H. 264/AVC”), but since the scaling only needs to be applied to non-zero coefficients and the post-quantization DCT coefficients in a video codec are very sparse, it’s not a big cost to do the scaling during entropy decoding using regular, scalar 32-bit fixed point arithmetic. And if the scaling is designed such that no intermediate values ever take more than 24 bits (which is much easier to satisfy than 16), this integer computation can be performed exactly using floating-point SIMD too, on platforms where 32-bit integer multiplies are slow.

And that’s it! Congratulations, you now know everything worth knowing about Bink 2.2′s integer DCTs. :)

This subject is fairly technical, and I’m going to proceed at a somewhat higher speed than usual; part of the reason for this post is just as future reference material for myself.

### DCTs

There’s several different types of discrete cosine transforms, or DCTs for short. All of them are linear transforms on N-dimensional vectors (i.e. they can be written as a N×N matrix), and they all relate to larger real DFTs (discrete Fourier transforms) in some way; the different types correspond to different boundary conditions and different types of symmetries. In signal processing and data compression, the most important variants are what’s called DCT-II through DCT-IV. What’s commonly called “the” DCT is the DCT-II; this is the transform used in, among other things, JPEG, MPEG-1 and MPEG-2. The DCT-III is the inverse of the DCT-II (up to scaling, depending on how the two are normalized; there’s different conventions in use). Thus it’s often called “the” IDCT (inverse DCT). The DCT-IV is its own inverse, and forms the basis for the MDCT (modified DCT), a lapped transform that’s at the heart of most popular perceptual audio codecs (MP3, AAC, Vorbis, and Opus, among others). DCTs I and V-VIII also exist but off the top of my head I can’t name any signal processing or data compression applications that use them (which doesn’t mean there aren’t any).

Various theoretical justifications exist for the DCT’s use in data compression; they are quite close to the optimal choice of orthogonal basis for certain classes of signals, the relationship to the DFT means there are fast (FFT-related) algorithms available to compute them, and they are a useful building block for other types of transforms. Empirically, after 25 years of DCT-based codecs, there’s little argument that they work fairly well in practice too.

Image (and video) codecs operate on 2D data; like the DFT, there are 2D versions of all DCTs that are separable: a 2D DCT of a M×N block decomposes into N 1D M-element DCTs on the columns followed by M N-element DCTs on the rows (or vice versa). JPEG and the MPEGs up to MPEG-4 ASP use 2D DCTs on blocks of 8×8 pixels. H.264 (aka MPEG-4 AVC) initially switched to 4×4 blocks, then added the 8×8 blocks back in as part of the “High” profile later. H.265 / HEVC has both sizes and also added support for 16×16 and 32×32 transform blocks. Bink 2 sticks with 8×8 DCTs on the block transform level.

### Algorithms

There are lots of different DCT algorithms, and I’m only going to mention a few of them. Like the DFT, DCTs have a recursive structure that allows them to be implemented in O(N log N) operations, instead of the O(N2) operations a full matrix-vector multiply would take. Unlike the FFT, which decomposes into nothing but smaller FFTs except for some multiplications by scalars (“twiddle factors”) plus a permutation, the recursive factorizations of one DCT type will usually contain other trigonometric transforms – both smaller DCTs and smaller DST (discrete sine transforms) of varying types.

The first important dedicated DCT algorithm is presented in [Chen77] (see references below), which provides a DCT factorization for any power-of-2 N which is substantially faster than computing the DCT using a larger FFT. Over a decade later, [LLM89] describes a family of minimum-complexity (in terms of number of arithmetic operations) solutions for N=8 derived using graph transforms, a separate algorithm that has no more than one multiply along any path (this is the version used in the IJG JPEG library), and a then-new fast algorithm for N=16, all in a 4-page paper. [AAN88] introduces a scaled algorithm for N=8 which greatly reduces the number of multiplications. A “scaled” DCT is one where the output coefficients have non-uniform scaling factors; compression applications normally follow up the DCT with a quantization stage and precede the IDCT with a dequantization stage. The scale factors can be folded into the quantization / dequantization steps, which makes them effectively “free”. From today’s perspective, the quest for a minimal number of multiplies in DCT algorithms seems quaint; fast, pipelined multipliers are available almost everywhere these days, and today’s mobile phone CPUs achieve floating point throughputs higher than those of 1989′s fastest supercomputers (2012 Nexus 4: 2.8 GFLOPS measured with RgbenchMM; Cray-2: 1.9 GFLOPS peak – let that sink in for a minute). That said, the “scaled DCT” idea is important for other reasons.

There’s also direct 2D methods that can reduce number of arithmetic operations further relative to a separable 1D implementation; however, the overall reduction isn’t very large, the 2D algorithms require considerably more code (instead of 2 loops processing N elements each, there is a single “unrolled” computation processing N2 elements), and they have data flow patterns that are unfriendly to SIMD or hardware implementations (separable filters, on the other hand, are quite simple to implement in a SIMD fashion).

For 1D DCTs and N=8, the situation hasn’t substantially changed since the 1980s. Larger DCTs (16 and up) have seen some improvement on their arithmetic operation costs in recent years [Plonka02] [Johnson07], with algorithms derived symbolically from split-radix FFTs (though again, how much of a win the better algorithms are heavily depends on the environment).

### Building blocks

Independent of the choice of DCT algorithm, they all break down into the following 3 basic components:

• Butterflies. A butterfly is the transform $(a, b) \mapsto (a + b, a - b)$ (they’re called that way because of how they’re drawn in diagram form). A butterfly is also its own inverse, up to a scale factor of two, since $(a+b) + (a-b) = 2a$ and likewise $(a+b) - (a-b) = 2b$.
• Planar rotations. Take a pair of values, interpret them as coordinates in the plane, and rotate them about the origin. I’ve written about them (even in the context of DCTs) before. The inverse of a rotation by θ radians is a rotation by -θ radians. There’s also planar reflections which are closely related to rotations and work out pretty much the same on the implementation side. Reflections are self-inverse, and in fact a butterfly is just a reflection scaled by $\sqrt{2}$.
• Scalar multiplies. Map $a \mapsto ca$ for some nonzero constant c. The inverse is scaling by 1/c.

There are also a few properties of the DCT-II that simplify the situation further. Namely, the DCT-II (when properly normalized) is an orthogonal transform, which means it can be decomposed purely into planar rotations and potentially a few sign flips at the end (which can be represented as scalar multiplications). Butterflies, being as they are a scaled reflection, are used because they are cheaper than a “proper” rotation/reflection, but they introduce a scale factor of $\sqrt{2}$. So scalar multiplications are used to normalize the scaling across paths that go through a different number of butterflies; but the DCT algorithms we’re considering here only have scaling steps at the end, which can conveniently be dropped if a non-normalized (scaled) DCT is acceptable.

What this all boils down to is that a DCT implementation basically boils down to which planar rotations are performed when, and which of the various ways to perform rotations are employed. There are tons of tricky trade-offs here, and which variant is optimal really depends on the target machine. Older standards (JPEG, the earlier MPEGs) didn’t specify exactly which IDCT was to be used by decoders, leaving this up to the implementation. Every codec had its own DCT/IDCT implementations, which caused problems when a video was encoded using codec A and decoded using codec B – basically, over time, the decoded image could drift from what the encoder intended. Some codecs even had multiple DCT implementations with different algorithms (say one for MMX, and another for SSE2-capable CPUs) so this problem occurred even between different machines using the same codec. And of course there’s the general problem that the “real” DCT involves irrational numbers as coefficients, so there’s really no efficient way to compute it exactly for arbitrary inputs.

### Integer transforms

The solution to all these problems is to pick a specific IDCT approximation and specify it exactly – preferably using (narrow) integer operations only, since floating-point is expensive in hardware implementations and getting 100% bit-identical results for floating-point calculations across multiple platforms and architectures is surprisingly hard in software.

So what denotes a good integer DCT? It depends on the application. As the title says, this post is about the 8×8 integer IDCT (and matching DCT) design used for Bink 2.2. We had the following requirements:

• Bit-exact implementation strictly required (Bink 2 will often go hundreds or even thousands of frames without a key frame). Whatever algorithm we use, it must be exactly the same on all targets.
• Must be separable, i.e. the 2D DCT factors into 1D DCT steps.
• Must be a close approximation to the DCT. The goal was to be within 2% in terms of the matrix 2-norm, and same for the coding gain compared to a “proper” DCT. Bink 2 was initially designed with a full-precision floating-point (or high-precision fixed point) DCT in mind and we wanted a drop-in solution that wouldn’t affect the rest of the codec too much.
• Basis vectors must be fully orthogonal (or very close) to allow for trellis quantization.
• Must have an efficient SIMD implementation on all of the following:
• x86 with SSE2 or higher in both 32-bit and 64-bit modes.
• ARM with NEON support.
• PowerPC with VMX/AltiVec (e.g. PS3 Cell PPU).
• Cell SPUs (PS3 again).
• PowerPC with VMX128 (Xbox 360 CPU).

Yep, lots of game consoles – RAD’s a game middleware company.

• 16-bit integer preferred, to enable 8-wide operation on 128-bit SIMD instruction sets. To be more precise, for input values in the range [-255,255], we want intermediate values to fit inside 16 bits (signed) through all stages of the transform.

This turns out to limit our design space greatly. In particular, VMX128 removes all integer multiplication operations from AltiVec – integer multiplication by scalars, where desired, has to be synthesized from adds, subtractions and shifts. This constraint ended up driving the whole design.

There’s other integer DCTs designed along similar lines, including several used in other video codecs. However, while there’s several that only use 16-bit integer adds and shifts, they are generally much coarser approximations to the “true” DCT than what we were targeting. Note that this doesn’t mean they’re necessarily worse for compression purposes, or that the DCT described here is “better”; it just means that we wanted a drop-in replacement for the full-precision DCT, and the transform we ended up with is a closer to that goal than published variants with similar complexity.

### Building an integer DCT

Most of the integer DCT designs I’m aware of start with a DCT factorization; the Bink 2 DCT is no exception, and starts from the DCT-II factorizations given in [Plonka02]. For N=8, this is equivalent to the factorization described in [LLM89]: by reordering the rows of the butterfly matrices in Plonka’s factorization and flipping signs to turn rotation-reflections into pure rotations, the DCT-II factorization from example 2.8 can be expressed as the butterfly diagram (click to zoom)

which corresponds to the DCT variation on the left of the 2nd row in figure 4 of [LLM89], with the standard even part. However, [LLM89] covers only N=8, N=16 and (implicitly) N=4, whereas [Plonka02] provides factorizations for arbitrary power-of-2 N, making it easy to experiment with larger transform sizes with a single basic algorithm. UPDATE: The previous version of the diagram had a sign flip in the rightmost butterfly. Thanks to “terop” for pointing this out in the comments!

Now, we’re allowing a scaled DCT, so the final scale factors of $\pm \sqrt{2}$ can be omitted. As explained above, this leaves butterflies (which have simple and obvious integer implementations) and exactly three rotations, one in the even part (the top half, called “even” since it produces the even-numbered DCT coefficients) and two in the odd part:

As customary for DCT factorizations, $c_k = \cos(k \pi / 2N)$ and $s_k = \sin(k \pi / 2N)$. Note that due to the corresponding trig identities, we have $c_{-k} = c_k$, $s_{-k} = -s_k$, $c_{N-k} = s_k$ and $s_{N-k} = c_k$, which means that the three rotations we see here (and indeed all rotations referenced in [LLM89]) can be expressed in terms of just $c_1$, $s_1$, $c_2$, $s_2$, $c_3$ and $s_3$.

Now, all we really need to do to get an integer DCT is to pick integer approximations for these rotations (preferably using only adds and shifts, see the constraints above!). One way to do so is the BinDCT [Liang01], which uses the decomposition of rotation into shears I explained previously and then approximates the shears using simple dyadic fractions. This is a cool technique, but yields transforms with basis vectors that aren’t fully orthogonal; how big the non-orthogonality is depends on the quality of approximation, with cheaper approximations being less orthogonal. While not a show-stopper, imperfect orthogonality violates the assumptions inherent in trellis quantization and thus reduces the quality of the rate-distortion optimization performed on the encoder side.

### Approximating rotations

So how do we approximate irrational rotations as integers (preferably small integers) with low error? The critical insight is that any matrix of the form

$\begin{pmatrix} c & s \\ -s & c \end{pmatrix}, \quad c^2 + s^2 = 1$

is a planar rotation about some angle θ such that $c=\cos \theta, s=\sin \theta$. And more generally, any matrix of the form

$\begin{pmatrix} c & s \\ -s & c \end{pmatrix}, \quad c^2 + s^2 \ne 0$

is a scaled planar rotation. Thus, if we’re okay with a scaled DCT (and we are), we can just pick two arbitrary integers c, s that aren’t both zero and we get a scaled integer rotation (about some arbitrary angle). To approximate a particular rotation angle θ, we want to ensure that $s/c \approx (\sin\theta)/(\cos\theta) = \tan\theta$. Since we’d prefer small integer values s and c, we can find suitable approximations simply by enumerating all possibilities with a suitable ratio, checking how closely they approximate the target angle, and using the best one. This is easily done with a small program, and by itself sufficient to find a good solution for the rotation in the even part of the transform.

The odd part is a bit trickier, because it contains two rotations and butterflies that combine the results from different rotations. Therefore, if the two rotations were scaled differently, such a butterfly will result not in a scaled DCT, but a different transformation altogether. So simply approximating the rotations individually won’t work; however, we can approximate the two rotations $(c_1,s_1), (c_2,s_2)$ simultaneously, and add the additional constraint that $c_1^2 + s_1^2 = c_2^2 + s_2^2$ (somewhat reminiscent of Pythagorean triples), thus guaranteeing that the norm of the two rotations is the same. Initially I was a bit worried that this might over-constrain the problem, but it turns out that even with this extra complication, there are plenty of solutions to choose from. As before, this problem is amenable to an exhaustive search over all small integers that have the right ratios between each other.

Finally, we need implementations of the resulting rotations using only integer additions/subtractions and shifts. Since we’re already fully integer, all this means is that we need to express the integer multiplications using only adds/subtracts and shifts. This is an interesting and very tricky problem by itself, and I won’t cover it here; see books like Hacker’s Delight for a discussion of the problem. I implemented a simple algorithm that just identifies runs of 0- and 1-bits in the binary representation of values. This is optimal for small integers (below 45) but not necessarily for larger ones, but it’s usually fairly good. And since the program already tries out a bunch of variants, I made it compute costs for the different ways to factor planar rotations as well.

The result is findorth.cpp, a small C++ program that finds candidate small integer approximations to the DCT rotations and determines their approximate cost (in number of adds/subtractions and shifts) as well. Armed with this and the rest of the factorization above, it’s easy to come up with several families of integer DCT-like transforms at various quality and cost levels and compare them to existing published ones:

Variant (c2,s2) (c1,s1) (c3,s3) Cost L2 err Gain (dB)
A1 (17,-7)/16 (8,-1)/8 (7,4)/8 32A 10S 0.072 8.7971
B1 (5,-2)/4 (8,-1)/8 (7,4)/8 30A 10S 0.072 8.7968
A2 (17,-7)/16 (19,-4)/16 (16,11)/16 38A 12S 0.013 8.8253
B2 (5,-2)/4 (19,-4)/16 (16,11)/16 36A 12S 0.013 8.8250
A3 (17,-7)/16 (65,-13)/64 (55,37)/64 42A 15S 0.003 8.8258
B3 (5,-2)/4 (65,-13)/64 (55,37)/64 40A 15S 0.012 8.8255
Real DCT - - - - 0.000 8.8259
BinDCT-L1 - - - 40A 22S 0.013 8.8257
H.264 - - - 32A 10S 0.078 8.7833
VC-1 - - - ? 0.078 8.7978

The cost is measured in adds/subtracts (“A”) and shifts (“S”); “L2 err” is the matrix 2-norm of the difference between the approximated DCT matrix and the true DCT, and “gain” denotes the coding gain assuming a first-order Gauss-Markov process with autocorrelation coefficient ρ=0.95 (a standard measure for the quality of transforms in image coding).

These are by far not all solutions that findorth finds, but they’re what I consider the most interesting ones, i.e. they’re at good points on the cost/approximation quality curve. The resulting transforms compare favorably to published transforms at similar cost (and approximation quality) levels; Bink 2.2 uses variant “B2″, which is the cheapest one that met our quality targets. The Octave code for everything in this post is available on Github.

The H.264 transform is patented, and I believe the VC-1 transform is as well. The transforms described in this post are not (to the best of my knowledge), and RAD has no intention of ever patenting them or keeping them secret. Part of the reason for this post is that we wanted to make an integer DCT that is of similar (or higher) quality and comparable cost than those in existing video standards freely available to everyone. And should one of the particular transforms derived in this post turn out to be patent-encumbered, findorth provides everything necessary to derive a (possibly slightly worse, or slightly more expensive) alternative. (Big thanks to my boss at RAD, Jeff Roberts, for allowing and even encouraging me to write all this up in such detail!)

This post describes the basic structure of the transform; I’ll later post a follow-up with the details of our 16-bit implementation, and the derivation for why 16 bits are sufficient provided input data is in the range of [-255,255].

### References

[Chen77]
Chen, Wen-Hsiung, C. H. Smith, and Sam Fralick. “A fast computational algorithm for the discrete cosine transform.” Communications, IEEE Transactions on 25.9 (1977): 1004-1009. PDF
[AAN88]
Yukihiro, Arai, Agui Takeshi, and Masayuki Nakajima. “A fast DCT-SQ scheme for images.” IEICE TRANSACTIONS (1976-1990) 71.11 (1988): 1095-1097.
[LLM89]
Loeffler, Christoph, Adriaan Ligtenberg, and George S. Moschytz. “Practical fast 1-D DCT algorithms with 11 multiplications.” Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference on. IEEE, 1989. PDF
[Liang01]
Liang, Jie, and Trac D. Tran. “Fast multiplierless approximations of the DCT with the lifting scheme.” Signal Processing, IEEE Transactions on 49.12 (2001): 3032-3044. PDF
[Plonka02]
Plonka, Gerhard, and Manfred Tasche. “Split-radix algorithms for discrete trigonometric transforms.” (2002). PDF
[Johnson07]
Johnson, Steven G., and Matteo Frigo. “A modified split-radix FFT with fewer arithmetic operations.” Signal Processing, IEEE Transactions on 55.1 (2007): 111-119. PDF

A lot of bit manipulation operations exist basically everywhere: AND, OR, XOR/EOR, shifts, and so forth. Some exist only on some architectures but have obvious implementations everywhere else – NAND, NOR, equivalence/XNOR and so forth.

Some exist as built-in instructions on some architectures and require fairly lengthy multi-instruction sequences when they’re not available. Some examples would be population count (number of bits set; available on recent x86s and POWER but not PowerPC), or bit reversal (available on ARMv6T2 and above).

And then there’s bit scanning operations. All of these can be done fairly efficiently on most architectures, but it’s not always obvious how.

### x86 bit scanning operations

On x86, there’s BSF (bit scan forward), BSR (bit scan reverse), TZCNT (trailing zero count) and LZCNT (leading zero count). The first two are “old” (having existed since 386), the latter are very new. BSF and TZCNT do basically the same thing, with one difference: BSF has “undefined” results on a zero input (what is the index of the first set bit in the integer 0?), whereas the “trailing zeros” definition for TZCNT provides an obvious answer: the width of the register. BSR returns the index of the “last” set bit (again undefined on 0 input), LZCNT returns the number of leading zero bits, so these two actually produce different results.

• x86_bsf(x) = x86_tzcnt(x) unless x=0 (in which case bsf is undefined).
• x86_bsr(x) = (reg_width - 1) - x86_lzcnt(x) unless x=0 (in which case bsr is undefined).

In the following, I’ll only use LZCNT and TZCNT (use above table to convert where necessary) simply to avoid that nasty edge case at 0.

### PowerPC bit scanning operations

PPC has cntlzw (Count Leading Zeros Word) and cntlzd (Count Leading Zeros Double Word) but no equivalent for trailing zero bits. We can get fairly close though: there’s the old trick x & -x which clears all but the lowest set bit in x. As long as x is nonzero, this value has exactly one bit set, and we can determine its position using a leading zero count.

• x86_tzcnt(x) = (reg_width - 1) - ppc_cntlz(x & -x) unless x=0 (in which case the PPC expression returns -1).
• x86_lzcnt(x) = ppc_cntlz(x).

### ARM bit scanning operations

ARM also has CLZ (Count Leading Zeros) but nothing for trailing zero bits. But it also offers the aforementioned RBIT which reverses the bits in a word, which makes a trailing zero count easy to accomplish:

• x86_tzcnt(x) = arm_clz(arm_rbit(x)).
• x86_lzcnt(x) = arm_clz(x).

### Bonus

Finally, ARMs NEON also offers VCLS (Vector Count Leading Sign Bits), which (quoting from the documentation) “counts the number of consecutive bits following the topmost bit, that are the same as the topmost bit”. Well, we can do that on all architectures I mentioned as well, using only ingredients we already have: arm_cls(x) = x86_lzcnt(x ^ (x >> 1)) - 1 (the shift here is an arithmetic shift). The expression y = x ^ (x >> 1) gives a value that has bit n set if and only if bits n and n + 1 of x are the same. By induction, the number of leading zeros in y is thus exactly the number of leading bits in x that match the sign bit. This count includes the topmost (sign) bit, so it’s always at least 1, and the instruction definition I just quoted requires us to return the number of bits following the topmost bit that match it. So we subtract 1 to get the right result. Since we can do a fast leading zero count on all quoted platforms, we’re good.

Conclusion: bit scans / zero bit counts in either direction can be done fairly efficiently on all architectures covered, but you need to be careful when zeros are involved.

Several long posts in the works; in the meantime, have this:

Bob, just out of the shower and still drowsy, gets a call: his dad just died. Stunned, he calls a few of his other relatives. 30 minutes later, he shows up at work, to meet his boss: “You’re late again, Bob. You had two warnings. That’s it – you’re fired.”

And so forth. If this is a story about Bob, these are high emotional stakes: Bob’s probably having one of the worst days of his life right now. This is his personal apocalypse, and we’re immersed in it, feeling every blow.

Zoom out. A thousand people in the city got the same call that morning. It’s a virus outbreak! Bob might still be one of the protagonists – the character the audience is supposed to identify with. He’s having a bad day, maybe the worst day of his life, sure – but he’s expected to get over it and somehow help the trained virologists through a series of contrived plot twists and stop the epidemic from spreading further.

Zoom out. Maybe there’s no virus outbreak. But there’s a giant asteroid heading for earth! In this kind of story, Bob might still be the everyman character who’s having a really bad day. And he will still get drawn into the plot somehow. We may be supposed to feel sorry for him, or the story might play what’s happening to Bob for laughs, but either way, there are bigger things at stake here! Get a grip, Bob!

Drama is not scale-invariant. Depending on the scope, an individual goes from the center of the universe in the story, to a piece on a chess board, to a bacterium under the microscope, and ultimately becomes completely invisible.

This, in short, is the problem I have with most time-travel stories.

Once your story widens its scope so far that the very substance of space and time takes center stage and there’s a series of delicate (or not-so-delicate) manipulations to make the “right” universe, the one where everyone goes the way it “should” be, be the “right” one (whatever that is supposed to mean), the characters and the world they inhabit become set dressing. You have put the entire universe in a neat little box, labeled it “the universe where Tim dies”, and then you zoom out and look at the tree of all possible timelines and how they branch, and you start rearranging things until you can’t see “the universe where Tim dies” in the part of the timeline that you care anymore.

In short, you are now writing a story about rebalancing a tree data structure. There happen to be people and so on in it, and you may even care about them, but at this point the scale and the stakes are so all-encompassing that it’s hard to really care, and all that’s left is a bunch of neat abstractions.

I like abstractions. I like writing abstractions, and writing about abstractions. I have written a multi-part series on multiple variants of the half-edge data structure on this very blog! And these kinds of stories tickle the part of my brain that likes abstractions and manipulating them.

But they also leave the part of my brain that is social and thinks about humans and their relationships oddly cold and unsatisfied, and I think that’s an inherent problem of this type of story, and I don’t think it can be fixed.

Another Android post. As you can tell by the previous post, I’ve been doing some testing on Android lately, and before that came building, deploying, and testing/debugging. As part of the latter, I was trying to get ndk-gdb to work, on which I spent about one and a half day full-time (without success), and then later about as much time waiting (and sometimes answering questions) when some Android devs on Twitter took pity on me and helped me figure the problem out. Since I found no mention of this issue in the usual places (Stack Overflow etc.), I’m writing it up here in case someone else runs into the same issues later.

### First problem: Android 4.3

The symptom here is that ndk-gdb won’t be able to successfully start the target app at all, and dies pretty early with an error. “Package is unknown”, “data directory not found” or something similar. This is apparently something that got broken in the update to Android 4.3 – see this issue in the Android bug tracker. If you run into this problem with Android 4.3 running on an “official” Google device (i.e. the Nexus-branded ones), it can be fixed by flashing the device to use the Google-provided Factory Images with Android 4.3. If you’re not on a Nexus device – well, sucks to be you; there’s some other workarounds mentioned in the issue, and Google says that the problem will be fixed in a “future release”, but for now you’re out of luck.

### Second problem: “Waiting for Debugger” – NDK r9 vs. JDK 7

The second problem occurs a bit further down the line: ndk-gdb actually manages to launch the app on the phone and gets into gdb, but the app then freezes showing a “Waiting for Debugger” screen and won’t continue no matter what you do. Note that there are lots of ways to get stuck at that screen, see Stack Overflow and the like; in particular, if you see that screen even when launching the app directly on the Android device (instead of starting it via ndk-gdb --start or ndk-gdb --launch on the host), this is a completely different problem and what I’m describing here doesn’t apply.

Anyway, this one took ages to figure out. After about two days (when I had managed to find the original problem I was trying to debug on a colleague’s machine, where ndk-gdb worked), I realized that everything seemed to work fine on his machine, which had an older Android NDK, but did not work on two of my machines, which were both using NDK r9. So I went over the change log for r9 to check if there was anything related to ndk-gdb, and indeed, there was this item:

• Updated ndk-gdb script so that the --start or --launch actions now wait for the GNU Debug Server, so that it can more reliably hit breakpoints set early in the execution path (such as breakpoints in JNI code). (Issue 41278)

Note: This feature requires jdb and produces warning about pending breakpoints. Specify the --nowait option to restore previous behavior.

Aha! Finally, a clue. So I tried running my projects with ndk-gdb --start --nowait, and indeed, that worked just fine (in retrospect, I should have searched for a way to disable the wait sooner, but hindsight is always 20/20). That was good enough for me, although it meant I didn’t get to enjoy the fix for the Android issue cited in the change log. This is annoying, but not hard to work around: just sleep for a few seconds early in your JNI main function to give the debugger time to attach. I was still curious about what’s going on though, but I had absolutely no clue how to proceed from there – digging into it any further would’ve required knowledge of internals that I just didn’t have.

This is when an Android developer on Twitter offered to step in and see if he could figure it out for me; all I had to do was give him some debug logs. Fair enough! And today around noon, he hit paydirt.

Okay, so here’s the problem: as the change log entry notes, the “wait for debugger” is handled in the Java part of the app and goes through jdb (the Java debugger), whereas the native code side is handled by gdbserver and gdb. And the problem was on the Java side of the fence, which I really don’t know anything about. Anyway, I could attach jdb just fine (and run jdb commands successfully), but the wait dialog on the Android phone just wouldn’t go away no matter what I did. It turns out that the problem was caused by me using JDK 7, when Android only officially supports JDK 6. Everything else that I’ve tried worked fine, and none of the build (or other) Android SDK scripts complained about the version mismatch, but apparently on the Android side, things won’t work correctly if a JDK 7 version of jdb tries to connect. And while you’re at it, make sure you’re using a 32-bit JDK too, even if you’re on a 64-bit machine; I didn’t make that mistake, but apparently that one can cause problems too. After I switched to the 32-bit JDK6u38 from here (the old Java JDK site, which unlike the new Oracle-hosted site won’t make you create an user account if you want to download old versions) things started working: I can now use ndk-gdb just fine, and it properly waits for the debugger to attach so I can set breakpoints as early as I like without resorting to the sleep hack.

### Summary (aka TL;DR)

Use Android 4.2 or older, or flash to the “factory image” if you want native debugging to even start.

If you’re using the NDK r9, make sure you’re using a 32-bit JDK 6 (not 7) or you might get stuck at the “Waiting for Debugger” prompt indefinitely.

Thanks to Branimir Karadžić for pointing out the first issue to me (if I hadn’t known that it was a general Android 4.3 thing, I would’ve wasted a lot of time on this), and huge thanks to Justin Webb for figuring out the second one!

Well done, Android. The Enrichment Center once again reminds you that Android Hell is a real place where you will be sent at the first sign of defiance.

RAD Game Tools, my employer, recently (version 2.2e) started shipping Bink 2.2 on Android, and we decided it was time to go over the example texture upload code in our SDK and see if there was any changes we should make – the original code was written years ago. So far, the answer seems to be “no”: what we’re doing seems to be about as good as we can expect when sticking to “mainstream” GL ES, reasonably widely-deployed extensions and the parts of Android officially exposed in the NDK. That said, I have done a bunch of performance measurements over the past few days, and they paint an interesting (if somewhat depressing) picture. A lot of people on Twitter seemed to be interested in my initial findings, so I asked my boss if it was okay if I published the “proper” results here and he said yes – hence this post.

### Setting

Okay, here’s what we’re measuring: we’re playing back a 1280×720 29.97fps Bink 2 video – namely, an earlier cut of this trailer that we have a very high-quality master version of (it’s one of our standard test videos); I’m sure the exact version of the video we use is on the Internet somewhere too, but I didn’t find it within the 2 minutes of googling, so here goes. We’re only using the first 700 frames of the video to speed up testing (700 frames is enough to get a decent sample).

Like most popular video codecs, Bink 2 produces output data in planar YUV, with the U/V color planes sub-sampled 2x both horizontally and vertically. These three planes get uploaded as 3 separate textures (which together form a “texture set”): one 1280×720 texture for luminance (Y) and two 640×360 textures for chrominance (Cb/Cr). (Bink and Bink 2 also support encoding an alpha channel, which adds another 1280×720 texture to the set). All three textures use the GL_LUMINANCE pixel format by default, with GL_UNSIGNED_BYTE data; that is, one byte per texel. This data is converted to RGB using a simple fragment shader.

Every frame, we upload new data for all 3 textures in a set using glTexSubImage2D from the internal video frame buffers, uploading the entire image (we could track dirty regions fairly easily, but with slow uploads this increases frame rate variance, which is a bad thing). We then draw a single quad using the 3 textures and our fragment shader. All very straightforward stuff.

Furthermore, we actually keep two texture sets around – everything is double-buffered. You will see why this is a good idea despite the increased memory consumption in a second.

### A wrinkle

One problem with GL ES targets is that the original GL ES went a bit overboard in removing core GL features. One important feature they removed was GL_UNPACK_ROW_LENGTH – this parameter sets the distance between adjacent rows in a client-specified image, counted in pixels. Why would you care about this? Simple: Say you have a 256×256 texture that you want to update from a system memory copy, but you know that you only changed the lower-left 128×128 pixels. By default, glTexSubImage2D with width = height = 128 will assume that the rows of the source image are 128 pixels wide and densely packed. Thus, to update just a 128×128 pixel region, you would have to either copy the lower left 128×128 pixels of your system memory texture into a smaller array that is densely packed, or call glTexSubImage2D 128 times, uploading a row at a time. Neither of these is very appealing from a performance perspective. But if you have GL_UNPACK_ROW_LENGTH, you can just set it to 256 and upload everything with a single call. Much nicer.

The reason Bink 2 needs this is because we support arbitrary-width videos, but like most video codecs, the actual coding is done in terms of larger units. For example, MPEG 1 through to H.264 all use 16×16-pixel “macroblocks”, and any video that is not a multiple of 16 pixels will get padded out to a multiple-of-16-size internally. Even if you didn’t need the extra data in the codec, you would still want adjacent rows in the plane buffers to be multiples of at least 16 pixels, simply so that every row is 16-byte aligned (an important magic number for a lot of SIMD instruction sets). Bink 2′s equivalent of macroblocks is 32×32 pixels in size, so we internally want rows to be a multiple of 32 pixels wide.

What this all means is that if you decide you really want a 65×65 pixel video, that’s fine, but we’re going to allocate our internal buffers as if it was 96 pixels wide (and 80 pixels tall – we can omit storage for the last 16 rows in the last macroblock). Which is where the unpack row length comes into play – if we have it, we can support “odd-sized” videos efficiently; if we don’t, we have to use the slower fallback, i.e. call glTexSubImage2D for every scan line individually. Luckily, there is the GL_EXT_unpack_subimage GL ES extension that adds this feature back in and is available on most recent devices; but for “odd” sizes on older devices, we’re stuck with uploading a row at a time.

That said, none of this affects our test video, since 1280 pixels width is a multiple of 32; I just though I’d mention it anyway since it’s one of random, non-obvious API compatibility issues you run into. Anyway, back to the subject.

Okay, so here’s what I did: Bink 2 decodes the video on another (or multiple other) threads. Periodically – ideally, 30 times a second – we upload the current frame and draw it to the screen. My test program will never drop any frames; in other words, we may run slower than 30fps, but we will always upload and render all 700 frames, and we will never run faster than 30fps (well, 29.997fps, but close enough).

Around the texture upload, my test program does this:

    // update the GL textures
clock_t start = clock();
Update_Bink_textures( &TextureSet, Bink );
clock_t total = clock() - start;

upload_stats.record( (float) ( 1000.0 * total / CLOCKS_PER_SEC ) );


where upload_stats is an instance of the RunStatistics class I used in the Optimizing Software Occlusion Culling series. This gives me order statistics, mean and standard deviation for the texture update times, in milliseconds.

I also have several different test variants that I run:

• GL_LUMINANCE tests upload the texture data as GL_LUMINANCE as explained above. This is the “normal” path.
• GL_RGBA tests upload the same bytes as a GL_RGBA texture, with all X coordinates (and the texture width) divided by 4. In other words, they transfer the same amount of data (and in fact the same data), just interpreted differently. This was done to check whether RGBA textures enjoy special optimizations in the drivers (spoiler: it seems like they do).
• use1x1 tests force all glTexSubImage2D calls to upload just 1×1 pixels – in other words, this gives us the cost of API overhead, possible synchronization and texture ghosting while virtually removing any per-pixel costs (such as CPU color space conversion, swizzling, DMA transfers or memory bandwidth).
• nodraw tests do all of the texture uploading, but then don’t actually draw the quad. This still measures processing time for the texture upload, but since the texture isn’t actually used, no synchronization or ghosting is ever necessary.
• uploadall uses glTexImage2D instead of glTexSubImage2D to upload the whole texture. In theory, this will guarantee to the driver that all existing texture data is overwritten – so while texture ghosting might still have to allocate memory for a new texture, it won’t have to copy the old contents at least. In practice, it’s not clear if the drivers actually make use of that fact. For obvious reasons, this and use1x1 are mutually exclusive, and I only ran this test on the PowerVR device.

### Results

So, without further ado, here’s the results on the 4 devices I tested: (apologies for the tiny font size, but that was the only way to squeeze it into the blog layout)

Device / GPU Format min 25th med 75th max avg sdev
2010 Droid X (PowerVR SGX 530) GL_LUMINANCE 14.190 15.472 17.700 20.233 70.893 19.704 5.955
GL_RGBA 11.139 13.245 14.221 14.832 28.412 14.382 1.830
GL_LUMINANCE use1x1 0.061 38.269 39.398 41.077 93.750 41.905 6.517
GL_RGBA use1x1 0.061 30.761 32.348 32.837 59.906 33.165 4.305
GL_LUMINANCE nodraw 9.979 12.726 13.427 14.985 29.632 13.854 1.788
GL_RGBA nodraw 5.188 10.376 11.291 12.024 26.215 10.864 2.013
GL_LUMINANCE use1x1 nodraw 0.030 0.061 0.061 0.092 0.733 0.086 0.058
GL_RGBA use1x1 nodraw 0.030 0.061 0.061 0.091 0.916 0.082 0.081
GL_LUMINANCE uploadall 13.611 15.106 17.822 19.653 73.944 19.312 6.145
GL_RGBA uploadall 7.171 12.543 13.489 14.282 34.119 13.751 1.854
GL_LUMINANCE uploadall nodraw 9.491 12.756 13.702 14.862 33.966 13.994 2.176
GL_RGBA uploadall nodraw 5.158 9.796 10.956 11.718 22.735 10.465 2.135
2012 Nexus 7 (Nvidia Tegra 3) GL_LUMINANCE 6.659 7.706 8.710 10.627 18.842 9.597 2.745
GL_RGBA 3.278 3.600 4.128 4.906 9.244 4.395 1.011
GL_LUMINANCE use1x1 0.298 0.361 0.421 0.567 1.843 0.468 0.151
GL_RGBA use1x1 0.297 0.354 0.422 0.561 1.687 0.468 0.152
GL_LUMINANCE nodraw 6.690 7.674 8.669 9.815 24.035 9.495 2.929
GL_RGBA nodraw 3.208 3.501 3.973 5.974 12.059 4.737 1.589
GL_LUMINANCE use1x1 nodraw 0.295 0.360 0.413 0.676 1.569 0.520 0.204
GL_RGBA use1x1 nodraw 0.270 0.327 0.404 0.663 1.946 0.506 0.234
2013 Nexus 7 (Qualcomm Adreno 320) GL_LUMINANCE 0.732 0.976 1.190 3.907 22.249 2.383 1.879
GL_RGBA 0.610 0.824 0.977 3.510 13.368 2.163 1.695
GL_LUMINANCE use1x1 0.030 0.061 0.061 0.091 3.143 0.080 0.187
GL_RGBA use1x1 0.030 0.061 0.091 0.092 4.303 0.104 0.248
GL_LUMINANCE nodraw 0.793 1.098 3.570 4.425 25.760 3.001 2.076
GL_RGBA nodraw 0.732 0.916 1.038 3.937 26.370 2.416 2.190
GL_LUMINANCE use1x1 nodraw 0.030 0.061 0.091 0.092 4.181 0.090 0.204
GL_RGBA use1x1 nodraw 0.030 0.061 0.091 0.122 4.272 0.114 0.292
2012 Nexus 10 (ARM Mali T604) GL_LUMINANCE 1.292 2.782 3.590 4.439 16.893 3.656 1.256
GL_RGBA 1.451 2.782 3.432 4.358 8.517 3.551 0.982
GL_LUMINANCE use1x1 0.193 0.284 0.369 0.670 17.598 0.862 2.230
GL_RGBA use1x1 0.100 0.147 0.199 0.313 20.896 0.656 2.349
GL_LUMINANCE nodraw 1.314 2.179 2.320 2.823 10.677 2.548 0.700
GL_RGBA nodraw 1.209 2.101 2.196 2.539 5.008 2.414 0.553
GL_LUMINANCE use1x1 nodraw 0.190 0.294 0.365 0.601 2.113 0.456 0.228
GL_RGBA use1x1 nodraw 0.094 0.119 0.162 0.288 2.771 0.217 0.162

Yes, bunch of raw data, no fancy graphs – not this time. Here’s my observations:

• GL_RGBA textures are indeed a good deal faster than luminance ones on most devices. However, the ratio is not big enough to make CPU-side color space conversion to RGB (or even just interleaving the planes into an RGBA layout on the CPU side) a win, so there’s not much to do about it.
• Variability between devices is huge. Hooray for fragmentation.
• Newer devices tend to have fairly reasonable texture upload times, but there’s still lots of variation.
• Holy crap does the Droid X show badly in this test – it has both really slow upload times and horrible texture ghosting costs, and that despite us already alternating between a pair of texture sets! I hope that’s a problem that’s been fixed in the meantime, but since I don’t have any newer PowerVR devices here to test with, I can’t be sure.

So, to summarize it in one word: Ugh.

I originally intended to write something different for part 2 of this series, but since I’ve started rewriting that article no less than 3 times at this point, I’m just gonna switch the order of topics around a bit.

### Transpose from even/odd interleaves

I already showed one instance of this last time (“variant 2”) for the 4×4 case, but let’s go at it a bit more systematically. Since we’ve already beaten this size to back, let’s spice things up a bit and do 8×8 this time:

a0 a1 a2 a3 a4 a5 a6 a7
b0 b1 b2 b3 b4 b5 b6 b7
c0 c1 c2 c3 c4 c5 c6 c7
d0 d1 d2 d3 d4 d5 d6 d7
e0 e1 e2 e3 e4 e5 e6 e7
f0 f1 f2 f3 f4 f5 f6 f7
g0 g1 g2 g3 g4 g5 g6 g7
h0 h1 h2 h3 h4 h5 h6 h7


“Variant 2” from last time was the version where we started by interleaving rows not with their immediate neighbors, but with the rows that were two steps away. The key here is that we have to do multiple interleave steps to complete the transpose, and with every even-odd interleave, we space the original elements of a vector further apart. So if we interleaved rows A and B in the first step, we would have a0 b0 after the first step, but a0 xx b0 xx as soon as we interleaved that with something else. That’s not what we want. So instead, we start by interleaving rows A and E – after a total of 3 passes, that should put e0 where it’s supposed to go, 4 elements from a0.

The same argument goes for the other rows, too—so let’s just do an even-odd interleave between the entire first half and the last half of our rows:

a0 e0 a1 e1 a2 e2 a3 e3
a4 e4 a5 e5 a6 e6 a7 e7
b0 f0 b1 f1 b2 f2 b3 f3
b4 f4 b5 f5 b6 f6 b7 f7
c0 g0 c1 g1 c2 g2 c3 g3
c4 g4 c5 g5 c6 g6 c7 g7
d0 h0 d1 h1 d2 h2 d3 h3
d4 h4 d5 h5 d6 h6 d7 h7


For the next step, we want to interleave the row containing a0 with the row containing the elements that needs to end up 2 places away from a0 in the final result – namely, c0, and similar for the other two rows. Which again boils down to doing an even-odd interleave between the entire first and second halves:

a0 c0 e0 g0 a1 c1 e1 g1
a2 c2 e2 g2 a3 c3 e3 g3
a4 c4 e4 g4 a5 c5 e5 g5
a6 c6 e6 g6 a7 c7 e7 g7
b0 d0 f0 h0 b1 d1 f1 h1
b2 d2 f2 h2 b3 d3 f3 h3
b4 d4 f4 h4 b5 d5 f5 h5
b6 d6 f6 h6 b7 d7 f7 h7


At this point we’re just one turn of the crank away from the result we want, so let’s go for it and do one more round of interleaves…

a0 b0 c0 d0 e0 f0 g0 h0
a1 b1 c1 d1 e1 f1 g1 h1
a2 b2 c2 d2 e2 f2 g2 h2
a3 b3 c3 d3 e3 f3 g3 h3
a4 b4 c4 d4 e4 f4 g4 h4
a5 b5 c5 d5 e5 f5 g5 h5
a6 b6 c6 d6 e6 f6 g6 h6
a7 b7 c7 d7 e7 f7 g7 h7


and that’s it, 8×8 matrix successfully transposed.

This is nothing I haven’t shown you already, although in a different order than before. This form here makes the underlying algorithm much clearer, and also generalizes in the obvious way to larger sizes, should you ever need them. But that’s not the reason I’m talking about this. Time to get to the fun part!

### Rotations

Let’s look a bit closer at how the elements move during every pass. For this purpose, let’s just treat all of the elements as a 64-element array. The first row contains the first 8 elements, the second row contains the second 8 elements, and so forth. a0 starts out in row zero and column zero, the first element of the array, and stays there for the entire time (boring!). b3 starts out in row 1, column 3 – that’s element number 11 (1×8 + 3). Now, the algorithm above simply applies the same permutation to the overall array 3 times. Let’s look at how the array elements move in one step—for reasons that will become clear in a second, I’ll give the array indices in binary:

 From To 000 000 → 000 000 000 001 → 000 010 000 010 → 000 100 000 011 → 000 110 ⋮ 001 000 → 010 000 001 001 → 010 010 001 010 → 010 100 ⋮ 100 000 → 000 001 100 001 → 000 011 100 010 → 000 101 100 011 → 000 111 ⋮ 101 010 → 010 101 ⋮

Since we have 8 rows and 8 columns, the first 3 and last 3 bits in each index correspond to the row and column indices, respectively. Anyway, can you see the pattern? Even-odd interleaving the first half of the array with the second half in effect performs a bitwise rotate left on the element indices!

Among other things, this instantly explains why it takes exactly three passes to transpose a 8×8 matrix with this approach: the row and column indices take 3 bits each. So after 3 rotate-lefts, we’ve swapped the rows and columns—which is exactly what a matrix transpose does. Another salient point is that repeating even-odd interleaves like this will return us to our starting arrangement after 6 passes. This is easy to see once we know that such a step effectively rotates the bits of the element index; it’s not at all obvious when looking at the permutation by itself.

But it goes further than that. For one thing, the even/odd interleave construction really works for any even number of elements; it certainly works for all powers of two. So we’re not strictly limited to square matrices here. Say we have a 4×8 matrix (4 rows, 8 columns). That’s 32 elements total, or a 5-bit element index, laid out in binary like r1r0c2c1c0, where r are row index bits and c are column index bits. After two interleave passes, we’re at c2c1c0r1r0, corresponding to the transposed layout—a 8×4 matrix. To go from there to the original layout, we would have to run three more passes, rotating everything by another 3 bits, back to where we started.

Which illustrates another interesting point: for non-square matrices, going in one direction using this method can be much cheaper than going in the other direction. That’s because the even/odd interleave step only gives us a “rotate left” operation, and sometimes the shorter path would be to “rotate right” instead. On some architectures the corresponding deinterleave operation is available as well (e.g. ARM NEON), but often it’s not, at least not directly.

### Groups

Let’s step back for a moment, though. We have an operation: even/odd interleave between the first and second halves of sequences with 2k elements, which is really just a particular permutation. And now we know that we can get the inverse operation via repeated interleaving. Which means that our even-odd interleave generates a group. Now, as long as we really only do even-odd interleaves on the complete sequence, this group is a cyclic group of order k—it has to be: it’s a finite group generated by a single element, ergo a cyclic group, and we already know how long the cycle is, based on the “rotate left” property.

So to make matters a bit more interesting, let’s get back to the original topic of this article! Namely, the SIMD bit. While it’s convenient to build a complete matrix transpose out of a single operation on all elements simultaneously, namely a very wide even/odd interleave, that’s not how the actual code looks. We have fixed-width registers, and to synthesize anything wider than that, we have to break the data down into smaller chunks and process them individually anyway. However, we do need to have full even/odd interleaves to get a permutation group structure, so we can’t allow using single “low” or “high” interleave instructions without their opposite half.

What kind of model do we end up with? Let’s list the ingredients. We have:

• A set of n SIMD registers, each k “elements” wide. We’ll assume that k is a power of two. What an “element” is depends on context; a 128-bit SIMD registers might be viewed as consisting of 16 byte-sized elements, or 8 16-bit elements, or 4 32-bit elements… you get the idea.
• An even-odd interleave (perfect shuffle) operation between a pair of SIMD registers. For convenience, we’ll assume that the results are returned in the same 2 registers. Implementing this might require another temporary register depending on the architecture, but let’s ignore that for the purposes of this article; we only care about the state of the registers before and after interleaves, not during them.
• Finally, we assume that registers are completely interchangeable, and that we can “rename” them at will; that is, we’ll consider all permutations that can be performed by just renumbering the registers to be equivalent.

To explain the latter, we would consider an arrangement like this:

r0 = a0 b0 c0 d0
r1 = a1 b1 c1 d1


(where r0 and r1 correspond to SIMD registers) to be equivalent to:

r0 = a1 b1 c1 d1
r1 = a0 b0 c0 d0


or even

r3 = a0 b0 c0 d0
r5 = a1 b1 c1 d1


To rephrase it, we don’t care about differences in “register allocation”: as long as we get all the individual rows we need, any order will do.

### What permutations can be generated using interleaves?

This is a question I’ve been wondering about ever since I first saw the original MMX instructions, but I didn’t really spend much time thinking about it until fairly recently, when I got curious. So I don’t have a full answer, but I do have some interesting partial results.

Let’s get the trivial case out of the way first: if k=1 (that is, each register contains exactly one element), then clearly we can reach any permutation we want, and without doing any work to boot—every register contains one value, and as explained above, our model treats register-level permutations as free. However, as soon as we have multiple elements inside a single register, things start to get interesting.

### Permutations generated for n=2, k=2

We’re only permuting n×k = 4 elements here, so the groups in question are all small enough to enumerate their elements on a piece of paper, which makes this a good place to start. Also, with n=2, the “register permutation” side of things is really simple (we either swap the two registers or we don’t). For the even-odd interleave, we would normally have to specify which two registers to interleave—but since we only have two, we can just agree that we always want to interleave register 0 with register 1. Should we want the opposite order (interleave register 1 with register 0), we can simply swap the two registers beforehand. So our available operations boil down to just two permutations on 4 elements:

• Swap registers 0 and 1—this swaps the first two and the last two elements, so it corresponds to the permutation $0123 \mapsto 2301$ or (02)(13) in cycle notation. This is an involution: applying it twice swaps the elements back.
• Even/odd interleave between registers 0 and 1. This boils down to the permutation $0123 \mapsto 0213$ or (12) which swaps the two middle elements and is also an involution.

These are the only operations we permit, so we have a finite group that is generated by involutions: it must be a dihedral group. In fact, it turns out to be Dih4, the symmetry group of a square, which is of order 8. So using 2 registers, we can reach only 8 of the $4! = 24$ permutations of 4 elements. So what happens when we have more registers at our disposal?

### Permutations generated for n>2, k=2

The next smallest case is n=3, k=2, which gives us permutations of 6 elements. $6! = 720$, so this is still small enough to simply run a search, and that’s what I did. Or, to be more precise, I wrote a program that did the searching for me. It turns out that the even-odd interleave of the first two registers combined with arbitrary permutations on the registers (of which there are $3! = 6$) is enough to reach all 720 permutations in S6, the symmetric group on 6 elements. Beyond this, I can’t say much about how this representation of the group works out; it would be nice if there was an easy way to find shortest paths between permutations for example (which would have uses for code generation), but if there is, I don’t know how. That said, my understanding of group theory is fairly basic; I’d really appreciate input from someone with more experience dealing with finite groups here.

I can tell you what happens for n>3, though: we already know we can produce all permutations using only 3 registers. And using the exact same steps, we can reach any permutation of the first 3 registers for n>3, leaving the other registers untouched. But that’s in fact enough to generate an arbitrary permutation of n elements, as follows: Say we have n=4, and we start with

r0 = e0 e1
r1 = e2 e3
r2 = e4 e5
r3 = e6 e7


and we want to end up with the (arbitrarily chosen) permutation

r0 = e3 e7
r1 = e0 e5
r2 = e1 e4
r3 = e6 e2


To begin, we try to generate the value we want to end up in r3 (namely, e6 e2). First, we swap rows around so that we have the source values we need in rows 0 and 1 (that is, registers r0 and r1). In our example, that just requires swapping r0 with r3:

r0 = e6 e7
r1 = e2 e3
r2 = e4 e5
r3 = e0 e1


Next, we know that we can reach arbitrary permutations of the first three rows (registers). In particular, we can shuffle things around so that r0 contains the value we’d like to be in r3:

r0 = e6 e2
r1 = e7 e3
r2 = e4 e5
r3 = e0 e1


This is one way of doing it, but really any permutation that has e6 e2 in the first row would work. Anyway, now that we have produced the value we wanted, we can swap it back into r3:

r0 = e0 e1
r1 = e7 e3
r2 = e4 e5
r3 = e6 e2


We now have the value we want in r3, and all the remaining unused source elements remain in rows 0–2. And again, since we know we can achieve arbitrary permutations of 6 elements using 3 registers using the n=3 case, we’re done! For n>4, the proof works in the same way: we first generate rows 3, 4, …, n-1 one by one; once a row is done, we never touch it again (it can’t contain any source elements we need for the remaining rows, since we only allow permutations). In the end we will always arrive at a configuration that has all the remaining “unfinished” elements in rows 0–2, which is a configuration we can solve.

I don’t mean to suggest that this is an efficient way to solve this problem; quite the opposite, in fact. But it’s an easy way to prove that once we have n=3 solved, higher n’s don’t add any substantial difficulty.

### Summary

This covers the easiest cases, k=1 and k=2, and answers the question I originally wondered about in the positive: using only interleaves, you can produce arbitrary permutations of the input elements in registers, as long as you only have k=2 elements per register (for example, 32-bit values inside the 64-bit MMX registers) and at least 3 SIMD registers worth of storage. Without any nice bounds on the number of operations required or an algorithmic way to compute an optimal interleave/rename sequence, I’m the first one to admit that this has little practical relevance, but it’s cool stuff nonetheless! Coming up, I’ll talk a bit about the (more interesting) k=4 case, but I think this is enough material for a single blog post. Until next time!