The Z-transform is a standard tool in signal processing; however, most descriptions of it focus heavily on the mechanics of manipulation and don’t give any explanation for what’s going on. In this post, my goal is to cover only the basics (just enough to understand standard FIR and IIR filter formulations) but make it a bit clearer what’s actually going on. The intended audience is people who’ve seen and used the Z-transform before, but never understood why the terminology and notation is the way it is.
Polynomials
In an earlier draft, I tried to start directly with the standard Z-transform, but that immediately brings up a bunch of technicalities that make matters more confusing than necessary. Instead, I’m going to start with a simpler setting: let’s look at polynomials.
Not just any polynomials, mind. Say we have a finite sequence of real or complex numbers . Then we can define a corresponding polynomial that has the values in that sequence as its coefficients:
.
and of course there’s also a corresponding polynomial function A(x) that plugs in concrete values for x. Polynomials are closed under addition and scalar multiplication (both componentwise, or more accurately in this case, coefficient-wise) so they are a vector space and we can form linear combinations . That’s reassuring, but not particularly interesting. What is interesting is the other fundamental operation we can do with polynomials: Multiply them. Let’s say we multiply two polynomials A and B to form a third polynomial C, and we want to find the corresponding coefficients:
.
To find the coefficients of C, we simply need to sum across all the combinations of indices i, j such that i+j=k, which of course means that j=k-i:
.
When I don’t specify a restriction on the summation index, that simply means “sum over all i for which the right-hand side is well-defined”. And speaking of notation, in the future, I don’t want to give a name to every individual expression just so we can look at the coefficients; instead, I’ll just write to denote the coefficient of xk in the product of A and B – which is of course our ck. Anyway, this is simply the discrete convolution of the two input sequences, and we’re gonna milk this connection for all it’s worth.
Knowing nothing but this, we can already do some basic filtering in this form if we want to: Suppose our ai encode a sampled sound effect. A again denotes the corresponding polynomial, and let’s say , corresponding to the sequence (1, 1). Then C=AB computes the convolution of A with the sequence (1, 1), i.e. each sample and its immediate successor are summed (this is a simple but unnormalized low-pass filter). Now so far, we’ve only really substituted one way to write convolution for another. There’s more to the whole thing than this, but for that we need to broaden our setting a bit.
Generating functions
The next step is to get rid of the fixed-length limitation. Instead of a finite sequence, we’re now going to consider potentially infinite sequences . A finite sequence is simple one where all but finitely many of the ai are zero. Again, we can create a corresponding object that captures the whole sequence – only instead of a polynomial, it’s now a power series:
.
And the corresponding function A(x) is called ai‘s generating function. Now we’re dealing with infinite series, so if we want to plug in an actual value for x, we have to worry about convergence issues. For the time being, we won’t do so, however; we simply treat the whole thing as a formal power series (essentially, an “infinite-degree polynomial”), and all the manipulations I’ll be doing are justified in that context even if the corresponding series don’t converge.
Anyway, the properties described above carry over: we still have linearity, and there’s a multiplication operation (the Cauchy product) that is the obvious generalization of polynomial multiplication (in fact, the formula I’ve written above for the ck still applies) and again matches discrete convolution. So why did I start with polynomials in the first place if everything stays pretty much the same? Frankly, mainly so I don’t have to whip out both infinite sequences and power series in the second paragraph; experience shows that when I start an article that way, the only people who continue reading are the ones who already know everything I’m talking about anyway. Let’s see whether my cunning plan works this time.
So what do we get out of the infinite sequences? Well, for once, we can now work on infinite signals – or, more usually, signals with a length that is ultimately finite, but not known in advance, as occurs with real-time processing. Above we saw a simple summing filter, generated from a finite sequence. That sequence is the filter’s “impulse response”, so called because it’s the result you get when applying the filter to the unit impulse signal (1 0 0 …). (The generating function corresponding to that signal is simply “1”, so this shouldn’t be much of a surprise). Filters where the impulse response has finite length are called “finite impulse response” or FIR filters. These filters have a straight-up polynomial as their generating function. But we can also construct filters with an infinite impulse response – IIR filters. And those are the filters where we actually get something out of going to generating functions in the first place.
Let’s look at the simplest infinite sequence we can think of: (1 1 1 …), simply an infinite series of ones. The corresponding generating function is
And now let’s look at what we get when we convolve a signal ai with this sequence:
Expanding out, we see that s0 = a0, s1 = a0 + a1, s2 = a0 + a1 + a2 and so forth: convolution with G1 generates a signal that, at each point in time, is simply the sum of all values up to that time. And if we actually had to compute things this way, this wouldn’t be very useful, because our filter would keep getting slower over time! Luckily, G1 isn’t just an arbitrary function – it’s a geometric series, which means that for concrete values x, we can compute G1(x) as:
and more generally, for arbitrary
.
If we apply the identity the other way round, we can turn such an expression of x back into a power series; in particular, when dealing with formal power series, the left-hand side is the definition of the expression on the right-hand side. This notation also suggests that G1 is the inverse (wrt. convolution) of (1 – x), and more generally that Gc is the inverse of (1 – cx). Verifying this makes for a nice exercise.
But what does that mean for us? It means that, given the expression
we can treat it as the identity between power series that it is and multiply both sides by (1 – cx), giving:
and thus
i.e. we can compute sk in constant time if we’re allowed to look at sk-1. In particular, for the c=1 case we started with, this just means the obvious thing: don’t throw the partial sum away after every sample, instead just keep adding the most recent sample to the running total.
And here’s the thing: that’s everything you need to compute convolutions with almost any sequence that has a rational generating function, i.e. it’s a quotient of polynomials . Using the same trick as above, it’s easy to see what that means computationally. Say that
and
. If our signal has the generating function A(x), then computing the filtered signal S boils down to evaluating
. Along the same lines as before, we have
.
So again, we can compute the signal incrementally using a fixed amount of work (depending only on n and m) for every sample, provided that q0 isn’t zero. The question is, do these rational functions still have a corresponding series expansion? After all, this is what we need to employ generating functions in the first place. Luckily, the answer is yes, again provided that q0 isn’t zero. I’ll skip describing how exactly this works since we’ll be content to deal directly with the factored rational function form of our generating functions from here on out; if you want more details (and see just how useful the notion of a generating function turns out to be for all kinds of problems!), I recommend you look at the excellent “Concrete Mathematics” by Graham, Knuth and Patashnik or the by now freely downloadable “generatingfunctionology” by Wilf.
At last, the Z-transform
At this point, we already have all the theory we need for FIR and IIR filters, but with a non-standard notation, motivated by the desire to make the connection to standard polynomials and generating functions more explicit. Let’s fix that up: in signal processing, it’s customary to write a signal x as a function (or
), and it’s customary to write the argument in square brackets. So instead of dealing with sequences that consist of elements xn, we now have functions with values at integer locations x[n]. And the (unilateral) Z-transform of our signal x is now the function
.
in other words, it’s basically a generating function, but this time the exponents are negative. I also assume that the signal is x[n] = 0 for all n<0, i.e. the signal starts at some defined point and we move that point to 0. This doesn’t make any fundamental difference for the things I’ve discussed so far: all the properties discussed above still hold, and indeed all the derivations will still work if you mechanically substitute xk with z-k. In particular, anything involving convolutions still works exactly same. However it does make a difference if you actually plug in concrete values for z, which we are about to do. Also note that our variable is now z, not x. Customarily, “z” is used to denote complex variables, and this is no exception – more in a minute. Next, the Z-transform of our filter’s impulse response (which is essentially the filter’s generating function, except now we evaluate at 1/z) is called the “transfer function” and has the general form
where P and Q are the same polynomials as above; these polynomials in z-1 are typically written Y(z) and X(z) in the DSP literature. You can factorize the numerator and denominator polynomials to get the zeroes and poles of a filter. They’re important concepts in IIR filter design, but fairly incidental to what I’m trying to do (give some intution about what the Z-transform does and how it works), so I won’t go into further detail here.
The Fourier connection
One last thing: The relation of this all to frequency space, or: what do our filters actually do to frequencies? For this, we can use the discrete-time Fourier transform (DTFT, not to be confused with the Discrete Fourier Transform or DFT). The DTFT of a general signal x is
Now, in our case we’re only considering signals with x[n]=0 for n<0, so we get
which means we can compute the DTFT of a signal by evaluating its Z-transform at exp(iω) – assuming the corresponding series of expression converges. Now, if the Z-transform of our signal is in general series form, this is just a different notation for the same thing. But for our rational transfer functions H(z), this is a big deal, because evaluating their values at given complex z is easy – it’s just a rational function, after all.
In fact, since we know that polynomial (and series) multiplication corresponds to convolution, we can now also easily see why convolution filters are useful to modify the frequency response (Fourier transform) of a signal: If we have a signal x with Z-transform X and the transfer function of a filter H, we get:
and in particular
The first of these two equations is the discrete-time convolution theorem for Fourier transforms of signals: the DTFT of the convolution of the two signals is the point-wise product of the DTFTs of the original signal and the filter. The second shows us how filters can amplify or attenuate individual frequencies: if |H(eiω)| > 1, frequency ω will be amplified in the filtered signal, and it it’s less than 1, it will be dampened.
Conclusion and further reading
The purpose of this post was to illustrate a few key concepts and the connections between them:
- Polynomial/series multiplication and convolution are the same thing.
- The Z-transform is very closely related to generating functions, an extremely powerful technique for manipulating sequences.
- In particular, the transfer function of a filter isn’t just some arbitrary syntactic convention to tie together the filter coefficients; there’s a direct connection to corresponding sequence manipulations.
- The Fourier transform of filters is directly tied to the behavior of H(z) in the complex plane; computing the DTFT of an IIR filter’s impulse response directly would get messy, but the factored form of H(z) makes it easy.
- With this background, it’s also fairly easy to see why filters work in the first place.
I intentionally cover none of these aspects deeply; my experience is that most material on the subject does a great job of covering the details, at the expense of making it harder to see the big picture, so I wanted to try doing it the other way round. More details on series and generating functions can be found in the two books I cited above, and a good introduction to digital filters that supplies the details I omitted is Smith’s Introduction to Digital Filters.
I wanted to point someone to a short explanation of this today and noticed, with some surprise, that I couldn’t find something concise within 5 minutes of googling. So it seems worth writing up. I’m assuming you know what quaternions are and what they’re used for.
First, though, it seems important to point one thing out: Actually, there’s nothing special at all about either integration or differentiation of quaternion-valued functions. If you have a quaternion-valued function of one variable q(t), then
same as for any real- or complex-valued function.
So what, then, is this post about? Simple: unit quaternions are commonly used to represent rotations (or orientation of rigid bodies), and rigid-body dynamics require integration of orientation over time. Almost all sources I could find just magically pull a solution out of thin air, and those that give a derivation tend to make it way more complicated than necessary. So let’s just do this from first principles. I’m assuming you know that multiplying two unit quaternions quaternions q1q0 gives a unit quaternion representing the composition of the two rotations. Now say we want to describe the orientation q(t) of a rigid body rotating at constant angular velocity. Then we can write
where describes the rotation the body undergoes in one time step. Since we have constant angular velocity, we will have
, and more generally
for all nonnegative integer k by induction. So for even more general t we’d expect something like
.
Now, qω is a unit quaternion, which means it can be written in polar form
where θ is some angle and n is a unit vector denoting the axis of rotation. That part is usually mentioned in every quaternion tutorial. Embedding real 3-vectors as the corresponding pure imaginary quaternion, i.e. writing just for the quaternion
, is usually also mentioned somewhere. What usually isn’t mentioned is the crucial piece of information that the polar form of a quaternion, in fact, just the quaternion version of Euler’s formula: Any unit complex number
can be written as the complex exponential of a pure imaginary number
, and similarly any unit-length quaternion (and qω in particular) can be written as the exponential of a pure imaginary quaternion
which gives us a natural definition for
.
Now, what if we want to write a differential equation for the behavior of q(t) over time? Just compute the derivative of q(t) as you would for any other function of t. Using the chain and product rules we get:
Now t and θ are real numbers, so the exponential is of a real number times n; exp is defined as a power series, and for arbitrary c real we have
n commutes with powers of n, and real numbers commute with everything; therefore, n commutes with exp(cn) for arbitrary c, and thus
The vector θn is in fact just the angular velocity ω, which yields the oft-cited but seldom-derived equation:
This is usually quoted completely without context. In particular, it’s not usually mentioned that q(t) describes the orientation of a body with constant angular velocity, and similar for the crucial link to the exponential function.
Linear interpolation. Lerp. The bread and butter of graphics programming. Well, turns out I have three tricks about lerps to share. One of them is really well known, the other two less so. So let’s get cracking!
Standard linear interpolation is just lerp(t, a, b) = (1-t)*a + t*b. You should already know this. At t=0 we get a, at t=1 we get b, and for inbetween values of t we interpolate linearly between the two. And of course we can also linearly extrapolate by using a t outside [0,1].
Past
The expression shown above has two multiplies. Now, it used to be the case that multiplies were really slow (and by the way they really aren’t anymore, so please stop doubling the number of adds in an expression to get rid of a multiply, no matter what your CS prof tells you). If multiplies are expensive, there’s a better (equivalent) expression you can get with some algebra: lerp(t, a, b) = a + t*(b-a). This is the first of the three tricks, and it’s really well known.
But in this setting (int multiplies are slow) you’re also really unlikely to have floating-point hardware, or to be working on floating-point numbers. Especially if we’re talking about pixel processing in computer graphics, it used to be all-integer for a really long time. It’s more likely to be all fixed-point, and in the fixed-point setting you typically have something like fixedpt_lerp(t, a, b) = a + ((t * (b-a)) >> 8). And more importantly, your a’s and b’s are also fixed-point (integer) numbers, typically with a very low range.
So here comes the second trick: for some applications (e.g. cross fades), you’re doing a lot of lerps with the same t (at least one per pixel). Now note that if t is fixed, the (t * (b - a)) >> 8 part really only depends on b-a, and that’s a small set of possible values: If a and b are byte values in [0,255], then d=b-a is in [-255,255]. So we really only ever need to do 511 multiplies based on t, to build a table indexed by d. Except that’s not true either, because we compute first t*0 then t*1 then t*2 and so forth, so we’re really just adding t every time, and we can build the whole table without any multiplies at all. So here we get trick two, for doing lots of lerps with constant t on integer values:
U8 lerp_tab[255*2+1]; // U8 is sufficient here
// build the lerp table
for (int i=0, sum = 0; i < 256; i++, sum += t) {
lerp_tab[255-i] = (U8) (-sum >> 8); // negative half of table
lerp_tab[255+i] = (U8) ( sum >> 8); // positive half
}
// cross-fade (grayscale image here for simplicity)
for (int i=0; i < npixels; i++) {
int a = src1[i];
int b = src2[i];
out[i] = a + lerp_tab[255 + b-a];
}
Look ma, no multiplies!
Present
But back to the present. Floating-point hardware is readily available on most platforms and purely counting arithmetic ops is a poor estimator for performance on modern architectures. Practically speaking, for almost all code, you’re never going to notice any appreciable difference in speed between the two versions
lerp_1(t, a, b) = (1 - t)*a + t*b lerp_2(t, a, b) = a + t*(b-a)
but you might notice something else: unlike the real numbers (which our mathematical definition of lerp is based on) and the integers (which our fixed-point lerps worked on), floating-point numbers don’t obey the arithmetic identities we’ve been using to derive this. In particular, for two floating point numbers we generally have a + (b-a) != b, so the second lerp expression is generally not correct at t=1! In contrast, with IEEE floating point, the first expression is guaranteed to return the exact value of a (b) at t=0 (t=1). So there’s actually some reason to prefer the first expression in that case, and using the second one tends to produce visible artifacts for some applications (you can also hardcode your lerp to return b exactly at t=1, but that’s just ugly and paying a data-dependent branch to get rid of one FLOP is a bad idea). While for pixel values you’re unlikely to care, it generally pays to be careful for mesh processing and the like; using the wrong expression can produce cracks in your mesh.
So what to do? Use the first form, which has one more arithmetic operation and does two multiplies, or the second form, which has one less operation but unfavorable rounding properties? Luckily, this dilemma is going away.
Future
Okay, “future” is stretching a bit here, because for some platforms this “future” started in 1990, but I digress. Anyway, in the future we’d like to have a magic lerp instruction in our processors that solves this problem for us (and is also fast). Unfortunately, that seems very unlikely: Even GPUs don’t have one, and Michael Abrash never could get Intel to give him one either. However, he did get fused multiply-adds, and that’s one thing all GPUs have too, and it’s either already on the processor you’re using or soon coming to it. So if fused multiply-adds can help, then maybe we’re good. And it turns out they can.
A fused multiply-add is just an operation fma(a, b, c) = a*b + c that computes the inner expression using only one exact rounding step. And FMA-based architectures tend to compute regular adds and multiplies using the same circuitry, so all three operations cost roughly the same (in terms of latency anyway, not necessarily in terms of power). And while I say “fma” these chips usually support different versions with sign flips in different places too (implementing this in the HW is almost free); the second most important one is “fused negative multiply-subtract”, which does fnms(a, b, c) = -(a*b - c) = c - a*b. Let’s rewrite our lerp expressions using FMA:
lerp_1(t, a, b) = fma(t, b, (1 - t)*a) lerp_2(t, a, b) = fma(t, b-a, a)
Both of these still have arithmetic operations left that aren’t in FMA form; lerp_1 has two leftover ops and lerp_2 has one. And so far both of them aren’t significantly better than their original counterparts. However, lerp_1 has exactly one multiply and one subtract left; they’re just subtract-multiply (which we don’t have HW for), rather than multiply-subtract. However, that one is easily remedied with some algebra: (1 - t)*a = 1*a - t*a = a - t*a, and that last expression is in fma (more accurately, fnms) form. So we get a third variant, and our third trick:
lerp_3(t, a, b) = fma(t, b, fnms(t, a, a))
Two operations, both FMA-class ops, and this is based on lerp_1 so it’s actually exact at the end points. Two dependent ops – as long as we don’t actually get a hardware lerp instruction, that’s the best we can hope for. So this version is both fast and accurate, and as you migrate to platforms with guaranteed FMA support, you should consider rewriting your lerps that way – it’s the lerp of the future!
It’s fairly well-known that Bezier curves can be evaluated using repeated linear interpolation – de Casteljau’s algorithm. It’s also fairly well-known that a generalization of this algorithm can be used to evaluate B-Splines: de Boor’s algorithm. What’s not as well known is that it’s easy to construct interpolating polynomials using a very similar approach, leading to an algorithm that is, in a sense, halfway between the two.
In the following, I’ll write for linear interpolation. I’ll stick with quadratic curves since they are the lowest-order curves to show “interesting” behavior for the purposes of this article; everything generalizes to higher degrees in the obvious way.
De Casteljau’s algorithm
De Casteljau’s algorithm is a well-known algorithm to evaluate Bezier Curves. There’s plenty of material on this elsewhere, so as usual, I’ll keep it brief. Assume we have three control points . In the first stage, we construct three constant (degree-0) polynomials for the three control points:
These are then linearly interpolated to yield two linear (degree-1) polynomials:
which we then interpolate linearly again to give the final result
Note I give the construction of the full polynomials here; the actual de Casteljau algorithm gets rid of them immediately by evaluating all of them as soon as they appear (only ever doing linear interpolations). Anyway, the general construction rule we’ve been following is this:
De Boor’s algorithm
De Boor’s algorithm is the equivalent to de Casteljau’s algorithm for B-Splines. Again, there’s plenty of material out there on it, so I’ll keep it brief: We again start with constant functions for our data points. This time, the exact formulas depend on the degree of the spline we’ll be using. I’ll be using the degree (quadratic) here. We’ll also need a knot vector
which determines where our knots are; knots are (very roughly) the t’s corresponding to the control points. I’ll be using slightly different indexing from what’s normally used to make the similarities more visible, and ignore issues such as picking the right set of control points to interpolate from:
Then we linearly interpolate based on the knot vector:
and interpolate again one more time to get the results:
The general recursion formula for de Boor’s algorithm (with this indexing convention, which is non-standard, so do not use this for reference!) is this:
Interpolating polynomials from linear interpolation
There’s multiple constructions for interpolating polynomials; the best-known are probably the Lagrange polynomials (which form a basis for the interpolating polynomials of degree n for a given set of nodes ) and the Newton polynomials (since polynomial interpolation has unique solutions, these give the same results, but the Newton formulation is more suitable for incremental evaluation).
What’s less well known is that interpolating polynomials also obey a simple triangular scheme based on repeated linear interpolation: Again, we start the same way with constant polynomials
and this time we have associated nodes and want to find the interpolating polynomial
such that
,
,
. Same as before, we first try to find linear functions that solve part of the problem. A reasonable choice is:
Note the construction here. is a linear polynomial that interpolates the data points
, and we get it by interpolating between two simpler (degree-0) polynomials that interpolate only
and
, respectively: we simply make sure that at
, we use
, and at
, we use
. All of this is easiest to visualize when
, but it in facts works with them in any order.
is constructed the same way.
To construct our final interpolating polynomial, we use the same trick again:
Note this one is a bit subtle. We linearly interpolate between two polynomials that both in turn interpolate ; this means we already know that the result will also pass through this point. So
is taken care of, and we only need to worry about
and
– and for each of the two, one of our two polynomials does the job, so we can do the linear interpolation trick again. The generalization of this approach to higher degrees requires that we make sure that both of our input polynomials at every step interpolate all of the middle points, so we only need to fix up the ends. But this is easy to arrange – the general pattern should be clear from the construction above. This gives us our recursive construction rule:
All of this is, of course, not new; in fact, this is just Neville’s algorithm. But in typical presentations, this is derived purely algebraically from the properties of Newton interpolation and divided differences, and it’s not pointed out that the linear combination in the recurrence is, in fact, a linear interpolation – which at least to me makes everything much easier to visualize.
The punchline
The really interesting bit to me though is that, starting from the exact same initial conditions, we get three different important interpolation / approximation algorithms that differ only in how they choose their interpolation factors:
de Casteljau:
Neville:
de Boor:
I think this quite pretty. B-Splines with the right knot vector (e.g. [0,0,0,1,1,1] for the quadratic curves we’ve been using) are just Bezier Curves, that bit is well known. But what’s less well known is that Neville’s Algorithm (and hence regular polynomial interpolation) is just another triangular linear interpolation scheme that fits inbetween the two.
This post is part of the series “Debris: Opening the box“.
At the end of the last post, I described the basic “moving average” implementation for box filters and how to build a simple blur out of it. I also noted that the approach given only offers very coarse control of the blur radius, and that we’d like to have something better than box filters. So let’s fix both of these issues.
Subpixel resolution
In the previous article, I used ‘r’ to denote what is effectively the “radius” of a box filter, and our filters always had an odd number of taps 2r+1. So r=0 has 1 tap (which is 1, so this is just the identity – no blur), r=1 has 3 taps, r=2 has 5 taps and so forth. But how do we get intermediate stages, say r=0.1 or r=1.5?
Well, Jim Blinn once said that “All problems in computer graphics can be solved with a matrix inversion”. But here’s the thing: Jim Blinn lied to you. Disregard Jim Blinn. Because at least 50% of all problems in computer graphics can in fact be solved using linear interpolation without any matrices whatsoever. And this is one of them. We know how to handle r=0, and we know how to handle r=1. Want something in the middle? Well, just linearly fade in another 1 and then normalize the whole thing so the weights still sum to 1. To implement this, let’s split our radius into an integer and fractional component first:
Then our blur kernel looks like this:
with exactly 2m+1 ones in the middle. This is still cheap to implement – for example, we could just treat the whole thing as a box filter with r=m, and then add the two samples at the ends with weight α after the fact (and then apply our usual normalization). This is perfectly reasonable and brings us up to four samples per pixel instead of two, but it's still a constant cost independent of the length of the filter, so we're golden.
However there's a slightly different equivalent formulation that fits in a bit better with the spirit of the filter and is very suitable when texture filtering hardware is available: consider what happens when we move one pixel to the right. Let's look at our five-tap case again (without normalization to keep things simple) and take another look at the differences between adjacent output samples:
On the right side, we add a new pixel at the right end with weight α, and we need to "upgrade" our old rightmost pixel to weight 1 by adding it in again weighted with (1-α); we do the same on the left side to subtract out the old samples. Interestingly, if you look at what we're actually adding and subtracting, it's just linear interpolation between two adjacent samples – i.e. linear filtering (the 1D version of bilinear filtering), the kind of thing that texture sampling hardware is great at!
Time to update our pseudocode: (I'm writing this using floats for convenience, in practice you might want to use fixed point)
// Convolve x with a box filter of "radius" r, ignoring boundary
// conditions.
float scale = 1.0f / (2.0f * r + 1.0f); // or use fixed point
int m = (int) r; // integer part of radius
float alpha = r - m; // fractional part
// Compute sum at first pixel
float sum = x[0];
for (int i=0; i < m; i++)
sum += x[-i] + x[i];
sum += alpha * (x[-m] + x[m]);
// Generate output pixel, then update running sum for next pixel.
// lerp(t, a, b) = (1-t)*a + t*b = a + t * (b-a)
for (int i=0; i < n; i++) {
y[i] = sum * scale;
sum += lerp(alpha, x[i+m+1], x[i+m+2]);
sum -= lerp(alpha, x[i-m], x[i-m-1]);
}
I think this is really pretty slick – especially when implemented on the GPU, where you can combine the linear interpolation and sampling steps into a single bilinear sample, for an inner loop with two texture samples and a tiny bit of arithmetic. However nice this may be, though, we’re still stuck with box filters – let’s fix that one too.
Unboxing
Okay, we know we can do box filters fast, but they’re crappy, and we also know how to do arbitrary filters, but that’s much slower. However, we can build better filters by using our box filters as a building block. In particular, we can apply our box filters multiple times. Now, to give you an understanding of what’s going on, I’m going to switch to the continuous domain for a bit, because that gets rid of a few discretization artifacts like the in-between box filters introduced above (which make analysis trickier), and it’s also nicer for drawing plots. As in the introduction, I’m not going to give you a proper definition or theory of convolution (neither discrete nor continuous); as usual, you can step by Wikipedia if you’re interested in details. What we need to know here are just two things: first, there is a continuous version of the discrete convolution we’ve been using so far that behaves essentially the same way (except now, we perform integration instead of the finite summation we had before), and second, convolution is associative, i.e. . In our case, we’re gonna convolve our image I with a box filter b multiple times, and for this case we get something like
. Point being, applying a box filter p times is equivalent to filtering the image once with a filter that’s a box convolved p-1 times with itself.
In other words, using just our single lowly box filter, we can build fancier filters just by applying it multiple times. So what filters do we get? Let’s plot the first few iterations, starting from a single box filter with radius 1. Note that in the continuous realm, there really is only one box filter – because we’re not on a discrete grid, we can generate other widths just by scaling the “time” value we pass into our filter function. This is one of the ways that looking at continuous filters makes our life easier here. Anyway, here’s the pretty pictures:
These all show the filter in question in red, plotted over a Gaussian with matching variance () and scale in blue (ahem, not to suggest any conclusions or anything!). Triangle filters also have a name (which is just as inspired as “box filter”, natch), the other ones I’ve labeled simply by what type of function they correspond to. Anyway, the point to take away here is that convolving a box with itself quickly leads to pretty smooth functions (the Cardinal B-splines, in fact) that happen to get quite similar to a Gaussian very fast. In fact, as we keep increasing p (the order), they will converge towards a “real” Gaussian – proof here for the curious. (You can also stay with the discrete version and prove this using the Central Limit Theorem, but I like this version more). The piecewise cubic function you get for p=4 above is already within 3% (absolute) error of the corresponding Gaussian. In practice, exact match with a Gaussian isn’t that important anyway as long as you’re just blurring images, and just using p=3 is normally fine.
So, to get better filters, just blur multiple times. Easy! So easy in fact that I’m not gonna bore you with another piece of pseudocode – I trust you can imagine a “do this p times” outer loop on your own. Now, to be fair, the convolutions I showed you were done in the continuous domain, and none of them actually prove anything about the modified box filters we’re using. To their defense, we can at least argue that our modified box filter definition is reasonable in that it’s the discretization of the continuous box filter with radius r to a pixel grid, where the pixels themselves have a rectangular shape (this is a fairly good model of current displays). But if we actually wanted to prove consistency and convergence using our filters, we’d need to prove that our discrete version is a good approximation of the continuous version. I’ll skip that here since it gets technical and doesn’t really add anything if you just want fast blurs, but if you want to see proof (and also get an exact expression for the variance of our iterated modified box filters), this paper has the answers you seek.
Implementation notes
And that’s really all you need to know to get nice, fast quasi-Gaussian blurs, no matter how wide the kernel. For a GPU implementation, you should be able to more or less directly take my pseudo-code above and turn it into a Compute Shader: for the horizontal blur passes, have each thread in a group work on a different scan line (and for vertical passes, have each thread work on a different row). To get higher orders, you just keep iterating the passes, ping-ponging between destination render targets (or UAVs if you want to stick with DX11 terminology). Important caveat: I haven’t actually tried this (yet) and I don’t know how well it performs; you might need to whack the code a bit to actually get good performance. As usual with GPUs, once you have the basic algorithm you’re at best half-done; the rest is figuring out the right way to package and partition the data.
I can, however, tell you how to implement this nicely on CPUs. Computation-wise, we’ll stick exactly with the algorithm above, but we can eke out big wins by doing things in the right order to maximize cache hit ratios.
Let’s look at the horizontal passes first. These nicely walk the image left to right and access memory sequentially. Very nice locality of reference, very good for memory prefetchers. There’s not much you can do wrong here – however, one important optimization is to perform the multiple filter passes per scanline (or per group of scanlines) and not filter the whole image multiple times. That is, you want your code to do this:
for (int y=0; y < height; y++) {
for (int pass=0; pass < num_passes; pass++) {
blur_scanline(y, radius);
}
}
and not
// Do *NOT* do it like this!
for (int pass=0; pass < num_passes; pass++) {
for (int y=0; y < height; y++) {
blur_scanline(y, radius);
}
}
This is because a single scan line might fit inside your L1 data cache, and very likely fits fully inside your L2 cache. Unless your image is small, the whole image won’t. So in the second version, by the point you get back to y=0 for the second pass, all of the image data has likely dropped out of the L1 and L2 caches and needs to be fetched again from memory (or the L3 cache if you’re lucky). That’s a completely unnecessary waste of time and memory bandwidth – so avoid this.
Second, vertical passes. If you implement them naively, they will likely be significantly slower (sometimes by an order of magnitude!) than the horizontal passes. The problem is that stepping through images vertically is not making good use of the cache: a cache line these days is usually 64 bytes (even 128 on some platforms), while a single pixel usually takes somewhere between 4 and 16 bytes (depending on the pixel format). So we keep fetching cache lines from memory (or lower cache levels) and only looking at between 1/16th and 1/4th of the data we get back! And since the cache operates at cache line granularity, that means we get an effective cache size of somewhere between a quarter and a sixteenth of the actual cache size (this goes both for L1 and L2). Ouch! And same as above with the multiple passes, if we do this, then it becomes a lot more likely that by the time we come back to any given cache line (once we’re finished with our current column of the image and proceeded to the next one), a lot of the image has dropped out of the closest cache levels.
There’s multiple ways to solve this problem. You can write your vertical blur pass so that it always works on multiple columns at a time – enough to always be reading (and writing) full cache lines. This works, but it means that the horizontal and vertical blur code are really two completely different loops, and on most architectures you’re likely to run out of registers when doing this, which also makes things slower.
A better approach (in my opinion anyway) is to have only one blur loop – let’s say the horizontal one. Now, to do a vertical blur pass, we first read a narrow stripe of N columns and write it out, transposed (i.e. rows and columns interchanged), into a scratch buffer. N is chosen so that we’re consuming full cache lines at a time. Then we run our multiple horizontal blur passes on the N scan lines in our scratch buffer, and finally transpose again as we’re writing out the stripe to the image. The upshot is that we only have one blur loop that gets used for both directions (and only has to be optimized once); the complexity gets shifted into the read and write loops, which are glorified copy loops (okay, with a transpose inside them) that are much easier to get right. And the scratch buffer is small enough to stay in the cache (maybe not L1 for large images, but probably L2) all the time.
However, we still have two separate copy loops, and now there’s this weird asymmetry between horizontal and vertical blurs. Can’t we do better? Let’s look at the sequence of steps that we effectively perform:
- Horizontal blur pass.
- Transpose.
- Horizontal blur pass.
- Transpose.
Currently, our horizontal blur does step 1, and our vertical blur does steps 2-4. But as should be obvious when you write it this way, it’s actually much more natural to write a function that does steps 1 and 2 (which are the same as steps 3 and 4, just potentially using a different blur radius). So instead of writing one copy loop that transposes from the input to a scratch buffer, and one copy loop that transposes from a scratch buffer to the output, just have the horizontal blur always write to the scratch buffer, and only ever use the second type of copy loop (scratch buffer to output with transpose).
And that’s it! For reference, a version of the Werkkzeug3 Blur written using SSE2 intrinsics (and adapted from an earlier MMX-only version) is here – the function is GenBitmap::Blur. This one uses fixed-point arithmetic throughout, which gets a bit awkward in places, because we have to work on 32-bit numbers using the MMX set of arithmetic operations which doesn’t include any 32-bit multiplies. It also performs blurring and transpose in two separate steps, making a pass over the full image each time, for no good reason whatsoever. :) If there’s interest, I can write a more detailed explanation of exactly how that code works in a separate post; for now, I think I’ve covered enough material. Take care, and remember to blur responsibly!
This post is part of the series “Debris: Opening the box“.
A bit of context first. As I explained before, the texture generators in RG2 / GenThree / Werkkzeug3 / Werkkzeug4 all use 16 bits of storage per color channel, with the actual values using only 15 bits (to work around some issues with the original MMX instruction set). Generated textures tend to go through lots of intermediate stages, and experience with some older texture generation experiments taught me that rounding down to 8 bits after every step comes at a noticeable cost in both quality and user convenience.
All those texture generators are written to use the CPU. For Werkkzeug4, this is simple because it’s using a very slightly modified version of the Werkkzeug3 texture generator; the others were written in the early 2000s, when Pixel Shaders were either not present or very limited, GPU arithmetic wasn’t standardized, video memory management and graphics drivers were crappy, there was no support for render target formats with more than 8 bits per pixel, and cards were on AGP instead of PCI Express so read-backs were seriously slow.
Because we were targeting CPU, we used algorithms that were appropriate for CPU execution, and one of the more CPU-centric ones was the blur we used – an improved version of “Blur v2” in RG2, which I wrote around 2001-2002 (not sure when exactly). This algorithm really doesn’t map well to the regular graphics pipeline, and for a long time I thought it was going to join the ranks of now-obsolete software rendering tricks. However, we now have Compute Shaders, and guess what – they’re actually a good fit for this algorithm! Which just goes to show that it’s worthwhile to know these tricks even when they’re not immediately useful right now.
Anyway, enough introduction. Time to talk blurs.
2D convolution
As usual, I’ll just describe the basics very briefly (there’s plenty of places on the web where you can get more details if you’re confused) so I can get to the interesting bits. Blurs are, in essence, 2D convolution filters, so each pixel of the output image is a linear combination (weighted sum) of the corresponding pixel in the input image and some of its neighbors. The set of weights (corresponding to the adjacent pixels) is called the “convolution kernel” or “filter kernel”. A typical example kernel would be
which happens to correspond to a simple blur filter (throughout this post, I’ll use filters with odd dimensions, with the center of the kernel aligned with the output pixel). Different types of blur correspond to different kernels. There’s also non-linear filters that can’t be described this way, but I’ll ignore those here. To apply such a filter, you simple sum the contributions of all adjacent input pixels, weighted by the corresponding value in the filter kernel, for each output pixel. This is nice, simple, and the amount of work per pixel directly depends on the size of the filter kernel: a 3×3 filter needs to sum 9 samples, a 5×5 filter 25 samples, and so forth. So the amount of work is roughly proportional to the blur radius (which determines the filter kernel size), squared.
For 3×3 pixels this is not a big deal. For 5×5 pixels it’s already borderline, and it’s very rare to see larger kernels implemented as direct 2D convolution filters.
Separable filters
Luckily, the filters we actually care about for blurs are separable: the 2D filter can be written as the sequential application of two 1D filters, one in the horizontal direction and one in the vertical direction. So instead of convolving the whole image with a 2D kernel, we effectively do two passes: first we blur all individual rows horizontally, then we blur all individual columns vertically. The kernel given above is separable, and can be factored into the product of two 1D kernels:
.
To apply this kernel, we first filter all pixels in the horizontal direction, then filter the result in the vertical direction. Both these passes sum 3 samples per pixel, so to apply our 3×3 kernel we need 6 samples total (in two passes) – a 33% reduction in number of samples taken, and so potentially up to 33% faster (with some major caveats; I’ll pass on this for a minute, but we’ll get there eventually). A 5×5 kernel takes 10 samples instead of 25 (60% reduction), a 7×7 kernel 14 instead of 49 (71% reduction), you get the idea.
Also, nobody forces us to use the same kernel in both directions. We can have different kernels of different sizes if we want to; for example, we might want to have a stronger blur in the horizontal direction than in the vertical, or even use different filter types. If the width of the horizontal kernel is p, and the height of the vertical kernel is q, the two-pass approach takes p+q samples per pixel, vs. p×q for a full 2D convolution. A linear amount of work per pixel is quite a drastic improvement from the quadratic amount we had before, and while large-radius blurs may still not be fast, at least they’ll complete within a reasonable amount of time.
So far, this is all bog-standard. The interesting question is, can we do less than a linear amount of work per pixel and still get the right results? And if so, how much less? Logarithmic time? Maybe even constant time? Surprisingly, we can actually get down to (almost) constant time if we restrict our selection of filters, but let’s look at a more general approach first.
Logarithmic time per pixel: convolution and the discrete-time Fourier transform
But even for general kernels (separable or not), it turns out we can do better than linear time per pixel, if we’re willing to do a bit more setup work. The key here is the convolution theorem: convolution is equivalent to pointwise multiplication in the frequency domain. In 1D, we can compute convolutions by transforming the row/column in question to the frequency domain using the Fast Fourier Transform (in the general case, we need to add padding around the image to get the boundary conditions right), multiplying point-wise with the FFT of the filter kernel, and then doing an inverse FFT. The forward/inverse FFTs have a cost of (where n is either the width or height of the image), and the pointwise multiplication is linear, so our total cost for transforming a full row/column with a separable filter is also
, a logarithmic amount of work per pixel. And of course, we only compute the FFT of the filter kernel once. Note that the size of our FFT depends on both the size of the image and the size of the convolution kernel, so we didn’t lose the dependency on filter kernel size completely, even though it may seem that way.
The same approach also works for non-separable filters, using a 2D FFT. The FFT as a linear transform is separable, so computing the FFT of a 2D image with N×N pixels is still a logarithmic amount of work per pixel. This time, we need the 2D FFT of the filter kernel to multiply with, but the actual convolution in frequency space is again just pointwise multiplication.
While all of this is nice, general and very elegant, it all gets a bit awkward once you try to implement it; the FFT has higher dynamic range than its inputs (so if you compute the FFT of an 8-bit/channel image, 8 bits per pixel aren’t going to be enough), and the FFT of real signals is complex-valued (albeit with some symmetry properties). The padding also means that you end up enlarging the image considerably before you do the actual convolution, and combined with the higher dynamic range it means we need a lot of temporary space. So yes, you can implement filters this way, and for large radii it will win over the direct separable implementation, but at the same time it tends to be significantly slower (and certainly more cumbersome) for smaller kernels (which are important in practice), so it’s not an easy choice to make.
Thinking inside the box
Instead, let’s try something different. Instead of starting with a general filter kernel and trying to figure out how to apply it efficiently, let’s look for filters that are cheap to apply even without doing any FFTs. A very simple blur filter is the box filter: just take N sequential samples and average them together with equal weights. It’s called a box filter because that’s what a plot of the impulse response looks like. So for example a 5-tap 1D box filter would compute
.
Now, because all samples are weighted equally, it’s very easy to compute the value at location n+1 from the value at location n – the filter just computes a moving average:
The middle part of the sum stays the same, it’s just one sample coming in at one end and another dropping off at the opposite end. So once we have the value at one location, we can sweep through pixels sequentially and update the sum with two samples per pixel, independent of how wide the filter is. We compute the full box filter once for the first pixel, then just keep updating it. Here’s some pseudo-code:
// Compute box-filtered version of x with a (2*r+1)-tap filter,
// ignoring boundary conditions.
float scale = 1.0f / (2*r + 1); // or use fixed point
// Compute sum at first pixel. Remember this is an odd-sized
// box filter kernel.
int sum = x[0];
for (int i=1; i <= r; i++)
sum += x[-i] + x[i];
// Generate output pixel, then update running sum for next pixel.
for (int i=0; i < n; i++) {
y[i] = sum * scale;
sum += x[i+r+1] - x[i-r];
}
In practice, you need to be careful near the borders of the image and define proper boundary conditions. The natural options are pretty much the boundary rules supported by texture samplers: define a border color for pixels outside the image region, clamp to the edge, wrap around (periodic extension), mirror (symmetric extension) and so forth. Some people also just decrease the size of the blur near the edge, but that is trickier to implement and tends to produce strange-looking results; I’d advise against it. Also, in practice you’ll support multiple color channels, but the whole thing is linear and we treat all color channels equally, so this doesn’t add any complications.
This is all very nice and simple. It’s also fast – if we have a n-pixel image and a m-tap box filter, using this trick drops us from n×m samples down to m + 2n samples for a full column or now, and hence down from m samples per pixel to (m/n + 2) samples per pixel. Now the m/n term is generally less than 1, because it makes little sense to blur an image with a kernel wider than the image itself is, so this is for all practical purposes bounded by a constant amount of samples per pixel (less than 3).
There’s two problems, though: First, while box filters are cheap to compute, they make for fairly crappy blur filters. We’d really like to use better kernels, preferably Gaussian kernels (or something very similar). Second, the implementation as given only supports odd-sized blur kernels, so we can only increase the filter size in increments of 2 pixels at a time. That’s extremely coarse. We could try to support even-sized kernels too (it works exactly the same way), but even-sized kernels shift phase by half a pixel, and pixel granularity is still very coarse; what we really want is sub-pixel resolution, and this turns out to be reasonably easy to do.
That said, this post is already fairly long and I’m not even halfway through the material I want to cover, so I guess this is going to be another multi-parter. So we’ll pick up right here in part two.
Bonus: Summed area tables
I got one more though: summed area tables, also known as “integral images” in the Computer Vision community. Wikipedia describes the 2D version, but I’ll give you the 1D variant for comparison with the moving average approach. What you do is simply calculate the cumulative sums of all pixels in a row (column):
so
,
,
…
.
Now, the fundamental operation in box filtering was simply computing the sum across a continuous span of pixels, let’s write this as
which given S reduces to
In other words, once we know the cumulative sums for a given row (which take n operations to compute), we can again compute arbitrary-width box filters using two samples per pixel. However, the cumulative sums usually perform more setup work per row/column than the moving average approach described above, and boundary conditions other than a constant color border are trickier to implement. They do have one big advantage though: once you have the cumulative sums, it’s really easy to use a different blur radius per pixel. Neither moving averages nor the FFT approach can do this, and it’s a pretty neat feature.
I’ve described the 1D version; the Wikipedia link describes actual summed area tables which are the 2D version, and the same approach also generalizes to higher dimensions. They all allow applying a N-dimensional box filter using a constant number of samples, and they all take linear time (in the number of elements in the dataset) to set up. As said, I won’t be using them here, but they’re definitely worth knowing about.
This is a cute little trick that would be way more useful if there was proper OS support for it. It’s useful in a fairly small number of cases, but it’s nice enough to be worth writing up.
Ring buffers
I’m assuming you know what a ring buffer is, and how it’s typically implemented. I’ve written about it before, focusing on different ways to keep count of the fill state (and what that means for the invariants). Ring buffers are a really nice data structure – the only problem is that everything that directly interfaces with ring buffers needs to be aware of the fact, to handle wrap-around correctly. With careful interface design, this can be done fairly transparently: if you use the technique I described in “Buffer-centric IO”, the producer has enough control over the consumer’s view on the data to make the wrap-around fully transparent. However, while this is a nice way of avoiding copies in the innards of an IO system, it’s too unwieldy to pass down to client code. A lot of code is written assuming data that is completely linear in memory, and using such code with ring buffers requires copying the data out to a linear block in memory first.
Unless you’re using what I’m calling a “magic ring buffer”, that is.
The idea
The underlying concept is quite simple: “unwrap” the ring by placing multiple identical copies of it right next to each other in memory. In theory you can have an arbitrary number of copies, but in practice two is sufficient in basically all practical use cases, so that’s what I’ll be using here.
Of course, keeping the multiple copies in sync is both annoying and tricky to get right, and essentially doing every memory access twice is a bad idea. Luckily, we don’t actually need to keep two physical copies of the data around. Really the only thing we need is to have the same physical data in two separate (logical, or virtual) memory locations – and virtual memory (paging) hardware is, at this point, ubiquitous. Using paging means we get some external constraints: buffers have to have sizes that are (at least) a multiple of the processor’s page size (potentially larger, if virtual memory is managed at a coarser granularity) and meet some alignment constraints. If those restrictions are acceptable, then really all we need is some way to get the OS to map the same physical memory multiple times into our address space, using calls available from user space.
The right facility turns out to be memory mapped files: modern OSes generally provide support for anonymous mmaps, which are “memory mapped files” that aren’t actually backed by any file at all (except maybe the swap file). And using the right incantations, these OSes can indeed be made to map the same region of physical memory multiple times into a contiguous virtual address range – exactly what we need!
The code
I’ve written a basic implementation of this idea in C++ for Windows. The code is available here. The implementation is a bit dodgy (see comments) since Windows won’t let me reserve a memory region for memory-mapping (as far as I can tell, this can only be done for allocations), so there’s a bit of a song and dance routine involved in trying to get an address we can map to – other threads might be end up allocating the memory range we just found between us freeing it and completing our own mapping, so we might have to retry several times.
On the various Unix flavors, you can try the same basic principle, though you might actually need to create a backing file in some cases (I don’t see a way to do it without when relying purely on POSIX functionality). For Linux you should be able to do it using an anonymous shared mmap followed by remap_file_pages. No matter which Unix flavor you’re on, you can do it without a race condition, so that part is much nicer. (Though it’s really better without a backing file, or at most a backing file on a RAM disk, since you certainly don’t want to cause disk IO with this).
The code also has a small example to show it in action.
UPDATE: In the comments there is now also a link to two articles describing how to implement the same idea (and it turns out using pretty much the same trick) on MacOS X, and it turns out that Wikipedia has working code for a (race-condition free) POSIX variant as described earlier.
Coda: Why I originally wanted this
There’s a few cases where this kind of thing is useful (several of them IO-related, as mentioned in the introduction), but the case where I originally really wanted this was for inter-thread communication. The setting had one thread producing variable-sized commands and one thread consuming them, with a SPSC queue inbetween them. Without a magic ring buffer, doing this was a major hassle: wraparound could theoretically happen anywhere in the middle of a command (well, at any word boundary anyway), so this case was detected and there was a special command to skip ahead in the ring buffer that was inserted whenever the “real” command would’ve wrapped around. With a magic ring buffer, all the logic and special cases just disappear, and so does some amount of wasted memory. It’s not huge, but it sure is nice.
The last part had a bit of theory and terminology groundwork. This part will be focusing on the computational side instead.
Slicing up matrices
In the previous parts, I’ve been careful to distinguish between vectors and their representation as coefficients (which depends on a choice of basis), and similarly I’ve tried to keep the distinction between linear transforms and matrices clear. This time, it’s all about matrices and their properties, so I’ll be a bit sloppier in this regard. Unless otherwise noted, assume we’re dealing exclusively with vector spaces (for various n) using their canonical bases. In this setting (all bases fixed in advance), we can uniquely identify linear transforms with their matrices, and that’s what I’ll do. However, I’ll be switching between scalars, vectors and matrices a lot, so to avoid confusion I’ll be a bit more careful about typography: lowercase letters like
denote scalars, lowercase bold-face letters
denote column vectors, row vectors (when treated as vectors) will be written as the transpose of column vectors
, and matrices use upper-case bold-face letters like
. A vector is made of constituent scalars
, and so is a matrix (with two sets of indices)
. Note all these are overlapping to a degree: we can write a 1×1 matrix as a scalar, vector or matrix, and similarly a nx1 (or 1xn) matrix can be written either as vector or matrix. In this context, matrices are the most general kind of object we’re be dealing with, so unless something needs to be a vector or scalar, I’ll write it as a matrix.
All that said, let’s take another look at matrices. As I explained before (in part 1), the columns of a matrix contain the images of the basis vectors. Let’s give those vectors names:
.
This is just taking the n column vectors making up A and giving them distinct names. This is useful when you look at a matrix product:
.
You can prove this algebraically by expanding out the matrix product and doing a bunch of index manipulation, but I’d advise against it: don’t expand out into scalars unless you have exhausted every other avenue of attack – it’s tedious and extremely error-prone. You can also prove this by exploiting the correspondence between linear transforms and matrices. This is elegant, very short and makes for a nice exercise, so I won’t spoil it. But there’s another way to prove it algebraically, without expanding it out into scalars, that’s more in line with the spirit of this series: getting comfortable with manipulating matrix expressions.
The key is writing the as the result of a matrix expression. Now, as explained before, the i-th column of a matrix contains the image of the i-th basis vector. Since we’re using the canonical basis, our basis vectors are simply
…
and it’s easy to verify that (this one you’ll have to expand out if you want to prove it purely algebraically, but since the product is just with ones and zeros, it’s really easy to check). So multiplying
from the right by one of the
gives us the i-th column of a, and conversely, if we have all n column vectors, we can piece together the full matrix by gluing them together in the right order. So let’s look at our matrix product again: we wan’t the i-th column of the matrix product
, so we look at
exactly as claimed. As always, there’s a corresponding construction using row vectors. Using the dual basis of linear functionals (note no transpose this time!)
…
we can disassemble a matrix into its rows:
and since , we can write matrix products in terms of what happens to the row vectors too (though this time, we’re expanding in terms of the row vectors of the first not second factor):
.
Gluing it back together
Above, the trick was to write the “slicing” step as a matrix product with one of the basis vectors – and so forth. We’ll soon need to deal with the gluing step too, so let’s work out how to write that as a matrix expression too so we can manipulate it easily.
Let’s say we have a m×2 matrix A that we’ve split into two column vectors ,
. What’s the matrix expression that puts A back together from those columns again? So far we’ve expressed that as a concatenation; but to be able to manipulate it nicely, we want it to be a linear expression (same as everything else we’re dealing with). So it’s gotta be a sum: one term for
and one for
. Well, the sum is supposed to be
, which is a m×2 matrix, so the summands all have to be m×2 matrices too. The
are column vectors (m×1 matrices), so for the result to be a m×2 matrix, we have to multiply from the right with a 1×2 matrix. From there, it’s fairly easy to see that the expression that re-assembles
from its column vectors is simply:
.
Note that the terms have the form column vector × row vector – this is the general form of the vector outer product (or dyadic product) that we first saw in the previous part. If A has more than two columns, this generalizes in the obvious way.
So what happens if we disassemble a matrix only to re-assemble it again? Really, this should be a complete no-op. Let’s check:
For the last step, note that the summands are matrices that are all-zero, except for a single one in row i, column i. Adding all these together produces a matrix that’s zero everywhere except on the diagonal, where it’s all ones – in short, the n×n identity matrix. So yes, disassembling and re-assembling a matrix is indeed a no-op. Who would’ve guessed.
Again, the same thing can be done for the rows; instead of multiplying by from the right, you end up multiplying by
from the left, but same difference. So that covers slicing a matrix into its constituent vectors (of either the row or column kind) and putting it back together. Things get a bit more interesting (and a lot more useful) when we allow more general submatrices.
Block matrices
In our first example above, we sliced A into n separate column vectors. But what if we just slice it into just two parts, a left “half” and a right “half” (the sizes need not be the same), both of which are general matrices? Let’s try:
For the same reasons as with column vectors, multiplying a second matrix B from the left just ends up acting on the halves separately:
and the same also works with right multiplication on vertically stacked matrices. Easy, but not very interesting yet – we’re effectively just keeping some columns (rows) glued together through the whole process. It gets more interesting when you start slicing in the horizontal and vertical directions simultaneously, though:
Note that for the stuff I’m describing here to work, the “cuts” between blocks need to be uniform across the whole matrix – that is, all matrices in a block column need to have the same width, and all matrices in a block row need to have the same height. So in our case, let’s say is a p×q matrix. Then
must be a p×(n-q) matrix (the heights have to agree and
is n columns wide),
is (m-p)×q, and
is (m-p)×(n-q).
Adding block matrices is totally straightforward – it’s all element-wise anyway. Multiplying block matrices is more interesting. For regular matrix multiplication , we require that B has as many columns as A has rows; for block matrix multiplication, we’ll also require that B has as many block columns as A has block rows, and that all of the individual block sizes are compatible as well. Given all that, how does block matrix multiplication work? Originally I meant to give a proof here, but frankly it’s all notation and not very enlightening, so let’s skip straight to the punchline:
Block matrix multiplication works just like regular matrix multiplication: you compute the “dot product” between rows of B and columns of A. This is all independent of the sizes too – I show it here with a matrix of 1×2 blocks and a matrix of 2×2 blocks because that’s the smallest interesting example, but you can have arbitrarily many blocks involved.
So what does this mean? Two things: First, most big matrices that occur in practice have a natural block structure, and the above property means that for most matrix operations, we can treat the blocks as if they were scalars in a much smaller matrix. Even if you don’t deal with big matrices, working at block granularity is often a lot more convenient. Second, it means that big matrix products can be naturally expressed in terms of several smaller ones. Even when dealing with big matrices, you can just chop them up into smaller blocks that nicely fit in your cache, or your main memory if the matrices are truly huge.
All that said, the main advantage of block matrices as I see it are just that they add a nice, in-between level of granularity: dealing with the individual scalars making up a matrix is unwieldy and error-prone, but sometimes (particularly when you’re interested in some structure within the matrix) operating on the whole matrix at once is too coarse-grained.
Example: affine transforms in homogeneous coordinates
To show what I mean, let’s end with a familiar example, at least for graphics/game programmers: matrices representing affine transforms when using homogeneous coordinates. The matrices in question look like this:
where is an arbitrary square matrix and
is a translation vector (note the 0 in the bottom row is printed in bold and means a 0 row vector, not a scalar 0!). So how does the product of two such matrices look? Well, this has obvious block structure, so we can use a block matrix product without breaking A up any further:
Note this works in any dimension – I just required that A was square, I never specified what the actual size was. This is a fairly simple example, but it’s a common case and handy to know.
And that should be enough for this post. Next time, I plan to first review some basic identities (and their block matrix analogues) then start talking about matrix decompositions. Until then!
In the previous part I covered a bunch of basics. Now let’s continue with stuff that’s a bit more fun. Small disclaimer: In this series, I’ll be mostly talking about finite-dimensional, real vector spaces, and even more specifically for some n. So assume that’s the setting unless explicitly stated otherwise; I don’t want to bog the text down with too many technicalities.
(Almost) every product can be written as a matrix product
In general, most of the functions we call “products” share some common properties: they’re examples of “bilinear maps”, that is vector-valued functions of two vector-valued arguments which are linear in both of them. The latter means that if you hold either of the two arguments constant, the function behaves like a linear function of the other argument. Now we know that any linear function can be written as a matrix product
for some matrix M, provided we’re willing to choose a basis.
Okay, now take one such product-like operation between vector spaces, let’s call it . What the above sentence means is that for any
, there is a corresponding matrix
such that
(and also a
such that
, but let’s ignore that for a minute). Furthermore, since a product is linear in both arguments,
itself (respectively
) is a linear function of a (respectively b) too.
This is all fairly abstract. Let’s give an example: the standard dot product. The dot product of two vectors a and b is the number . This should be well known. Now let’s say we want to find the matrix
for some a. First, we have to figure out the correct dimensions. For fixed a,
is a scalar-valued function of two vectors; so the matrix that represents “a-dot” maps a 3-vector to a scalar (1-vector); in other words, it’s a 1×3 matrix. In fact, as you can verify easily, the matrix representing “a-dot” is just “a” written as a row vector – or written as a matrix expression,
. For the full dot product expression, we thus get
=
(because the dot product is symmetric, we can swap the positions of the two arguments). This works for any dimension of the vectors involved, provided they match of course. More importantly, it works the other way round too – a 1-row matrix represents a scalar-valued linear function (more concisely called a “linear functional”), and in case of the finite-dimensional spaces we’re dealing with, all such functions can be written as a dot product with a fixed vector.
The same technique works for any given bilinear map. Especially if you already know a form that works on coordinate vectors, in which case you can instantly write down the matrix (same as in part 1, just check what happens to your basis vectors). To give a second example, take the cross product in three dimensions. The corresponding matrix looks like this:
.
The is standard notation for this construction. Note that in this case, because the cross product is vector-valued, we have a full 3×3 matrix – and not just any matrix: it’s a skew-symmetric matrix, i.e.
. I might come back to those later.
So what we have now is a systematic way to write any “product-like” function of a and b as a matrix product (with a matrix depending on one of the two arguments). This might seem like a needless complication, but there’s a purpose to it: being able to write everything in a common notation (namely, as a matrix expression) has two advantages: first, it allows us to manipulate fairly complex expressions using uniform rules (namely, the rules for matrix multiplication), and second, it allows us to go the other way – take a complicated-looked matrix expression and break it down into components that have obvious geometric meaning. And that turns out to be a fairly powerful tool.
Projections and reflections
Let’s take a simple example: assume you have a unit vector , and a second, arbitrary vector
. Then, as you hopefully know, the dot product
is a scalar representing the length of the projection of x onto v. Take that scalar and multiply it by v again, and you get a vector that represents the component of x that is parallel to v:
.
See what happened there? Since it’s all just matrix multiplication, which is associative (we can place parentheses however we want), we can instantly get the matrix that represents parallel projection onto v. Similarly, we can get the matrix for the corresponding orthogonal component:
.
All it takes is the standard algebra trick of multiplying by 1 (or in this case, an identity matrix); after that, we just use linearity of matrix multiplication. You’re probably more used to exploiting it when working with vectors (stuff like ), but it works in both directions and with arbitrary matrices:
and
– matrix multiplication is another bilinear map.
Anyway, with the two examples above, we get a third one for free: We’ve just separated into two components,
. If we keep the orthogonal part but flip the parallel component, we get a reflection about the plane through the origin with normal
. This is just
, which is again linear in x, and we can get the matrix
for the whole by subtracting the two other matrices:
.
None of this is particularly fancy (and most of it you should know already), so why am I going through this? Two reasons. First off, it’s worth knowing, since all three special types of matrices tend to show up in a lot of different places. And second, they give good examples for transforms that are constructed by adding something to (or subtracting from) the identity map; these tend to show up in all kinds of places. In the general case, it’s hard to mentally visualize what the sum (or difference) of two transforms does, but orthogonal complements and reflections come with a nice geometric interpretation.
I’ll end this part here. See you next time!
I’m prone to ranting about people making their lives more complicated than necessary with regards to math. Which I’ve been doing quite often lately. So rather than just pointing fingers and shouting “you’re doing it wrong!”, I thought it would be a good idea to write a post about what I consider good practice when dealing with problems in linear algebra and (analytic) geometry. This is basically just a collection of short articles. I have enough material to fill several posts; I’m just gonna write more whenever I feel like it. (Don’t worry, I didn’t abandon my series on Debris, we just had a few gorgeous weekends in a row, and living as I do in the Pacific Northwest sunny weekend days are a limited resource, so I didn’t do a lot of writing lately).
Get comfortable manipulating matrix expressions
In high school algebra, you learn how to deal with problems such as solving for x using algebraic manipulations. The exact same thing can be done with expressions involving vectors and matrices, albeit with slightly more restrictive rules. It’s worthwhile to get familiar (and comfortable) with this. The most important rules to keep in mind are:
- Vector and matrix addition/subtraction can be handled exactly the same way as their scalar counterparts.
- Since for matrices in general
, multiplying an equation by a matrix must be done consistently – either from the left on both sides, or from the right on both sides.
- As an exception to the above, multiplying a matrix/vector etc. by a scalar is commutative, so the direction doesn’t matter, and the scalar can be moved around in the expression to wherever it is convenient.
- Multiplying any scalar equation by 0 turns it into “0 = 0” which is true but useless. For matrix equations, the equivalent is that the matrices you multiply by must be non-singular, i.e. they must be square and have non-zero determinant.
Example: Consider the map . This is an affine transformation, i.e. a linear transformation followed by a translation. Let’s solve this for x:
⇔
⇔
⇔
. The only interesting step here is multiplying from the left by
, the rest is very basic (and shown in more detail than I will use for the rest of this post). Note the result is in the same form as the expression we started with – meaning that the inverse of an affine map (when it exists) is again an affine map, not very surprisingly.
Matrices depend on the choice of basis
This one is really important.
In linear algebra, the primary objects we deal with are vectors and linear transforms – that is, functions that map vectors to other vectors and are linear. Neither concept involves coordinates. Particularly among programmers, there’s this meme that “vector” is basically just fancy math-speak for an array of numbers. This is wrong. A vector is a mathematical object, just like the number ten. The number ten can be represented in lots of different ways: as text, as “10” in decimal notation, as “1010” in binary notation, as “X” using roman numerals and so forth. These representations are good for different things: decimal and binary have reasonably simple algorithms for addition, subtraction, multiplication and division. Roman numerals are reasonably easy to add and subtract but considerably trickier to multiply, and the textual representation is related to the pronunciation of numbers when we read them out loud, but completely useless to do arithmetic on.
Writing a vector as tuple of numbers is a particular representation that means “one times the first basis vector, plus two times the second basis vector, plus three times the third basis vector”. This representation depends on the choice of basis vectors. Any non-zero vector can end up represented as
, given the right choice of basis. Similarly, a matrix depends on the basis for the source and destination vector spaces, and a given linear transform can have any number of different representations, corresponding to different choices of basis vectors. The important bit here is that even though the numbers can be made almost anything, they’re still describing the same mathematical object, and everything that’s true for this mathematical object will be true no matter which representation we pick.
This means that we have a considerable amount of freedom in choosing a convenient representation. As long as we’re writing code that performs computation, we’ll still drop down to matrices and numbers eventually. But generally, we want to postpone that choice until we absolutely have to – the later we nail it down, the more information we have to make the resulting computation easier. And a lot of stuff can be done without ever choosing an explicit basis or coordinate system at all.
What’s in a matrix?
Now I’ve explained what the values in the coordinate representation of a vector mean, but I haven’t yet done the same for the values in a matrix. Time to fix this. Note that with a matrix, there’s generally two vector spaces (and hence two bases) involved, so we have to be careful about which is which. Here’s a simple matrix-vector product expanded out using the usual matrix multiplication rule and then rearranged slightly:
=
=
.
This is straightforward algebra, but it’s key to figuring out what’s matrices actually mean. Note that the vectors appearing in that last expression are just the first, second and third columns of our matrix A, respectively. Those are vectors in the “output” vector space, which is 2-dimensional here; there’s also a 3-dimensional “input” vector space that x lives in. Often the two spaces are the same and use the same basis, but that is not a requirement! Anyway, if we plug in (corresponding to the first basis vector in our input vector space), the result will just be the first column of A. Similarly, we’ll get the second and third columns of A if we choose different “test vectors” with the one in a different place.
And this is the connection between matrices and the linear transforms they represent: to form a matrix, just transform the i-th vector of your “input” basis and write the resulting vector represented in your “output” basis into column i. This also means that it’s trivial to write down the matrix for a linear transform if you know what that transform does to your basis vectors. Note that here and elsewhere, if you use row vectors instead of column vectors, you write xA, and the role of rows and columns are reversed.
Change of basis
Now to give an example for why these distinctions matter, let’s look at a concrete problem: suppose you have a 3D character model authored in a tool that uses a right-handed coordinate system with +x pointing to the right, +y pointing up, and +z pointing out of the screen, and you want to load it in a game that has a left-handed coordinate system with +x pointing right, +y pointing up (same as before) and +z pointing into the screen. Let’s also say that character comes (as character models tend to do) complete with a skeleton and transform hierarchy. Assume for simplicity that the transforms are all linear and represented by 3×3 matrices. In reality they’ll be at least affine and probably be represented either as homogeneous 4×4 matrices or using some combination of 3×3 rotation matrices or unit quaternions, translation vectors and scale factors, but that doesn’t change anything substantial here.
So what do we have to do to convert everything from the modeler’s coordinate system to our coordinate system? It’s just a few sign flips, right? How hard can it be? Turns out, harder than it looks. While it’s easy enough to convert vertex data, matrices present more of a challenge, and randomly flipping signs won’t get you anywhere – and even if you luck out eventually, you still won’t know why it works now (you can ask anyone who’s ever tried this).
Instead, let’s approach this systematically. Imagine the model you’re trying to load in 3D space. What we want to do is leave the model and transforms exactly as they are, but express them in a different basis (coordinate system). It’s important to get this right. We’re going to be modifying the matrices and coordinates, true, but we’re not actually changing the model itself. Computationally, “express this vector in a different basis that has the z axis pointing in the opposite direction”, “reflect this vector about the z=0 plane” and “flip the sign of the z coordinate” are all equivalent, but they correspond to different conceptual approaches. When you perform a change in basis, it’s clear what has to happen with matrices representing transforms (namely, they have to change transforms to). A reflection is a transform – you might want to apply it to the whole model (at which point it’s best expressed as another matrix you pre-multiply to your transform hierarchy), you might want to reflect part of it, or you might want to just reflect some vertices. All of these actually modify the model (not just your representation of it), and it’s not clear which one is the “right” one. And so forth.
Anyway, enough of philosophy, back to the subject at hand: We have two different bases for the same 3D space: the one used by the modeler (let’s call it basis M) and the one used by your game (basis G). We want to leave the model unchanged and change basis from M to G. The linear transform that leaves everything unchanged is the identity transform I that sends every vector to itself. As mentioned above, every matrix depends on not only the linear transform it represents, but also the basis chosen for the input and output vector space. Let’s make that explicit by adding them as subscripts. We can then write down a matrix that takes a vector expressed in basis M, applies the identity transform (=leaves everything unchanged), then expresses the result in basis G. In our case, the x and y basis vectors in both G and M are the same, while the a vector expressed as in basis M corresponds to
in basis G (and vice versa). Putting it all together, we can write down the change-of-basis matrix to G from M:
.
If we wanted to go the opposite way (to M from G), we would use . Now what can we do with this? Well, first of, we can use it to convert vectors expressed in M to vectors expressed in G:
Who would’ve thought, we end up sign-flipping the z coordinate. Written like this, it just seems like a verbose (and inefficient) way to write down what we already know. But note that the subscript notation allows us to see at a glance which basis we’re currently in. More importantly, we can see when we’re trying to chain together transforms that work in different bases. This comes in handy when dealing with trickier objects, such as matrices: note that the transforms stored in the file will both accept and produce vectors expressed in basis M; by our convention, we’d write this as . Now say we want to get the corresponding transform in basis G – well, we could try the same thing that worked for vectors and just multiply the right form of the identity transform from the left:
Oops – see what happened there? That gives us a matrix that produces vectors expressed in basis G, but still expects vectors expressed in basis M as input. To get the true representation of T in basis G, we need to throw in another change of basis at the input end:
So to convert everything to basis G, we need multiply all the vectors with on the left, and multiply all matrices with
from the left and
from the right (the resulting double sign-flip “from both sides” is the bit that’s really hard to stumble upon if you approach the whole issue in “let’s just keep flipping signs until it works” mode). Alternatively, we can leave all the vectors and transforms as they are, happily working in basis M, but then throw in a single
at the top of the transform hierarchy of the model, so that all of it gets transformed into the “game” basis before the rest of the rendering / physics / whatever code gets to see it. Either way works. And not only that, I hope that by now it’s clear why they work.
Note: By the way, this convention of explicitly writing down the source and destination coordinate systems for every single matrix is good practice in all code that deals with multiple coordinate systems. Names like “world matrix” or “view matrix” only tell half the story. A while back I got into the habit of always writing both coordinate systems explicitly: world_from_model, view_from_world and so forth. It’s a bit more typing, but again, it makes it clear which coordinate system you’re in, what can be safely multiplied by what, and where inverses are necessary. It’s clear what view_from_world * world_from_model gives – namely, view_from_model. More importantly, it’s also clear that world_from_model * view_from_world doesn’t result in anything useful. This stuff prevents bugs – the nasty kind where you keep staring at code for 30 minutes and say to yourself “well, it looks right to me”, too.
Implement your solution, not your derivation
This ties in with one of my older rants. This is not a general rule you should always follow, but it’s definitely the right thing to do for performance-sensitive code.
In the example above, note how I wrote all transforms involved as matrices. This is the right thing to do in deriving your solution, because it results in expressions that are easy to write down and manipulate. But once you have the solution, it’s time to look at the results and figure out how to implement them efficiently (provided that you’re dealing with performance-sensitive code that is). So if, for some perverse reason, you’re really bottlenecked by the performance of your change-of-basis code, a good first step would be to not do a full matrix-vector multiply – in the case above, you’d just sign-flip the z component of vectors, and similarly work out once on paper what has to happen to the entries of T during the change of basis.
I’ll give more (and more interesting) examples in upcoming posts, but for now, I think this is enough material in a single batch.



