# Fast blurs 2

*This post is part of the series “Debris: Opening the box“.*

At the end of the last post, I described the basic “moving average” implementation for box filters and how to build a simple blur out of it. I also noted that the approach given only offers very coarse control of the blur radius, and that we’d like to have something better than box filters. So let’s fix both of these issues.

### Subpixel resolution

In the previous article, I used ‘r’ to denote what is effectively the “radius” of a box filter, and our filters always had an odd number of taps 2r+1. So r=0 has 1 tap (which is 1, so this is just the identity – no blur), r=1 has 3 taps, r=2 has 5 taps and so forth. But how do we get intermediate stages, say r=0.1 or r=1.5?

Well, Jim Blinn once said that “All problems in computer graphics can be solved with a matrix inversion”. But here’s the thing: Jim Blinn lied to you. Disregard Jim Blinn. Because at least 50% of all problems in computer graphics can in fact be solved using linear interpolation without any matrices whatsoever. And this is one of them. We know how to handle r=0, and we know how to handle r=1. Want something in the middle? Well, just linearly fade in another 1 and then normalize the whole thing so the weights still sum to 1. To implement this, let’s split our radius into an integer and fractional component first:

Then our blur kernel looks like this:

with exactly 2m+1 ones in the middle. This is still cheap to implement – for example, we could just treat the whole thing as a box filter with r=m, and then add the two samples at the ends with weight α after the fact (and then apply our usual normalization). This is perfectly reasonable and brings us up to four samples per pixel instead of two, but it's still a constant cost independent of the length of the filter, so we're golden.

However there's a slightly different equivalent formulation that fits in a bit better with the spirit of the filter and is very suitable when texture filtering hardware is available: consider what happens when we move one pixel to the right. Let's look at our five-tap case again (without normalization to keep things simple) and take another look at the differences between adjacent output samples:

On the right side, we add a new pixel at the right end with weight α, and we need to "upgrade" our old rightmost pixel to weight 1 by adding it in again weighted with (1-α); we do the same on the left side to subtract out the old samples. Interestingly, if you look at what we're actually adding and subtracting, it's just linear interpolation between two adjacent samples – i.e. linear filtering (the 1D version of bilinear filtering), the kind of thing that texture sampling hardware is great at!

Time to update our pseudocode: (I'm writing this using floats for convenience, in practice you might want to use fixed point)

// Convolve x with a box filter of "radius" r, ignoring boundary // conditions. float scale = 1.0f / (2.0f * r + 1.0f); // or use fixed point int m = (int) r; // integer part of radius float alpha = r - m; // fractional part // Compute sum at first pixel float sum = x[0]; for (int i=0; i < m; i++) sum += x[-i] + x[i]; sum += alpha * (x[-m] + x[m]); // Generate output pixel, then update running sum for next pixel. // lerp(t, a, b) = (1-t)*a + t*b = a + t * (b-a) for (int i=0; i < n; i++) { y[i] = sum * scale; sum += lerp(alpha, x[i+m+1], x[i+m+2]); sum -= lerp(alpha, x[i-m], x[i-m-1]); }

I think this is really pretty slick – especially when implemented on the GPU, where you can combine the linear interpolation and sampling steps into a single bilinear sample, for an inner loop with two texture samples and a tiny bit of arithmetic. However nice this may be, though, we’re still stuck with box filters – let’s fix that one too.

### Unboxing

Okay, we know we can do box filters fast, but they’re crappy, and we also know how to do arbitrary filters, but that’s much slower. However, we can build better filters by using our box filters as a building block. In particular, we can apply our box filters multiple times. Now, to give you an understanding of what’s going on, I’m going to switch to the continuous domain for a bit, because that gets rid of a few discretization artifacts like the in-between box filters introduced above (which make analysis trickier), and it’s also nicer for drawing plots. As in the introduction, I’m not going to give you a proper definition or theory of convolution (neither discrete nor continuous); as usual, you can step by Wikipedia if you’re interested in details. What we need to know here are just two things: first, there is a continuous version of the discrete convolution we’ve been using so far that behaves essentially the same way (except now, we perform integration instead of the finite summation we had before), and second, convolution is associative, i.e. . In our case, we’re gonna convolve our image I with a box filter b multiple times, and for this case we get something like . Point being, applying a box filter p times is equivalent to filtering the image once with a filter that’s a box convolved p-1 times with itself.

In other words, using just our single lowly box filter, we can build fancier filters just by applying it multiple times. So what filters do we get? Let’s plot the first few iterations, starting from a single box filter with radius 1. Note that in the continuous realm, there really is only one box filter – because we’re not on a discrete grid, we can generate other widths just by scaling the “time” value we pass into our filter function. This is one of the ways that looking at continuous filters makes our life easier here. Anyway, here’s the pretty pictures:

These all show the filter in question in red, plotted over a Gaussian with matching variance () and scale in blue (ahem, not to suggest any conclusions or anything!). Triangle filters also have a name (which is just as inspired as “box filter”, natch), the other ones I’ve labeled simply by what type of function they correspond to. Anyway, the point to take away here is that convolving a box with itself quickly leads to pretty smooth functions (the Cardinal B-splines, in fact) that happen to get quite similar to a Gaussian very fast. In fact, as we keep increasing p (the order), they will converge towards a “real” Gaussian – proof here for the curious. (You can also stay with the discrete version and prove this using the Central Limit Theorem, but I like this version more). The piecewise cubic function you get for p=4 above is already within 3% (absolute) error of the corresponding Gaussian. In practice, exact match with a Gaussian isn’t that important anyway as long as you’re just blurring images, and just using p=3 is normally fine.

So, to get better filters, just blur multiple times. Easy! So easy in fact that I’m not gonna bore you with another piece of pseudocode – I trust you can imagine a “do this p times” outer loop on your own. Now, to be fair, the convolutions I showed you were done in the continuous domain, and none of them actually prove anything about the modified box filters we’re using. To their defense, we can at least argue that our modified box filter definition is reasonable in that it’s the discretization of the continuous box filter with radius r to a pixel grid, where the pixels themselves have a rectangular shape (this is a fairly good model of current displays). But if we actually wanted to prove consistency and convergence using our filters, we’d need to prove that our discrete version is a good approximation of the continuous version. I’ll skip that here since it gets technical and doesn’t really add anything if you just want fast blurs, but if you want to see proof (and also get an exact expression for the variance of our iterated modified box filters), this paper has the answers you seek.

### Implementation notes

And that’s really all you need to know to get nice, fast quasi-Gaussian blurs, no matter how wide the kernel. For a GPU implementation, you should be able to more or less directly take my pseudo-code above and turn it into a Compute Shader: for the horizontal blur passes, have each thread in a group work on a different scan line (and for vertical passes, have each thread work on a different row). To get higher orders, you just keep iterating the passes, ping-ponging between destination render targets (or UAVs if you want to stick with DX11 terminology). Important caveat: I haven’t actually tried this (yet) and I don’t know how well it performs; you might need to whack the code a bit to actually get good performance. As usual with GPUs, once you have the basic algorithm you’re at best half-done; the rest is figuring out the right way to package and partition the data.

I can, however, tell you how to implement this nicely on CPUs. Computation-wise, we’ll stick exactly with the algorithm above, but we can eke out big wins by doing things in the right order to maximize cache hit ratios.

Let’s look at the horizontal passes first. These nicely walk the image left to right and access memory sequentially. Very nice locality of reference, very good for memory prefetchers. There’s not much you can do wrong here – however, one important optimization is to perform the multiple filter passes *per scanline* (or per group of scanlines) and not filter the whole image multiple times. That is, you want your code to do this:

for (int y=0; y < height; y++) { for (int pass=0; pass < num_passes; pass++) { blur_scanline(y, radius); } }

and not

// Do *NOT* do it like this! for (int pass=0; pass < num_passes; pass++) { for (int y=0; y < height; y++) { blur_scanline(y, radius); } }

This is because a single scan line might fit inside your L1 data cache, and very likely fits fully inside your L2 cache. Unless your image is small, the whole image won’t. So in the second version, by the point you get back to y=0 for the second pass, all of the image data has likely dropped out of the L1 and L2 caches and needs to be fetched again from memory (or the L3 cache if you’re lucky). That’s a completely unnecessary waste of time and memory bandwidth – so avoid this.

Second, vertical passes. If you implement them naively, they will likely be significantly slower (sometimes by an order of magnitude!) than the horizontal passes. The problem is that stepping through images vertically is not making good use of the cache: a cache line these days is usually 64 bytes (even 128 on some platforms), while a single pixel usually takes somewhere between 4 and 16 bytes (depending on the pixel format). So we keep fetching cache lines from memory (or lower cache levels) and only looking at between 1/16th and 1/4th of the data we get back! And since the cache operates at cache line granularity, that means we get an effective cache size of somewhere between a quarter and a sixteenth of the actual cache size (this goes both for L1 and L2). Ouch! And same as above with the multiple passes, if we do this, then it becomes a lot more likely that by the time we come back to any given cache line (once we’re finished with our current column of the image and proceeded to the next one), a lot of the image has dropped out of the closest cache levels.

There’s multiple ways to solve this problem. You can write your vertical blur pass so that it always works on multiple columns at a time – enough to always be reading (and writing) full cache lines. This works, but it means that the horizontal and vertical blur code are really two completely different loops, and on most architectures you’re likely to run out of registers when doing this, which also makes things slower.

A better approach (in my opinion anyway) is to have only one blur loop – let’s say the horizontal one. Now, to do a vertical blur pass, we first read a narrow stripe of N columns and write it out, transposed (i.e. rows and columns interchanged), into a scratch buffer. N is chosen so that we’re consuming full cache lines at a time. Then we run our multiple horizontal blur passes on the N scan lines in our scratch buffer, and finally transpose again as we’re writing out the stripe to the image. The upshot is that we only have one blur loop that gets used for both directions (and only has to be optimized once); the complexity gets shifted into the read and write loops, which are glorified copy loops (okay, with a transpose inside them) that are much easier to get right. And the scratch buffer is small enough to stay in the cache (maybe not L1 for large images, but probably L2) all the time.

However, we still have two separate copy loops, and now there’s this weird asymmetry between horizontal and vertical blurs. Can’t we do better? Let’s look at the sequence of steps that we effectively perform:

- Horizontal blur pass.
- Transpose.
- Horizontal blur pass.
- Transpose.

Currently, our horizontal blur does step 1, and our vertical blur does steps 2-4. But as should be obvious when you write it this way, it’s actually much more natural to write a function that does steps 1 and 2 (which are the same as steps 3 and 4, just potentially using a different blur radius). So instead of writing one copy loop that transposes from the input to a scratch buffer, and one copy loop that transposes from a scratch buffer to the output, just have the horizontal blur always write to the scratch buffer, and only ever use the second type of copy loop (scratch buffer to output with transpose).

And that’s it! For reference, a version of the Werkkzeug3 Blur written using SSE2 intrinsics (and adapted from an earlier MMX-only version) is here – the function is `GenBitmap::Blur`

. This one uses fixed-point arithmetic throughout, which gets a bit awkward in places, because we have to work on 32-bit numbers using the MMX set of arithmetic operations which doesn’t include any 32-bit multiplies. It also performs blurring and transpose in two separate steps, making a pass over the full image each time, for no good reason whatsoever. :) If there’s interest, I can write a more detailed explanation of exactly how that code works in a separate post; for now, I think I’ve covered enough material. Take care, and remember to blur responsibly!

Aren’t you totally changing the transposition behavior when you go to combined-blue-tranpose? It seems like you suddenly now need a second buffer.

That is, in the original, step 2 is: transpose a narrow, tall column band to a wide, short row band. Step 4 is: transpose a wide, short row band to a narrow, tall column band. Under this scheme, you only need the scratch buffer, because you’re doing a loop over scratch-buffer-sized objects from step 2-4.

In the sequel, it appears you’re proposing to always loop over scratch-buffer-sized chunks for 1-2 and 3-4, and 2 and 4 are both ‘tranpose a wide, short row band to a narrow, tall column’. But that means that the first step 2 needs to be write a narrow tall column band of the image, despite having only processed a wide, short row band, which means it can’t write it back to the original buffer.

Yeah, I wrote the last part a bit sloppily, sorry for that. Bundling horizontal blur and transpose-in-output does indeed cost you the ability to do the processing in place, and if you’re memory-constrained, you might want to opt for one of the other variants.

Also, I’ve always wondered whether for the non-subpixel multipass blurs it would be worth doing it all in a single pass.

That is, a single step of a single blur reads two samples (at each end) to update its accumulator compute the blurred value at a single pixel.

If we consider the second step of a two-pass blur, those two samples it’s reading are themselves the result of a single blur pass. But it’s reading through them in a consistent streaming order. So instead of computing that as a separate blur pass, we could compute those values on the fly. Each of the two samples would be computed by reading two samples from the underlying image, so it would require 4 memory reads and three blur calculations (actually two of the memory operations read the same address). The main question is how it behaves cachewise. (It also maybe complicates the startup and shutdown phases if you’re not reading from a clamped texture so you’re trying to optimize the clamping out of the loop.)

Three passes naively requires reading 8 samples, but 4 in practice. But it does require seven blur calculations (versus three if multipass).

Things explode too exponentially fast with subpixel precision, since each pass needs to read four samples.

I kept the articles almost completely DSP-free since everything in there has a nice, obvious interpretation anyway, and I didn’t want to complicate the exposition unnecessarily. But if you want to explore “fused multi-pass” variants, your better off just using the Z-transform (Generating Functions) and looking at the ways you can factor the transfer function.

Our basic box filter is

`H(z) = z^(-n) + z^(-n+1) + ... + z^(n-1) + z^n`

(plus normalization, but that’s all linear so ignore this for now). Using the standard geometric summation formula, this is equal to H(z) = (z^(n+1) – z^(-n)) / (z – 1). Presto, that’s the moving average derivation in a single line. :) Multiple passes are just powers of H(z), and multiplying it out gives you a “fused” way to compute multiple passes – using a very rare beast, a n-poleFIRfilter.`H^2(z) = (z^(2n+2) - 2z + z^(-2n)) / (z - 1)^2`

, so we only get 3 reads (as you mention) but a 2-pole feedback stage (need to keep track of 2 previous output samples). That’s still totally reasonable.`H^3(z) = (z^(3n+3) - 3z^(n+2) + 3z^(1-n) - z^(-3n)) / (z - 1)^3`

. Reads 4 samples, has 3 poles. And so forth – you get the idea. The point being that all of this is still completely reasonable if you don’t mind the extra arithmetic on the output end. Haven’t looked at the subpixel case though; yes, things get messy there.I like the formulation of the incremental non-integer-radius blur!

So one issue is the precision of intermediates. If your pixels are U8, if you do one direction of blur and stuff it back into U8′s , and then do another direction of blur, you’re losing precision in the intermediate. (no big deal in a demo context, but not nice as a general way to do filters)

It’s almost as fast to keep a few rows of higher precision, whatever the width of the filter kernel is, of U16 pixels. Then you grab U8 horizontal, blur and shift left a few steps to get some fractional precision, write the output to the U16 row buffers. Each time you step a scan line you apply the vertical kernel to the row buffers and write out a full row. The row buffer is circular. As long as the cache is big enough to fit the entire row buffers, this is very fast. Actually this can be much faster when memory bandwidth is your limitation, but it makes the horizontal and vertical code totally different.

The other thing is, I think it’s faster to read-vertical and write horizontal, rather than read horizontal and write vertical. The reason is if you write horizontal you are always writing full cache lines, so you don’t have to fetch a line and just change one byte and flush it out again. The difference is pretty small I guess because the vertical cache misses will always dominate.

As mentioned in the first part, our texture generator keeps things at 15 bit/channel internally despite only ever outputting ARGB8888 images. The “in a demo context” bit here is misleading – with fully generated textures it’s not uncommon to have paths with 50+ image manipulation/filtering and several color correction operations in sequence, which will absolutely kill your precision at 8 bits per channel no matter how careful you are about rounding. Within a filter (e.g. the blur) you can keep extra precision, for the temporaries and round (or even dither) on output, but if you do a dozen or more dependent steps in sequence that won’t save you. With the 15 bits/channel, you do have headroom for this kind of thing, and you can afford to not have extra precision in the middle stages, which makes things simpler.

One thing I did not mention in the article is that for fixed-point linear interpolation on the input side, you want to not down-scale before you add into the filter sum; instead keep all the fractional bits. Again, don’t throw away precision unless you have to! There’s a balancing act between max size of blur kernels and bits in your texture; with the 15 bits/channel and 6 bits “subtexel” accuracy the filter uses, you get a max kernel width of 1024 pixels before you start overflowing into the sign bit in a 32-bit accumulator. That’s plenty to be sure, but still, it’s a noticeable limit.

“The other thing is, I think it’s faster to read-vertical and write horizontal, rather than read horizontal and write vertical. The reason is if you write horizontal you are always writing full cache lines”I did mention explicitly that you should write

stripesof rows, not single scan lines – and specifically, size the stripe so you always write full cache lines at a time. That accomplished pretty much the same thing.Another thing to be careful about is that with power-of-2 texture sizes, it’s really easy to screw yourself by hitting cache set aliasing issues. It can make a dramatic difference to just allocate 64 extra bytes per line just to make sure you hit all sets evenly.

Indeed! Quite hypocritical, since one of my greatest pet peeves is boners who leave “but what about this!” comments when I’ve already addressed that topic.

Presumably the ideal way to do multiple effects on an image is not to run multiple passes over the image but to run multiple passes over the scratch buffer (as long as the effects are H-V separable like blur).

Separability is not a very useful criterion here, few things besides straight convolution filters actually care.

We did have a version for mobile devices years ago (2004/2005, long before they had GPUs worth the name) that distinguished between three types of filters:

- 1:1 pixel transforms (this covers basically all color correction and compositing)

- Wants full image input but can produce output tile by tile (most operations that perform any kind of geometric transform, most generators)

- Wants full image, writes full image (this is rare; the first pass of the separable moving-average blur is an example)

The textures would be stored and, generally, processed tile by tile (16×16 pixels at up to 8 bytes/pixel if my memory serves me right), and most of the 1:1 transforms would live directly off the output of their predecessor without any of it ever being written to memory. The goal was for most processing to happen purely in the cache, to save on memory bandwidth (this was aiming for 16-32k of L1 data cache with no L2, hence the small tiles).

It was a cute idea and worked reasonably well, but we never seriously pursued this on PC which had faster memory so we were more compute limited. More importantly, all our PC tex gen work was shooting for small size, and while the tile stuff wasn’t bad, it did need a fair bit more infrastructure than the dead simple “just call functions on images” model we used otherwise.