I was just looking over SIMD FFT code I wrote in 2015 for Bink Audio and Miles Sound System to replace the old, all-scalar implementation we had been using since (presumably) the late 90s. That in turn reminded me of writing that code originally, reading a lot of papers on the subject, and how I wished at the time that there were a write-up going into what mattered (and why) and what didn’t. It’s 8 years too late to save myself some time, but it might still be handy for you!

I’ll try to cover everything interesting to users of FFTs first. I’ll do a follow-up post that is only really interesting for FFT implementers. All of this assumes some familiarity with DSP concepts.

### Big picture: what do you need that FFT for, and what kind?

FFTs get used for many things, but there’s 3 main classes of use cases that all have somewhat different requirements:

1. Actual spectrum analysis. You have some signal that you want to see the discrete Fourier spectrum of. Generally uses windowed FFTs (aka Short-Time Fourier Transforms, STFTs). Typically wants FFT coefficients in the “proper” (natural) order. Large transform sizes (let’s draw an arbitrary line in the sand and call any transform size above 4096 “large”) are common. All-real inputs are more common by far, and usually just one direction of transform (let’s call it “forward”) is used.
2. Convolution or polynomial multiplication. Here, you’re just using the FFT as an intermediate step in a convolution algorithm. For convolution, real inputs are more common by far. For polynomials, you see complex a bit more often, but still not that common. If you don’t want cyclical convolution (and you usually don’t), not only are your inputs and outputs often real, the second half of FFT inputs is usually all zeroes and the second half of IFFT outputs is usually discarded; this can be used to speed up the transform if you care to exploit it. When used for convolutions, you don’t typically care whether coefficients are in the right order or not (for convolution, all you’re doing is element-wise multiplications in frequency space; reordered coefficients cause no real problems.) Can use quite large transforms but often doesn’t, even for uses cases like fairly long convolution reverbs (see below). Uses both forward and inverse transforms, in about equal proportion if one side of the convolution (e.g. a filter kernel) is fixed and pre-processed in advance; twice as many forward as inverse transforms if not.
3. As building block in other DSP algorithms. Many other trigonometric transforms used in practice, e.g. DCTs and MDCTs, reduce to FFTs with some pre- or post-processing, sometimes both. The inner FFT here is usually a complex-valued one. Usually wants coefficients in order. It’s rare to see large transform sizes for this use case.

FFTs “want” to work in the complex domain and this is the most natural form for them. Real-valued signals can be turned into complex-valued signals by just adding all-0 imaginary parts, and then you can use a complex FFT, however this wastes memory and work; a “proper” real-input FFT doesn’t require blowing up the input into complex numbers, can compute just half the spectrum (the other half is implied by symmetry), and as a result uses about half the memory and is nearly twice as fast. One caveat: for a FFT that works on complex numbers, the difference between a “forward” and “inverse” FFT is very small (literally a sign flip), and there is not any universal agreement on which sign the “forward” transform should be (sigh), but it also doesn’t actually matter for many applications. Once you do a real FFT and corresponding IFFT, well the real FFT takes N real-valued inputs and outputs N/2 complex numbers (really N/2-1 complex numbers and 2 real numbers, but the two reals are frequently packed together into one complex value) and the real IFFT takes those N/2 complex numbers and outputs N reals, so they are not interchangeable and you need to use the right one.

### Big picture: transform sizes

In principle, any size FFT can be computed efficiently, in the sense that you get an O(N log N) asymptotic complexity bound no matter what. In practice, the practical FFT algorithms that you actually want to use in most cases need the transform size to be of a certain form to be efficient, and the arbitrary-size FFT algorithms work by reducing a size-N FFT to a different FFT of a nearby “convenient” size, again with pre-/post-processing.

The most important class of algorithms in practice (Cooley-Tukey and descendants) wants highly composite numbers with few different prime factors; powers of two (just factors of 2) are best. Having a single factor of 3 or 5 in there is also often supported, which gives you more convenient spacing of transform sizes: for example, if you just need slightly more than 1024 elements, you don’t have to jump all the way to 2048 but might instead use 5*256 = 1280.

If possible, and there’s no strong reasons to do anything else, stick with just powers of 2. If that’s not an option, try to make do with exactly one factor of 3 or 5 in there. Any other “odd” sizes will likely end up getting reduced to one of those sizes in the end anyhow.

### Big picture: libraries, interfaces, what to call?

If you’re a user of FFTs, and you’re interested mostly in real-valued data (you often are), check if your FFT implementation of choice supports real-valued input or has inverse transforms that produce real outputs from half a spectrum, instead of manually expanding real values to complex values with zero imaginary part. That’s a fairly easy near-2x speed-up, usually.

FFTs usually need some description of the transform on the side, depending on the transform type and size. This is a data structure that usually contains twiddle factors (which effectively means a sine/cosine table, sometimes in funny order) and some auxiliary data (tables or otherwise) used during coefficient reordering.

Some FFT libraries only support a small, restricted set of transform sizes (usually just certain powers of 2), and these can usually avoid a “setup” structure of “FFT plan” or similar entirely. It’s a simpler API but more restricted.

There’s in-place vs. out-of-place. This is really more of an implementation detail, but sometimes it matters. Some approaches lend themselves to a fully in-place implementation (where the input and output are the same array), others don’t. “Real” in-place needs less memory but has more complicated logistics and sometimes inefficiencies. Asa a result, often in-place functions exist but are just an API entry point you can call that internally allocates workspace memory, works out-of-place, then copies the results back. Sometimes you also have API functions that take an optional pointer to workspace memory; if those exist, that usually means they need it and if you don’t give them workspace memory, they’ll allocate it themselves and free it once they’re done, so allocating the workspace once and passing it in can save a lot of allocation churn if you’re doing a lot of transforms. In short, there’s a lot of variations here, and this part is often poorly documented, but this is something you’ll want to look into because eating a lot of unnecessary allocations and copies in this type of DSP functionality can contribute a significant cost.

Then there’s complex number format. There’s two main contenders: interleaved and split. Interleaved has complex numbers stored as pairs of real part and imaginary part right next to each other. This is the format most commonly used for complex number data types (like C99 built-in complex type or C++ `std::complex`). FFT algorithms can work in this form directly, and some do, but it can be advantageous to split the data into two arrays, one containing just the real parts and one just the imaginary parts of each value. In general, interleaved form is pretty much always supported, but if the library supports the split format, that is usually the faster one, and it might be more convenient for your code as well, so it could be worth looking into.

Finally, permuted coefficients vs. not. If you only want to use a FFT for convolutions, you might not care about the order coefficients are in. Some algorithms can produce a FFT with coefficients in a non-standard order at a lower cost. You don’t see them that much anymore for reasons I’ll get into later, but if you use a FFT library that supports this option, and you only care about convolutions, it might be interesting for you.

With that all said, I know the question of library recommendations is going to come up, so for what it’s worth: if you’re on an Apple platform, vDSP ships as part of the Accelerate framework, is always present, and looks fine to me (although I haven’t tried the DFT/FFT functionality in particular), so there’s at least a few platforms where there’s an obvious choice.

There’s always FFTW, which I have last used in the early 2010s and don’t have fond memories of. It’s a big library with external dependencies, not easy to build or use, a ton of code (around half a megabyte last I checked), and for a while anyway it seemed to lag behind quite a bit in SIMD instruction set support (that might have been fixed by now, or maybe that was just a configuration problem on my part). FFTW does have pretty much everything once you get over that integration hump, though.

Intel has FFT functionality as part of their MKL (Math Kernel Library). I haven’t used it for FFTs but MKL in general tends to offer excellent performance on Intel CPUs. They detect CPUs by model number and mysteriously fall back to their slowest unoptimized kernels on any x86 device that doesn’t report as Intel (unless you override it with an environment variable), which is a reason for me not to use it on general principle but YMMV. MKL is also fairly big (and of course x86-specific) so if distribution size is a problem for you and you just want a FFT this is probably not what you want.

The one that’s easiest for me to recommend is Mark Borgerding’s KISS FFT. It’s not the fastest FFT around but it’s small, simple, and easy to build and integrate. If you find yourself bottlenecked by FFT speed you’ll want to look into something more advanced but if you just need a FFT-type algorithm to replace a calculation that would otherwise be O(N2) with something that’s O(N log N) and doesn’t have a huge amount of wasted work, this fits the bill. You’re losing a constant factor relative to what you could be getting with proper vectorization so be advised that there’s faster options when that becomes a problem.

### Big picture: convolutions

Convolutions are a fundamental signal processing operation for filtering (and also for effects such as convolution reverb in audio processing), and by what’s usually called the Convolution Theorem, time-domain convolution (the operation we’re interested in) corresponds to frequency-domain point-wise multiplication. (Vice versa, frequency-domain convolution also corresponds to time-domain point-wise multiplication, which is the math behind how windowing affects the FFT).

The theory is this: convolving a length-N signal X by a length-M signal Y directly takes N*M multiply-adds (the first one is just a multiply, but close enough). If we take the first signal to be our data and the second to be our filter, that means convolving by Y takes M operations per sample. (It stands to reason that we have to some work for every output sample on average.) This gets bad if Y is long. In the aforementioned convolution reverb, Y could be a 5-second sample at 48kHz sampling rate. That’s 240,000 multiplies per sample, which is what we in the business refer to as “not ideal”. The convolution theorem tells us we can do a Fourier transform, then one multiply per sample in the frequency domain, which is 4 real multiplies, and finally we do a inverse Fourier transform back. Fourier transforms are (mumble mumble) something something O(N log N), or I guess in our case O(M log M), so maybe this is O(log M) arithmetic operations per sample, and log2(M) is 18. Even assuming some fairly high constant factors here we’d expect this to work out to ballpark a few hundreds of operations per sample, roughly a 3 orders of magnitude speed-up.

Only problem being: I glossed over all the details of how exactly that actually works. This is not the subject of this post, so I’ll keep it brief, but there’s a bunch of important techniques here to actually get this to work, and this is another one of those things that are often glossed over, so I want to at least put all the search keywords in here so you know what to look for if you actually want (or need) to implement this.

First, we want “discrete convolution”, which is the convolution operation that belongs with what’s usually called the DTFT (Discrete-Time Fourier Transform). The FFT is a realization of the DFT (Discrete Fourier Transform) not DTFT, so it’s a different type of transform. DFT is like DTFT but for periodic signals, and it also has a convolution operation it’s associated with for its version of the Convolution Theorem, but the convolution we get with the DFT/FFT is what’s called cyclical convolution – it wraps around the signal (which is assumed to be periodic). If we want the no-wraparound (discrete) version, we can get it from the cyclical version by padding the signal with extra zeroes and using a long enough transform length that no actual wrap-around takes place. For our length-N signal, length-M filter setup, we can pad both out to be N+M-1 samples long, and then use cyclical convolution (which is just point-wise multiplication of DFT/FFT coefficients) to get our discrete convolution result.

That works, but if we did that directly, we’d need the whole input signal in advance, and we also get O((N+M) log2(N+M)) operations for the whole job, so the amount of operations per output sample has a dependency on the input length. That’s not only worse than advertised, it also fundamentally doesn’t work in a setting where you get the signal in real-time, because N (the eventual length of your input signal) is not something you know while it’s happening, and you don’t know all the future samples yet. So the next step is to realize that no, of course you don’t need full knowledge of all samples past and future to do a convolution with a finite-length filter. What you can do is cut your source signal into small pieces, say also of size M, filter those individually, and then combine the results. This works just fine and the search term for the main approaches are “overlap-save” and “overlap-add” method. This gets the dependency on N out of there entirely, and we actually do have O(log(M)) asymptotic operation count per output sample now. Nice.

This is now good enough for offline editing or batch processing, but still doesn’t work for something like real-time audio, because we work in units of big blocks, and for our M around 240,000, we need blocks of roughly that many input samples to do the job. That’s 5 seconds of audio. If we need to wait until we have more than 5 seconds of audio to do another batch, that means we can’t do it without stuttering unless we introduce 5 seconds of latency. Once again we’re off by several orders of magnitude – we might be willing to live with 20-30ms of latency, although for Pro Audio a real 5ms or thereabouts would actually be nice. Direct convolution doesn’t have this delay. As soon as we have a new input sample, we can compute the corresponding output sample – although it might take us a lot of operations to do so. FFT-based convolution amortizes the effort greatly, but introduces unacceptably large latency when used directly with large kernels. So what do we do?

Well, several tricks. First, we can use the same “cut into pieces” trick we used on the input signal on the filter as well; for example, we might cut it into pieces 1024 samples long (duration 21.3ms at 48kHz sampling rate), which is getting towards the upper end of what we’re willing to tolerate. Then we can use overlap-add/overlap-save style techniques on these pieces to build our convolution with a 240,000-sample signal out of 235 convolutions by a length-1024 signal (this is effectively just another direct convolution, just between corresponding coefficients of 1024-sample blocks). We don’t need to redo all these every time, every 1024 samples we transform one new block of 1024 and keep the older ones around – the search term here is “Frequency-Domain Delay Line”. That adds a few hundred multiply-adds per sample to our tally, but a few hundred extra we can live with.

Next, that 1024-sample latency is also not actually necessary. It’s true it takes 21.3ms to complete a 1024-sample block, but that doesn’t mean you have to twiddle your thumbs while that’s going. The problem is really only with the “beginning” parts of Y – which get multiplied by the most recent samples from X. Once you’re 1024 samples into Y, those values get multiplied by “old” samples in X (at least 1024 samples old), so there’s not nearly as much of a rush. So we can use direct convolution with just the initial part of Y to keep our latency down (it can now again go about as low as direct convolution can), and then use optimized FFT-based convolution for the contribution from more distant parts of the signal. That contribution from the direct convolution is now contributing something like a 1024 multiply-adds per sample, which is starting to hurt, but at least we can hit low latency now.

Finally, the actual solution in practice: this realization that the big blocks are only a problem in terms of latency for recent operation is true in general, and lets us use this whole process recursively. For example, in our initial 1024 samples, the first 64 samples (1.3ms worth at 48kHz) are latency-critical and probably need direct convolution. Past that point, we might switch to length-64 blocks with FFT-based convolution. At some point past 1024 samples into the “past”, we might switch over to length-1024 blocks. And for further back-in-time samples, we might switch to length-16384 blocks or so (our 5s impulse response fits into 15 16,384-sample blocks, so the 15 resulting overlap-adds/overlap-saves we don’t worry about much, but if it were longer we might do this a few more time with increasing sizes). There’s some subtlety about what block sizes to use and when to switch (you might think it would be good to use pow2 sizes and greedily double the block size as soon as we can every time, but turns out that’s not work-optimal), and there’s also other concerns such as how much processing time the various transform sizes take and when to schedule them etc., but this is the actual “good” convolution algorithm for real-time applications, and it’s called “Non-Uniform Partitioned Block Convolution”.

All right. This was just a very brief, very rough overview, and I would absolutely not recommend trying to fill in the details from my few paragraphs, but especially with convolutions, I’ve several times seen code that really does a 2D FFT on a whole image or similar to convolve with a 50×50 pixel kernel, and anecdotally it seems that outside the domain of real-time audio, the better techniques are not nearly as well-known as they ought to be, hence this extended detour into convolution algorithms.

This concludes my notes on the subjects that might be interesting for FFT users. Next part will be for FFT implementers only, pretty much.

From → Coding, Maths