SSE: mind the gap!
SSE and SSE2 are available in every single x86-family CPU with 64-bit support. You too can play around with SIMD, which is great fun! Unfortunately, SSE2 level in particular also happens to be what is probably the most maddeningly non-orthogonal SIMD instruction set in the world, where operations are either available or not available for particular data types with little rhyme or reason, especially where integers are involved. Later revisions (especially starting around SSE4.1) fill in some of the more annoying gaps, but plenty of us are stuck with supporting the older CPUs for at least a few more years, and besides – not to mess with the authentic SSE experience – even on AVX2-supporting CPUs, there’s still a few of the classic gaps remaining.
So, here’s a list of tricks to get you around some of the more common, eh, “idiosyncrasies” of SSE and its descendants. This happens to be mostly focused on the integer side; the floating-point side is generally less, well, weird. I’ll keep the individual descriptions relatively brief since the whole point of this post is to collect lots of tricks. The assumption here is that you’re already somewhat familiar with the instructions, so I’ll not explain the basics (maybe another time). I’ll use the official Intel intrinsics (as exposed in C/C++) since that’s probably the most common way people interact with these instructions intentionally (awkward glance in the direction of auto-vectorization here. No making eye contact. Moving on.)
Branchless “select” (cond ? a : b)
The natural mode of operation in SIMD computations is to do things branchlessly. If some part of a computation is conditional, rather than doing the equivalent of an
if, it’s more typical to do both the computation for the “if” and “else” forks, and then merge the results based on the condition. The “select” I mean is the operation which takes the condition and both results and performs the rough equivalent of C’s ternary operator
cond ? a : b. You first evaluate both sides, giving
b. You then evaluate the condition using a SIMD compare, which returns a vector containing a bit mask that is has all bits set for lanes that meet
cond, and all bits clear for lanes that don’t.
This select operation can always be done using a few bitwise operations (which is well known), but starting in SSE 4.1 we get slightly more efficient variants too (less well known, and the reason I mention this):
- Integer (all vers):
_mm_or_si128(_mm_and_si128(a, cond), _mm_andnot_si128(cond, b)).
- 32-bit float (all vers):
_mm_or_ps(_mm_and_ps(a, cond), _mm_andnot_ps(cond, b)).
- 64-bit float (all vers):
_mm_or_pd(_mm_and_pd(a, cond), _mm_andnot_pd(cond, b)).
- Integer (SSE4.1+):
_mm_blendv_epi8(a, b, cond).
- 32-bit float (SSE4.1+):
_mm_blendv_ps(a, b, cond).
- 64-bit float (SSE4.1+):
_mm_blendv_pd(a, b, cond).
andnot operations don’t come in handy very often, but they’re the best choice here (pre-SSE4.1).
If you don’t want to use
cond but its logical negation, just switch the positions of
(!cond) ? a : b is the same as
cond ? b : a.
Unsigned integer compares
SSE, in all incarnations, offers precisely two types of integer comparisons: test for equality (
_mm_cmpeq_T, where t and T stand for various type suffixes) and test for signed greater-than (
_mm_cmpgt_T). Most other comparison types can be produced using nothing but logical negation and standard identities:
a == bis supported directly.
a != bis
!(a == b).
a > b(signed) is supported directly.
a < b(signed) is the same as
b > a(swap a and b).
a >= b(signed) is
!(a < b)(which in turn is
!(b > a)).
a <= b(signed) is
!(a > b).
See previous note on selection operations on how to get rid of the NOT in the most common use case. Conspicuously absent from that list is any type of unsigned ordered comparison. However, a trick that works is to bias both integers so that signed comparison does the right thing:
a > b(unsigned, 8-bit) is the same as
(a - 0x80) > (b - 0x80)(signed, 8-bit).
a > b(unsigned, 16-bit) is the same as
(a - 0x8000) > (b - 0x8000)(signed, 16-bit).
a > b(unsigned, 32-bit) is the same as
(a - 0x80000000) > (b - 0x80000000)(signed, 32-bit).
The same argument-swapping and NOT-ing tricks as above still apply to give you the other compare types. In general, the trick is to add (or subtract, or XOR – they all do the same thing in this particular case) the INT_MIN for the respective type from both operands before doing the compare. This turns the smallest possible unsigned integer, 0, into the smallest possible signed integer for the given type; after that, the ordering works out. In particular, when comparing against a constant, this addition (or subtraction, or XOR) can be baked into the constant operand, so the unsigned compare “only” ends up doing one more operation than a signed compare (instead of two).
A completely different approach is to use the unsigned integer min/max instructions (more about those in a second) to build less-or-equal or greater-or-equal comparisons:
a <= bif and only if
max(a, b) == b.
a >= bif and only if
min(a, b) == b.
The good news is that this reduces unsigned comparisons to either an unsigned min or a max, followed by an equality comparison, which is only 2 instead of 3 operations. The bad news is that the requisite unsigned min/max operations only exist for uint8s in SSE2. The uint16/uint32 variants were finally added with SSE4.1; if your minimum target is earlier, you’re stuck with the bias-then-compare variants above.
Integer min and max
SSE4.1 has the full set of integer min/max for 8-, 16- and 32-bit types, both signed and unsigned. So if you’re targeting SSE4.1 or later, good for you!
If you’re stuck with anything earlier, you’re decidedly more limited. In SSE2, you get integer min/max for uint8 and int16. If you need min/max for int8, uint16, or anything 32-bit, you’re on your own.
Luckily, we can just combine some of the techniques above to derive a solution. The general patterns here are:
max(a, b) == (a > b) ? a : b; min(a, b) == (a > b) ? b : a;
So this is just a combination of a compare and a “select” operation. When the compare is signed (the int8 and int32 cases), the comparison maps to a single SSE intrinsic. The unsigned compares (uint16 and uint32) can be solved using the bias-then-signed-compare trick which in turn gives us an unsigned min/max.
32-bit and 64-bit loads/stores
This one has nothing to do with the actual instruction set and everything to do with the intrinsics: yes, SSE2 has 32-bit (
MOVD) and 64-bit (
MOVQ) loads and stores, the standard intrinsics just do their best to confuse you about it:
- 64-bit loads are
_mm_loadl_epi64. This intrinsic takes a
__m128i *as an argument. Don’t take that seriously. The actual load is 64-bit sized, not 128-bit sized, and there is no alignment requirement.
- 64-bit stores are
_mm_storel_epi64. Again, the
__m128i *is confusing and does not mean that the actual store is 128-bit or that there are alignment requirements. It isn’t and there are not.
- 32-bit loads are even more hidden! Namely, you write
xis a pointer to a 32-bit integer. No direct load intrinsic, but compilers will turn this into a
MOVDwith memory operand where applicable.
- 32-bit stores, likewise:
*x = _mm_cvtsi128_si32(value). Now you know.
There’s lots of different ways to provide multiplies in a SIMD instruction set, and by now SSE has tried most of them in one form or another.
Let’s start with the (historically) first variant: multiplying 16-bit numbers. The relevant instructions originated in the Pentium MMX and compute the low and high halves (bottom and top 16 bits) of a signed 16-bit×16-bit product. MMX only has signed multiplies, but SSE also added a “high half of unsigned 16-bit times 16-bit product” instruction (the low halves of signed and unsigned products are identical), so we’re not gonna have to worry about that particular problem, not yet anyway.
These instructions are fine if you want the low or high halves of the product. What if you want the full 32-bit product of vectors of 16-bit values? You compute the low and high halves and then merge them using the “unpack” instructions. This is the standard approach, but not very obvious if you haven’t deal with this kind of thing before. So for a full 16×16→32-bit product (note this produces two vectors worth of results), we get:
// EITHER: a*b (16-bit lanes), signed __m128i lo16 = _mm_mullo_epi16(a, b); __m128i hi16 = _mm_mulhi_epi16(a, b); // OR: a*b (16-bit lanes), unsigned __m128i lo16 = _mm_mullo_epi16(a, b); __m128i hi16 = _mm_mulhi_epu16(a, b); // THEN: merge results __m128i res0 = _mm_unpacklo_epi16(lo16, hi16); // result lanes 0..3 __m128i res1 = _mm_unpackhi_epi16(lo16, hi16); // result lanes 4..7
But what if you’re working with 32-bit values? There is a 32×32→32-bit product (
_mm_mullo_epi32), but it was only added with SSE4.1, and it’s significantly slower than the other SSE2 multiplies in many implementations. So you might either not want to set your minimum target that high, or you might be looking for something quicker.
There’s full 32×32→64-bit products, which are available from SSE2 on as
_mm_mul_epu32 (unsigned). SSE4.1 adds the signed equivalent
_mm_mul_epi32 (UPDATE: An older version of this post incorrectly stated that PMULDQ was SSE2. Thanks Exophase for pointing it out!). These ones only compute two products (between the even lanes of the two sources) and place them in a 128-bit result. The odd 32-bit lanes are ignored completely, so if you want four 32×32→32-bit products, you need at least two of these multiplies and a lot of wrangling:
// res = _mm_mullo_epi32(a, b) equivalent using SSE2, via PMULUDQ. // even and odd lane products __m128i evnp = _mm_mul_epu32(a, b); __m128i odda = _mm_srli_epi64(a, 32); __m128i oddb = _mm_srli_epi64(b, 32); __m128i oddp = _mm_mul_epu32(odda, oddb); // merge results __m128i evn_mask = _mm_setr_epi32(-1, 0, -1, 0); __m128i evn_result = _mm_and_si128(evnp, evn_mask); __m128i odd_result = _mm_slli_epi64(oddp, 32); __m128i res = _mm_or_si128(evn_result, odd_result);
It works, but it’s a mouthful.
But what if you’re using 32-bit vector lanes, but happen to know that the numbers we’re trying to multiply are in fact in the range [-32768,32767] (i.e. representable as signed 16-bit integers)? We could try narrowing the 32-bit lanes into 16 bits then using the 16×16→32 sequences above, but is that really the best we can do?
It is not:
_mm_madd_epi16), MMX/SSE2’s amazing and strange (but mostly amazing) dot product operation, has our back, for we can do this:
// a and b have 32-bit lanes with values that fit in int16s. // produces the 32-bit result // res[i] = a[i] * b[i] // clears high 16 bits of every 32-bit lane __m128i bm = _mm_and_si128(b, _mm_set1_epi32(0xffff)); // after this, madd_epi16 does what we want! __m128i res = _mm_madd_epi16(a, bm); // can swap role of a and b above too, when convenient.
That’s a lot shorter than narrowing to 16-bit first would be! Alas, it only works for int16 (signed). What if we’re working in 32-bit lanes with values that fit inside a uint16 (unsigned)? It’s not quite as slick, but still, better than narrowing to 16-bit first or dealing with the logistics when synthesizing 32×32→32-bit muls from
// a and b have 32-bit lanes with values that fit in uint16s, // i.e. a[i] == (uint16)a[i] and same for b[i]. // // produces the 32-bit result // res[i] = a[i] * b[i] // compute low and high 16-bit products __m128i lop = _mm_mullo_epi16(a, b); __m128i hip = _mm_mulhi_epu16(a, b); // merge results __m128i res = _mm_or_si128(lop, _mm_slli_epi32(hip, 16));
Horizontal adds, dot products etc. (float)
SSE3 adds horizontal adds
_mm_hadd_pd) and SSE4.1 throws in the dot-product instructions
Generally, don’t expect these operations to be magic. They exist in the instruction set but are fast precisely nowhere; in all x86 implementations I’m familiar with, they just turn into a canned sequence of more basic (SSE2-level) operations. So more often that not, you will end up requiring a higher minimum CPU target for little to no speed gain. Caveat: these instructions are a smaller than their replacement instruction sequence, so using them can reduce code size slightly. But still, don’t expect this to be fast.
If you want good SIMD performance, don’t lean on horizontal and dot-product style operations; process data in batches (not just one vec4 at a time) and transpose on input, or use a SoA layout to begin with.
The other kind of horizontal adds, dot products etc. (integer)
SSE does have a bunch of horizontal add and dot product-style operations that don’t suck, but they’re on the integer pipe, and not what you’d expect.
_mm_hadd_epi32). These are SSSE3 and later only and OK but not particularly great (similar disclaimer as for the floating-point horizontal adds applies).
No, I’m talking about
_mm_madd_epi16, SSE2 with its ancestor around since the original MMX),
_mm_sad_epu8, SSE2) and
_mm_maddubs_epi16, SSSE3). The official manuals describe what these instructions do, so I won’t bother going into too much detail, but here’s the basic idea:
PMADDUBSW are 2-element dot-product instructions between pairs of adjacent SIMD lanes.
PMADDWD computes two int16 by int16 products for each pair of 16-bit lanes and sums the 32-bit integer products to yield the 32-bit result lanes.
PMADDUBSW computes two uint8 by int8 products for each pair of 8-bit lanes and sums the 16-bit integer products to yield the 16-bit result lanes. These can be used to compute dot products of this problem; but they also have “degenerate” configurations that are very useful:
_mm_madd_epi16(x, _mm_set1_epi16(1))sums the 16-bit even and odd lanes of x in pairs to yield 32-bit results.
_mm_maddubs_epi16(_mm_unpacklo_epi8(a, b), _mm_setr_epi8(1, -1, 1, -1, ..., 1, -1))happens to be the fastest way to compute the 16-bit signed differences between 8-bit unsigned vectors
bon processors that support SSSE3.
- The 16-bit multiply example above shows another special configuration.
Long story short, these dot product instructions are surprisingly versatile in decidedly non-obvious ways.
_mm_sad_epu8, SSE2). This one is intended for motion estimation in video codecs, but it also happens to be the one actually really fast horizontal add you get on x86. In particular,
_mm_sad_epu8(x, _mm_setzero_si128()) computes two 16-bit horizontal sums of groups of 8 uint8 lanes in a single, and quite fast, operation. We can do the same trick we did for compares in reverse to compute the sum of 8 int8s instead: add (or subtract, or XOR)
x (before the
PSADBW), the subtract 128×8 from the resulting 16-bit sums.
To be continued!
There’s a lot more of these, but this feels like enough to chew on for a single blog post. So there will be a sequel covering, at least, integer widening/narrowing and variable shifts. Until then!