Bit scanning equivalencies
A lot of bit manipulation operations exist basically everywhere: AND, OR, XOR/EOR, shifts, and so forth. Some exist only on some architectures but have obvious implementations everywhere else – NAND, NOR, equivalence/XNOR and so forth.
Some exist as built-in instructions on some architectures and require fairly lengthy multi-instruction sequences when they’re not available. Some examples would be population count (number of bits set; available on recent x86s and POWER but not PowerPC), or bit reversal (available on ARMv6T2 and above).
And then there’s bit scanning operations. All of these can be done fairly efficiently on most architectures, but it’s not always obvious how.
x86 bit scanning operations
On x86, there’s
BSF (bit scan forward),
BSR (bit scan reverse),
TZCNT (trailing zero count) and
LZCNT (leading zero count). The first two are “old” (having existed since 386), the latter are very new.
TZCNT do basically the same thing, with one difference:
BSF has “undefined” results on a zero input (what is the index of the first set bit in the integer 0?), whereas the “trailing zeros” definition for
TZCNT provides an obvious answer: the width of the register.
BSR returns the index of the “last” set bit (again undefined on 0 input),
LZCNT returns the number of leading zero bits, so these two actually produce different results.
x86_bsf(x) = x86_tzcnt(x)unless x=0 (in which case
x86_bsr(x) = (reg_width - 1) - x86_lzcnt(x)unless x=0 (in which case
In the following, I’ll only use
TZCNT (use above table to convert where necessary) simply to avoid that nasty edge case at 0.
PowerPC bit scanning operations
cntlzw (Count Leading Zeros Word) and
cntlzd (Count Leading Zeros Double Word) but no equivalent for trailing zero bits. We can get fairly close though: there’s the old trick
x & -x which clears all but the lowest set bit in
x. As long as x is nonzero, this value has exactly one bit set, and we can determine its position using a leading zero count.
x86_tzcnt(x) = (reg_width - 1) - ppc_cntlz(x & -x)unless x=0 (in which case the PPC expression returns -1).
x86_lzcnt(x) = ppc_cntlz(x).
ARM bit scanning operations
ARM also has
CLZ (Count Leading Zeros) but nothing for trailing zero bits. But it also offers the aforementioned
RBIT which reverses the bits in a word, which makes a trailing zero count easy to accomplish:
x86_tzcnt(x) = arm_clz(arm_rbit(x)).
x86_lzcnt(x) = arm_clz(x).
Finally, ARMs NEON also offers
VCLS (Vector Count Leading Sign Bits), which (quoting from the documentation) “counts the number of consecutive bits following the topmost bit, that are the same as the topmost bit”. Well, we can do that on all architectures I mentioned as well, using only ingredients we already have:
arm_cls(x) = x86_lzcnt(x ^ (x >> 1)) - 1 (the shift here is an arithmetic shift). The expression
y = x ^ (x >> 1) gives a value that has bit
n set if and only if bits
n + 1 of x are the same. By induction, the number of leading zeros in
y is thus exactly the number of leading bits in
x that match the sign bit. This count includes the topmost (sign) bit, so it’s always at least 1, and the instruction definition I just quoted requires us to return the number of bits following the topmost bit that match it. So we subtract 1 to get the right result. Since we can do a fast leading zero count on all quoted platforms, we’re good.
Conclusion: bit scans / zero bit counts in either direction can be done fairly efficiently on all architectures covered, but you need to be careful when zeros are involved.