Reputation: 13
I have found the following code for counting bits in a 4-bytes integer in Arm assembly in the minimal number of instructions:
;R0 - value
;R1 = number of ones
;Uses R0-R1
;81 cycles worst case, 4 cycles best case, exit when r1=0
mov r1,r0,lsr #31
loop movs r0,r0,lsl #2
adc r1,r1,r0,lsr #31
bne loop
mov pc,r14
Can you please explain what is the algorithm here? I can't understand it though I know what all instructions do.
Upvotes: 1
Views: 4261
Reputation: 365812
mov r1,r0,lsr #31 @ start with r1 = the high bit of r0 (right shift by 31 bits)
loop movs r0,r0,lsl #2 @ left shift r0 by 2, and set flags on the result
adc r1,r1,r0,lsr #31
bne loop @ loop if r0 is non-zero (testing flags set by movs)
The add-with-carry is a neat trick: r0 >> 31
is the high bit, and the carry flag is the bit that was shifted out by the movs r0,r0,lsl #2
(I assume ARM works this way like x86, otherwise the algorithm doesn't make sense.)
So it adds 2 bits per iteration to the popcnt total: the high bit, and the last bit shifted out.
This is not the fastest way to popcnt.
A previous question: Fastest way to count number of 1s in a register, ARM assembly addressed this, with one answer using vcnt.8 d0, d0
and then a horizontal sum of the four 8bit counts. None of the answers there mentions this two-bits-at-a-time loop.
Even without a LUT (e.g. 8-bit LUT to look up the count for each byte) or a dedicated popcnt instruction, there are bithacks, like this one from Sean Anderson's collection:
v = v - ((v >> 1) & 0x55555555); // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
constant-time no branches, but does require several 32bit constants (takes two instructions each to get into regs on ARM)
Upvotes: 2