Counting bits in arm assembly

Question

I have found the following code for counting bits in a 4-bytes integer in Arm assembly in the minimal number of instructions:

  ;R0 - value
  ;R1 = number of ones
  ;Uses R0-R1
  ;81 cycles worst case, 4 cycles best case, exit when r1=0

mov    r1,r0,lsr #31
loop   movs     r0,r0,lsl #2    
       adc  r1,r1,r0,lsr #31    
       bne  loop
       mov  pc,r14

Can you please explain what is the algorithm here? I can't understand it though I know what all instructions do.

Peter Cordes · Accepted Answer

       mov    r1,r0,lsr #31        @ start with r1 = the high bit of r0 (right shift by 31 bits)
loop   movs     r0,r0,lsl #2       @ left shift r0 by 2, and set flags on the result
       adc  r1,r1,r0,lsr #31
       bne  loop                   @ loop if r0 is non-zero (testing flags set by movs)

The add-with-carry is a neat trick: r0 >> 31 is the high bit, and the carry flag is the bit that was shifted out by the movs r0,r0,lsl #2 (I assume ARM works this way like x86, otherwise the algorithm doesn't make sense.)

So it adds 2 bits per iteration to the popcnt total: the high bit, and the last bit shifted out.

This is not the fastest way to popcnt.

A previous question: Fastest way to count number of 1s in a register, ARM assembly addressed this, with one answer using vcnt.8 d0, d0 and then a horizontal sum of the four 8bit counts. None of the answers there mentions this two-bits-at-a-time loop.

Even without a LUT (e.g. 8-bit LUT to look up the count for each byte) or a dedicated popcnt instruction, there are bithacks, like this one from Sean Anderson's collection:

v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count

constant-time no branches, but does require several 32bit constants (takes two instructions each to get into regs on ARM)

Counting bits in arm assembly

Answers (1)

Related Questions