R. Moravskyi
R. Moravskyi

Reputation: 31

Counting '1' in number in C

My task was to print all whole numbers from 2 to N(for which in binary amount of '1' is bigger than '0')

int CountOnes(unsigned int x)
{ 
    unsigned int iPassedNumber = x; // number to be modifed
    unsigned int iOriginalNumber = iPassedNumber;
    unsigned int iNumbOfOnes = 0;

    while (iPassedNumber > 0)
    {
        iPassedNumber = iPassedNumber >> 1 << 1;  //if LSB was '1', it turns to '0'

        if (iOriginalNumber - iPassedNumber == 1) //if diffrence == 1, then we increment numb of '1'
        {
            ++iNumbOfOnes;
        }

        iOriginalNumber = iPassedNumber >> 1; //do this to operate with the next bit
        iPassedNumber = iOriginalNumber; 
    }
    return (iNumbOfOnes);
}

Here is my function to calculate the number of '1' in binary. It was my homework in college. However, my teacher said that it would be more efficient to

{ 
   if(n%2==1)
      ++CountOnes;
   else(n%2==0)
      ++CountZeros;
}

In the end, I just messed up and don`t know what is better. What do you think about this?

Upvotes: 3

Views: 991

Answers (2)

Ahmed Masud
Ahmed Masud

Reputation: 22372

I used gcc compiler for the experiment below. Your compiler may be different, so you may have to do things a bit differently to get a similar effect.

When trying to figure out the most optimized method for doing something you want to see what kind of code the compiler produces. Look at the CPU's manual and see which operations are fast and which are slow on that particular architecture. Although there are general guidelines. And of course if there are ways you can reduce the number of instructions that a CPU has to perform.

I decided to show you a few different methods (not exhaustive) and give you a sample of how to go about looking at optimization of small functions (like this one) manually. There are more sophisticated tools that help with larger and more complex functions, however this approach should work with pretty much anything:

Note

All assembly code was produced using:

gcc -O99 -o foo -fprofile-generate foo.c

followed by

gcc -O99 -o foo -fprofile-use foo.c

On -fprofile-generate

The double compile makes gcc really let's gcc work (although -O99 most likely does that already) however mileage may vary based on which version of gcc you may be using.

On with it:

Method I (you)

Here is the disassembly of your function:

CountOnes_you:
.LFB20:
        .cfi_startproc
        xorl    %eax, %eax
        testl   %edi, %edi
        je      .L5
        .p2align 4,,10
        .p2align 3
.L4:
        movl    %edi, %edx
        xorl    %ecx, %ecx
        andl    $-2, %edx
        subl    %edx, %edi
        cmpl    $1, %edi
        movl    %edx, %edi
        sete    %cl
        addl    %ecx, %eax
        shrl    %edi
        jne     .L4
        rep ret
        .p2align 4,,10
        .p2align 3
.L5:
        rep ret
        .cfi_endproc

At a glance

Approximately 9 instructions in a loop, until the loop exits

Method II (teacher)

Here is a function which uses your teacher's algo:

int CountOnes_teacher(unsigned int x)
{
    unsigned int one_count = 0;
    while(x) {
        if(x%2)
            ++one_count;
        x >>= 1;
    }
    return one_count;
}

Here's the disassembly of that:

CountOnes_teacher:
.LFB21:
        .cfi_startproc
        xorl    %eax, %eax
        testl   %edi, %edi
        je      .L12
        .p2align 4,,10
        .p2align 3
.L11:
        movl    %edi, %edx
        andl    $1, %edx
        cmpl    $1, %edx
        sbbl    $-1, %eax
        shrl    %edi
        jne     .L11
        rep ret
        .p2align 4,,10
        .p2align 3
.L12:
        rep ret
        .cfi_endproc

At a glance:

5 instructions in a loop until the loop exits

Method III

Here is Kernighan's method:

 int CountOnes_K(unsigned int x) {
      unsigned int count;
      for(count = 0; ; x; count++) {
          x &= x - 1; // clear least sig bit
      }
      return count;
 }

Here's the disassembly:

CountOnes_k:
.LFB22:
        .cfi_startproc
        xorl    %eax, %eax
        testl   %edi, %edi
        je      .L19
        .p2align 4,,10
        .p2align 3
.L18: 
        leal    -1(%rdi), %edx
        addl    $1, %eax
        andl    %edx, %edi
        jne     .L18  ; loop is here
        rep ret
        .p2align 4,,10
        .p2align 3
.L19:
        rep ret
        .cfi_endproc

At a glance

3 instructions in a loop.

Some commentary before continuing

As you can see the compiler doesn't really use the best way when you employ % to count (which was used by both you and your teacher).

Krenighan method is pretty optimized, least number of operations in the loop). It is instructional to compare Krenighan to the naive method of counting, while on the surface it may look the same it's really not!

for (c = 0; v; v >>= 1)
{
  c += v & 1;
}

This method sucks compared to Krenighans. Here if you have say the 32nd bit set this loop will run 32 times, whereas Krenighan's will not!

But all these methods are still rather sub-par because they loop.

If we combine a couple of other piece of (implicit) knowledge into our algorithms we can get rid of loops all together. Those are, 1 the size of our number in bits, and the size of a character in bits. With these pieces and by realizing that we can filter out bits in chunks of 14, 24 or 32 bits given that we have a 64 bit register.

So for instance, if we look at a 14-bit number then we can simply count the bits by:

 (n * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;

uses % but only once for all numbers between 0x0 and 0x3fff

For 24 bits we use 14 bits and then something similar for the remaining 10 bits:

  ((n & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f 
+ (((n & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) 
 % 0x1f;

But we can generalize this concept by realizing the patterns in the numbers above and realize that the magic numbers are actually just compliments (look at the hex numbers closely 0x8000 + 0x400 + 0x200 + 0x1) shifted

We can generalize and then shrink the ideas here, giving us the most optimized method for counting bits (up to 128 bits) (no loops) O(1):

CountOnes_best(unsigned int n) {
    const unsigned char_bits = sizeof(unsigned char) << 3;
    typedef __typeof__(n) T; // T is unsigned int in this case;
    n = n - ((n >> 1) & (T)~(T)0/3); // reuse n as a temporary 
    n = (n & (T)~(T)0/15*3) + ((n >> 2) & (T)~(T)0/15*3);
    n = (n + (n >> 4)) & (T)~(T)0/255*15;
    return (T)(n * ((T)~(T)0/255)) >> (sizeof(T) - 1) * char_bits;
} 


CountOnes_best:
.LFB23:
        .cfi_startproc
        movl    %edi, %eax
        shrl    %eax
        andl    $1431655765, %eax
        subl    %eax, %edi
        movl    %edi, %edx
        shrl    $2, %edi
        andl    $858993459, %edx
        andl    $858993459, %edi
        addl    %edx, %edi
        movl    %edi, %ecx
        shrl    $4, %ecx
        addl    %edi, %ecx
        andl    $252645135, %ecx
        imull   $16843009, %ecx, %eax
        shrl    $24, %eax
        ret
        .cfi_endproc

This may be a bit of a jump from (how the heck did you go from previous to here), but just take your time to go over it.

The most optimized method was first mentioned in Software Optimization Guide for AMD Athelon™ 64 and Opteron™ Processor, my URL of that is broken. It is also well explained on the very excellent C bit twiddling page

I highly recommend going over the content of that page it really is a fantastic read.

Upvotes: 5

Clifford
Clifford

Reputation: 93476

Even better that your teacher's suggestion:

   if( n & 1 ) {
      ++ CountOnes;
   }
   else {
      ++ CountZeros;
   }

n % 2 has an implicit divide operation which the compiler is likely to optimise, but you should not rely on it - divide is a complex operation that takes longer on some platforms. Moreover there are only two options 1 or 0, so if it is not a one, it is a zero - there is no need for the second test in the else block.

Your original code is overcomplex and hard to follow. If you want to assess the "efficiency" of an algorithm, consider the number of operations performed per iteration, and the number of iterations. Also the number of variables involved. In your case there are 10 operations per iteration and three variables (but you omitted to count the zeros so you'd need four variables to complete the assignment). The following:

unsigned int n = x; // number to be modifed
int ones = 0 ;
int zeroes = 0 ;

while( i > 0 )
{
   if( (n & 1) != 0 )
   {
      ++ones ;
   }
   else
   {
      ++zeroes ;
   }

   n >>= 1 ;
}

has only 7 operations (counting >>= as two - shift and assign). More importantly perhaps, it is much easier to follow.

Upvotes: 2

Related Questions