Speed difference for bitwise operations

Question

Below two code blocks calculate the Hamming distance between two integers. They are the same, but why is the first one faster than the second one on LeetCode?

Fast:

int hammingDistance(int x, int y) {
    int count = 0;
    for (int i = 0; i < 32; ++i)
    {
        count += (x&1) ^ (y&1);
        x >>= 1;
        y >>= 1;
    }
    return count;
}

Slow:

int hammingDistance(int x, int y) {
    int n = x ^ y;
    int count = 0;
    for (int i = 0; i < 32; ++i)
    {
        count += n & 1;
        n >>= 1;
    }
    return count;
}

**** Update ****

I have written a test code on my Mac machine:

#include 
#include "cstdio"

int hamm_fast(int x, int y) {
    int count = 0;
    for (int i = 0; i < 32; ++i)
    {
        count += (x&1) ^ (y&1);
        x >>= 1;
        y >>= 1;
    }
    return count;
}

int hamm_slow(int x, int y) {
    int n = x ^ y;
    int count = 0;
    for (int i = 0; i < 32; ++i)
    {
        count += n & 1;
        n >>= 1;
    }
    return count;
}

int main()
{
    clock_t begin;
    clock_t end;
    double time_spent;

    // benchmark fast
    begin = clock();
    for (int i = 0; i < 100000; ++i)
        hamm_fast(100,100);
    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("Fast: %f ms
", time_spent*1e3);

    // benchmark slow
    begin = clock();
    for (int i = 0; i < 100000; ++i)
        hamm_slow(100,100);
    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("Slow: %f ms
", time_spent*1e3);

    return 0;
}

And compile and run as

g++ ham.cpp && ./a.out

And the two solutions turn out to be similar. For example:

Fast: 7.233000 ms
Slow: 6.963000 ms

Actually the slow one is faster ...

Jonathan Leffler · Accepted Answer

As I noted in a comment, you need to be careful with the benchmarking to ensure that an aggressive optimizer cannot optimize the function calls away.

An aggressive optimizer could remove the test loops because you pass the same values to the Hamming distance functions on each iteration and ignore the returned value — which would leave you with nothing to measure.

If it were my testing, I'd have a test function containing the timing and function calls. The function call would be inside two loops, and I'd pass the two loop indexes as arguments to the function, and sum the return values, and print the sum at the end to help ensure that the functions produce the same results. I'd also make it run for seconds, not milliseconds.

Here's my code. It uses timing code available in my SOQ (Stack Overflow Questions) repository on GitHub as files timer.c and timer.h in the src/libsoq sub-directory.

#include "timer.h"
#include 

#define L1_MIN 0
#define L1_MAX 10240
#define L2_MIN 0
#define L2_MAX 10240

static int hamm_fast(int x, int y)
{
    int count = 0;
    for (int i = 0; i < 32; ++i)
    {
        count += (x & 1) ^ (y & 1);
        x >>= 1;
        y >>= 1;
    }
    return count;
}

static int hamm_slow(int x, int y)
{
    int n = x ^ y;
    int count = 0;
    for (int i = 0; i < 32; ++i)
    {
        count += n & 1;
        n >>= 1;
    }
    return count;
}

static void tester(const char *tag, int (*function)(int x, int y))
{
    Clock t;
    clk_init(&t);

    clk_start(&t);
    int sum = 0;
    for (int i = L1_MIN; i < L1_MAX; i++)
    {
        for (int j = L2_MIN; j < L2_MAX; j++)
            sum += (*function)(i, j);
    }
    clk_stop(&t);

    char buffer[32];
    int iterations = (L1_MAX - L1_MIN) * (L2_MAX - L2_MIN);
    printf("%s sum = %d (%d iterations) %s
", tag, sum, iterations,
            clk_elapsed_us(&t, buffer, sizeof(buffer)));
}

int main(void)
{
    for (int i = 0; i < 10; i++)
    {
        tester("Fast", hamm_fast);
        tester("Slow", hamm_slow);
    }
    return 0;
}

The output I got on one run was:

Fast sum = 710934528 (104857600 iterations) 2.461100
Slow sum = 710934528 (104857600 iterations) 1.181584
Fast sum = 710934528 (104857600 iterations) 2.480401
Slow sum = 710934528 (104857600 iterations) 1.182961
Fast sum = 710934528 (104857600 iterations) 2.466685
Slow sum = 710934528 (104857600 iterations) 1.197394
Fast sum = 710934528 (104857600 iterations) 2.435806
Slow sum = 710934528 (104857600 iterations) 1.175533
Fast sum = 710934528 (104857600 iterations) 2.384162
Slow sum = 710934528 (104857600 iterations) 1.184161
Fast sum = 710934528 (104857600 iterations) 2.376042
Slow sum = 710934528 (104857600 iterations) 1.191555
Fast sum = 710934528 (104857600 iterations) 2.389027
Slow sum = 710934528 (104857600 iterations) 1.169186
Fast sum = 710934528 (104857600 iterations) 2.393707
Slow sum = 710934528 (104857600 iterations) 1.209600
Fast sum = 710934528 (104857600 iterations) 2.423526
Slow sum = 710934528 (104857600 iterations) 1.204585
Fast sum = 710934528 (104857600 iterations) 2.515968
Slow sum = 710934528 (104857600 iterations) 1.196783

As you can see, the 'fast' code is about twice as slow as the 'slow' code. That's primarily because the 'fast' code is doing many more operations per loop than the 'slow' code. The 'fast' code does 2 & operations, 1 ^ operation, and 2 >>= operations compared to 1 & and 1 >>= in the 'slow' code. But the results are apparently the same; that's the good news. The functions are equivalent in terms of result, but not in terms of speed.

Compilation on a MacBook Pro running macOS 10.13.6 High Sierra, using GCC 8.2.0.

Compilation command line (source file spot79.c):

$ gcc -O3 -g -I./inc -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes \
>     -Wstrict-prototypes spot79.c -o spot79 -L./lib -lsoq
$

The timer.h header was in the ./inc directory and the soq library was in ./lib — that's simply my build setup.

Speed difference for bitwise operations

Answers (1)

Related Questions