Compute sum of bits efficiently with SSE

Question

I have done a calculation using SSE to improve the performance of my code, of which I include a minimal working example. I have included comments and the compilation line to make it as clear as possible, please ask if you need any clarification.

I am trying to sum N bits, bit[0], ..., bit[N-1], and write the result in binary in a vector result[0], ..., result[bits_N-1], where bits_N is the number of bits needed to write N in binary. This sum is performed bit-by-bit: each bit[i] is an unsigned long long int, and into its j-th bit is stored either 0 or 1. As a result, I make 64 sums, each of N bit, in parallel.

In lines 80-105 I make this sum by using 64-bit arithmetic.

In lines 107-134 I do it by using SSE: I store the first half of the sum bit[0], ...., bit[N/2-1] in the first 64 bits of _m128i objects BIT[0], ..., BIT[N/2-1], respectively. Similarly, I store bit[N/2], ...., bit[N-1] in the last 64 bits of BIT[0], ..., BIT[N/2-1], respectively, and sum all the BITs. So far everything works fine, and the 128-bit sum takes the same time as the 64-bit one. However, to collect the final result I need to sum the two halves to each other, see lines 125-132. This takes a long time, and makes me lose the gain obtained with SSE.

I am running this on an Intel(R) i7-4980HQ CPU @ 2.80GHz with gcc 7.2.0.

Do you know a way around this?

Compute sum of bits efficiently with SSE

Answers (1)

Related Questions