user2138251
user2138251

Reputation:

Compute sum of bits efficiently with SSE

I have done a calculation using SSE to improve the performance of my code, of which I include a minimal working example. I have included comments and the compilation line to make it as clear as possible, please ask if you need any clarification.

I am trying to sum N bits, bit[0], ..., bit[N-1], and write the result in binary in a vector result[0], ..., result[bits_N-1], where bits_N is the number of bits needed to write N in binary. This sum is performed bit-by-bit: each bit[i] is an unsigned long long int, and into its j-th bit is stored either 0 or 1. As a result, I make 64 sums, each of N bit, in parallel.

In lines 80-105 I make this sum by using 64-bit arithmetic.

In lines 107-134 I do it by using SSE: I store the first half of the sum bit[0], ...., bit[N/2-1] in the first 64 bits of _m128i objects BIT[0], ..., BIT[N/2-1], respectively. Similarly, I store bit[N/2], ...., bit[N-1] in the last 64 bits of BIT[0], ..., BIT[N/2-1], respectively, and sum all the BITs. So far everything works fine, and the 128-bit sum takes the same time as the 64-bit one. However, to collect the final result I need to sum the two halves to each other, see lines 125-132. This takes a long time, and makes me lose the gain obtained with SSE.

I am running this on an Intel(R) i7-4980HQ CPU @ 2.80GHz with gcc 7.2.0.

Do you know a way around this?

Upvotes: 0

Views: 380

Answers (1)

Aki Suihkonen
Aki Suihkonen

Reputation: 20037

The low part can be trivially saved with movq instruction or either _mm_storel_epi64 (__m128i* mem_addr, __m128i a); intrinsic storing to memory, or _mm_cvtsi128_si64 storing to register.

There is also a _mm_storeh_pd counterpart, which requires cast to pd and may cause a stall due to mixing floating points and integers.

The top part can be of course moved to low part with _mm_shuffle_epi(src, 0x4e) and then saved with movq.

Upvotes: 1

Related Questions