Reputation:
I have done a calculation using SSE to improve the performance of my code, of which I include a minimal working example. I have included comments and the compilation line to make it as clear as possible, please ask if you need any clarification.
I am trying to sum N bits, bit[0], ..., bit[N-1]
, and write the result in binary in a vector result[0], ..., result[bits_N-1]
, where bits_N
is the number of bits needed to write N in binary. This sum is performed bit-by-bit: each bit[i]
is an unsigned long long int, and into its j-th bit is stored either 0 or 1. As a result, I make 64 sums, each of N bit, in parallel.
In lines 80-105 I make this sum by using 64-bit arithmetic.
In lines 107-134 I do it by using SSE: I store the first half of the sum bit[0], ...., bit[N/2-1]
in the first 64 bits of _m128i objects BIT[0], ..., BIT[N/2-1]
, respectively. Similarly, I store bit[N/2], ...., bit[N-1]
in the last 64 bits of BIT[0], ..., BIT[N/2-1]
, respectively, and sum all the BIT
s. So far everything works fine, and the 128-bit sum takes the same time as the 64-bit one. However, to collect the final result I need to sum the two halves to each other, see lines 125-132. This takes a long time, and makes me lose the gain obtained with SSE.
I am running this on an Intel(R) i7-4980HQ CPU @ 2.80GHz with gcc 7.2.0.
Do you know a way around this?
Upvotes: 0
Views: 380
Reputation: 20037
The low part can be trivially saved with movq
instruction or either _mm_storel_epi64 (__m128i* mem_addr, __m128i a);
intrinsic storing to memory, or _mm_cvtsi128_si64
storing to register.
There is also a _mm_storeh_pd
counterpart, which requires cast to pd
and may cause a stall due to mixing floating points and integers.
The top part can be of course moved to low part with _mm_shuffle_epi(src, 0x4e)
and then saved with movq
.
Upvotes: 1