Reputation: 2200
Thanks to some of you, I have already used SSE for speeding up computation of one of my function of my scientific app in C++ use SSE instructions for comparing huge vectors of ints.
The final version of the optimized SSE function is:
int getBestDiffsSse(int nodeId, const vector<int> &goalNodeIdTemp) {
int positionNodeId = 2 * nodeId * nof;
int myNewIndex = 2 * nof;
int result[4] __attribute__((aligned(16))) = {0};
__m128i vresult = _mm_set1_epi32(0);
__m128i v1, v2, vmax;
for (int k = 0; k < myNewIndex; k += 4) {
v1 = _mm_loadu_si128((__m128i *) & distances[positionNodeId + k]);
v2 = _mm_loadu_si128((__m128i *) & goalNodeIdTemp[k]);
v1 = _mm_xor_si128(v1, vke);
v2 = _mm_xor_si128(v2, vko);
v1 = _mm_sub_epi32(v1, vke);
v2 = _mm_sub_epi32(v2, vko);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *) result, vresult);
return max(max(max(result[0], result[1]), result[2]), result[3]);
}
where
const __m128i vke = _mm_set_epi32(0, -1, 0, -1);
const __m128i vko = _mm_set_epi32(-1, 0, -1, 0);
and
int* distances
distances= new int[size];
where size is huge (18M x 64)
My naive question is: Do you believe I could get a better speed up if both: a) array distances is aligned or b) vector goalNodeIdTemp is aligned and c) how do I do that?
I' ve seen some posts about memalign or align_malloc but I have not understand how to use them for a dynamic array or a vector. Or since I am talking about ints, alignment is not an issue? Keep in mind I am using Ubuntu 12.04 and gcc, so a solution about Visual Studio compiler is not an option.
Added questions: First of all, is the following code enough to align the dynamic array (Keep in mind that definition and initialization have to be kept differently);
int *distances __attribute__((aligned(16)));
distances = new int[size];
Second, in order to align vector goalNodeIdTemp do I need to write entire code for custom vector allocator? Is there a simpler alternative?
I need your help. Thanks in advance
Upvotes: 0
Views: 2169
Reputation: 64253
There are several things you can do to improve performances a bit :
__m128i v1, v2, vmax;
out of the loop, but that is most likely done by the compiler _mm_load_si128
.If distance and goalNodeIdTemp were properly aligned, you could use raw pointers. Something like this :
__m128i *v1 = (__m128i *) & distances[positionNodeId + k];
__m128i *v2 = (__m128i *) & goalNodeIdTemp[k];
All further optimizations, you need to look into assembly code.
Do you believe I could get a better speed up if both: a) array distances is aligned b) vector goalNodeIdTemp is aligned
Yes, you will get a small performance boost. Nothing spectacular, but if every cycle count, then it may be noticeable
how do I do that?
To have goalNodeIdTemp
aligned, you have to use a special allocator for std::vector
(see for example here how to do it).
To align distance
, you have to be a bit careful. See here how to allocate aligned memory.
Upvotes: 1