C++ SSE and aligned array of ints and vector of ints

Question

Thanks to some of you, I have already used SSE for speeding up computation of one of my function of my scientific app in C++ use SSE instructions for comparing huge vectors of ints.

The final version of the optimized SSE function is:

int getBestDiffsSse(int nodeId, const vector &goalNodeIdTemp) {
    int positionNodeId = 2 * nodeId * nof;
    int myNewIndex = 2 * nof;
    int result[4] __attribute__((aligned(16))) = {0};

    __m128i vresult = _mm_set1_epi32(0);
    __m128i v1, v2, vmax;

    for (int k = 0; k < myNewIndex; k += 4) {
        v1 = _mm_loadu_si128((__m128i *) & distances[positionNodeId + k]);
        v2 = _mm_loadu_si128((__m128i *) & goalNodeIdTemp[k]);
        v1 = _mm_xor_si128(v1, vke);
        v2 = _mm_xor_si128(v2, vko);
        v1 = _mm_sub_epi32(v1, vke);
        v2 = _mm_sub_epi32(v2, vko);
        vmax = _mm_add_epi32(v1, v2);
        vresult = _mm_max_epi32(vresult, vmax);
    }
    _mm_store_si128((__m128i *) result, vresult);
    return max(max(max(result[0], result[1]), result[2]), result[3]);
}

where

const __m128i vke = _mm_set_epi32(0, -1, 0, -1);
const __m128i vko = _mm_set_epi32(-1, 0, -1, 0);

and

int* distances 
distances= new int[size];

where size is huge (18M x 64)

My naive question is: Do you believe I could get a better speed up if both: a) array distances is aligned or b) vector goalNodeIdTemp is aligned and c) how do I do that?

I' ve seen some posts about memalign or align_malloc but I have not understand how to use them for a dynamic array or a vector. Or since I am talking about ints, alignment is not an issue? Keep in mind I am using Ubuntu 12.04 and gcc, so a solution about Visual Studio compiler is not an option.

Added questions: First of all, is the following code enough to align the dynamic array (Keep in mind that definition and initialization have to be kept differently);

int *distances __attribute__((aligned(16)));
distances = new int[size];

Second, in order to align vector goalNodeIdTemp do I need to write entire code for custom vector allocator? Is there a simpler alternative?

I need your help. Thanks in advance

BЈовић · Accepted Answer

There are several things you can do to improve performances a bit :

take __m128i v1, v2, vmax; out of the loop, but that is most likely done by the compiler
make sure distances is properly aligned
instead of using std::vector, align data and pass the pointer. Then use _mm_load_si128.

If distance and goalNodeIdTemp were properly aligned, you could use raw pointers. Something like this :

__m128i *v1 = (__m128i *) & distances[positionNodeId + k];
__m128i *v2 = (__m128i *) & goalNodeIdTemp[k];

All further optimizations, you need to look into assembly code.

Do you believe I could get a better speed up if both: a) array distances is aligned b) vector goalNodeIdTemp is aligned

Yes, you will get a small performance boost. Nothing spectacular, but if every cycle count, then it may be noticeable

how do I do that?

To have goalNodeIdTemp aligned, you have to use a special allocator for std::vector (see for example here how to do it).

To align distance, you have to be a bit careful. See here how to allocate aligned memory.

C++ SSE and aligned array of ints and vector of ints

Answers (1)

Related Questions