Using Intel Intrinsics to quickly find sum of array of integers

Question

I've been doing a task on an online judge: implement int sum(const int* array, unsigned int len) so that it returns the array of the sum. len can be 200,000 and this function can be called 200,000 times; and my program has to execute in under 0.9s.

Currently, my code looks like this:

#include 
#include 

int sum(const int* array, unsigned int len) {
    register int i = 8, s = 0;
    __m256i sm = _mm256_loadu_si256((void *)(array));
    for (; i+8 < len; i += 8) {
        const __m256i x = _mm256_loadu_si256((void *)(array+i));
        sm = _mm256_add_epi32(sm, x);
    }
    sm = _mm256_hadd_epi32(sm, sm);
    sm = _mm256_hadd_epi32(sm, sm);
    s = _mm256_extract_epi32(sm, 0);
    s += _mm256_extract_epi32(sm, 4);
    for(; i < len; ++i) s += array[i];
    return s;
}

However, this code does not pass as the judge reports Time limit exceeded.

Could anyone point out which instructions are expensive time-wise, and how to speed up my code?

Jerry Coffin · Accepted Answer

Doing a quick check, it looks like most reasonably recent processors provide two load ports, and two ports for addition, so at least theoretically you should get a decent gain by unrolling two iterations of the loop (though if the data is very big, it's probably going to pretty quickly come down to the bandwidth to main memory).

As with any AVX operation, you want to assure the data you're working with is properly aligned. Older processors will fault if the data is misaligned. Newer ones will work, but you'll get a pretty serious speed penalty.

Using Intel Intrinsics to quickly find sum of array of integers

Answers (2)

Related Questions