Reputation: 1935
I've been doing a task on an online judge: implement int sum(const int* array, unsigned int len)
so that it returns the array of the sum. len
can be 200,000 and this function can be called 200,000 times; and my program has to execute in under 0.9s.
Currently, my code looks like this:
#include <immintrin.h>
#include <stdio.h>
int sum(const int* array, unsigned int len) {
register int i = 8, s = 0;
__m256i sm = _mm256_loadu_si256((void *)(array));
for (; i+8 < len; i += 8) {
const __m256i x = _mm256_loadu_si256((void *)(array+i));
sm = _mm256_add_epi32(sm, x);
}
sm = _mm256_hadd_epi32(sm, sm);
sm = _mm256_hadd_epi32(sm, sm);
s = _mm256_extract_epi32(sm, 0);
s += _mm256_extract_epi32(sm, 4);
for(; i < len; ++i) s += array[i];
return s;
}
However, this code does not pass as the judge reports Time limit exceeded
.
Could anyone point out which instructions are expensive time-wise, and how to speed up my code?
Upvotes: 2
Views: 1710
Reputation: 1935
Implementing @JerryCoffin's suggestions:
#include <immintrin.h>
#include <stdio.h>
int sum(const int* array, unsigned int len) {
if(len < 60) {
int s = 0;
for(int i = 0; i < len; ++i) s += array[i];
return s;
}
register int i = 0, s = 0;
__m256i sm = _mm256_loadu_si256((void *)(array+i));
__m256i sm2 = _mm256_loadu_si256((void *)(array+i+8));
i += 16;
for (; i+16 < len; i += 16) {
const __m256i x = _mm256_loadu_si256((void *)(array+i));
sm = _mm256_add_epi32(sm, x);
const __m256i y = _mm256_loadu_si256((void *)(array+i+8));
sm2 = _mm256_add_epi32(sm2, y);
}
sm = _mm256_add_epi32(sm, sm2);
sm = _mm256_hadd_epi32(sm, sm);
sm = _mm256_hadd_epi32(sm, sm);
s += _mm256_extract_epi32(sm, 0);
s += _mm256_extract_epi32(sm, 4);
for(; i < len; ++i) s += array[i];
return s;
}
Interestingly, because the function is called so many times, consuming the ints until the array is aligned actually takes more time than using loadu
.
Upvotes: 0
Reputation: 490593
Doing a quick check, it looks like most reasonably recent processors provide two load ports, and two ports for addition, so at least theoretically you should get a decent gain by unrolling two iterations of the loop (though if the data is very big, it's probably going to pretty quickly come down to the bandwidth to main memory).
As with any AVX operation, you want to assure the data you're working with is properly aligned. Older processors will fault if the data is misaligned. Newer ones will work, but you'll get a pretty serious speed penalty.
Upvotes: 2