Reputation: 497
I'm trying to use SSE intrinsics to add two 32-bit signed int arrays. But I'm getting very poor performance compared to a linear addition.
Platform - Intel Core i3 550, GCC 4.4.3, Ubuntu 10.04 (bit old, yeah)
#define ITER 1000
typedef union sint4_u {
__m128i v;
sint32_t x[4];
} sint4;
The functions:
void compute(sint32_t *a, sint32_t *b, sint32_t *c) {
sint32_t len = 96000;
sint32_t i, j;
__m128i x __attribute__ ((aligned(16)));
__m128i y __attribute__ ((aligned(16)));
sint4 z;
for(j = 0; j < ITER; j++) {
for(i = 0; i < len; i += 4) {
x = _mm_set_epi32(a[i + 0], a[i + 1], a[i + 2], a[i + 3]);
y = _mm_set_epi32(b[i + 0], b[i + 1], b[i + 2], b[i + 3]);
z.v = _mm_add_epi32(x, y);
c[i + 0] = z.x[3];
c[i + 1] = z.x[2];
c[i + 2] = z.x[1];
c[i + 3] = z.x[0];
}
}
return;
}
void compute_s(sint32_t *a, sint32_t *b, sint32_t *c) {
sint32_t len = 96000;
sint32_t i, j;
for(j = 0; j < ITER; j++) {
for(i = 0; i < len; i++) {
c[i] = a[i] + b[i];
}
}
return;
}
The results:
➜ C gcc -msse4.2 simd.c
➜ C ./a.out
Time Elapsed (SSE): 612.520000 mS
Time Elapsed (Scalar): 401.713000 mS
➜ C gcc -O3 -msse4.2 simd.c
➜ C ./a.out
Time Elapsed (SSE): 135.124000 mS
Time Elapsed (Scalar): 46.438000 mS
On using -O3
, the SSE version becomes 3 times slower (!!). What am I doing wrong? Even if I skip the loading back to c
in compute
, it still takes an extra 100 ms without any optimizations.
EDIT - as suggested in the comments, I replaced _mm_set with _mm_load, here are the updated times -
➜ C gcc audproc.c -msse4
➜ C ./a.out
Time Elapsed (SSE): 303.931000 mS
Time Elapsed (Scalar): 413.701000 mS
➜ C gcc -O3 audproc.c -msse4
➜ C ./a.out
Time Elapsed (SSE): 82.532000 mS
Time Elapsed (Scalar): 48.104000 mS
Much much better, but still nowhere close to the theoretical gain of 4x. Also, why is my vectorization slower at O3
? Also, how do I get rid of this warning? (I tried adding __vector__
to my declaration but got more warnings instead. :( )
audproc.c: In function ‘compute’:
audproc.c:54: warning: passing argument 1 of ‘_mm_load_si128’ from incompatible pointer type /usr/lib/gcc/i486-linux-gnu/4.4.3/include/emmintrin.h:677: note: expected ‘const long long int __vector__ *’ but argument is of type ‘const sint32_t *’
Upvotes: 0
Views: 1114
Reputation: 213060
As already mentioned in the comments, in order to get the performance benefits of SIMD you should avoid scalar operations in your loop, i.e. get rid of the _mm_set_epi32
pseudo-intrinsics and the union for storing SIMD results. Here is a fixed version of your function:
void compute(const sint32_t *a, const sint32_t *b, sint32_t *c)
{
sint32_t len = 96000;
sint32_t i, j;
for(j = 0; j < ITER; j++)
{
for(i = 0; i < len; i += 4)
{
__m128i x = _mm_loadu_si128((__m128i *)&a[i]);
__m128i y = _mm_loadu_si128((__m128i *)&b[i]);
__m128i z = _mm_add_epi32(x, y);
_mm_storeu_si128((__m128i *)&c[i], z);
}
}
}
Upvotes: 3