a3mlord
a3mlord

Reputation: 1060

Wrong result in vectorization with SSE

The code below generates the following output:

6 6 0 140021597270387

which means that only the first two positions are calculated correctly. However, I am dealing with longs (4 bytes) and __m128i can hold more than 4 longs.

long* AA = (long*)malloc(32*sizeof(long));
long* BB = (long*)malloc(32*sizeof(long));

for(i = 0; i<4;i++){
    AA[i] = 2;
    BB[i] = 3;
}

__m128i* m1 = (__m128i*) AA;
__m128i* m2 = (__m128i*) BB;

__m128i m3 = _mm_mul_epu32(m1[0],m2[0]);

long* CC = (long*) malloc(16 * sizeof(long));
CC = (long*)&m3;

for (i = 0; i < 4; i++)
    printf("%ld \n",CC[i]);

To allocate:

long* AA = (long*) memalign(16 * sizeof(long), 16);

(and the remaining vectors) generates a seg. fault. Can somebody comment?

Thanks

Upvotes: 3

Views: 320

Answers (1)

Paul R
Paul R

Reputation: 212929

1) don't use an indeterminate-sized type like long, use a specific fixed with type such as uint32_t

2) don't use malloc - it's not guaranteed to return 16 byte aligned memory, use memalign or equivalent*

3) don't cast the result of malloc (or any other function return void *) in C

4) no need to allocate yet another buffer just to print results

Fixed code:

uint32_t* AA = memalign(32*sizeof(uint32_t), 16);
uint32_t* BB = memalign(32*sizeof(uint32_t), 16);

for (i = 0; i < 4; i++){
    AA[i] = 2;
    BB[i] = 3;
}

__m128i* m1 = (__m128i*)AA;
__m128i* m2 = (__m128i*)BB;

__m128i m3 = _mm_mul_epu32(m1[0], m2[0]);    // 2 x 32x32->64 bit unsigned multiplies -> m3

uint64_t* CC = (uint64_t*)&m3;

for (i = 0; i < 2; i++)                      // display 2 x 64 bit result values
    printf("%llu\n", CC[i]);

*Note that, depending on your platform, you may need to use a call other than memalign in order to allocate suitably aligned memory, e.g. posix_memalign, _mm_malloc or _aligned_malloc (WIN32).

Upvotes: 4

Related Questions