Nik Kovac
Nik Kovac

Reputation: 195

how to use SSE to process array of ints, using a condition

I'm new to SSE, and limited in knowledge. I'm trying to vectorize my code (C++, using gcc), which is actually quite simple. I have an array of unsigned ints, and I only check for elements that are >=, or <= than some constant. As result, I need an array with elements that passed condition. I'm thinking to use 'mm_cmpge_ps' as a mask, but this construct work over floats not ints!? :(

any suggestion, help is very much appreciated.

Upvotes: 0

Views: 1105

Answers (3)

Paul R
Paul R

Reputation: 212969

It's pretty easy to just mask out (i.e. set to 0) all non-matching ints. e.g.

#include <emmintrin.h>    // SSE2 intrinsics

for (int i = 0; i < N; i += 4)
{
    __m128i v = _mm_load_si128(&a[i]);
    __m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
    __m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
    __m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
    v = _mm_and_si128(v, vcmp);
    _mm_store_si128(&a[i], v);
}

Note that a needs to be 16 byte aligned and N needs to be a multiple of 4 - if these constraints are a problem then it's not too hard to extend the code to cope with this.

Upvotes: 2

user2088790
user2088790

Reputation:

Here you go. Here are three functions.

The first function,foo_v1, is based on Paul R's answer.

The second function,foo_v2, is based on a popular question today Fastest way to determine if an integer is between two integers (inclusive) with known sets of values

The third function, foo_v3 uses Agner Fog's vectorclass which I added only to show how much easier and cleaner it is to use his class. If you don't have the class then just comment out the #include "vectorclass.h" line and the foo_v3 function. I used Vec8ui which means it will use AVX2 if available and break it into two Vec4ui otherwise so you don't have to change your code to get the benefit of AVX2.

#include <stdio.h>
#include <nmmintrin.h>                 // SSE4.2
#include "vectorclass.h"

void foo_v1(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
    for (int i = 0; i < N; i += 4) {
        __m128i v = _mm_load_si128((const __m128i*)&a[i]);
        __m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
        __m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
        __m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
        v = _mm_and_si128(v, vcmp);
        _mm_store_si128((__m128i*)&a[i], v);
    }
}

void foo_v2(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
    //if ((unsigned)(number-lower) < (upper-lower))
    for (int i = 0; i < N; i += 4) {
        __m128i v = _mm_load_si128((const __m128i*)&a[i]);
        __m128i dv = _mm_sub_epi32(v, _mm_set1_epi32(MIN_VAL));
        __m128i min_ab = _mm_min_epu32(dv,_mm_set1_epi32(MAX_VAL-MIN_VAL));
        __m128i vcmp = _mm_cmpeq_epi32(dv,min_ab);
        v = _mm_and_si128(v, vcmp);
        _mm_store_si128((__m128i*)&a[i], v);
    }
}

void foo_v3(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
    //if ((unsigned)(number-lower) < (upper-lower))
    for (int i = 0; i < N; i += 8) {
        Vec8ui va = Vec8ui().load(&a[i]);
        va &= (va - MIN_VAL) <= (MAX_VAL-MIN_VAL);
        va.store(&a[i]);
    }
}

int main() {
    const int N = 16;
    int* a = (int*)_mm_malloc(sizeof(int)*N, 16);
    for(int i=0; i<N; i++) {
        a[i] = i;
    }
    foo_v2(N, a, 7, 3);
    for(int i=0; i<N; i++) {
        printf("%d ", a[i]);
    } printf("\n");
    _mm_free(a);
}

Upvotes: 1

Kevin MOLCARD
Kevin MOLCARD

Reputation: 2218

First place to look might be Intel® Intrinsics Guide

Upvotes: 0

Related Questions