Reputation: 195
I'm new to SSE, and limited in knowledge. I'm trying to vectorize my code (C++, using gcc), which is actually quite simple. I have an array of unsigned ints, and I only check for elements that are >=, or <= than some constant. As result, I need an array with elements that passed condition. I'm thinking to use 'mm_cmpge_ps' as a mask, but this construct work over floats not ints!? :(
any suggestion, help is very much appreciated.
Upvotes: 0
Views: 1105
Reputation: 212969
It's pretty easy to just mask out (i.e. set to 0) all non-matching ints. e.g.
#include <emmintrin.h> // SSE2 intrinsics
for (int i = 0; i < N; i += 4)
{
__m128i v = _mm_load_si128(&a[i]);
__m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
__m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
__m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
v = _mm_and_si128(v, vcmp);
_mm_store_si128(&a[i], v);
}
Note that a
needs to be 16 byte aligned and N
needs to be a multiple of 4 - if these constraints are a problem then it's not too hard to extend the code to cope with this.
Upvotes: 2
Reputation:
Here you go. Here are three functions.
The first function,foo_v1
, is based on Paul R's answer.
The second function,foo_v2
, is based on a popular question today Fastest way to determine if an integer is between two integers (inclusive) with known sets of values
The third function, foo_v3
uses Agner Fog's vectorclass which I added only to show how much easier and cleaner it is to use his class. If you don't have the class then just comment out the #include "vectorclass.h"
line and the foo_v3
function. I used Vec8ui
which means it will use AVX2 if available and break it into two Vec4ui otherwise so you don't have to change your code to get the benefit of AVX2.
#include <stdio.h>
#include <nmmintrin.h> // SSE4.2
#include "vectorclass.h"
void foo_v1(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
for (int i = 0; i < N; i += 4) {
__m128i v = _mm_load_si128((const __m128i*)&a[i]);
__m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
__m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
__m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
v = _mm_and_si128(v, vcmp);
_mm_store_si128((__m128i*)&a[i], v);
}
}
void foo_v2(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
//if ((unsigned)(number-lower) < (upper-lower))
for (int i = 0; i < N; i += 4) {
__m128i v = _mm_load_si128((const __m128i*)&a[i]);
__m128i dv = _mm_sub_epi32(v, _mm_set1_epi32(MIN_VAL));
__m128i min_ab = _mm_min_epu32(dv,_mm_set1_epi32(MAX_VAL-MIN_VAL));
__m128i vcmp = _mm_cmpeq_epi32(dv,min_ab);
v = _mm_and_si128(v, vcmp);
_mm_store_si128((__m128i*)&a[i], v);
}
}
void foo_v3(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
//if ((unsigned)(number-lower) < (upper-lower))
for (int i = 0; i < N; i += 8) {
Vec8ui va = Vec8ui().load(&a[i]);
va &= (va - MIN_VAL) <= (MAX_VAL-MIN_VAL);
va.store(&a[i]);
}
}
int main() {
const int N = 16;
int* a = (int*)_mm_malloc(sizeof(int)*N, 16);
for(int i=0; i<N; i++) {
a[i] = i;
}
foo_v2(N, a, 7, 3);
for(int i=0; i<N; i++) {
printf("%d ", a[i]);
} printf("\n");
_mm_free(a);
}
Upvotes: 1