Implementation and performance of using bitsets with SSE

Question

I am trying to speed up my method using SSE (On Visual Studio). I am a novice in the area. The main data types I work with in my method are bitsets of size 32 and the logical operation I mainly use is the AND operation (with _BitScanForward scarcely used). I was wondering if SSE instructions can be used to speed up my procedures.

This is how I am doing right now (I am completely done and cannot compare results directly):

I load the operands (bitsets) using _mm_set_ps. I use the to_ulong() on bitsets to convert them to unsigned long integers:

__m128 v1 = _mm_set_ps(b1.to_ulong(),b2.to_ulong(),b3.to_ulong(),b4.to_ulong());
__m128 v2 = _mm_set1_ps(b.to_ulong())

This is followed by the actual AND operation:

__m128 v3 = _mm_and_ps(v1,v2);

At this point, I have two questions:

Is the way I am doing it (converting bitsets to unsigned long integers using to_ulong()) a good way to do it? I suspect that there is a large overhead that may kill the potential performance improvement I may get out of using SSE.
What is the best way to store v3 back on memory in the shape of 4 bitsets? I am planning to use the _mm_storeu_ps intrinsic.

Paul R · Accepted Answer

A couple of things:

if your bit sets are basically 32 bit ints then you should be using a suitable integer SIMD type, i.e. __m128i, not floating point (__m128)
_mm_set_XXX macros are relatively expensive - unlike regular SSE intrinsics they can generate quite a few instructions - if all you are doing is one AND operation then any performance benefit from the _mm_and_XXX operation will be more than wiped out by the cost of the _mm_set_XXX operations

Ideally if you just want to AND a bunch of bit sets in arrays then the code should look something like this:

const int N = 1024;

int32_t b1[N]; // 2 x arrays of input bit sets
int32_t b2[N];
int32_t b3[N]; // 1 x array of output bit sets

for (int i = 0; i < N; i += 4)
{
    __m128i v1 = _mm_loadu_si128(&b1[i]); // load input bits sets
    __m128i v2 = _mm_loadu_si128(&b2[i]);
    __m128i v3 = _mm_and_si128(v1, v2);   // do the bitwise AND
    _mm_storeu_si128(&b3[i], v3);         // store the result
}

If you just want to AND an array in-place with a fixed mask then it would simplify to this:

const int N = 1024;

int32_t b1[N]; // input/output array of bit sets

const __m128i v2 = _mm_set1_epi32(0x12345678); // mask

for (int i = 0; i < N; i += 4)
{
    __m128i v1 = _mm_loadu_si128(&b1[i]); // load input bits sets
    __m128i v3 = _mm_and_si128(v1, v2);   // do the bitwise AND
    _mm_storeu_si128(&b1[i], v3);         // store the result
}

Note: for better performance make sure your input/output arrays are 16 byte aligned and then use _mm_load_si128/_mm_store_si128 rather than their unaligned counterparts as above.

Implementation and performance of using bitsets with SSE

Answers (1)

Related Questions