Nick
Nick

Reputation: 180

Optimizing a logic AND operation for x86

I am trying to optimize an algorithm that masks an array. The initial code looks like this:

void mask(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *m)
{
    unsigned int rep=size_x*size_y;
    while (rep--)
    {            
        *(source++) &= *(m++);
    }
}

I have tried to do Loop Unrolling + prefetching

void mask_LU4(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t   *mask)
{                             // in place
    unsigned int rep;
    rep= size_x* size_y;
    rep/= 4 ; 
    while (rep--) 
    {              
        _mm_prefetch(&source[16], _MM_HINT_T0);
        _mm_prefetch(&mask[16], _MM_HINT_T0);
        source[0] &= mask[0];
        source[1] &= mask[1];
        source[2] &= mask[2];
        source[3] &= mask[3];
        source += 4;
        mask += 4;
    }
}

and use intrinsics

void fmask_SIMD(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *mask)
{                             // in place
    unsigned int rep;
    __m128i *s,*m ;
    s = (__m128i *) source;
    m = (__m128i *) mask;
    rep= size_x* size_y;
    rep/= 4 ; 
    while (rep--) 
    {               
        *s = _mm_and_si128(*s,*m);
        source+=4;mask+=4; 
        s = (__m128i *) source;
        m = (__m128i *) mask;
    }   
}  

However the performance is the same. I have tried to perform sw prefetch both to the SIMD and Loop Unrolling version and it I couldn't see any improvement. Any ideas on how I could optimize this algorithm?

P.S.1: I am using gcc 4.8.1 and I compile with -march=native and -Ofast.

P.S.2: I am using an Intel Core i5 3470 @3.2Ghz, Ivy bridge architecture. L1 DCache 4X32KB (8-way), L2 4x256, L3 6MB, RAM-DDR3 4Gb (Dual Channel, DRAM @798,1Mhz)

Upvotes: 2

Views: 163

Answers (2)

kmargar
kmargar

Reputation: 1

Your problem might be memory bound, but that does not mean you cannot process more per cycle. Usually when you have a low payload operation (like yous here, it's only an AND after all), it makes sense to combine many loads and stores. In most CPUs most loads will be combined by the L2 cache into a single cache line load (especially if they're consecutive). I'd suggest increasing the loop unrolling to at least 4 SIMD packets here, along with the prefetching. You'd still be memory bound, but you will get fewer cache misses, giving you a slightly better performance.

Upvotes: -1

Z boson
Z boson

Reputation: 33679

Your operation is memory bandwidth bound. However, that does not necessarily mean your operation is achieving the maximum memory bandwidth. To get closer to the maximum memory bandwidth you need to use multiple threads. Using OpenMP (add -fopenmp to GCC's options) you can do this:

#pragma omp parallel for
for(int i=0; i<rep; i++) { source[i] &= m[i]; }

If you wanted to not modify the source but use a different destination then you can use stream instructions like this:

#pragma omp parallel for
for(int i=0; i<rep/4; i++) {
    __m128i m4 = _mm_load_si128((__m128i*)&m[4*i]);
    __m128i s4 = _mm_load_si128((__m128i*)&source[4*i]);
    s4 = _mm_and_si128(s4,m4);
    _mm_stream_si128((__m128i*i)&dest[4*i], s4);
}

This would not be any faster than using the same destination and source. However, if you already planned to use a destination not equal to the source this would likely be faster (for some value of rep) than using _mm_store_si128.

Upvotes: 2

Related Questions