Royi
Royi

Reputation: 4953

Loop Vectorization 001

I have a Vectorization optimization problem.

I have a struct pDst which have 3 fields named: 'red', 'green' and 'blue'.
The type might be 'Char', 'Short' or 'Float'.This is given and can not be altered.
Theres is another array pSrc which represents an image [RGB] - Namely an array of 3 pointers which every one of them point to a layer of an image.
Each layer is built using IPP plane oriented image (Namely, Each plane is formed independently - 'ippiMalloc_32f_C1'): http://software.intel.com/sites/products/documentation/hpc/ipp/ippi/ippi_ch3/functn_Malloc.html.

We would like to copy it as described in the following code:

for(int y = 0; y < imageHeight; ++y)
{
    for(int x = 0; x < imageWidth; ++x)
    {
        pDst[x + y * pDstRowStep].red     = pSrc[0][x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].green   = pSrc[1][x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].blue    = pSrc[2][x + y * pSrcRowStep];
    }
} 

Yet, in this form the compiler can't vectorize the code.
At first it says:

"loop was not vectorized: existence of vector dependence.".

When I use the #pragma ivdep to help the compiler (Since there's no dependence) I get the following error:

"loop was not vectorized: dereference too complex.".

Anyone has an idea how to allow vectorization?
I use Intel Compiler 13.0.
Thanks.

Update:

If I edit the code as following:

Ipp32f *redChannel      = pSrc[0];
Ipp32f *greenChannel  = pSrc[1];
Ipp32f *blueChannel     = pSrc[2];
for(int y = 0; y < imageHeight; ++y)
{
    #pragma ivdep
    for(int x = 0; x < imageWidth; ++x)
    {
        pDst[x + y * pDstRowStep].red     = redChannel[x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].green   = greenChannel[x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].blue    = blueChannel[x + y * pSrcRowStep];
    }
}

For output types of 'char' and 'short' I get vecotization.
Yet for type of 'float' I don't.
Instead I get the following message:

loop was not vectorized: vectorization possible but seems inefficient.

How could that be?

Upvotes: 1

Views: 595

Answers (2)

Anoop - Intel
Anoop - Intel

Reputation: 354

In the following code, using pragma ivdep does surely ignore the vector dependence but the compiler heuristics/cost analysis came to a conclusion that vectorizing the loop is not efficient:

Ipp32f *redChannel      = pSrc[0];
Ipp32f *greenChannel  = pSrc[1];
Ipp32f *blueChannel     = pSrc[2];
for(int y = 0; y < imageHeight; ++y)
{
    #pragma ivdep
    for(int x = 0; x < imageWidth; ++x)
    {
        pDst[x + y * pDstRowStep].red     = redChannel[x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].green   = greenChannel[x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].blue    = blueChannel[x + y * pSrcRowStep];
    }
}

The vectorization will be inefficient since the operation involves copying contiguous block of memory from source to non-contiguous memory locations at the destination. So there is a scatter happening here. If you still want to enforce vectorization and see if there any performance improvement in comparison to non-vectorized version, please use pragma simd instead of pragma ivdep as shown below:

#include<ipp.h>
struct Dest{
float red;
float green;
float blue;
};
void foo(Dest *pDst, Ipp32f **pSrc, int imageHeight, int imageWidth, int pSrcRowStep, int pDstRowStep){
    Ipp32f *redChannel      = pSrc[0];
    Ipp32f *greenChannel  = pSrc[1];
    Ipp32f *blueChannel     = pSrc[2];
    for(int y = 0; y < imageHeight; ++y)
    {
        #pragma simd
        for(int x = 0; x < imageWidth; ++x)
        {
            pDst[x + y * pDstRowStep].red     = redChannel[x + y * pSrcRowStep];
            pDst[x + y * pDstRowStep].green   = greenChannel[x + y * pSrcRowStep];
            pDst[x + y * pDstRowStep].blue    = blueChannel[x + y * pSrcRowStep];
        }
    }
    return;
}

The corresponding vectorization report is:

$ icpc -c test.cc -vec-report2
test.cc(14): (col. 9) remark: SIMD LOOP WAS VECTORIZED
test.cc(11): (col. 5) remark: loop was not vectorized: not inner loop

More documentation on pragma simd is available at https://software.intel.com/en-us/node/514582.

Upvotes: 1

CAFxX
CAFxX

Reputation: 30301

Something along those lines should work (char version, untested, also keep in mind that __m128i pointers should be properly aligned!)

void interleave_16px_to_rgb0(__m128i *red, __m128i *green, __m128i *blue, __m128i *dest) {
  __m128i zero = _mm_setzero_si128();
  __m128i rg_0 = _mm_unpackhi_epi8(*red, *green);
  __m128i rg_1 = _mm_unpacklo_epi8(*red, *green);
  __m128i bz_0 = _mm_unpackhi_epi8(*blue, zero);
  __m128i bz_1 = _mm_unpacklo_epi8(*blue, zero);
  dest[0] = _mm_unpackhi_epi16(rg_0, bz_0);
  dest[1] = _mm_unpacklo_epi16(rg_0, bz_0);
  dest[2] = _mm_unpackhi_epi16(rg_1, bz_1);
  dest[3] = _mm_unpacklo_epi16(rg_1, bz_1);
}

This will take 16 bytes from each plane:

r0 r1 r2 ... r16
g0 g1 g2 ... g16
b0 b1 b2 ... b16

and interleave them like so, writing out 16x4 bytes starting from *dest:

r0 g0 b0 0 r1 g1 b1 0 r2 g2 b2 0 ... r16 g16 b16 0

It goes without saying that you can use the same family of functions to interleave other data types as well.


Update: better yet, since you already have the IPPs, you should try to use what's provided instead of reinventing the wheel. From a quick check it appears that ippiCopy_8u_P3C3R or ippiCopy_8u_P4C4R are what you are looking for.

Upvotes: 1

Related Questions