Bit shift whole block of memory efficiently

Question

I have following code, after performing a sobel operation:

short* tempBufferVert = new short[width * height];
ippiFilterSobelVertBorder_8u16s_C1R(pImg, width, tempBufferVert, width * 2, dstSize, IppiMaskSize::ippMskSize3x3, IppiBorderType::ippBorderConst, 0, pBufferVert);
for (int i = 0; i < width * height; i++)
    tempBufferVert[i] >>= 2;

The frustrating thing is, the bit shift is the longest taking operation of it all, the IPP sobel is so optimized it runs faster than my stupid bit shift. How can I optimize the bitshift, or are there IPP or other options (AVX?) to perform a bitshift on the whole memory (but pertain the sign of the short, which the >>= does on the Visual Studio implementation)

Richard Hodges · Accepted Answer

C++ optimisers perform a lot better with iterator-based loops than with indexing loops.

This is because the compiler can make assumptions about how address arithmetic works at the index overflow. For it to make the same assumptions when using an index into an array you must happen to pick the correct datatype for the index by luck.

The shift code can be expressed as:

void shift(short* first, short* last, int bits)
{
  while (first != last) {
    *first++ >>= bits;
  }
}

int test(int width, int height)
{
  short* tempBufferVert = new short[width * height];
  shift(tempBufferVert, tempBufferVert + (width * height), 2);

}

Which will (with correct optimisations enabled) be vectorised: https://godbolt.org/g/oJ8Boj

note how the middle of the loop becomes:

.L76:
        vmovdqa ymm0, YMMWORD PTR [r9+rdx]
        add     r8, 1
        vpsraw  ymm0, ymm0, 2
        vmovdqa YMMWORD PTR [r9+rdx], ymm0
        add     rdx, 32
        cmp     rsi, r8
        ja      .L76
        lea     rax, [rax+rdi*2]
        cmp     rcx, rdi
        je      .L127
        vzeroupper

Bit shift whole block of memory efficiently

Answers (2)

Related Questions