SinisterMJ
SinisterMJ

Reputation: 3509

Bit shift whole block of memory efficiently

I have following code, after performing a sobel operation:

short* tempBufferVert = new short[width * height];
ippiFilterSobelVertBorder_8u16s_C1R(pImg, width, tempBufferVert, width * 2, dstSize, IppiMaskSize::ippMskSize3x3, IppiBorderType::ippBorderConst, 0, pBufferVert);
for (int i = 0; i < width * height; i++)
    tempBufferVert[i] >>= 2;

The frustrating thing is, the bit shift is the longest taking operation of it all, the IPP sobel is so optimized it runs faster than my stupid bit shift. How can I optimize the bitshift, or are there IPP or other options (AVX?) to perform a bitshift on the whole memory (but pertain the sign of the short, which the >>= does on the Visual Studio implementation)

Upvotes: 0

Views: 1867

Answers (2)

Richard Hodges
Richard Hodges

Reputation: 69902

C++ optimisers perform a lot better with iterator-based loops than with indexing loops.

This is because the compiler can make assumptions about how address arithmetic works at the index overflow. For it to make the same assumptions when using an index into an array you must happen to pick the correct datatype for the index by luck.

The shift code can be expressed as:

void shift(short* first, short* last, int bits)
{
  while (first != last) {
    *first++ >>= bits;
  }
}

int test(int width, int height)
{
  short* tempBufferVert = new short[width * height];
  shift(tempBufferVert, tempBufferVert + (width * height), 2);

}

Which will (with correct optimisations enabled) be vectorised: https://godbolt.org/g/oJ8Boj

note how the middle of the loop becomes:

.L76:
        vmovdqa ymm0, YMMWORD PTR [r9+rdx]
        add     r8, 1
        vpsraw  ymm0, ymm0, 2
        vmovdqa YMMWORD PTR [r9+rdx], ymm0
        add     rdx, 32
        cmp     rsi, r8
        ja      .L76
        lea     rax, [rax+rdi*2]
        cmp     rcx, rdi
        je      .L127
        vzeroupper

Upvotes: 1

Paul R
Paul R

Reputation: 213060

Firstly make sure you are compiling with optimisation enabled (e.g. -O3), and then check whether your compiler is auto-vectorizing the right shift loop. If it's not then you can probably get a significant improvement with SSE:

#include <emmintrin.h> // SSE2

for (int i = 0; i < width * height; i += 8)
{
    __m128i v = _mm_loadu_si128((__m128i *)&tempBufferVert[i]);
    v = _mm_srai_epi16(v, 2); // v >>= 2
    _mm_storeu_si128((__m128i *)&tempBufferVert[i], v);
}

(Note: assumes width*height is a multiple of 8.)

You can probably do even better with some loop unrolling and/or using AVX2, but this may be enough for your needs as it stands.

Upvotes: 1

Related Questions