Reputation: 3509
I have following code, after performing a sobel operation:
short* tempBufferVert = new short[width * height];
ippiFilterSobelVertBorder_8u16s_C1R(pImg, width, tempBufferVert, width * 2, dstSize, IppiMaskSize::ippMskSize3x3, IppiBorderType::ippBorderConst, 0, pBufferVert);
for (int i = 0; i < width * height; i++)
tempBufferVert[i] >>= 2;
The frustrating thing is, the bit shift is the longest taking operation of it all, the IPP sobel is so optimized it runs faster than my stupid bit shift. How can I optimize the bitshift, or are there IPP or other options (AVX?) to perform a bitshift on the whole memory (but pertain the sign of the short, which the >>= does on the Visual Studio implementation)
Upvotes: 0
Views: 1867
Reputation: 69902
C++ optimisers perform a lot better with iterator-based loops than with indexing loops.
This is because the compiler can make assumptions about how address arithmetic works at the index overflow. For it to make the same assumptions when using an index into an array you must happen to pick the correct datatype for the index by luck.
The shift code can be expressed as:
void shift(short* first, short* last, int bits)
{
while (first != last) {
*first++ >>= bits;
}
}
int test(int width, int height)
{
short* tempBufferVert = new short[width * height];
shift(tempBufferVert, tempBufferVert + (width * height), 2);
}
Which will (with correct optimisations enabled) be vectorised: https://godbolt.org/g/oJ8Boj
note how the middle of the loop becomes:
.L76:
vmovdqa ymm0, YMMWORD PTR [r9+rdx]
add r8, 1
vpsraw ymm0, ymm0, 2
vmovdqa YMMWORD PTR [r9+rdx], ymm0
add rdx, 32
cmp rsi, r8
ja .L76
lea rax, [rax+rdi*2]
cmp rcx, rdi
je .L127
vzeroupper
Upvotes: 1
Reputation: 213060
Firstly make sure you are compiling with optimisation enabled (e.g. -O3
), and then check whether your compiler is auto-vectorizing the right shift loop. If it's not then you can probably get a significant improvement with SSE:
#include <emmintrin.h> // SSE2
for (int i = 0; i < width * height; i += 8)
{
__m128i v = _mm_loadu_si128((__m128i *)&tempBufferVert[i]);
v = _mm_srai_epi16(v, 2); // v >>= 2
_mm_storeu_si128((__m128i *)&tempBufferVert[i], v);
}
(Note: assumes width*height
is a multiple of 8.)
You can probably do even better with some loop unrolling and/or using AVX2, but this may be enough for your needs as it stands.
Upvotes: 1