Reputation: 862
My own implementation bite me back when trying to optimize following with SSE4:
std::distance(byteptr, std::mismatch(byteptr, ptr + lenght, dataptr).first)
This compares the byteptr and data and returns the index where bytes mismatch. I really do need the raw speed because I'm processing so much memory the RAM speed is already a bottleneck. Fetching and comparing 16 bytes at time with SSE4 would provide a speed boost since comparing 16 bytes at time is faster.
Here is my current code that I could not get working. It uses GCC SSE intrinsics and needs SSE4.2:
// define SIMD 128-bit type of bytes.
typedef char v128i __attribute__ ((vector_size(16)));
// mask of four low bits set.
const uintptr_t aligned_16_imask = (uintptr_t)15;
// mask of four low bits unset.
const uintptr_t aligned_16_mask = ~aligned_16_imask;
inline unsigned int cmp_16b_sse4(v128i *a, v128i *b) {
return __builtin_ia32_pcmpistri128(__builtin_ia32_lddqu((char*)a), *b, 0x18);
}
size_t memcmp_pos(const char * ptr1, const char * ptr2, size_t lenght)
{
size_t nro = 0;
size_t cmpsz;
size_t alignlen = lenght & aligned_16_mask;
// process 16-bytes at time.
while(nro < alignlen) {
cmpsz = cmp_16b_sse4((v128i*)ptr1, (v128i*)ptr2);
ptr1 += cmpsz;
ptr2 += cmpsz;
nro += cmpsz;
// if compare failed return now.
if(cmpsz < 16)
return nro;
if(cmpsz != 16)
break;
}
// process remainder 15 bytes:
while( *ptr1 == *ptr2 && nro < lenght) {
++nro;
++ptr1;
++ptr2;
}
return nro;
}
When testing the above function it works most of the time but in some cases it fails.
Upvotes: 1
Views: 138
Reputation: 29042
One known problem with pcmpistri
is that it always reads the full 16 bytes - even beyond the end of the variable. This becomes a problem on a page boundary, on the border of allocated to unallocated memory. See here (scroll down to "Renat Saifutdinov").
This can be avoided by using only aligned reads of the source even if unaligned reads are supported, see this SO answer.
This could be one of the possibilities why your code fails.
Upvotes: 2