result with/without SSE simd operation is different

Question

i'm trying to sum all the element of array (unsigned char)

but the result of cv::Mat sum is different from SSE result(below code)

with sse, sum of array result bigger than without, but why??

ex) i got 2042115 for sse sum, but cv::mat's sum results 2041104.

__m128i srcVal;
        __m128i src16bitlo;
        __m128i src16bithi;
        __m128i src32bitlolo;
        __m128i src32bitlohi;
        __m128i src32bithilo;
        __m128i src32bithihi;
        __m128i vsum = _mm_setzero_si128();
    for (int i = 0; i < nSrcSize; i += 16)
    {

        srcVal = _mm_loadu_si128((__m128i*) (pSrc + i));

        src16bitlo = _mm_unpacklo_epi8(srcVal, _mm_setzero_si128());
        src16bithi = _mm_unpackhi_epi8(srcVal, _mm_setzero_si128());

        src32bitlolo = _mm_unpacklo_epi16(src16bitlo, _mm_setzero_si128());
        src32bitlohi = _mm_unpackhi_epi16(src16bitlo, _mm_setzero_si128());
        src32bithilo = _mm_unpacklo_epi16(src16bithi, _mm_setzero_si128());
        src32bithihi = _mm_unpackhi_epi16(src16bithi, _mm_setzero_si128());

        vsum = _mm_add_epi32(src32bitlolo, vsum);
        vsum = _mm_add_epi32(src32bitlohi, vsum);
        vsum = _mm_add_epi32(src32bithilo, vsum);
        vsum = _mm_add_epi32(src32bithihi, vsum);


        // cout << "sumSrc : " << sumSrc << endl;
    }
    int sumSrc = vsum.m128i_i32[0] + vsum.m128i_i32[1] + vsum.m128i_i32[2] + vsum.m128i_i32[3];
    //int check = sumSrc;

    int remainSize = nSrcSize % 16;
    if (remainSize > 0)
    {
        unsigned char* arrTemp = new unsigned char[16]();  // 0으로 초기화
        memcpy(arrTemp, pSrc + nSrcSize - remainSize -1, remainSize);
        __m128i srcVal = _mm_loadu_si128((__m128i*)arrTemp);
        vsum = _mm_sad_epu8(srcVal, _mm_setzero_si128());
        sumSrc += vsum.m128i_i16[0] + vsum.m128i_i16[1] + vsum.m128i_i16[2] + vsum.m128i_i16[3] + vsum.m128i_i16[4] + vsum.m128i_i16[5] + vsum.m128i_i16[6] + vsum.m128i_i16[7];
    }

Peter Cordes · Accepted Answer

You have 2 bugs:

i < nSrcSize can be true when the final vector extends past nSrcSize. Since you're already using signed int i, you can use i < nSrcSize - 15 to find the highest i value that can load a full 16 bytes from i+0 to i+15. Or use nSrcSize & -16U if you're using size_t.

new unsigned char[16]() doesn't zero the memory, so you're summing some extra garbage. You do not need new, and you forgot to delete it so you're leaking that memory! You could use a local array, not dynamically allocating anything.

alignas(16) unsigned char arrTemp[16] = {0};  // implicitly initializes later elements to 0

But that variable-size memcpy is not great for efficiency, and reloading the memcpy result will cause a store-forwarding stall. OTOH, you can just _mm_add_epi32(vsum, cleanup_sad) and only do one horizontal vector sum.

Even more efficient might be branching yourself on the size (instead of passing the work to memcpy) and doing an 8-byte and 4-byte chunk with SIMD loads. Or once there are fewer than 8 bytes left, do an 8-byte load that won't cross a cache-line boundary to get it all.

Check if a load that starts at the first byte you want would cross a 64-byte boundary. If no, do it and 64-bit left shift to shift in zeros, which you can safely hsum. If yes, then do a load that ends with the last byte you want, and do a right shift. You have to calculate the shift count as 8 * (8 - bytes_to_keep). You can use a scalar shift and then _mm_cvtsi64_si128 into a SIMD vector, or directly _mm_loadl_epi64 (movq) and use a SIMD shift. (Unfortunately SSE/AVX doesn't have variable-count byte-shifts, only bit-shifts, and the count needs to be in another SIMD vector.)

FYI, psadbw against zero will horizontal sum a vector of unsigned char into two qwords much more efficiently than your SIMD loop. Fastest way to horizontally sum SSE unsigned byte vector. See also how How to count character occurrences using SIMD uses it in the outer loop to accumulate vectors of bytes into a SIMD vector with wider elements.

You're already using psadbw in the cleanup, but you're adding up all eight of the 16-bit elements, even though six of them are zero.

result with/without SSE simd operation is different

Answers (1)

Related Questions