Carlos
Carlos

Reputation: 33

SSE optimization of sum of squared differences

I've recently found that my program spend most time in the following simple function:

void SumOfSquaredDifference(
    const uint8_t * a, size_t aStride, const uint8_t * b, size_t bStride, 
    size_t width, size_t height, uint64_t * sum)
{
    *sum = 0;
    for(size_t row = 0; row < height; ++row)
    {
        int rowSum = 0;
        for(size_t col = 0; col < width; ++col)
        {
            int d = a[col] - b[col];
            rowSum += d*d;
        }
        *sum += rowSum;
        a += aStride;
        b += bStride;
    }
}

This function finds a sum of squared difference of two 8-bit gray images. I think that there is the way to improve its performance with using SSE, but I don't have an experience in this area. Could anybody help me?

Upvotes: 3

Views: 1909

Answers (2)

Cris Luengo
Cris Luengo

Reputation: 60494

GCC has switches that encourage it to vectorize the code. For example, the -mfma switch gives me about 25% speed increase on simple loops like this, using doubles. I imagine it's even better with 8-bit integers. I prefer that over hand-written optimizations because your code stays readable.

That said, there are a few old tricks that can speed up your loop:

  • Don't index, increment your pointer in every loop iteration. You do this in the outer loop, you should do the same in the inner loop. You can create a new pointer before going into the inner loop, so the +=stride stays valid.

  • Don't assign into the sum pointer inside your loop, use a local variable to accumulate and copy to the output when done. You use rowSum, but only in the inner loop. Use that variable across both loops instead.

Upvotes: 0

ErmIg
ErmIg

Reputation: 4038

Of course, you can improve your code. This an example of optimization of your function with using SSE2:

const __m128i Z = _mm_setzero_si128();
const size_t A = sizeof(__m128i);

inline __m128i SquaredDifference(__m128i a, __m128i b)
{
    const __m128i aLo = _mm_unpacklo_epi8(a, Z);
    const __m128i bLo = _mm_unpacklo_epi8(b, Z);
    const __m128i dLo = _mm_sub_epi16(aLo, bLo);

    const __m128i aHi = _mm_unpackhi_epi8(a, Z);
    const __m128i bHi = _mm_unpackhi_epi8(b, Z);
    const __m128i dHi = _mm_sub_epi16(aHi, bHi);

    return _mm_add_epi32(_mm_madd_epi16(dLo, dLo), _mm_madd_epi16(dHi, dHi));
}

inline __m128i HorizontalSum32(__m128i a)
{
    return _mm_add_epi64(_mm_unpacklo_epi32(a, Z), _mm_unpackhi_epi32(a, Z));
}

inline uint64_t ExtractSum64(__m128i a)
{
    uint64_t  _a[2];
    _mm_storeu_si128((__m128i*)_a, a);
    return _a[0] + _a[1];
}

void SumOfSquaredDifference(
    const uint8_t *a, size_t aStride, const uint8_t *b, size_t bStride, 
    size_t width, size_t height, uint64_t * sum)
{
    assert(width%A == 0 && width < 0x10000);
    __m128i fullSum = Z;
    for(size_t row = 0; row < height; ++row)
    {
        __m128i rowSum = Z;
        for(size_t col = 0; col < width; col += A)
        {
            const __m128i a_ = _mm_loadu_si128((__m128i*)(a + col));
            const __m128i b_ = _mm_loadu_si128((__m128i*)(b + col)); 
            rowSum = _mm_add_epi32(rowSum, SquaredDifference(a_, b_));
        }
        fullSum = _mm_add_epi64(fullSum, HorizontalSum32(rowSum));
        a += aStride;
        b += bStride;
    }
    *sum = ExtractSum64(fullSum);
}

This example is a few simplified (it doesn't work if the image width isn't multiple of 16). Full version of the algorithm is here.

And some magic from SSSE3 version:

const __m128i K_1FF = _mm_set1_epi16(0x1FF);

inline __m128i SquaredDifference(__m128i a, __m128i b)
{
    const __m128i lo = _mm_maddubs_epi16(_mm_unpacklo_epi8(a, b), K_1FF);
    const __m128i hi = _mm_maddubs_epi16(_mm_unpackhi_epi8(a, b), K_1FF);
    return _mm_add_epi32(_mm_madd_epi16(lo, lo), _mm_madd_epi16(hi, hi));
}

The magic description (see _mm_maddubs_epi16):

K_1FF -> {-1, 1, -1, 1, ...};
_mm_unpacklo_epi8(a, b) -> {a0, b0, a1, b1, ...};
_mm_maddubs_epi16(_mm_unpacklo_epi8(a, b), K_1FF) -> {b0 - a0, b1 - a1, ...};

Upvotes: 9

Related Questions