Bogi
Bogi

Reputation: 2618

Outer-loop vectorization with SSE and NEON

I want to vectorize the following loop on ARM NEON and SSE:

for (int i = 0; i < n; ++i) {
   b[i][0] = 0.0;
   for (int j = 1; j < n; ++j) {
      b[i][j] = b[i][j - 1] + a[i][j]; 
   }
}

This loop has a loop-carried dependency, so it cannot be vectorized simply. Instead, it can be vectorized using outer-loop vectorization.

Under the assumption that a and b are float32, outer-loop vectorization functions by running 4 instances of the inner loop in parallel. Here is what the iterations of the vectorized loop look like:

First iteration:  { { i = 0, j = 1 }, { i = 1, j = 1 }, { i = 2, j = 1 }, { i = 3 , j = 1 } }
Second iteration: { { i = 0, j = 2 }, { i = 1, j = 2 }, { i = 2, j = 2 }, { i = 3 , j = 2 } }

Etc.

I want to vectorize this loop using NEON and SSE. Neither of them support gathers and scatters, which are needed to simply vectorize this loop. Do you have any idea how to vectorize this loop efficiently?

Upvotes: 2

Views: 188

Answers (1)

Soonts
Soonts

Reputation: 21956

The algorithm which you have in your code is called “exclusive prefix sum”.

I’m not sure SIMD going to bring much profit, but still, about SSE, try the following version. Micro-kernel:

// Given a vector of [ a, b, c, d ], compute a vector of [ a, a+b, a+b+c, a+b+c+d ]
// The function requires SSE 4.1 support
__m128 inclusivePrefixSum( __m128 v )
{
    // a*2, a+b, c*2, c+d
    __m128 tmp = _mm_add_ps( v, _mm_moveldup_ps( v ) );
    // a, a+b, c, c+d
    tmp = _mm_blend_ps( tmp, v, 0b0101 );

    // Broadcasted a+b
    __m128 res = _mm_shuffle_ps( tmp, tmp, _MM_SHUFFLE( 1, 1, 1, 1 ) );
    // a*2+b, (a+b)*2, a+b+c, a+b+c+d
    res = _mm_add_ps( res, tmp );
    // a, a+b, a+b+c, a+b+c+d
    res = _mm_blend_ps( res, tmp, 0b0011 );
    return res;
}

Usage:

    const size_t innerEndAligned = 1 + ( ( n - 1 ) / 4 ) * 4;
    for( size_t i = 0; i < n; i++ )
    {
        float* const pa = a[ i ];
        float* const pb = b[ i ];

        __m128 acc = _mm_setzero_ps();
        _mm_store_ss( pb, acc );
        size_t j;
        for( j = 1; j < innerEndAligned; j += 4 )
        {
            // Load 4 elements
            __m128 vec = _mm_loadu_ps( pa + j );
            // Compute prefix sum
            vec = inclusivePrefixSum( vec );
            // Increment by the accumulator
            vec = _mm_add_ps( vec, acc );
            // Store 4 output elements
            _mm_storeu_ps( pb + j, vec );
            // Broadcast the last lane from the vector,
            // which contains cumulative sum of the row so far, into the accumulator
            acc = _mm_shuffle_ps( vec, vec, _MM_SHUFFLE( 3, 3, 3, 3 ) );
        }

        // Handle the last 0-3 elements of the row with scalar code
        for( ; j < n; j++ )
        {
            __m128 v = _mm_load_ss( pa + j );
            acc = _mm_add_ss( acc, v );
            _mm_store_ss( pb + j, acc );
        }
    }

Upvotes: 1

Related Questions