Reputation: 2618
I want to vectorize the following loop on ARM NEON and SSE:
for (int i = 0; i < n; ++i) {
b[i][0] = 0.0;
for (int j = 1; j < n; ++j) {
b[i][j] = b[i][j - 1] + a[i][j];
}
}
This loop has a loop-carried dependency, so it cannot be vectorized simply. Instead, it can be vectorized using outer-loop vectorization.
Under the assumption that a
and b
are float32, outer-loop vectorization functions by running 4 instances of the inner loop in parallel. Here is what the iterations of the vectorized loop look like:
First iteration: { { i = 0, j = 1 }, { i = 1, j = 1 }, { i = 2, j = 1 }, { i = 3 , j = 1 } }
Second iteration: { { i = 0, j = 2 }, { i = 1, j = 2 }, { i = 2, j = 2 }, { i = 3 , j = 2 } }
Etc.
I want to vectorize this loop using NEON and SSE. Neither of them support gathers and scatters, which are needed to simply vectorize this loop. Do you have any idea how to vectorize this loop efficiently?
Upvotes: 2
Views: 188
Reputation: 21956
The algorithm which you have in your code is called “exclusive prefix sum”.
I’m not sure SIMD going to bring much profit, but still, about SSE, try the following version. Micro-kernel:
// Given a vector of [ a, b, c, d ], compute a vector of [ a, a+b, a+b+c, a+b+c+d ]
// The function requires SSE 4.1 support
__m128 inclusivePrefixSum( __m128 v )
{
// a*2, a+b, c*2, c+d
__m128 tmp = _mm_add_ps( v, _mm_moveldup_ps( v ) );
// a, a+b, c, c+d
tmp = _mm_blend_ps( tmp, v, 0b0101 );
// Broadcasted a+b
__m128 res = _mm_shuffle_ps( tmp, tmp, _MM_SHUFFLE( 1, 1, 1, 1 ) );
// a*2+b, (a+b)*2, a+b+c, a+b+c+d
res = _mm_add_ps( res, tmp );
// a, a+b, a+b+c, a+b+c+d
res = _mm_blend_ps( res, tmp, 0b0011 );
return res;
}
Usage:
const size_t innerEndAligned = 1 + ( ( n - 1 ) / 4 ) * 4;
for( size_t i = 0; i < n; i++ )
{
float* const pa = a[ i ];
float* const pb = b[ i ];
__m128 acc = _mm_setzero_ps();
_mm_store_ss( pb, acc );
size_t j;
for( j = 1; j < innerEndAligned; j += 4 )
{
// Load 4 elements
__m128 vec = _mm_loadu_ps( pa + j );
// Compute prefix sum
vec = inclusivePrefixSum( vec );
// Increment by the accumulator
vec = _mm_add_ps( vec, acc );
// Store 4 output elements
_mm_storeu_ps( pb + j, vec );
// Broadcast the last lane from the vector,
// which contains cumulative sum of the row so far, into the accumulator
acc = _mm_shuffle_ps( vec, vec, _MM_SHUFFLE( 3, 3, 3, 3 ) );
}
// Handle the last 0-3 elements of the row with scalar code
for( ; j < n; j++ )
{
__m128 v = _mm_load_ss( pa + j );
acc = _mm_add_ss( acc, v );
_mm_store_ss( pb + j, acc );
}
}
Upvotes: 1