How to read optimally from an array (in memory) having array position from a vector?

Question

I've such a code:

const rack::simd::float_4 pos = phase * waveTable.mLength;
const rack::simd::int32_4 pos0 = pos;
const rack::simd::float_4 frac = pos - (rack::simd::float_4)pos0;
rack::simd::float_4 v0;
rack::simd::float_4 v1;
for (int v = 0; v < 4; v++) {
    v0[v] = waveTable.mSamples[pos0[v]];
    v1[v] = waveTable.mSamples[pos0[v] + 1]; // mSamples size is waveTable.mLength + 1 for interpolation wraparound
}

oversampleBuffer[i] = v0 + (v1 - v0) * frac;

which takes phase (normalized) and interpolate (linear interpolation) between two samples, stored on waveTable.mSamples (as single float each).

The positions are insides rack::simd::float_4, which are basically 4 aligned float defined as __m128. That part of code, after some benchmarks, takes some times (I guess due to lot of cache missing).

Is built with -march=nocona, so I can use MMX, SSE, SSE2 and SSE3.

How would you optimize this code? Thanks

Soonts · Accepted Answer

Your code is not too efficient for multiple reasons.

You’re setting individual lanes of SIMD vectors with scalar code. Processors can’t quite do that, but the compiler pretends they can. Unfortunately, these compiler-implemented workarounds are slow, typically they do that with a rountrip to memory and back.
Generally, you should avoid writing loops of very small lengths of 2 or 4. Sometimes compiler unrolls and you’re OK, but other time they don’t and the CPU is mispredicting way too many branches.
Finally, processors can load 64-bit values with a single instruction. You’re loading consecutive pairs from the table, can use 64-bit loads instead of two 32-bit ones.

Here’s a fixed version (untested). This assumes you’re building for PCs i.e. using SSE SIMD.

// Load a vector with rsi[ i0 ], rsi[ i0 + 1 ], rsi[ i1 ], rsi[ i1 + 1 ]
inline __m128 loadFloats( const float* rsi, int i0, int i1 )
{
    // Casting load indices to unsigned, otherwise compiler will emit sign extension instructions
    __m128d res = _mm_load_sd( (const double*)( rsi + (uint32_t)i0 ) );
    res = _mm_loadh_pd( res, (const double*)( rsi + (uint32_t)i1 ) );
    return _mm_castpd_ps( res );
}

__m128 interpolate4( const float* waveTableData, uint32_t waveTableLength, __m128 phase )
{
    // Convert wave table length into floats.
    // Consider doing that outside of the inner loop, and passing the __m128.
    const __m128 length = _mm_set1_ps( (float)waveTableLength );

    // Compute integer indices, and the fraction
    const __m128 pos = _mm_mul_ps( phase, length );
    const __m128 posFloor = _mm_floor_ps( pos );    // BTW this one needs SSE 4.1, workarounds are expensive
    const __m128 frac = _mm_sub_ps( pos, posFloor );
    const __m128i posInt = _mm_cvtps_epi32( posFloor );

    // Abuse 64-bit load instructions to load pairs of values from the table.
    // If you have AVX2, can use _mm256_i32gather_pd instead, will load all 8 floats with 1 (slow) instruction.
    const __m128 s01 = loadFloats( waveTableData, _mm_cvtsi128_si32( posInt ), _mm_extract_epi32( posInt, 1 ) );
    const __m128 s23 = loadFloats( waveTableData, _mm_extract_epi32( posInt, 2 ), _mm_extract_epi32( posInt, 3 ) );

    // Shuffle into the correct order, i.e. gather even/odd elements from the vectors
    const __m128 v0 = _mm_shuffle_ps( s01, s23, _MM_SHUFFLE( 2, 0, 2, 0 ) );
    const __m128 v1 = _mm_shuffle_ps( s01, s23, _MM_SHUFFLE( 3, 1, 3, 1 ) );

    // Finally, linear interpolation between these vectors.
    const __m128 diff = _mm_sub_ps( v1, v0 );
    return _mm_add_ps( v0, _mm_mul_ps( frac, diff ) );
}

The assembly looks good. Modern compilers even automatically use FMA, when available. (GCC by default, clang with -ffp-contract=fast to do contraction across C statements, not just within one expression.)

Just saw the update. Consider switching your target to SSE 4.1. Steam hardware survey says the market penetration is 98.76%. If you still supporting prehistoric CPUs like Pentium 4, the workaround for _mm_floor_ps is in DirectXMath, and instead of _mm_extract_epi32 you can use _mm_srli_si128 + _mm_cvtsi128_si32.

Even if you need to support an old baseline like only SSE3, -mtune=generic or even -mtune=haswell might be a good idea along with -march=nocona, to still make inlining and other code-gen choices that are good for a range of CPUs, not just Pentium 4.

How to read optimally from an array (in memory) having array position from a vector?

Answers (1)

Related Questions