C++ SSE Optimisation with multiple functions

Question

I have some code that is structurally similar to the below. There is a bunch of small SSE helper functions, a larger one that does most of the work, and the public function that organises data, runs the large function in a loop and deals with any left over data.

This gave about a 2x speed boost over the scalar implementation, however I would like to obtain more if possible. As well as some conceptual issues there were some things in the disassembly (only looked at x86 VC++ 2010 in detail, but support x86 and GCC) I did not like.

For at least some targets I can only use SSE and SSE2 here, but if it is worth a separate build I could possibly use newer instruction sets as well.

Problem 1:

All the small helpers got inlined into the large helper nicely, and the large one didn't.

However, even though it is only referenced by one function in one source file and there are plenty registers (Looking at the algorithm, pretty sure it only needs at most 12 XMM registers except for loading the data arrays), the compiler seems to want to follow normal calling conventions for fooHelper.

So after putting data into XMM registers in foo it puts them back on the stack, and passes pointers, then after the loops and tidy up stuff, it loads that stack back into XMM so i can unload it again...

I guess I could force it to inline fooHelper, but that is a very large number of duplicated instructions because it wouldn't use 4 XMM registers to do the job. I could also not use SSE in foo itself, which would remove the load/store issue, but fooHelper is still doing completely unrequired loads and stores on those 4 state variables...

Ideally since this is a private function a way to ignore the normal calling conventions would be nice, and I am sure this will come up in lots of other larger pieces of SSE where I don't really want everything fully inlined.

Problem 2:

The implementation is basically working on 4 state vectors organised as AAAA, BBBB, CCCC, DDDD, such that the code can simply be written as if it is working with A, B, C and D as separate variables, while processing all 4 data streams at once.

However the output itself is in the form ABCD, ABCD, ABCD, ABCD and the input is also 4 separate buffers requiring the _m_set_epi32 to load it.

Is there a better way to deal with these inputs and outputs (the format of which can not practically be changed)?

namespace
{
    void fooHelperA(__m128i &a, __m128i b, __m128i x, int s)
    {
        ...small function (<5 sse operations)...
    }
    ...bunch of other small functions...

    //
    void fooHelper(        
         const int *data1, const int *data2, const int *data3, const int *data4,
         __m128i &a, __m128i &b, __m128i &c, __m128i &d)
    {
        //Get the current piece of data
        __m128 c = _mm_set_epi32(data1[0], data2[0], data3[0], data4[0]);
        ...do stuff with data...
        fooHelperA(a, b, c, 5);
        ...
        c = _mm_set_epi32(data1[1], data2[1], data3[1], data4[1]);
        ...
        fooHelperA(b, a, c, 7);
        ... lots more code ...
        c = _mm_set_epi32(data1[3], data2[3], data3[3], data4[3]);
        ...
    }
}
void foo(
    const char*data1, const char *data2, const float *data3, const char *data4,
    int*out1, int*out2, int*out3, int*out4,
    size_t len)
{
    __m128i a = _mm_setzero_si128();
    __m128i b = _mm_setzero_si128();
    __m128i c = _mm_setzero_si128();
    __m128i d = _mm_setzero_si128();
    while (len >= 16) //expected to loop <25 times for datasets in question
    {
        fooHelper((const int*)data1, (const int*)data2, (const int*)data3, (const int*)data4, a,b,c,d);
        data1 += 16;
        data2 += 16;
        data3 += 16;
        data4 += 16;
        len -= 16;
    }
    if (len)
    {
        int[4][4] buffer;
        ...padd data into buffer...
        fooHelper(buffer[0], buffer[1], buffer[2], buffer[3], a,b,c,d);
    }
    ALIGNED(16, int[4][4]) tmp;
    _mm_store_si128((__m128i*)tmp[0], a);
    _mm_store_si128((__m128i*)tmp[1], b);
    _mm_store_si128((__m128i*)tmp[2], c);
    _mm_store_si128((__m128i*)tmp[3], d);

    out1[0] = tmp[0][0];
    out2[0] = tmp[0][1];
    out3[0] = tmp[0][2];
    out4[0] = tmp[0][3];

    out1[1] = tmp[0][0];
    out2[1] = tmp[0][1];
    out3[1] = tmp[0][2];
    out4[1] = tmp[0][3];

    out1[2] = tmp[0][0];
    out2[2] = tmp[0][1];
    out3[2] = tmp[0][2];
    out4[2] = tmp[0][3];

    out1[3] = tmp[0][0];
    out2[3] = tmp[0][1];
    out3[3] = tmp[0][2];
    out4[3] = tmp[0][3];
}

alexbuisson · Accepted Answer

Some advice,

1) Looking at your code and data description, it seem you can have huge gain by moving your data organization from SOA (Struct of array ) your AAAA vector to a AOS array of struct where your input data will already be organized as ABCD , you will have 1 big input vector (4x bigger)!

2) take care to your data alignment. for now you don't care has you should have pinalllity due to the set_epi32 function but if you switch to AOS you should be able to use a fast load ( memory to XMS ).

3) the end of the function is a bit strange, (I cannot simulate for now) I really don't understand why you need a tmp 2d array.

4) interleaving (& the inverse operation) can be done using some example of SOA/ AOS conversion ... Intel wrote a lot of paper on this topic when promoting SIMD Instruction Set.

good luck, alex

C++ SSE Optimisation with multiple functions

Answers (1)

Related Questions