user1181950
user1181950

Reputation: 819

Unknown SSE bottleneck

I have a generic code that I am trying to move to SSE to speed it up since it's getting called a lot. The code in question is basically something like this:

for (int i = 1; i < mysize; ++i)
{
    buf[i] = myMin(buf[i], buf[i - 1] + offset);
}

where myMin is your simple min function (a < b) ? a : b (I've looked at the disassembly and there are jumps in here)

My SSE code (which I've gone through several iterations to speed up) is at this form now:

float tmpf = *(tmp - 1);
__m128 off = _mm_set_ss(offset);
for (int l = 0; l < mysize; l += 4)
{
    __m128 post = _mm_load_ps(tmp);
    __m128 pre = _mm_move_ss(post, _mm_set_ss(tmpf));
    pre = _mm_shuffle_ps(pre, pre, _MM_SHUFFLE(0, 3, 2, 1));
    pre = _mm_add_ss(pre, off);
    post = _mm_min_ss(post, pre);

    // reversed
    pre = _mm_shuffle_ps(post, post, _MM_SHUFFLE(2, 1, 0, 3));
    post = _mm_add_ss(post, off );
    pre = _mm_min_ss(pre, post);

    post = _mm_shuffle_ps(pre, pre, _MM_SHUFFLE(2, 1, 0, 3));
    pre = _mm_add_ss(pre, off);
    post = _mm_min_ss(post, pre);

    // reversed
    pre = _mm_shuffle_ps(post, post, _MM_SHUFFLE(2, 1, 0, 3));
    post = _mm_add_ss(post, off);
    pre = _mm_min_ss(pre, post);

    post = _mm_shuffle_ps(pre, pre, _MM_SHUFFLE(2, 1, 0, 3));
    _mm_store_ps(tmp, post);
    tmpf = tmp[3];
    tmp += 4;
}

Ignoring any edge case scenarios, which I've handled fine, and overhead for those are negligible due to size of buf/tmp, can anyone explain why the SSE version is slower by 2x? VTune keeps attributing it to L1 misses, but as I can see, it should be make 4x less trips to L1 and no branches/jumps, so it should be faster, but it's not. What am I mistaking here?

Thanks

EDIT: So I did find something else in a separate test case. I didn't think this would matter but alas it did. So mysize above is actually not that big (about 30-50), but there are a LOT of these and they are all being done serially. In that case, the ternary expression is faster than SSE. However, if it's reversed with mysize being in millions and there are only 30-50 iterations of them, the SSE version is faster. Any idea why? I would think memory interactions would be the same for both, including pre-emptive prefetching etc...

Upvotes: 2

Views: 230

Answers (2)

Z boson
Z boson

Reputation: 33689

Here's a branchless solution you could try

for (int i = 1; i < mysize; i++) {
    float a = buf[i];
    float b = buf[i-1] + offset;
    buf[i] = b + (a<b)*(a-b);
}

Here is the assembly:

.L6:
addss   xmm0, xmm4
movss   xmm1, DWORD PTR [rax]
movaps  xmm2, xmm1
add rax, 4
movaps  xmm3, xmm6
cmpltss xmm2, xmm0
subss   xmm1, xmm0
andps   xmm3, xmm2
andnps  xmm2, xmm5
orps    xmm2, xmm3
mulss   xmm1, xmm2
addss   xmm0, xmm1
movss   DWORD PTR [rax-4], xmm0
cmp rax, rdx
jne .L6

But the version with a branch is probably already better

for (int i = 1; i < mysize; i++) {
     float a = buf[i];
     float b = buf[i-1] + offset;
     buf[i] = a<b ? a : b;
}

Here is the assembly

.L15:
addss   xmm0, xmm2
movss   xmm1, DWORD PTR [rax]
add rax, 4
minss   xmm1, xmm0
movss   DWORD PTR [rax-4], xmm1
cmp rax, rdx
movaps  xmm0, xmm1
jne .L15

This produces code which is branchless anyway using minss (cmp rax, rdx applies to the loop iterator).

Finally, here is code you can be used with MSVC which produces the same assembly as GCC which is branchless

__m128 offset4 = _mm_set1_ps(offset);
for (int i = 1; i < mysize; i++) {
    __m128 a = _mm_load_ss(&buf[i]);
    __m128 b = _mm_load_ss(&buf[i-1]);
    b = _mm_add_ss(b, offset4);
    a = _mm_min_ss(a,b);
    _mm_store_ss(&buf[i], a);
}

Here is another form you can try which uses a branch

__m128 offset4 = _mm_set1_ps(offset);
for (int i = 1; i < mysize; i++) {
    __m128 a = _mm_load_ss(&buf[i]);
    __m128 b = _mm_load_ss(&buf[i-1]);
    b = _mm_add_ss(b, offset4);
    if(_mm_comige_ss(b,a))
        _mm_store_ss(&buf[i], b);
}

Upvotes: 0

gnasher729
gnasher729

Reputation: 52622

If this code is performance critical, you'll have to look at the data that you get. It's the serial dependency that's killing you, and you need to get rid of it.

One very small value an buf [i] will influence a lot of the following values. For example, if offset = 1, buf [0] = 0, and all other values are > 1 million, that one value will influence the next one million. On the other hand, that kind of thing might happen very rarely.

If it is rare, they you check fully vectorised whether buf [i] > buf [i] + offset, replace it if it is, and keep track where changes were made, without considering that buf [i] values could trickle upwards. Then you check where changes were made, and re-check them.

In extreme cases, say buf [i] is always between 0 and 1, and offset > 0.5, you know that buf [i] cannot influence buf [i + 2] at all, so you just ignore the serial dependency and do everything in parallel, fully vectorised.

On the other hand, if you have some tiny values in your buffer that influence large numbers of consecutive values, then you start with the first value buf [0] and fully vectorised check whether buf [i] < buf [0] + i * offset, replacing values, until the check fails.

You say "the values can be anything". If that is the case, for example if buf [i] is randomly chosen anywhere between 0 and 1,000,000, and offset is not very large, then you will have elements buf [i] which force lots of following elements to be buf [i] + (k - i) * offset. For example if offset = 1, and you find buf [i] is about 10,000, then it will force on average about 100 values to be equal to buf [i] + (k - i) * offset.

Upvotes: 1

Related Questions