Reputation: 7611

Tiny SSE addpd loop slightly slower than scalar on AMD Phenom II?

Yes, I read SIMD code runs slower than scalar code. No, it's not really a duplicate.

I have been using 2D math stuff for a while, and in the process of porting my codebase from C to C++. There are a few walls I've hit with C that mean I really need polymorphism, but that's another story. Anyway, I considered this a while ago, but it presented a perfect opportunity to use a 2D vector class, including SSE implementations of the common math operations. Yes, I know there are libraries out there, but I wanted to try it myself to understand what's going on, and I don't use anything more complicated than +=.

My implementation is via <immintrin.h>, with a

union {
    __m128d ss;
    struct {
        double x;
        double y;
    }
}

SSE seemed slow, so I looked at its generated ASM output. After fixing something stupid pointerwise, I ended up with the following sets of instructions, run a billion times in a loop: (Processor is an AMD Phenom II at 3.7GHz)

SSE enabled: 1.1 to 1.8 seconds (varies)

add      $0x1, %eax
addpd    %xmm0, %xmm1
cmp      $0x3b9aca00, %eax
jne      4006c8

SSE disabled: 1.0 seconds (pretty constant)

add      $0x1, %eax
addsd    %xmm0, %xmm3
cmp      $0x3b9aca00, %eax
addsd    %xmm2, %xmm1
jne      400630

The only conclusion I can draw from this is that addsd is faster than addpd, and that pipelining means that the extra instruction is compensated for by the ability to do more faster things partially overlapping.

So my question is: is this worth it, and in practice will it actually help, or should I just not bother with the stupid optimization and let the compiler handle it in scalar mode?

Upvotes: 4

Answers (3)

Peter Cordes

Reputation: 365707

Just for the record, Agner Fog's instruction tables confirm that K10 runs addpd and addsd with identical performance: 1 m-op for the FADD unit, with 4 cycle latency. The earlier K8 did only have 64bit execution units, and split addpd into two m-ops.

So both loops have a 4 cycle loop-carried dependency chain. The scalar loop has two separate 4c dep chains, but that still only keeps the FADD unit occupied half the time (instead of 1/4).

Other parts of the pipeline must be coming into play, perhaps code alignment or just instruction ordering. AMD is more sensitive to that than Intel, IIRC. I'm not curious enough to read up on the K10 pipeline and figure out if there's an explanation in Agner Fog's docs.

K10 doesn't fuse cmp/jcc into a single op, so having them split up isn't actually a problem. (Bulldozer-family CPUs do, and of course Intel does).

Upvotes: 2

Joel Falcou

Reputation: 6357

This require more loop unrolling and maybe cache prefetching. Your arithmetic density is very low : 1 operation for 2 memory operations so you need to jam as much of these in your pipeline as possible.

Also don't use union but __m128d directly and use _mm_load_pd to fill your __m128 from your data. _m128 in union generate bad code where all element are doing a stack-register-stack dance which is detrimental.

Upvotes: 7

NotKyon

Reputation: 383

2D math isn't that processor intensive (compared to 3D math) so I highly doubt it's worth sinking that much time into it. It's worth optimizing if

Your profiler says the code is a hot spot.
Your code is running slowly. (I imagine this is for a game?)
You've already optimized the high-level algorithms.

I've done some SSE tests on my rigs (AMD APU @ 3GHz x 4; old Intel CPU @ 1.8Ghz x 2) and have found SSE to be of benefit in most of the cases I've tested. However, this was for 3D operations, not 2D.

The scalar code has more of an opportunity for parallelism iirc. Four registers used instead of two; less dependencies. If register contention becomes greater, the vectorized code may run better. Take that with a grain of salt though, I haven't put that to the test.

Upvotes: 1

Tiny SSE addpd loop slightly slower than scalar on AMD Phenom II?

Answers (3)

Related Questions