Ricky
Ricky

Reputation: 1683

Why is my application not able to reach core i7 920 peak FP performance

i have a question about the FP peak performance of my core i7 920. I have an application that does a lot of MAC operations (basically a convolution operation), and i am not able to reach the peak FP performance of the cpu by a factor of ~8x when using multi-threading and SSE instructions. When trying to find out what the reason was for this i ended up with a simplified code snippet, running on a single thread and not using SSE instructions which performs equally bad:

for(i=0; i<49335264; i++)
{
    data[i] += other_data[i] * other_data2[i];
}

If i'm correct (the data and other_data arrays are all FP) this piece of code requires:

49335264 * 2 = 98670528 FLOPs

It executes in ~150 ms (i'm very sure this timing is correct, since C timers and the Intel VTune Profiler give me the same result)

This means the performance of this code snippet is:

98670528 / 150.10^-3 / 10^9 = 0.66 GFLOPs/sec

Where the peak performance of this cpu should be at 2*3.2 GFlops/sec (2 FP units, 3.2 GHz processor) right?

Is there any explanation for this huge gap? Because i cannot explain it.

Thanks a lot in advance, and i could really use your help!

Upvotes: 3

Views: 652

Answers (4)

Gunther Piez
Gunther Piez

Reputation: 30449

I would use SSE.

Edit: I run some more tests by myself and discovered that your program is neither limited by memory bandwidth (the theoretical limit is about 3-4 times higher than your result) nor floating point performance (with an even higher limit), it is limited by lazy allocation of memory pages by the OS.

#include <chrono>
#include <iostream>
#include <x86intrin.h>

using namespace std::chrono;

static const unsigned size = 49335264;

float data[size], other_data[size], other_data2[size];

int main() {
#if 0
        for(unsigned i=0; i<size; i++) {
                data[i] = i;
                other_data[i] = i;
                other_data2[i] = i;
        }
#endif
    system_clock::time_point start = system_clock::now();
        for(unsigned i=0; i<size; i++) 
                data[i] += other_data[i]*other_data2[i];

    microseconds timeUsed = system_clock::now() - start;

    std::cout << "Used " << timeUsed.count() << " us, " 
              << 2*size/(timeUsed.count()/1e6*1e9) << " GFLOPS\n";
}

Translate with g++ -O3 -march=native -std=c++0x. The program gives

Used 212027 us, 0.465368 GFLOPS

as output, although the hot loop translates to

400848:       vmovaps 0xc234100(%rdx),%ymm0
400850:       vmulps 0x601180(%rdx),%ymm0,%ymm0
400858:       vaddps 0x17e67080(%rdx),%ymm0,%ymm0
400860:       vmovaps %ymm0,0x17e67080(%rdx)
400868:       add    $0x20,%rdx
40086c:       cmp    $0xbc32f80,%rdx
400873:       jne    400848 <main+0x18>

This means it is fully vectorized, using 8 floats per iteration and even taking advantage of AVX. After playing around with streaming instruction like movntdq, which don't bought anything, I decided to actually initialize the arrays with something - otherwise they will be zero pages, which only get mapped to real memory if they are written to. Changing the #if 0 to #if 1 immediately yields

Used 48843 us, 2.02016 GFLOPS

Which comes pretty close to the memory bandwith of the system (4 floats a 4 bytes per two FLOPS = 16 GBytes/s, theoretical limit is 2 Channels of DDR3 each 10,667 GBytes/s).

Upvotes: 5

Ricky
Ricky

Reputation: 1683

I looked at the assembly code of the multiply-accumulate of the code snippet in my first post and it looks like:

movq  0x80(%rbx), %rcx
movq  0x138(%rbx), %rdi
movq  0x120(%rbx), %rdx
movq  (%rcx), %rsi
movq  0x8(%rdi), %r8
movq  0x8(%rdx), %r9
movssl  0x1400(%rsi), %xmm0
mulssl  0x90(%r8), %xmm0
addssl  0x9f8(%r9), %xmm0
movssl  %xmm0, 0x9f8(%r9)

I estimated from the total number of cycles that it takes ~10 cycles to execute the multiply-accumulate.

The problem seems to be that the compiler is unable to pipeline the execution of the loop, even though there are no inter-loop dependencies, am i correct?

Does anybody have any other ideas / solutions for this?

Thanks for the help so far!

Upvotes: 0

Igor ostrovsky
Igor ostrovsky

Reputation: 7392

As High Performance Mark explained, your test is very likely to be memory bound rather than compute-bound.

One thing I'd like to add is that to quantify this effect, you can modify the test so that it operates on data that fits into the L1 cache:

for(i=0, j=0; i<6166908; i++) 
{ 
    data[j] += other_data[j] * other_data2[j]; j++;
    data[j] += other_data[j] * other_data2[j]; j++; 
    data[j] += other_data[j] * other_data2[j]; j++;
    data[j] += other_data[j] * other_data2[j]; j++;
    data[j] += other_data[j] * other_data2[j]; j++;
    data[j] += other_data[j] * other_data2[j]; j++; 
    data[j] += other_data[j] * other_data2[j]; j++; 
    data[j] += other_data[j] * other_data2[j]; j++; 

    if ((j & 1023) == 0) j = 0;
} 

The performance of this version of the code should be closer to the theoretical maximum of FLOPS. Of course, it presumably doesn't solve your original problem, but hopefully it can help understand what's going on.

Upvotes: 1

High Performance Mark
High Performance Mark

Reputation: 78364

The explanation is simple: while your processor can run at (say) 6.4GHz, your memory sub-system can only feed data in/out at about 1/10th that rate (broad rule-of-thumb for most current commodity CPUs). So achieving a sustained flops rate of 1/8th of the theoretical maximum for your processor is actually very good performance.

Since you seem to be dealing with about 370MB of data, which is probably larger than the caches on your processor, your computation is I/O bound.

Upvotes: 3

Related Questions