Reputation: 1683
i have a question about the FP peak performance of my core i7 920. I have an application that does a lot of MAC operations (basically a convolution operation), and i am not able to reach the peak FP performance of the cpu by a factor of ~8x when using multi-threading and SSE instructions. When trying to find out what the reason was for this i ended up with a simplified code snippet, running on a single thread and not using SSE instructions which performs equally bad:
for(i=0; i<49335264; i++)
{
data[i] += other_data[i] * other_data2[i];
}
If i'm correct (the data and other_data arrays are all FP) this piece of code requires:
49335264 * 2 = 98670528 FLOPs
It executes in ~150 ms (i'm very sure this timing is correct, since C timers and the Intel VTune Profiler give me the same result)
This means the performance of this code snippet is:
98670528 / 150.10^-3 / 10^9 = 0.66 GFLOPs/sec
Where the peak performance of this cpu should be at 2*3.2 GFlops/sec (2 FP units, 3.2 GHz processor) right?
Is there any explanation for this huge gap? Because i cannot explain it.
Thanks a lot in advance, and i could really use your help!
Upvotes: 3
Views: 652
Reputation: 30449
I would use SSE.
Edit: I run some more tests by myself and discovered that your program is neither limited by memory bandwidth (the theoretical limit is about 3-4 times higher than your result) nor floating point performance (with an even higher limit), it is limited by lazy allocation of memory pages by the OS.
#include <chrono>
#include <iostream>
#include <x86intrin.h>
using namespace std::chrono;
static const unsigned size = 49335264;
float data[size], other_data[size], other_data2[size];
int main() {
#if 0
for(unsigned i=0; i<size; i++) {
data[i] = i;
other_data[i] = i;
other_data2[i] = i;
}
#endif
system_clock::time_point start = system_clock::now();
for(unsigned i=0; i<size; i++)
data[i] += other_data[i]*other_data2[i];
microseconds timeUsed = system_clock::now() - start;
std::cout << "Used " << timeUsed.count() << " us, "
<< 2*size/(timeUsed.count()/1e6*1e9) << " GFLOPS\n";
}
Translate with g++ -O3 -march=native -std=c++0x
. The program gives
Used 212027 us, 0.465368 GFLOPS
as output, although the hot loop translates to
400848: vmovaps 0xc234100(%rdx),%ymm0
400850: vmulps 0x601180(%rdx),%ymm0,%ymm0
400858: vaddps 0x17e67080(%rdx),%ymm0,%ymm0
400860: vmovaps %ymm0,0x17e67080(%rdx)
400868: add $0x20,%rdx
40086c: cmp $0xbc32f80,%rdx
400873: jne 400848 <main+0x18>
This means it is fully vectorized, using 8 floats per iteration and even taking advantage of AVX.
After playing around with streaming instruction like movntdq
, which don't bought anything, I decided to actually initialize the arrays with something - otherwise they will be zero pages, which only get mapped to real memory if they are written to. Changing the #if 0
to #if 1
immediately yields
Used 48843 us, 2.02016 GFLOPS
Which comes pretty close to the memory bandwith of the system (4 floats a 4 bytes per two FLOPS = 16 GBytes/s, theoretical limit is 2 Channels of DDR3 each 10,667 GBytes/s).
Upvotes: 5
Reputation: 1683
I looked at the assembly code of the multiply-accumulate of the code snippet in my first post and it looks like:
movq 0x80(%rbx), %rcx
movq 0x138(%rbx), %rdi
movq 0x120(%rbx), %rdx
movq (%rcx), %rsi
movq 0x8(%rdi), %r8
movq 0x8(%rdx), %r9
movssl 0x1400(%rsi), %xmm0
mulssl 0x90(%r8), %xmm0
addssl 0x9f8(%r9), %xmm0
movssl %xmm0, 0x9f8(%r9)
I estimated from the total number of cycles that it takes ~10 cycles to execute the multiply-accumulate.
The problem seems to be that the compiler is unable to pipeline the execution of the loop, even though there are no inter-loop dependencies, am i correct?
Does anybody have any other ideas / solutions for this?
Thanks for the help so far!
Upvotes: 0
Reputation: 7392
As High Performance Mark explained, your test is very likely to be memory bound rather than compute-bound.
One thing I'd like to add is that to quantify this effect, you can modify the test so that it operates on data that fits into the L1 cache:
for(i=0, j=0; i<6166908; i++)
{
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
if ((j & 1023) == 0) j = 0;
}
The performance of this version of the code should be closer to the theoretical maximum of FLOPS. Of course, it presumably doesn't solve your original problem, but hopefully it can help understand what's going on.
Upvotes: 1
Reputation: 78364
The explanation is simple: while your processor can run at (say) 6.4GHz, your memory sub-system can only feed data in/out at about 1/10th that rate (broad rule-of-thumb for most current commodity CPUs). So achieving a sustained flops rate of 1/8th of the theoretical maximum for your processor is actually very good performance.
Since you seem to be dealing with about 370MB of data, which is probably larger than the caches on your processor, your computation is I/O bound.
Upvotes: 3