Reputation: 335
I've written some C++ backpropagation code which I'm running on a i9-9900K in Ubuntu 18.04.
The issue I'm seeing is that I'm getting progressively worse mulithreaded performance with newer versions of g++.
Single threaded benchmarks improve as expected with newer g++ versions:
g++ 4.8: 5437 cycles/s
g++ 5.5: 5929 cycles/s
g++ 6.5: 5932 cycles/s
g++ 7.4: 6117 cycles/s
g++ 8.3: 6921 cycles/s
Multi threaded benchmarks (14 pthreads on 8 cores) degrade significantly with newer versions:
g++ 4.8: 25456 cycles/s
g++ 5.5: 17212 cycles/s
g++ 6.5: 18616 cycles/s
g++ 7.4: 17054 cycles/s
g++ 8.3: 14797 cycles/s
I've seen similar behavior in CentOS 7.6 and Clear Linux as well. Across all tested OS's the fastest performance came from using 14 threads with g++ 4.8.
Here are the compilation flags I'm using: g++ -c -std=c++11 -march=native -Ofast
Am I using the wrong flags for compilation? I’ve tried -O3 and the degradation is similar though less extreme (and slower than -Ofast)
g++ 4.8 -O3: 17256 cycles/s
g++ 5.5 -O3: 15129 cycles/s
g++ 6.5 -O3: 15779 cycles/s
g++ 7.4 -O3: 15736 cycles/s
g++ 8.3 -O3: 13361 cycles/s
I feel like I am running into a memory bandwidth issue with so many cores. Are there any compilation options that can help with the memory pressure from so many threads?
Upvotes: 4
Views: 167
Reputation: 335
Further testing revealed that the issue was related to the -march=native optimization flag.
g++ 4.8 treats the i9-9900k natively as core-avx2 which activates: MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AES and PCLMUL
g++ 4.9 and greater treat the i9-9900k natively as broadwell which activates: MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI, BMI2, F16C, RDSEED, ADCX and PREFETCHW
Apparently this somehow results in over-optimization.
Removing the -march flag altogether fixed the issue. Disabling AVX also worked using -mno-avx and -mno-avx2
Upvotes: 1