Reputation: 40118
I was looking for the fastest way to popcount
large arrays of data. I encountered a very weird effect: Changing the loop variable from unsigned
to uint64_t
made the performance drop by 50% on my PC.
#include <iostream>
#include <chrono>
#include <x86intrin.h>
int main(int argc, char* argv[]) {
using namespace std;
if (argc != 2) {
cerr << "usage: array_size in MB" << endl;
return -1;
}
uint64_t size = atol(argv[1])<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer = reinterpret_cast<char*>(buffer);
for (unsigned i=0; i<size; ++i)
charbuffer[i] = rand()%256;
uint64_t count,duration;
chrono::time_point<chrono::system_clock> startP,endP;
{
startP = chrono::system_clock::now();
count = 0;
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with unsigned
for (unsigned i=0; i<size/8; i+=4) {
count += _mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count += _mm_popcnt_u64(buffer[i+3]);
}
}
endP = chrono::system_clock::now();
duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
startP = chrono::system_clock::now();
count=0;
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with uint64_t
for (uint64_t i=0;i<size/8;i+=4) {
count += _mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count += _mm_popcnt_u64(buffer[i+3]);
}
}
endP = chrono::system_clock::now();
duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "uint64_t\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
free(charbuffer);
}
As you see, we create a buffer of random data, with the size being x
megabytes where x
is read from the command line. Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount
intrinsic to perform the popcount. To get a more precise result, we do the popcount 10,000 times. We measure the times for the popcount. In the upper case, the inner loop variable is unsigned
, in the lower case, the inner loop variable is uint64_t
. I thought that this should make no difference, but the opposite is the case.
I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1):
g++ -O3 -march=native -std=c++11 test.cpp -o test
Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1
(so 1 MB random data):
As you see, the throughput of the uint64_t
version is only half the one of the unsigned
version! The problem seems to be that different assembly gets generated, but why? First, I thought of a compiler bug, so I tried clang++
(Ubuntu Clang version 3.4-1ubuntu3):
clang++ -O3 -march=native -std=c++11 teest.cpp -o test
Result: test 1
So, it is almost the same result and is still strange. But now it gets super strange. I replace the buffer size that was read from input with a constant 1
, so I change:
uint64_t size = atol(argv[1]) << 20;
to
uint64_t size = 1 << 20;
Thus, the compiler now knows the buffer size at compile time. Maybe it can add some optimizations! Here are the numbers for g++
:
Now, both versions are equally fast. However, the unsigned
got even slower! It dropped from 26
to 20 GB/s
, thus replacing a non-constant by a constant value lead to a deoptimization. Seriously, I have no clue what is going on here! But now to clang++
with the new version:
Wait, what? Now, both versions dropped to the slow number of 15 GB/s. Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang!
I asked a colleague with an Ivy Bridge CPU to compile my benchmark. He got similar results, so it does not seem to be Haswell. Because two compilers produce strange results here, it also does not seem to be a compiler bug. We do not have an AMD CPU here, so we could only test with Intel.
Take the first example (the one with atol(argv[1])
) and put a static
before the variable, i.e.:
static uint64_t size=atol(argv[1])<<20;
Here are my results in g++:
Yay, yet another alternative. We still have the fast 26 GB/s with u32
, but we managed to get u64
at least from the 13 GB/s to the 20 GB/s version! On my collegue's PC, the u64
version became even faster than the u32
version, yielding the fastest result of all. Sadly, this only works for g++
, clang++
does not seem to care about static
.
Can you explain these results? Especially:
u32
and u64
?static
keyword make the u64
loop faster? Even faster than the original code on my collegue's computer!I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. Of course, I always want to have the version that is able to popcount 26 GB/s. The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. This is the only way I can get rid of compilers that seem to go mad on small changes. What do you think? Is there another way to reliably get the code with most performance?
Here is the disassembly for the various results:
26 GB/s version from g++ / u32 / non-const bufsize:
0x400af8:
lea 0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea 0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea 0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add $0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add %rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp %rbp,%rcx
jb 0x400af8
13 GB/s version from g++ / u64 / non-const bufsize:
0x400c00:
popcnt 0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add %rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add %rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add %rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb 0x400c00
15 GB/s version from clang++ / u64 / non-const bufsize:
0x400e50:
popcnt (%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp %rbp,%rcx
jb 0x400e50
20 GB/s version from g++ / u32&u64 / const bufsize:
0x400a68:
popcnt (%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add %rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add %rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add %rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne 0x400a68
15 GB/s version from clang++ / u32&u64 / const bufsize:
0x400dd0:
popcnt (%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp $0x20000,%rcx
jb 0x400dd0
Interestingly, the fastest (26 GB/s) version is also the longest! It seems to be the only solution that uses lea
. Some versions use jb
to jump, others use jne
. But apart from that, all versions seem to be comparable. I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. The slowest (13 GB/s) version looks even very short and good. Can anyone explain this?
No matter what the answer to this question will be; I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code. I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static
keyword in front of the size variable! In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance.
The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. So even if you unroll, you can still get hit by major performance deviations. Quite interesting.
Upvotes: 1651
Views: 199937
Reputation: 7457
This is not an answer but a feedback with few compilers of 2021. On Intel CoffeeLake 9900k.
With Microsoft compiler (VS2019), toolset v142:
unsigned 209695540000 1.8322 sec 28.6152 GB/s uint64_t 209695540000 3.08764 sec 16.9802 GB/s
With Intel compiler 2021:
unsigned 209695540000 1.70845 sec 30.688 GB/s uint64_t 209695540000 1.57956 sec 33.1921 GB/s
According to Mysticial's answer, Intel compiler is aware of False Data Dependency, but not Microsoft compiler.
For intel compiler, I used /QxHost
(optimize of CPU's architecture which is that of the host) /Oi
(enable intrinsic functions) and #include <nmmintrin.h>
instead of #include <immintrin.h>
.
Full compile command: /GS /W3 /QxHost /Gy /Zi /O2 /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Qipo /Zc:forScope /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" //fprofile-instr-use "x64\Release\" /Fp"x64\Release\Benchmark.pch"
.
The decompiled (by IDA 7.5) assembly from ICC:
int __cdecl main(int argc, const char **argv, const char **envp)
{
int v6; // er13
_BYTE *v8; // rsi
unsigned int v9; // edi
unsigned __int64 i; // rbx
unsigned __int64 v11; // rdi
int v12; // ebp
__int64 v13; // r14
__int64 v14; // rbx
unsigned int v15; // eax
unsigned __int64 v16; // rcx
unsigned int v17; // eax
unsigned __int64 v18; // rcx
__int64 v19; // rdx
unsigned int v20; // eax
int result; // eax
std::ostream *v23; // rbx
char v24; // dl
std::ostream *v33; // rbx
std::ostream *v41; // rbx
__int64 v42; // rdx
unsigned int v43; // eax
int v44; // ebp
__int64 v45; // r14
__int64 v46; // rbx
unsigned __int64 v47; // rax
unsigned __int64 v48; // rax
std::ostream *v50; // rdi
char v51; // dl
std::ostream *v58; // rdi
std::ostream *v60; // rdi
__int64 v61; // rdx
unsigned int v62; // eax
__asm
{
vmovdqa [rsp+98h+var_58], xmm8
vmovapd [rsp+98h+var_68], xmm7
vmovapd [rsp+98h+var_78], xmm6
}
if ( argc == 2 )
{
v6 = atol(argv[1]) << 20;
_R15 = v6;
v8 = operator new[](v6);
if ( v6 )
{
v9 = 1;
for ( i = 0i64; i < v6; i = v9++ )
v8[i] = rand();
}
v11 = (unsigned __int64)v6 >> 3;
v12 = 0;
v13 = Xtime_get_ticks_0();
v14 = 0i64;
do
{
if ( v6 )
{
v15 = 4;
v16 = 0i64;
do
{
v14 += __popcnt(*(_QWORD *)&v8[8 * v16])
+ __popcnt(*(_QWORD *)&v8[8 * v15 - 24])
+ __popcnt(*(_QWORD *)&v8[8 * v15 - 16])
+ __popcnt(*(_QWORD *)&v8[8 * v15 - 8]);
v16 = v15;
v15 += 4;
}
while ( v11 > v16 );
v17 = 4;
v18 = 0i64;
do
{
v14 += __popcnt(*(_QWORD *)&v8[8 * v18])
+ __popcnt(*(_QWORD *)&v8[8 * v17 - 24])
+ __popcnt(*(_QWORD *)&v8[8 * v17 - 16])
+ __popcnt(*(_QWORD *)&v8[8 * v17 - 8]);
v18 = v17;
v17 += 4;
}
while ( v11 > v18 );
}
v12 += 2;
}
while ( v12 != 10000 );
_RBP = 100 * (Xtime_get_ticks_0() - v13);
std::operator___std::char_traits_char___(std::cout, "unsigned\t");
v23 = (std::ostream *)std::ostream::operator<<(std::cout, v14);
std::operator___std::char_traits_char____0(v23, v24);
__asm
{
vmovq xmm0, rbp
vmovdqa xmm8, cs:__xmm@00000000000000004530000043300000
vpunpckldq xmm0, xmm0, xmm8
vmovapd xmm7, cs:__xmm@45300000000000004330000000000000
vsubpd xmm0, xmm0, xmm7
vpermilpd xmm1, xmm0, 1
vaddsd xmm6, xmm1, xmm0
vdivsd xmm1, xmm6, cs:__real@41cdcd6500000000
}
v33 = (std::ostream *)std::ostream::operator<<(v23);
std::operator___std::char_traits_char___(v33, " sec \t");
__asm
{
vmovq xmm0, r15
vpunpckldq xmm0, xmm0, xmm8
vsubpd xmm0, xmm0, xmm7
vpermilpd xmm1, xmm0, 1
vaddsd xmm0, xmm1, xmm0
vmulsd xmm7, xmm0, cs:__real@40c3880000000000
vdivsd xmm1, xmm7, xmm6
}
v41 = (std::ostream *)std::ostream::operator<<(v33);
std::operator___std::char_traits_char___(v41, " GB/s");
LOBYTE(v42) = 10;
v43 = std::ios::widen((char *)v41 + *(int *)(*(_QWORD *)v41 + 4i64), v42);
std::ostream::put(v41, v43);
std::ostream::flush(v41);
v44 = 0;
v45 = Xtime_get_ticks_0();
v46 = 0i64;
do
{
if ( v6 )
{
v47 = 0i64;
do
{
v46 += __popcnt(*(_QWORD *)&v8[8 * v47])
+ __popcnt(*(_QWORD *)&v8[8 * v47 + 8])
+ __popcnt(*(_QWORD *)&v8[8 * v47 + 16])
+ __popcnt(*(_QWORD *)&v8[8 * v47 + 24]);
v47 += 4i64;
}
while ( v47 < v11 );
v48 = 0i64;
do
{
v46 += __popcnt(*(_QWORD *)&v8[8 * v48])
+ __popcnt(*(_QWORD *)&v8[8 * v48 + 8])
+ __popcnt(*(_QWORD *)&v8[8 * v48 + 16])
+ __popcnt(*(_QWORD *)&v8[8 * v48 + 24]);
v48 += 4i64;
}
while ( v48 < v11 );
}
v44 += 2;
}
while ( v44 != 10000 );
_RBP = 100 * (Xtime_get_ticks_0() - v45);
std::operator___std::char_traits_char___(std::cout, "uint64_t\t");
v50 = (std::ostream *)std::ostream::operator<<(std::cout, v46);
std::operator___std::char_traits_char____0(v50, v51);
__asm
{
vmovq xmm0, rbp
vpunpckldq xmm0, xmm0, cs:__xmm@00000000000000004530000043300000
vsubpd xmm0, xmm0, cs:__xmm@45300000000000004330000000000000
vpermilpd xmm1, xmm0, 1
vaddsd xmm6, xmm1, xmm0
vdivsd xmm1, xmm6, cs:__real@41cdcd6500000000
}
v58 = (std::ostream *)std::ostream::operator<<(v50);
std::operator___std::char_traits_char___(v58, " sec \t");
__asm { vdivsd xmm1, xmm7, xmm6 }
v60 = (std::ostream *)std::ostream::operator<<(v58);
std::operator___std::char_traits_char___(v60, " GB/s");
LOBYTE(v61) = 10;
v62 = std::ios::widen((char *)v60 + *(int *)(*(_QWORD *)v60 + 4i64), v61);
std::ostream::put(v60, v62);
std::ostream::flush(v60);
free(v8);
result = 0;
}
else
{
std::operator___std::char_traits_char___(std::cerr, "usage: array_size in MB");
LOBYTE(v19) = 10;
v20 = std::ios::widen((char *)&std::cerr + *((int *)std::cerr + 1), v19);
std::ostream::put(std::cerr, v20);
std::ostream::flush(std::cerr);
result = -1;
}
__asm
{
vmovaps xmm6, [rsp+98h+var_78]
vmovaps xmm7, [rsp+98h+var_68]
vmovaps xmm8, [rsp+98h+var_58]
}
return result;
}
and disassembly of main:
.text:0140001000 .686p
.text:0140001000 .mmx
.text:0140001000 .model flat
.text:0140001000
.text:0140001000 ; ===========================================================================
.text:0140001000
.text:0140001000 ; Segment type: Pure code
.text:0140001000 ; Segment permissions: Read/Execute
.text:0140001000 _text segment para public 'CODE' use64
.text:0140001000 assume cs:_text
.text:0140001000 ;org 140001000h
.text:0140001000 assume es:nothing, ss:nothing, ds:_data, fs:nothing, gs:nothing
.text:0140001000
.text:0140001000 ; =============== S U B R O U T I N E =======================================
.text:0140001000
.text:0140001000
.text:0140001000 ; int __cdecl main(int argc, const char **argv, const char **envp)
.text:0140001000 main proc near ; CODE XREF: __scrt_common_main_seh+107↓p
.text:0140001000 ; DATA XREF: .pdata:ExceptionDir↓o
.text:0140001000
.text:0140001000 var_78 = xmmword ptr -78h
.text:0140001000 var_68 = xmmword ptr -68h
.text:0140001000 var_58 = xmmword ptr -58h
.text:0140001000
.text:0140001000 push r15
.text:0140001002 push r14
.text:0140001004 push r13
.text:0140001006 push r12
.text:0140001008 push rsi
.text:0140001009 push rdi
.text:014000100A push rbp
.text:014000100B push rbx
.text:014000100C sub rsp, 58h
.text:0140001010 vmovdqa [rsp+98h+var_58], xmm8
.text:0140001016 vmovapd [rsp+98h+var_68], xmm7
.text:014000101C vmovapd [rsp+98h+var_78], xmm6
.text:0140001022 cmp ecx, 2
.text:0140001025 jnz loc_14000113E
.text:014000102B mov rcx, [rdx+8] ; String
.text:014000102F call cs:__imp_atol
.text:0140001035 mov r13d, eax
.text:0140001038 shl r13d, 14h
.text:014000103C movsxd r15, r13d
.text:014000103F mov rcx, r15 ; size
.text:0140001042 call ??_U@YAPEAX_K@Z ; operator new[](unsigned __int64)
.text:0140001047 mov rsi, rax
.text:014000104A test r15d, r15d
.text:014000104D jz short loc_14000106E
.text:014000104F mov edi, 1
.text:0140001054 xor ebx, ebx
.text:0140001056 mov rbp, cs:__imp_rand
.text:014000105D nop dword ptr [rax]
.text:0140001060
.text:0140001060 loc_140001060: ; CODE XREF: main+6C↓j
.text:0140001060 call rbp ; __imp_rand
.text:0140001062 mov [rsi+rbx], al
.text:0140001065 mov ebx, edi
.text:0140001067 inc edi
.text:0140001069 cmp rbx, r15
.text:014000106C jb short loc_140001060
.text:014000106E
.text:014000106E loc_14000106E: ; CODE XREF: main+4D↑j
.text:014000106E mov rdi, r15
.text:0140001071 shr rdi, 3
.text:0140001075 xor ebp, ebp
.text:0140001077 call _Xtime_get_ticks_0
.text:014000107C mov r14, rax
.text:014000107F xor ebx, ebx
.text:0140001081 jmp short loc_14000109F
.text:0140001081 ; ---------------------------------------------------------------------------
.text:0140001083 align 10h
.text:0140001090
.text:0140001090 loc_140001090: ; CODE XREF: main+A2↓j
.text:0140001090 ; main+EC↓j ...
.text:0140001090 add ebp, 2
.text:0140001093 cmp ebp, 2710h
.text:0140001099 jz loc_140001184
.text:014000109F
.text:014000109F loc_14000109F: ; CODE XREF: main+81↑j
.text:014000109F test r13d, r13d
.text:01400010A2 jz short loc_140001090
.text:01400010A4 mov eax, 4
.text:01400010A9 xor ecx, ecx
.text:01400010AB nop dword ptr [rax+rax+00h]
.text:01400010B0
.text:01400010B0 loc_1400010B0: ; CODE XREF: main+E7↓j
.text:01400010B0 popcnt rcx, qword ptr [rsi+rcx*8]
.text:01400010B6 add rcx, rbx
.text:01400010B9 lea edx, [rax-3]
.text:01400010BC popcnt rdx, qword ptr [rsi+rdx*8]
.text:01400010C2 add rdx, rcx
.text:01400010C5 lea ecx, [rax-2]
.text:01400010C8 popcnt rcx, qword ptr [rsi+rcx*8]
.text:01400010CE add rcx, rdx
.text:01400010D1 lea edx, [rax-1]
.text:01400010D4 xor ebx, ebx
.text:01400010D6 popcnt rbx, qword ptr [rsi+rdx*8]
.text:01400010DC add rbx, rcx
.text:01400010DF mov ecx, eax
.text:01400010E1 add eax, 4
.text:01400010E4 cmp rdi, rcx
.text:01400010E7 ja short loc_1400010B0
.text:01400010E9 test r13d, r13d
.text:01400010EC jz short loc_140001090
.text:01400010EE mov eax, 4
.text:01400010F3 xor ecx, ecx
.text:01400010F5 db 2Eh
.text:01400010F5 nop word ptr [rax+rax+00000000h]
.text:01400010FF nop
.text:0140001100
.text:0140001100 loc_140001100: ; CODE XREF: main+137↓j
.text:0140001100 popcnt rcx, qword ptr [rsi+rcx*8]
.text:0140001106 add rcx, rbx
.text:0140001109 lea edx, [rax-3]
.text:014000110C popcnt rdx, qword ptr [rsi+rdx*8]
.text:0140001112 add rdx, rcx
.text:0140001115 lea ecx, [rax-2]
.text:0140001118 popcnt rcx, qword ptr [rsi+rcx*8]
.text:014000111E add rcx, rdx
.text:0140001121 lea edx, [rax-1]
.text:0140001124 xor ebx, ebx
.text:0140001126 popcnt rbx, qword ptr [rsi+rdx*8]
.text:014000112C add rbx, rcx
.text:014000112F mov ecx, eax
.text:0140001131 add eax, 4
.text:0140001134 cmp rdi, rcx
.text:0140001137 ja short loc_140001100
.text:0140001139 jmp loc_140001090
.text:014000113E ; ---------------------------------------------------------------------------
.text:014000113E
.text:014000113E loc_14000113E: ; CODE XREF: main+25↑j
.text:014000113E mov rsi, cs:__imp_?cerr@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A ; std::ostream std::cerr
.text:0140001145 lea rdx, aUsageArraySize ; "usage: array_size in MB"
.text:014000114C mov rcx, rsi ; std::ostream *
.text:014000114F call std__operator___std__char_traits_char___
.text:0140001154 mov rax, [rsi]
.text:0140001157 movsxd rcx, dword ptr [rax+4]
.text:014000115B add rcx, rsi
.text:014000115E mov dl, 0Ah
.text:0140001160 call cs:__imp_?widen@?$basic_ios@DU?$char_traits@D@std@@@std@@QEBADD@Z ; std::ios::widen(char)
.text:0140001166 mov rcx, rsi
.text:0140001169 mov edx, eax
.text:014000116B call cs:__imp_?put@?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV12@D@Z ; std::ostream::put(char)
.text:0140001171 mov rcx, rsi
.text:0140001174 call cs:__imp_?flush@?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV12@XZ ; std::ostream::flush(void)
.text:014000117A mov eax, 0FFFFFFFFh
.text:014000117F jmp loc_1400013E2
.text:0140001184 ; ---------------------------------------------------------------------------
.text:0140001184
.text:0140001184 loc_140001184: ; CODE XREF: main+99↑j
.text:0140001184 call _Xtime_get_ticks_0
.text:0140001189 sub rax, r14
.text:014000118C imul rbp, rax, 64h ; 'd'
.text:0140001190 mov r14, cs:__imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A ; std::ostream std::cout
.text:0140001197 lea rdx, aUnsigned ; "unsigned\t"
.text:014000119E mov rcx, r14 ; std::ostream *
.text:01400011A1 call std__operator___std__char_traits_char___
.text:01400011A6 mov rcx, r14
.text:01400011A9 mov rdx, rbx
.text:01400011AC call cs:__imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV01@_K@Z ; std::ostream::operator<<(unsigned __int64)
.text:01400011B2 mov rbx, rax
.text:01400011B5 mov rcx, rax ; std::ostream *
.text:01400011B8 call std__operator___std__char_traits_char____0
.text:01400011BD vmovq xmm0, rbp
.text:01400011C2 vmovdqa xmm8, cs:__xmm@00000000000000004530000043300000
.text:01400011CA vpunpckldq xmm0, xmm0, xmm8
.text:01400011CF vmovapd xmm7, cs:__xmm@45300000000000004330000000000000
.text:01400011D7 vsubpd xmm0, xmm0, xmm7
.text:01400011DB vpermilpd xmm1, xmm0, 1
.text:01400011E1 vaddsd xmm6, xmm1, xmm0
.text:01400011E5 vdivsd xmm1, xmm6, cs:__real@41cdcd6500000000
.text:01400011ED mov r12, cs:__imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV01@N@Z ; std::ostream::operator<<(double)
.text:01400011F4 mov rcx, rbx
.text:01400011F7 call r12 ; std::ostream::operator<<(double) ; std::ostream::operator<<(double)
.text:01400011FA mov rbx, rax
.text:01400011FD lea rdx, aSec ; " sec \t"
.text:0140001204 mov rcx, rax ; std::ostream *
.text:0140001207 call std__operator___std__char_traits_char___
.text:014000120C vmovq xmm0, r15
.text:0140001211 vpunpckldq xmm0, xmm0, xmm8
.text:0140001216 vsubpd xmm0, xmm0, xmm7
.text:014000121A vpermilpd xmm1, xmm0, 1
.text:0140001220 vaddsd xmm0, xmm1, xmm0
.text:0140001224 vmulsd xmm7, xmm0, cs:__real@40c3880000000000
.text:014000122C vdivsd xmm1, xmm7, xmm6
.text:0140001230 mov rcx, rbx
.text:0140001233 call r12 ; std::ostream::operator<<(double) ; std::ostream::operator<<(double)
.text:0140001236 mov rbx, rax
.text:0140001239 lea rdx, aGbS ; " GB/s"
.text:0140001240 mov rcx, rax ; std::ostream *
.text:0140001243 call std__operator___std__char_traits_char___
.text:0140001248 mov rax, [rbx]
.text:014000124B movsxd rcx, dword ptr [rax+4]
.text:014000124F add rcx, rbx
.text:0140001252 mov dl, 0Ah
.text:0140001254 call cs:__imp_?widen@?$basic_ios@DU?$char_traits@D@std@@@std@@QEBADD@Z ; std::ios::widen(char)
.text:014000125A mov rcx, rbx
.text:014000125D mov edx, eax
.text:014000125F call cs:__imp_?put@?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV12@D@Z ; std::ostream::put(char)
.text:0140001265 mov rcx, rbx
.text:0140001268 call cs:__imp_?flush@?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV12@XZ ; std::ostream::flush(void)
.text:014000126E xor ebp, ebp
.text:0140001270 call _Xtime_get_ticks_0
.text:0140001275 mov r14, rax
.text:0140001278 xor ebx, ebx
.text:014000127A jmp short loc_14000128F
.text:014000127A ; ---------------------------------------------------------------------------
.text:014000127C align 20h
.text:0140001280
.text:0140001280 loc_140001280: ; CODE XREF: main+292↓j
.text:0140001280 ; main+2DB↓j ...
.text:0140001280 add ebp, 2
.text:0140001283 cmp ebp, 2710h
.text:0140001289 jz loc_14000131D
.text:014000128F
.text:014000128F loc_14000128F: ; CODE XREF: main+27A↑j
.text:014000128F test r13d, r13d
.text:0140001292 jz short loc_140001280
.text:0140001294 xor eax, eax
.text:0140001296 db 2Eh
.text:0140001296 nop word ptr [rax+rax+00000000h]
.text:01400012A0
.text:01400012A0 loc_1400012A0: ; CODE XREF: main+2D6↓j
.text:01400012A0 xor ecx, ecx
.text:01400012A2 popcnt rcx, qword ptr [rsi+rax*8]
.text:01400012A8 add rcx, rbx
.text:01400012AB xor edx, edx
.text:01400012AD popcnt rdx, qword ptr [rsi+rax*8+8]
.text:01400012B4 add rdx, rcx
.text:01400012B7 xor ecx, ecx
.text:01400012B9 popcnt rcx, qword ptr [rsi+rax*8+10h]
.text:01400012C0 add rcx, rdx
.text:01400012C3 xor ebx, ebx
.text:01400012C5 popcnt rbx, qword ptr [rsi+rax*8+18h]
.text:01400012CC add rbx, rcx
.text:01400012CF add rax, 4
.text:01400012D3 cmp rax, rdi
.text:01400012D6 jb short loc_1400012A0
.text:01400012D8 test r13d, r13d
.text:01400012DB jz short loc_140001280
.text:01400012DD xor eax, eax
.text:01400012DF nop
.text:01400012E0
.text:01400012E0 loc_1400012E0: ; CODE XREF: main+316↓j
.text:01400012E0 xor ecx, ecx
.text:01400012E2 popcnt rcx, qword ptr [rsi+rax*8]
.text:01400012E8 add rcx, rbx
.text:01400012EB xor edx, edx
.text:01400012ED popcnt rdx, qword ptr [rsi+rax*8+8]
.text:01400012F4 add rdx, rcx
.text:01400012F7 xor ecx, ecx
.text:01400012F9 popcnt rcx, qword ptr [rsi+rax*8+10h]
.text:0140001300 add rcx, rdx
.text:0140001303 xor ebx, ebx
.text:0140001305 popcnt rbx, qword ptr [rsi+rax*8+18h]
.text:014000130C add rbx, rcx
.text:014000130F add rax, 4
.text:0140001313 cmp rax, rdi
.text:0140001316 jb short loc_1400012E0
.text:0140001318 jmp loc_140001280
.text:014000131D ; ---------------------------------------------------------------------------
.text:014000131D
.text:014000131D loc_14000131D: ; CODE XREF: main+289↑j
.text:014000131D call _Xtime_get_ticks_0
.text:0140001322 sub rax, r14
.text:0140001325 imul rbp, rax, 64h ; 'd'
.text:0140001329 mov rdi, cs:__imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A ; std::ostream std::cout
.text:0140001330 lea rdx, aUint64T ; "uint64_t\t"
.text:0140001337 mov rcx, rdi ; std::ostream *
.text:014000133A call std__operator___std__char_traits_char___
.text:014000133F mov rcx, rdi
.text:0140001342 mov rdx, rbx
.text:0140001345 call cs:__imp_??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV01@_K@Z ; std::ostream::operator<<(unsigned __int64)
.text:014000134B mov rdi, rax
.text:014000134E mov rcx, rax ; std::ostream *
.text:0140001351 call std__operator___std__char_traits_char____0
.text:0140001356 vmovq xmm0, rbp
.text:014000135B vpunpckldq xmm0, xmm0, cs:__xmm@00000000000000004530000043300000
.text:0140001363 vsubpd xmm0, xmm0, cs:__xmm@45300000000000004330000000000000
.text:014000136B vpermilpd xmm1, xmm0, 1
.text:0140001371 vaddsd xmm6, xmm1, xmm0
.text:0140001375 vdivsd xmm1, xmm6, cs:__real@41cdcd6500000000
.text:014000137D mov rcx, rdi
.text:0140001380 call r12 ; std::ostream::operator<<(double) ; std::ostream::operator<<(double)
.text:0140001383 mov rdi, rax
.text:0140001386 lea rdx, aSec ; " sec \t"
.text:014000138D mov rcx, rax ; std::ostream *
.text:0140001390 call std__operator___std__char_traits_char___
.text:0140001395 vdivsd xmm1, xmm7, xmm6
.text:0140001399 mov rcx, rdi
.text:014000139C call r12 ; std::ostream::operator<<(double) ; std::ostream::operator<<(double)
.text:014000139F mov rdi, rax
.text:01400013A2 lea rdx, aGbS ; " GB/s"
.text:01400013A9 mov rcx, rax ; std::ostream *
.text:01400013AC call std__operator___std__char_traits_char___
.text:01400013B1 mov rax, [rdi]
.text:01400013B4 movsxd rcx, dword ptr [rax+4]
.text:01400013B8 add rcx, rdi
.text:01400013BB mov dl, 0Ah
.text:01400013BD call cs:__imp_?widen@?$basic_ios@DU?$char_traits@D@std@@@std@@QEBADD@Z ; std::ios::widen(char)
.text:01400013C3 mov rcx, rdi
.text:01400013C6 mov edx, eax
.text:01400013C8 call cs:__imp_?put@?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV12@D@Z ; std::ostream::put(char)
.text:01400013CE mov rcx, rdi
.text:01400013D1 call cs:__imp_?flush@?$basic_ostream@DU?$char_traits@D@std@@@std@@QEAAAEAV12@XZ ; std::ostream::flush(void)
.text:01400013D7 mov rcx, rsi ; Block
.text:01400013DA call cs:__imp_free
.text:01400013E0 xor eax, eax
.text:01400013E2
.text:01400013E2 loc_1400013E2: ; CODE XREF: main+17F↑j
.text:01400013E2 vmovaps xmm6, [rsp+98h+var_78]
.text:01400013E8 vmovaps xmm7, [rsp+98h+var_68]
.text:01400013EE vmovaps xmm8, [rsp+98h+var_58]
.text:01400013F4 add rsp, 58h
.text:01400013F8 pop rbx
.text:01400013F9 pop rbp
.text:01400013FA pop rdi
.text:01400013FB pop rsi
.text:01400013FC pop r12
.text:01400013FE pop r13
.text:0140001400 pop r14
.text:0140001402 pop r15
.text:0140001404 retn
.text:0140001404 main endp
Coffee lake specification update "POPCNT instruction may take longer to execute than expected".
Upvotes: 11
Reputation: 471489
Culprit: False Data Dependency (and the compiler isn't even aware of it)
On Sandy/Ivy Bridge and Haswell processors, the instruction:
popcnt src, dest
appears to have a false dependency on the destination register dest
. Even though the instruction only writes to it, the instruction will wait until dest
is ready before executing. This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake)
Skylake fixed this for lzcnt
and tzcnt
.
Cannon Lake (and Ice Lake) fixed this for popcnt
.
bsf
/bsr
have a true output dependency: output unmodified for input=0. (But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.)
(Yes, these instructions all run on the same execution unit).
This dependency doesn't just hold up the 4 popcnt
s from a single loop iteration. It can carry across loop iterations making it impossible for the processor to parallelize different loop iterations.
The unsigned
vs. uint64_t
and other tweaks don't directly affect the problem. But they influence the register allocator which assigns the registers to the variables.
In your case, the speeds are a direct result of what is stuck to the (false) dependency chain depending on what the register allocator decided to do.
popcnt
-add
-popcnt
-popcnt
→ next iterationpopcnt
-add
-popcnt
-add
→ next iterationpopcnt
-popcnt
→ next iterationpopcnt
-popcnt
→ next iterationThe difference between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing. Either way, the processor starts to hit other bottlenecks once you reach this speed.
To test this, I used inline assembly to bypass the compiler and get exactly the assembly I want. I also split up the count
variable to break all other dependencies that might mess with the benchmarks.
Here are the results:
Sandy Bridge Xeon @ 3.5 GHz: (full test code can be found at the bottom)
g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native
Different Registers: 18.6195 GB/s
.L4:
movq (%rbx,%rax,8), %r8
movq 8(%rbx,%rax,8), %r9
movq 16(%rbx,%rax,8), %r10
movq 24(%rbx,%rax,8), %r11
addq $4, %rax
popcnt %r8, %r8
add %r8, %rdx
popcnt %r9, %r9
add %r9, %rcx
popcnt %r10, %r10
add %r10, %rdi
popcnt %r11, %r11
add %r11, %rsi
cmpq $131072, %rax
jne .L4
Same Register: 8.49272 GB/s
.L9:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# This time reuse "rax" for all the popcnts.
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L9
Same Register with broken chain: 17.8869 GB/s
.L14:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# Reuse "rax" for all the popcnts.
xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L14
So what went wrong with the compiler?
It seems that neither GCC nor Visual Studio are aware that popcnt
has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.
popcnt
isn't exactly the most used instruction. So it's not really a surprise that a major compiler could miss something like this. There also appears to be no documentation anywhere that mentions this problem. If Intel doesn't disclose it, then nobody outside will know until someone runs into it by chance.
(Update: As of version 4.9.2, GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)
Why does the CPU have such a false dependency?
We can speculate: it runs on the same execution unit as bsf
/ bsr
which do have an output dependency. (How is POPCNT implemented in hardware?). For those instructions, Intel documents the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to avoid breaking old software: output unmodified. AMD documents this behaviour.
Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output but others not.
AMD processors do not appear to have this false dependency.
The full test code is below for reference:
#include <iostream>
#include <chrono>
#include <x86intrin.h>
int main(int argc, char* argv[]) {
using namespace std;
uint64_t size=1<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer=reinterpret_cast<char*>(buffer);
for (unsigned i=0;i<size;++i) charbuffer[i]=rand()%256;
uint64_t count,duration;
chrono::time_point<chrono::system_clock> startP,endP;
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"popcnt %4, %4 \n\t"
"add %4, %0 \n\t"
"popcnt %5, %5 \n\t"
"add %5, %1 \n\t"
"popcnt %6, %6 \n\t"
"add %6, %2 \n\t"
"popcnt %7, %7 \n\t"
"add %7, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "No Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"popcnt %4, %%rax \n\t"
"add %%rax, %0 \n\t"
"popcnt %5, %%rax \n\t"
"add %%rax, %1 \n\t"
"popcnt %6, %%rax \n\t"
"add %%rax, %2 \n\t"
"popcnt %7, %%rax \n\t"
"add %%rax, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
: "rax"
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "Chain 4 \t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"xor %%rax, %%rax \n\t" // <--- Break the chain.
"popcnt %4, %%rax \n\t"
"add %%rax, %0 \n\t"
"popcnt %5, %%rax \n\t"
"add %%rax, %1 \n\t"
"popcnt %6, %%rax \n\t"
"add %%rax, %2 \n\t"
"popcnt %7, %%rax \n\t"
"add %%rax, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
: "rax"
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "Broken Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
free(charbuffer);
}
An equally interesting benchmark can be found here: http://pastebin.com/kbzgL8si
This benchmark varies the number of popcnt
s that are in the (false) dependency chain.
False Chain 0: 41959360000 0.57748 sec 18.1578 GB/s
False Chain 1: 41959360000 0.585398 sec 17.9122 GB/s
False Chain 2: 41959360000 0.645483 sec 16.2448 GB/s
False Chain 3: 41959360000 0.929718 sec 11.2784 GB/s
False Chain 4: 41959360000 1.23572 sec 8.48557 GB/s
Upvotes: 1757
Reputation: 183
TL;DR: Use __builtin
intrinsics instead; they might happen to help.
I was able to make gcc
4.8.4 (and even 4.7.3 on gcc.godbolt.org) generate optimal code for this by using __builtin_popcountll
which uses the same assembly instruction, but gets lucky and happens to make code that doesn't have an unexpectedly long loop-carried dependency because of the false dependency bug.
I am not 100% sure of my benchmarking code, but objdump
output seems to share my views. I use some other tricks (++i
vs i++
) to make the compiler unroll loop for me without any movl
instruction (strange behaviour, I must say).
Results:
Count: 20318230000 Elapsed: 0.411156 seconds Speed: 25.503118 GB/s
Benchmarking code:
#include <stdint.h>
#include <stddef.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
uint64_t builtin_popcnt(const uint64_t* buf, size_t len){
uint64_t cnt = 0;
for(size_t i = 0; i < len; ++i){
cnt += __builtin_popcountll(buf[i]);
}
return cnt;
}
int main(int argc, char** argv){
if(argc != 2){
printf("Usage: %s <buffer size in MB>\n", argv[0]);
return -1;
}
uint64_t size = atol(argv[1]) << 20;
uint64_t* buffer = (uint64_t*)malloc((size/8)*sizeof(*buffer));
// Spoil copy-on-write memory allocation on *nix
for (size_t i = 0; i < (size / 8); i++) {
buffer[i] = random();
}
uint64_t count = 0;
clock_t tic = clock();
for(size_t i = 0; i < 10000; ++i){
count += builtin_popcnt(buffer, size/8);
}
clock_t toc = clock();
printf("Count: %lu\tElapsed: %f seconds\tSpeed: %f GB/s\n", count, (double)(toc - tic) / CLOCKS_PER_SEC, ((10000.0*size)/(((double)(toc - tic)*1e+9) / CLOCKS_PER_SEC)));
return 0;
}
Compile options:
gcc --std=gnu99 -mpopcnt -O3 -funroll-loops -march=native bench.c -o bench
GCC version:
gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4
Linux kernel version:
3.19.0-58-generic
CPU information:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 70
model name : Intel(R) Core(TM) i7-4870HQ CPU @ 2.50 GHz
stepping : 1
microcode : 0xf
cpu MHz : 2494.226
cache size : 6144 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm arat pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 invpcid xsaveopt
bugs :
bogomips : 4988.45
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
Upvotes: 10
Reputation: 12344
Ok, I want to provide a small answer to one of the sub-questions that the OP asked that don't seem to be addressed in the existing questions. Caveat, I have not done any testing or code generation, or disassembly, just wanted to share a thought for others to possibly expound upon.
static
change the performance?The line in question:
uint64_t size = atol(argv[1])<<20;
I would look at the assembly generated for accessing size
and see if there are extra steps of pointer indirection involved for the non-static version.
Since there is only one copy of the variable whether it was declared static
or not, and the size doesn't change, I theorize that the difference is the location of the memory used to back the variable along with where it is used in the code further down.
Ok, to start with the obvious, remember that all local variables (along with parameters) of a function are provided space on the stack for use as storage. Now, obviously, the stack frame for main() never cleans up and is only generated once. Ok, what about making it static
? Well, in that case the compiler knows to reserve space in the global data space of the process so the location can not be cleared by the removal of a stack frame. But still, we only have one location so what is the difference? I suspect it has to do with how memory locations on the stack are referenced.
When the compiler is generating the symbol table, it just makes an entry for a label along with relevant attributes, like size, etc. It knows that it must reserve the appropriate space in memory but doesn't actually pick that location until somewhat later in process after doing liveness analysis and possibly register allocation. How then does the linker know what address to provide to the machine code for the final assembly code? It either knows the final location or knows how to arrive at the location. With a stack, it is pretty simple to refer to a location based one two elements, the pointer to the stackframe and then an offset into the frame. This is basically because the linker can't know the location of the stackframe before runtime.
Upvotes: -5
Reputation: 2050
First of all, try to estimate peak performance - examine https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf, in particular, Appendix C.
In your case, it's table C-10 that shows POPCNT instruction has latency = 3 clocks and throughput = 1 clock. Throughput shows your maximal rate in clocks (multiply by core frequency and 8 bytes in case of popcnt64 to get your best possible bandwidth number).
Now examine what compiler did and sum up throughputs of all other instructions in the loop. This will give best possible estimate for generated code.
At last, look at data dependencies between instructions in the loop as they will force latency-large delay instead of throughput - so split instructions of single iteration on data flow chains and calculate latency across them then naively pick up maximal from them. it will give rough estimate taking into account data flow dependencies.
However, in your case, just writing code the right way would eliminate all these complexities. Instead of accumulating to the same count variable, just accumulate to different ones (like count0, count1, ... count8) and sum them up at the end. Or even create an array of counts[8] and accumulate to its elements - perhaps, it will be vectorized even and you will get much better throughput.
P.S. and never run benchmark for a second, first warm up the core then run loop for at least 10 seconds or better 100 seconds. otherwise, you will test power management firmware and DVFS implementation in hardware :)
P.P.S. I heard endless debates on how much time should benchmark really run. Most smartest folks are even asking why 10 seconds not 11 or 12. I should admit this is funny in theory. In practice, you just go and run benchmark hundred times in a row and record deviations. That IS funny. Most people do change source and run bench after that exactly ONCE to capture new performance record. Do the right things right.
Not convinced still? Just use above C-version of benchmark by assp1r1n3 (https://stackoverflow.com/a/37026212/9706746) and try 100 instead of 10000 in retry loop.
My 7960X shows, with RETRY=100:
Count: 203182300 Elapsed: 0.008385 seconds Speed: 12.505379 GB/s
Count: 203182300 Elapsed: 0.011063 seconds Speed: 9.478225 GB/s
Count: 203182300 Elapsed: 0.011188 seconds Speed: 9.372327 GB/s
Count: 203182300 Elapsed: 0.010393 seconds Speed: 10.089252 GB/s
Count: 203182300 Elapsed: 0.009076 seconds Speed: 11.553283 GB/s
with RETRY=10000:
Count: 20318230000 Elapsed: 0.661791 seconds Speed: 15.844519 GB/s
Count: 20318230000 Elapsed: 0.665422 seconds Speed: 15.758060 GB/s
Count: 20318230000 Elapsed: 0.660983 seconds Speed: 15.863888 GB/s
Count: 20318230000 Elapsed: 0.665337 seconds Speed: 15.760073 GB/s
Count: 20318230000 Elapsed: 0.662138 seconds Speed: 15.836215 GB/s
P.P.P.S. Finally, on "accepted answer" and other mistery ;-)
Let's use assp1r1n3's answer - he has 2.5Ghz core. POPCNT has 1 clock throuhgput, his code is using 64-bit popcnt. So math is 2.5Ghz * 1 clock * 8 bytes = 20 GB/s for his setup. He is seeing 25Gb/s, perhaps due to turbo boost to around 3Ghz.
Thus go to ark.intel.com and look for i7-4870HQ: https://ark.intel.com/products/83504/Intel-Core-i7-4870HQ-Processor-6M-Cache-up-to-3-70-GHz-?q=i7-4870HQ
That core could run up to 3.7Ghz and real maximal rate is 29.6 GB/s for his hardware. So where is another 4GB/s? Perhaps, it's spent on loop logic and other surrounding code within each iteration.
Now where is this false dependency? hardware runs at almost peak rate. Maybe my math is bad, it happens sometimes :)
P.P.P.P.P.S. Still people suggesting HW errata is culprit, so I follow suggestion and created inline asm example, see below.
On my 7960X, first version (with single output to cnt0) runs at 11MB/s, second version (with output to cnt0, cnt1, cnt2 and cnt3) runs at 33MB/s. And one could say - voila! it's output dependency.
OK, maybe, the point I made is that it does not make sense to write code like this and it's not output dependency problem but dumb code generation. We are not testing hardware, we are writing code to unleash maximal performance. You could expect that HW OOO should rename and hide those "output-dependencies" but, gash, just do the right things right and you will never face any mystery.
uint64_t builtin_popcnt1a(const uint64_t* buf, size_t len)
{
uint64_t cnt0, cnt1, cnt2, cnt3;
cnt0 = cnt1 = cnt2 = cnt3 = 0;
uint64_t val = buf[0];
#if 0
__asm__ __volatile__ (
"1:\n\t"
"popcnt %2, %1\n\t"
"popcnt %2, %1\n\t"
"popcnt %2, %1\n\t"
"popcnt %2, %1\n\t"
"subq $4, %0\n\t"
"jnz 1b\n\t"
: "+q" (len), "=q" (cnt0)
: "q" (val)
:
);
#else
__asm__ __volatile__ (
"1:\n\t"
"popcnt %5, %1\n\t"
"popcnt %5, %2\n\t"
"popcnt %5, %3\n\t"
"popcnt %5, %4\n\t"
"subq $4, %0\n\t"
"jnz 1b\n\t"
: "+q" (len), "=q" (cnt0), "=q" (cnt1), "=q" (cnt2), "=q" (cnt3)
: "q" (val)
:
);
#endif
return cnt0;
}
Upvotes: -3
Reputation: 28921
I tried this with Visual Studio 2013 Express, using a pointer instead of an index, which sped up the process a bit. I suspect this is because the addressing is offset + register, instead of offset + register + (register<<3). C++ code.
uint64_t* bfrend = buffer+(size/8);
uint64_t* bfrptr;
// ...
{
startP = chrono::system_clock::now();
count = 0;
for (unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with uint64_t
for (bfrptr = buffer; bfrptr < bfrend;){
count += __popcnt64(*bfrptr++);
count += __popcnt64(*bfrptr++);
count += __popcnt64(*bfrptr++);
count += __popcnt64(*bfrptr++);
}
}
endP = chrono::system_clock::now();
duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "uint64_t\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
assembly code: r10 = bfrptr, r15 = bfrend, rsi = count, rdi = buffer, r13 = k :
$LL5@main:
mov r10, rdi
cmp rdi, r15
jae SHORT $LN4@main
npad 4
$LL2@main:
mov rax, QWORD PTR [r10+24]
mov rcx, QWORD PTR [r10+16]
mov r8, QWORD PTR [r10+8]
mov r9, QWORD PTR [r10]
popcnt rdx, rax
popcnt rax, rcx
add rdx, rax
popcnt rax, r8
add r10, 32
add rdx, rax
popcnt rax, r9
add rsi, rax
add rsi, rdx
cmp r10, r15
jb SHORT $LL2@main
$LN4@main:
dec r13
jne SHORT $LL5@main
Upvotes: 14
Reputation: 700
Have you tried passing -funroll-loops -fprefetch-loop-arrays
to GCC?
I get the following results with these additional optimizations:
[1829] /tmp/so_25078285 $ cat /proc/cpuinfo |grep CPU|head -n1
model name : Intel(R) Core(TM) i3-3225 CPU @ 3.30GHz
[1829] /tmp/so_25078285 $ g++ --version|head -n1
g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3
[1829] /tmp/so_25078285 $ g++ -O3 -march=native -std=c++11 test.cpp -o test_o3
[1829] /tmp/so_25078285 $ g++ -O3 -march=native -funroll-loops -fprefetch-loop-arrays -std=c++11 test.cpp -o test_o3_unroll_loops__and__prefetch_loop_arrays
[1829] /tmp/so_25078285 $ ./test_o3 1
unsigned 41959360000 0.595 sec 17.6231 GB/s
uint64_t 41959360000 0.898626 sec 11.6687 GB/s
[1829] /tmp/so_25078285 $ ./test_o3_unroll_loops__and__prefetch_loop_arrays 1
unsigned 41959360000 0.618222 sec 16.9612 GB/s
uint64_t 41959360000 0.407304 sec 25.7443 GB/s
Upvotes: 11
Reputation: 47010
I can't give an authoritative answer, but provide an overview of a likely cause. This reference shows pretty clearly that for the instructions in the body of your loop there is a 3:1 ratio between latency and throughput. It also shows the effects of multiple dispatch. Since there are (give-or-take) three integer units in modern x86 processors, it's generally possible to dispatch three instructions per cycle.
So between peak pipeline and multiple dispatch performance and failure of these mechanisms, we have a factor of six in performance. It's pretty well known that the complexity of the x86 instruction set makes it quite easy for quirky breakage to occur. The document above has a great example:
The Pentium 4 performance for 64-bit right shifts is really poor. 64-bit left shift as well as all 32-bit shifts have acceptable performance. It appears that the data path from the upper 32 bits to the lower 32 bit of the ALU is not well designed.
I personally ran into a strange case where a hot loop ran considerably slower on a specific core of a four-core chip (AMD if I recall). We actually got better performance on a map-reduce calculation by turning that core off.
Here my guess is contention for integer units: that the popcnt
, loop counter, and address calculations can all just barely run at full speed with the 32-bit wide counter, but the 64-bit counter causes contention and pipeline stalls. Since there are only about 12 cycles total, potentially 4 cycles with multiple dispatch, per loop body execution, a single stall could reasonably affect run time by a factor of 2.
The change induced by using a static variable, which I'm guessing just causes a minor reordering of instructions, is another clue that the 32-bit code is at some tipping point for contention.
I know this is not a rigorous analysis, but it is a plausible explanation.
Upvotes: 12
Reputation: 3911
This is not an answer, but it's hard to read if I put results in comment.
I get these results with a Mac Pro (Westmere 6-Cores Xeon 3.33 GHz). I compiled it with clang -O3 -msse4 -lstdc++ a.cpp -o a
(-O2 get same result).
uint64_t size=atol(argv[1])<<20;
unsigned 41950110000 0.811198 sec 12.9263 GB/s
uint64_t 41950110000 0.622884 sec 16.8342 GB/s
uint64_t size=1<<20;
unsigned 41950110000 0.623406 sec 16.8201 GB/s
uint64_t 41950110000 0.623685 sec 16.8126 GB/s
I also tried to:
for
statement in reverse: for (uint64_t i=size/8;i>0;i-=4)
. This gives the same result and proves the compile is smart enough to not divide size by 8 every iteration (as expected).Here is my wild guess:
The speed factor comes in three parts:
code cache: uint64_t
version has larger code size, but this does not have an effect on my Xeon CPU. This makes the 64-bit version slower.
Instructions used. Note not only the loop count, but the buffer is accessed with a 32-bit and 64-bit index on the two versions. Accessing a pointer with a 64-bit offset requests a dedicated 64-bit register and addressing, while you can use immediate for a 32-bit offset. This may make the 32-bit version faster.
Instructions are only emitted on the 64-bit compile (that is, prefetch). This makes 64-bit faster.
The three factors together match with the observed seemingly conflicting results.
Upvotes: 36
Reputation:
I coded up an equivalent C program to experiment, and I can confirm this strange behaviour. What's more, gcc
believes the 64-bit integer (which should probably be a size_t
anyway...) to be better, as using uint_fast32_t
causes gcc to use a 64-bit uint.
I did a bit of mucking around with the assembly:
Simply take the 32-bit version, replace all 32-bit instructions/registers with the 64-bit version in the inner popcount-loop of the program. Observation: the code is just as fast as the 32-bit version!
This is obviously a hack, as the size of the variable isn't really 64 bit, as other parts of the program still use the 32-bit version, but as long as the inner popcount-loop dominates performance, this is a good start.
I then copied the inner loop code from the 32-bit version of the program, hacked it up to be 64 bit, fiddled with the registers to make it a replacement for the inner loop of the 64-bit version. This code also runs as fast as the 32-bit version.
My conclusion is that this is bad instruction scheduling by the compiler, not actual speed/latency advantage of 32-bit instructions.
(Caveat: I hacked up assembly, could have broken something without noticing. I don't think so.)
Upvotes: 63
Reputation: 283793
Have you tried moving the reduction step outside the loop? Right now you have a data dependency that really isn't needed.
Try:
uint64_t subset_counts[4] = {};
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with unsigned
unsigned i=0;
while (i < size/8) {
subset_counts[0] += _mm_popcnt_u64(buffer[i]);
subset_counts[1] += _mm_popcnt_u64(buffer[i+1]);
subset_counts[2] += _mm_popcnt_u64(buffer[i+2]);
subset_counts[3] += _mm_popcnt_u64(buffer[i+3]);
i += 4;
}
}
count = subset_counts[0] + subset_counts[1] + subset_counts[2] + subset_counts[3];
You also have some weird aliasing going on, that I'm not sure is conformant to the strict aliasing rules.
Upvotes: 11