Reputation: 43619
I'm debugging an application that is running quite a bit slower when built as a 64-bit Linux ELF executable than as a 32-bit Linux ELF executable. Using Rational (IBM) Quantify, I tracked much of the performance difference down to (drum roll...) memset
. Oddly, memset
is taking a lot longer in the 64-bit executable.
I am even able to see this with a small, simple application:
#include <stdlib.h>
#include <string.h>
#define BUFFER_LENGTH 8000000
int main()
{
unsigned char* buffer = malloc(BUFFER_LENGTH * sizeof(unsigned char));
for(int i = 0; i < 10000; i++)
memset(buffer, 0, BUFFER_LENGTH * sizeof(unsigned char));
}
I build like this:
$ gcc -m32 -std=gnu99 -g -O3 ms.c
and
$ gcc -m64 -std=gnu99 -g -O3 ms.c
The wall-clock time as reported by time
is longer for the -m64
build and Quantify confirms that the extra time is being spent in memset
.
So far I've tested in VirtualBox and VMWare (but not bare-metal Linux; I realize I need to do that next). The amount of extra time spent seems to vary a bit from one system to the next.
What's going on here? Is there a well-known issue that my Google-foo is not able to uncover?
EDIT: The disassembly (gcc ... -S
) on my system shows that memset
is being invoked as an external function:
32-bit:
.LBB2:
.loc 1 14 0
movl $8000000, 8(%esp)
.loc 1 12 0
addl $1, %ebx
.loc 1 14 0
movl $0, 4(%esp)
movl %esi, (%esp)
call memset
64-bit:
.LBB2:
.loc 1 14 0
xorl %esi, %esi
movl $8000000, %edx
movq %rbp, %rdi
.LVL1:
.loc 1 12 0
addl $1, %ebx
.loc 1 14 0
call memset
System:
Upvotes: 12
Views: 2365
Reputation: 86353
I can confirm that on my non-virtualized Mandriva Linux system the x86_64 version is slightly (about 7%) slower. In both cases the memset()
library function is called, regardless of the instruction set word size.
A casual look at the assembly code of both library implementations reveals that the x86_64 version is significantly more complex. I assume that this has mostly to do with the fact that the 32-bit version has to deal with only 4 possible alignment cases, versus the 8 possible alignment cases of the 64-bit version. It also seems that the x86_64 memset()
loop has been more extensively unrolled, perhaps due to different compiler optimizations.
Another factor that could account for the slower operations is the increased I/O load associated with the use of a word size of 64 bits. Both code and metadata (pointers e.t.c.) generally get larger in 64-bit applications.
Also, keep in mind that the library implementations included in most distributions are targeted to whatever CPU the maintainers consider to be the current lowest common denominator for each processor family. This may leave the 64-bit processors at a disadvantage, since the 32-bit instruction set has been stable for some time now.
Upvotes: 1
Reputation: 24546
I believe that virtualization is the culprit: I have been running some benchmarks on my own (random number generation in bulk, sequential searches; also 64-bit) and found out that the code runs ~2x slower within Linux in VirtualBox than natively under windows. The funny thing is, the code does no I/O (except simple printf now and then, in between timings) and uses little memory (all data fits into L1 cache), so one could think that you could exclude page table management and TLB overheads.
This is mysterious indeed. I have noticed that VirtualBox reports to the VM that SSE 4.1 and SSE 4.2 instructions are not supported, even though the CPU supports them, and the program using them runs fine(!) in a VM. I have no time to investigate the issue further, but you REALLY should time it on a real machine. Unfortunately, my program won't run on 32 bits, so I couldn't test for slowdown in 32-bit mode.
Upvotes: 1
Reputation: 7147
When compiling your example code the compiler sees the fixed block size (~8MB) and decides to use the library version. Try code for much smaller blocks (for memset'ing just a few bytes) - compare the disassembly.
Though I do not know why the x64 version is slower. I guess there is an issue in your time measurement code.
From the changelog of gcc 4.3:
Code generation of block move (memcpy) and block set (memset) was rewritten. GCC can now pick the best algorithm (loop, unrolled loop, instruction with rep prefix or a library call) based on the size of the block being copied and the CPU being optimized for. A new option -minline-stringops-dynamically has been added. With this option string operations of unknown size are expanded such that small blocks are copied by in-line code, while for large blocks a library call is used. This results in faster code than -minline-all-stringops when the library implementation is capable of using cache hierarchy hints. The heuristic choosing the particular algorithm can be overwritten via -mstringop-strategy. Newly also memset of values different from 0 is inlined.
Hope this explains what the compiler designers try to do (even if this is for another version) ;-)
Upvotes: 0