Reputation: 25645
I am looking at the performance of memchr
-like functions and made an interesting observation.
This is check.c
with 3 implementations to find the offset of a \n
character in a string:
#include <stdlib.h>
size_t mem1(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x == '\n') return (p - s);
p++;
}
}
size_t mem2(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x <= '$' && (x == '\n' || x == '\0')) return (p - s);
p++;
}
}
size_t mem3(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x == '\n' || x == '\0') return (p - s);
p++;
}
}
size_t mem4(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x <= '$' && (x == '\n')) return (p - s);
p++;
}
}
I run these functions on a string of bytes which can be described by the Haskell expression (concat $ replicate 10000 "abcd") ++ "\n" ++ "hello"
- that is 10000 times asdf
, then the newline to find, and then hello
. Of course all 3 implementations return the same offset: 40000 as expected.
Interestingly, when compiled with gcc -O2
, the run times on that string are:
mem1
: 16 usmem2
: 12 usmem3
: 25 usmem4
: 16 us(I'm using the criterion library to measure these times with statistical accuracy.)
I cannot explain this to myself. Why is mem2
so much faster than the other two?
--
The assembly as generated by gcc -S -O2 -o check.asm check.c
:
mem1:
.LFB14:
cmpb $10, (%rdi)
movq %rdi, %rax
je .L9
.L6:
addq $1, %rax
cmpb $10, (%rax)
jne .L6
subq %rdi, %rax
ret
.L9:
xorl %eax, %eax
ret
mem2:
.LFB15:
movq %rdi, %rax
jmp .L13
.L19:
cmpb $10, %dl
je .L14
.L11:
addq $1, %rax
.L13:
movzbl (%rax), %edx
cmpb $36, %dl
jg .L11
testb %dl, %dl
jne .L19
.L14:
subq %rdi, %rax
ret
mem3:
.LFB16:
movzbl (%rdi), %edx
testb %dl, %dl
je .L26
cmpb $10, %dl
movq %rdi, %rax
jne .L27
jmp .L26
.L30:
cmpb $10, %dl
je .L23
.L27:
addq $1, %rax
movzbl (%rax), %edx
testb %dl, %dl
jne .L30
.L23:
subq %rdi, %rax
ret
.L26:
xorl %eax, %eax
ret
mem4:
.LFB17:
cmpb $10, (%rdi)
movq %rdi, %rax
je .L38
.L36:
addq $1, %rax
cmpb $10, (%rax)
jne .L36
subq %rdi, %rax
ret
.L38:
xorl %eax, %eax
ret
Any explanation is very appreciated!
Upvotes: 12
Views: 311
Reputation: 153338
With a cache, the test of mem1()
takes the brunt of filling the cache.
Run the mem1()
test first and again as last and use the 2nd time as it reflects a primed cache like the other tests. Confident it will be faster and a more fair time comparison.
Upvotes: 1
Reputation: 20914
My best guess is it's to do with register dependency - if you look at the 3-instruction main loop in mem1
, you have a circular dependency on rax
. Naïvely, this means each instruction has to wait for the last one to finish - in practice it means if the instructions aren't retired quickly enough the microarchitecture may run out of registers to rename and just give up and stall for a bit.
In mem2
the fact that there are 4 instructions in the loop - and possibly also the fact that there's more of an explicit pipeline in the use of both rax
and edx/dl
- is probably giving the out-of-order execution hardware an easier time thus it ends up pipelining more efficiently.
I don't claim to be an expert so this may be complete nonsense, but based on what I've studied of Agner Fog's absolute goldmine of Intel optimisation details it doesn't seem an entirely unreasonable hypothesis.
Edit: Out of interest, I've tested mem1
and mem2
on my machine (Core 2 Duo E7500), compiled with -O2 -falign-functions=64 to the exact same assembly code. Calling either function with the given string 1,000,000 times in a loop and using Linux's time
, I get ~19s for mem1
and ~18.8s for mem2
- much less than the 25% difference on the newer microarchitecture. Guess it's time to buy an i5...
Upvotes: 3
Reputation: 16761
Your input is such that makes mem2
faster. Every letter in the input apart from '\n' has value larger than '$', so if
condition is false from the first part of the expression (x <= '$'), and second part of the expression (x == '\n' || x == '\0') is never executed. If you would use "####" instead of "abcd" I suspect the execution would become slower.
Upvotes: 2