I am looking at the performance of memchr -like functions and made an interesting observation. This is check.c with 3 implementations to find the offset of a \n character in a string: #include <stdlib.h> size_t mem1(const char *s) { const char *p = s; while (1) { const char x = *p; if (x == '\n') return (p - s); p++; } } size_t mem2(const char *s) { const char *p = s; while (1) { const char x = *p; if (x <= '$' && (x == '\n' || x == '\0')) return (p - s); p++; } } size_t mem3(const char *s) { const char *p = s; while (1) { const char x = *p; if (x == '\n' || x == '\0') return (p - s); p++; } } size_t mem4(const char *s) { const char *p = s; while (1) { const char x = *p; if (x <= '$' && (x == '\n')) return (p - s); p++; } } I run these functions on a string of bytes which can be described by the Haskell expression (concat $ replicate 10000 "abcd") ++ "\n" ++ "hello" - that is 10000 times asdf , then the newline to find, and then hello . Of course all 3 implementations return the same offset: 40000 as expected. Interestingly, when compiled with gcc -O2 , the run times on that string are: mem1 : 16 us mem2 : 12 us mem3 : 25 us mem4 : 16 us (I'm using the criterion library to measure these times with statistical accuracy.) I cannot explain this to myself. Why is mem2 so much faster than the other two? -- The assembly as generated by gcc -S -O2 -o check.asm check.c : mem1: .LFB14: cmpb $10, (%rdi) movq %rdi, %rax je .L9 .L6: addq $1, %rax cmpb $10, (%rax) jne .L6 subq %rdi, %rax ret .L9: xorl %eax, %eax ret mem2: .LFB15: movq %rdi, %rax jmp .L13 .L19: cmpb $10, %dl je .L14 .L11: addq $1, %rax .L13: movzbl (%rax), %edx cmpb $36, %dl jg .L11 testb %dl, %dl jne .L19 .L14: subq %rdi, %rax ret mem3: .LFB16: movzbl (%rdi), %edx testb %dl, %dl je .L26 cmpb $10, %dl movq %rdi, %rax jne .L27 jmp .L26 .L30: cmpb $10, %dl je .L23 .L27: addq $1, %rax movzbl (%rax), %edx testb %dl, %dl jne .L30 .L23: subq %rdi, %rax ret .L26: xorl %eax, %eax ret mem4: .LFB17: cmpb $10, (%rdi) movq %rdi, %rax je .L38 .L36: addq $1, %rax cmpb $10, (%rax) jne .L36 subq %rdi, %rax ret .L38: xorl %eax, %eax ret Any explanation is very appreciated!

Reputation: 25715

Why is my more complicated C loop faster?

I am looking at the performance of memchr-like functions and made an interesting observation.

This is check.c with 3 implementations to find the offset of a \n character in a string:

#include <stdlib.h>

size_t mem1(const char *s)
{
  const char *p = s;
  while (1)
  {
    const char x = *p;
    if (x == '\n') return (p - s);
    p++;
  }
}

size_t mem2(const char *s)
{
  const char *p = s;
  while (1)
  {
    const char x = *p;
    if (x <= '$' && (x == '\n' || x == '\0')) return (p - s);
    p++;
  }
}

size_t mem3(const char *s)
{
  const char *p = s;
  while (1)
  {
    const char x = *p;
    if (x == '\n' || x == '\0') return (p - s);
    p++;
  }
}

size_t mem4(const char *s)
{
  const char *p = s;
  while (1)
  {
    const char x = *p;
    if (x <= '$' && (x == '\n')) return (p - s);
    p++;
  }
}

I run these functions on a string of bytes which can be described by the Haskell expression (concat $ replicate 10000 "abcd") ++ "\n" ++ "hello" - that is 10000 times asdf, then the newline to find, and then hello. Of course all 3 implementations return the same offset: 40000 as expected.

Interestingly, when compiled with gcc -O2, the run times on that string are:

mem1: 16 us
mem2: 12 us
mem3: 25 us
mem4: 16 us

(I'm using the criterion library to measure these times with statistical accuracy.)

I cannot explain this to myself. Why is mem2 so much faster than the other two?

The assembly as generated by gcc -S -O2 -o check.asm check.c:

mem1:
.LFB14:
  cmpb  $10, (%rdi)
  movq  %rdi, %rax
  je  .L9
.L6:
  addq  $1, %rax
  cmpb  $10, (%rax)
  jne .L6
  subq  %rdi, %rax
  ret
.L9:
  xorl  %eax, %eax
  ret


mem2:
.LFB15:
  movq  %rdi, %rax
  jmp .L13
.L19:
  cmpb  $10, %dl
  je  .L14
.L11:
  addq  $1, %rax
.L13:
  movzbl  (%rax), %edx
  cmpb  $36, %dl
  jg  .L11
  testb %dl, %dl
  jne .L19
.L14:
  subq  %rdi, %rax
  ret


mem3:
.LFB16:
  movzbl  (%rdi), %edx
  testb %dl, %dl
  je  .L26
  cmpb  $10, %dl
  movq  %rdi, %rax
  jne .L27
  jmp .L26
.L30:
  cmpb  $10, %dl
  je  .L23
.L27:
  addq  $1, %rax
  movzbl  (%rax), %edx
  testb %dl, %dl
  jne .L30
.L23:
  subq  %rdi, %rax
  ret
.L26:
  xorl  %eax, %eax
  ret


mem4:
.LFB17:
  cmpb  $10, (%rdi)
  movq  %rdi, %rax
  je  .L38
.L36:
  addq  $1, %rax
  cmpb  $10, (%rax)
  jne .L36
  subq  %rdi, %rax
  ret
.L38:
  xorl  %eax, %eax
  ret

Any explanation is very appreciated!

Upvotes: 12

Answers (3)

chux

Reputation: 154242

With a cache, the test of mem1() takes the brunt of filling the cache.

Run the mem1() test first and again as last and use the 2nd time as it reflects a primed cache like the other tests. Confident it will be faster and a more fair time comparison.

Upvotes: 1

Notlikethat

Reputation: 20984

My best guess is it's to do with register dependency - if you look at the 3-instruction main loop in mem1, you have a circular dependency on rax. Naïvely, this means each instruction has to wait for the last one to finish - in practice it means if the instructions aren't retired quickly enough the microarchitecture may run out of registers to rename and just give up and stall for a bit.

In mem2 the fact that there are 4 instructions in the loop - and possibly also the fact that there's more of an explicit pipeline in the use of both rax and edx/dl - is probably giving the out-of-order execution hardware an easier time thus it ends up pipelining more efficiently.

I don't claim to be an expert so this may be complete nonsense, but based on what I've studied of Agner Fog's absolute goldmine of Intel optimisation details it doesn't seem an entirely unreasonable hypothesis.

Edit: Out of interest, I've tested mem1 and mem2 on my machine (Core 2 Duo E7500), compiled with -O2 -falign-functions=64 to the exact same assembly code. Calling either function with the given string 1,000,000 times in a loop and using Linux's time, I get ~19s for mem1 and ~18.8s for mem2 - much less than the 25% difference on the newer microarchitecture. Guess it's time to buy an i5...

Upvotes: 3

Dialecticus

Reputation: 16779

Your input is such that makes mem2 faster. Every letter in the input apart from '\n' has value larger than '$', so if condition is false from the first part of the expression (x <= '$'), and second part of the expression (x == '\n' || x == '\0') is never executed. If you would use "####" instead of "abcd" I suspect the execution would become slower.

Upvotes: 2

Why is my more complicated C loop faster?

Answers (3)

Related Questions