Why is this simple loop faster in Go than in C?

Question

I was trying to find out whether Go's loop performance is as good as C's, but surprisingly found that for my simple test, C version takes twice the time of Go version.

C Version:

#include 

int main() {
  int i = 0, a = 0;

  while (i < 1e9) {
    a = (a + i) % 42;
    i = i + 1;
  }
  printf("%d
", a);
}

,

$ gcc -o main main.c && time ./main # tried -O0 as well; the result is similar
36
./main  10.53s user 0.08s system 98% cpu 10.769 total

Go Version:

package main

import "fmt"

func main() {
    a := int32(0)
    for i := int32(0); i < 1e9; i++ {
        a = (a + i) % 42
    }
    fmt.Println(a)
}

,

$ time go run main.go
36
colorgo run main.go  5.27s user 0.14s system 93% cpu 5.816 total

(tested on Darwin, amd64)

For this simple algorithm, shouldn't both of them produce nearly identical machine code? Is this due to compiler optimization? Cache efficiency?

Please help me understand! Thanks!

OneOfOne · Accepted Answer

It all boils down to the assembly generated.

go tool 6g -S (21 instructions):

MOVL    $0,SI
MOVL    SI,"".a+8(FP)
MOVL    $0,CX
CMPL    CX,$1000000000
JGE     $0,58
ADDL    CX,SI
MOVL    $818089009,BP
MOVL    SI,AX
IMULL   BP,
MOVL    DX,BX
SARL    $3,BX
MOVL    SI,BP
SARL    $31,BP
SUBL    BP,BX
IMULL   $42,BX
SUBL    BX,SI
MOVL    SI,"".a+8(FP)
INCL    ,CX #point A
NOP     ,
CMPL    CX,$1000000000
JLT     $0,16
RET     ,

gcc -O3 -march=native -S (17 instructions):

leal    (%rsi,%rcx), %edi
addl    $1, %ecx
vxorpd  %xmm0, %xmm0, %xmm0
vcvtsi2sd       %ecx, %xmm0, %xmm0
movl    %edi, %eax
imull   %r8d
movl    %edi, %eax
sarl    $31, %eax
sarl    $3, %edx
movl    %edx, %esi
subl    %eax, %esi
imull   $42, %esi, %esi
subl    %esi, %edi
vucomisd        %xmm0, %xmm1
movl    %edi, %esi
ja      .L2
subq    $8, %rsp

gcc -O3 -march=native -S (14 instructions, after replacing 1e9 with 1000000000):

leal    (%rdx,%rcx), %esi
addl    $1, %ecx
movl    %esi, %eax
imull   %edi
movl    %esi, %eax
sarl    $31, %eax
sarl    $3, %edx
subl    %eax, %edx
imull   $42, %edx, %edx
subl    %edx, %esi
movl    %esi, %edx
cmpl    $1000000000, %ecx
jne     .L2
subq    $8, %rsp

Timing:

$ gcc -O3 -march=native loop.c; and time ./a.out
36
2.92user 0.00system 0:02.93elapsed 99%CPU
$ go build -o loop loop.go; and time ./loop
36
2.89user 0.00system 0:02.90elapsed 99%CPU
$ gcc -O3 -march=native loop_nofp.c; and time ./a.out
36
2.92user 0.00system 0:02.94elapsed 99%CPU (0avgtext+0avgdata 1312maxresident)

I have no idea, I'm leaving this for now until a proper answer is posted.

//edit

Changing the C code to use for to match the Go version produced different assembly but the exact same timing.

int main() {
    int32_t i = 0, a = 0;
    for (i = 0; i < 1e9; i++) {
        a = (a + i) % 42;
    }
    printf("%d
", a);
    return 0;
}

Why is this simple loop faster in Go than in C?

Answers (2)

Related Questions