Julia generates same code but need different time to execute

Question

During an evaluation of an other question I stumple over a case where two different julia programms generate the same code but need different time to execute.

using BenchmarkTools
test(n) = [g() for i = 1:n]

Case 1:

g() = 0;
@btime test(1000);

1.020 μs (1 allocation: 7.94 KiB)

Code 1:

code_native(g,())

    .text
Filename: In[2]
    pushq   %rbp
    movq    %rsp, %rbp
Source line: 1
    xorl    %eax, %eax
    popq    %rbp
    retq
    nopl    (%rax,%rax)

@code_native test(1000)

    .text
Filename: In[1]
    pushq   %rbp
    movq    %rsp, %rbp
Source line: 2
    subq    $16, %rsp
    xorl    %eax, %eax
    testq   %rdi, %rdi
    cmovnsq %rdi, %rax
    movq    $1, -16(%rbp)
    movq    %rax, -8(%rbp)
    movabsq $collect, %rax
    leaq    -16(%rbp), %rdi
    callq   *%rax
    addq    $16, %rsp
    popq    %rbp
    retq
    nopw    %cs:(%rax,%rax)

Case 2:

g() = UInt8(0);
@btime test(1000);

142.603 ns (1 allocation: 1.06 KiB)

Code 2:

code_native(g,())

.text
Filename: In[8]
    pushq   %rbp
    movq    %rsp, %rbp
Source line: 1
    xorl    %eax, %eax
    popq    %rbp
    retq
    nopl    (%rax,%rax)

 @code_native test(1000)

    .text
Filename: In[11]
    pushq   %rbp
    movq    %rsp, %rbp
Source line: 2
    subq    $16, %rsp
    xorl    %eax, %eax
    testq   %rdi, %rdi
    cmovnsq %rdi, %rax
    movq    $1, -16(%rbp)
    movq    %rax, -8(%rbp)
    movabsq $collect, %rax
    leaq    -16(%rbp), %rdi
    callq   *%rax
    addq    $16, %rsp
    popq    %rbp
    retq
    nopw    %cs:(%rax,%rax)

Different timings but same code sounds very weird to me. Could someone explain what happens here?

Dan Getz · Accepted Answer

The time difference is unrelated to the different function g() used each time, but to the amount of memory zeroed as a result.

In case 1, 8 bytes * 1000 = 8000 bytes need to be allocated and zeroed.

In case 2, 1 bytes * 1000 = 1000 bytes need to be allocated and zeroed.

This can be seen from the results of @btime. In a clearer example, we have:

julia> @btime zeros(1000);
  767.300 ns (1 allocation: 7.94 KiB)

julia> @btime zeros(125);
  128.849 ns (1 allocation: 1.06 KiB)

Where zeros(n) simply returns an array of n Int zeroes. Notice the amount allocated matches the amounts in the question.

UPDATE

Stefan pointed out that, curiously, the output of @code_native for both g() and test(Int) is the same in both runs. Which begs the question how the computer knows if it is allocating UInt8s or Ints?

Since g() is redefined and test(Int) depends on it, the 0.5/0.6 introduced world age mechanism for dealing with redefinitions, triggers a recompilation of test(Int) when invoked after the redefinition. The new test(Int) has a similar @code_native (on an x86 target machine), but a reference to a $collect value is different in the two compilations. To clear this up, the @code_llvm output shows a difference in a suffix between the versions:

define %jl_value_t addrspace(10)* @julia_test_62122(i64) #0 !dbg !5 {
top:
  :
  :
  %5 = call %jl_value_t addrspace(10)* @julia_collect_62123(%Generator addrspace(11)* nocapture readonly %4)
  ret %jl_value_t addrspace(10)* %5
}

vs.

define %jl_value_t addrspace(10)* @julia_test_62151(i64) #0 !dbg !5 {
top:
  :
  :
  %5 = call %jl_value_t addrspace(10)* @julia_collect_62152(%Generator addrspace(11)* nocapture readonly %4)
  ret %jl_value_t addrspace(10)* %5
}

A closer to the metal approach would dig out the machine code for the two versions:

0x55, 0x48, 0x89, 0xe5, 0x48, 0x8b, 0x06, 0x48, 0x8b, 0x38, 0x48, 0xb8, 0xa0, 0x52, 0xa1, 0x21, 0x7e, 0x7e, 0x00, 0x00, 0xff, 0xd0, 0x5d, 0xc3

vs.

0x55, 0x48, 0x89, 0xe5, 0x48, 0x8b, 0x06, 0x48, 0x8b, 0x38, 0x48, 0xb8, 0x10, 0x58, 0xa1, 0x21, 0x7e, 0x7e, 0x00, 0x00, 0xff, 0xd0, 0x5d, 0xc3

Note 0xc3 is the x86 opcode for ret instruction. To get at the machine code, you need to go through the rabbit hole of the methods(test) nested objects/arrays.

Julia generates same code but need different time to execute

Answers (1)

Related Questions