Reputation:
function c1()
x::UInt64 = 0
while x<= (10^8 * 10)
x+=1
end
end
function c2()
x::UInt64 = 0
while x<= (10^9)
x+=1
end
end
function c3()
x::UInt64 = 0
y::UInt64 = 10^8 * 10
while x<= y
x+=1
end
end
function c4()
x::UInt64 = 0
y::UInt64 = 10^9
while x<= y
x+=1
end
end
Should be the same, right?
@time c1()
0.019102 seconds (40.99 k allocations: 2.313 MiB)
@time c1()
0.000003 seconds (4 allocations: 160 bytes)
@time c2()
9.205925 seconds (47.89 k allocations: 2.750 MiB)
@time c2()
9.015212 seconds (4 allocations: 160 bytes)
@time c3()
0.019848 seconds (39.23 k allocations: 2.205 MiB)
@time c3()
0.000003 seconds (4 allocations: 160 bytes)
@time c4()
0.705712 seconds (47.41 k allocations: 2.719 MiB)
@time c4()
0.760354 seconds (4 allocations: 160 bytes)
Upvotes: 4
Views: 338
Reputation: 18530
2nd UPDATE: Check out hckr answer. Much better than mine.
UPDATE: This is not a comprehensive answer. Just as much as I've been able to puzzle out, and I've had to give up for now due to time constraints.
I'm probably not the best person to answer this question, since as far as compiler optimization is concerned, I know just enough to be dangerous. Hopefully someone who understands Julia's compiler a little better stumbles across this question and can give a more comprehensive response, because from what I can see, your c2
function is doing an awful lot of work that it shouldn't need to.
So, there are at least two issues at play here. Firstly, as it stands, both c1
and c2
will always return nothing
. For some reason I don't understand, the compiler is able to work this out in the case of c1
, but not in the case of c2
. Consequently, after compilation, c1
runs almost instantly because the loop in the algorithm is never actually performed. Indeed:
julia> @btime c1()
1.535 ns (0 allocations: 0 bytes)
You can also see this using @code_native c1()
and @code_native c2()
. The former is only a couple of lines long while the latter contains many more instructions. Also worth noting that the former does not contain any references to the function <=
indicating that the condition in the while
loop has been completely optimized out.
We can deal with this first issue by adding a return x
statement at the bottom of both your functions, which forces the compiler to actually engage with the question of what the final value of x
will be.
However, if you do this, you'll note that c1
is still about 10 times faster than c2
, which is the second puzzling issue about your example.
It seems to me that even with a return x
, a sufficiently clever compiler has all the information it should need to skip the loop entirely. That is, it knows at compile time the start value of x
, the exact value of the transformation inside the loop, and the exact value of the terminating condition. Surprisingly enough, if you run @code_native c1()
(after adding return x
at the bottom), you'll notice that it has indeed worked out the function return value right there in the native code (cmpq $1000000001
):
julia> @code_native c1()
.text
; Function c1 {
; Location: REPL[2]:2
movq $-1, %rax
nopw (%rax,%rax)
; Location: REPL[2]:3
; Function <=; {
; Location: int.jl:436
; Function <=; {
; Location: int.jl:429
L16:
addq $1, %rax
cmpq $1000000001, %rax # imm = 0x3B9ACA01
;}}
jb L16
; Location: REPL[2]:6
retq
nopl (%rax)
;}
so I'm not really sure why it is still doing any work at all!
For reference, here is the output of @code_native c2()
(after adding return x
):
julia> @code_native c2()
.text
; Function c2 {
; Location: REPL[3]:2
pushq %r14
pushq %rbx
pushq %rax
movq $-1, %rbx
movabsq $power_by_squaring, %r14
nopw %cs:(%rax,%rax)
; Location: REPL[3]:3
; Function literal_pow; {
; Location: none
; Function macro expansion; {
; Location: none
; Function ^; {
; Location: intfuncs.jl:220
L32:
addq $1, %rbx
movl $10, %edi
movl $9, %esi
callq *%r14
;}}}
; Function <=; {
; Location: int.jl:436
; Function >=; {
; Location: operators.jl:333
; Function <=; {
; Location: int.jl:428
testq %rax, %rax
;}}}
js L59
cmpq %rax, %rbx
jbe L32
; Location: REPL[3]:6
L59:
movq %rbx, %rax
addq $8, %rsp
popq %rbx
popq %r14
retq
nopw %cs:(%rax,%rax)
;}
There is clearly a lot of additional work going on here for c2
that doesn't make much sense to me. Hopefully someone more familiar with Julia's internals can shed some light on this.
Upvotes: 1
Reputation: 5583
This is about Julia's compile-time optimization of literals using power-by-squaring. Julia is able to optimize if the exponent can be reached via solely power-by-squaring or the power is 0,1,2,3. This is I believe done through lowering x^p
to x^Val{p}
for integer p
and using compiler-specialization (or inlining plus a kind of metaprogramming, I am not sure what is the correct term here but it's like something you will find in Lisp; similar techniques is used for source-to-source auto-differentiation in Julia, see Zygote.jl) techniques to lower the code to a constant if p
is 0,1,2,3 or a power of 2.
Julia lowers 10^8
to inlined literal_pow
(and then power_by_squaring
) and this gets lowered down to a constant then julia lowers constant * 10
to get another constant and then realizes all the while loop is unnecessary and removes the loop and so on, all at compile-time.
If you change 10^8
with 10^7
in c1
you will see that it will evaluate the number and the loop at runtime. However, if you replace 10^8
with 10^4
or 10^2
you will see that it will handle all the computation at compile time. I think julia is not set specifically to compile-time optimize if the exponent is power of 2 but instead the compiler turns out to be able to optimize (lowering the code to a constant) the code just for that case.
The case in which p
is 1,2,3 is hard-coded in Julia. This is optimized again through lowering the code to inlined version of literal_pow
and then compile-specialization.
You can use @code_llvm
and @code_native
macros to see what's going on. Let's try.
julia> f() = 10^8*10
julia> g() = 10^7*10
julia> @code_native f()
.text
; Function f {
; Location: In[101]:2
movl $1000000000, %eax # imm = 0x3B9ACA00
retq
nopw %cs:(%rax,%rax)
;}
julia> @code_native g()
.text
; Function g {
; Location: In[104]:1
; Function literal_pow; {
; Location: none
; Function macro expansion; {
; Location: none
; Function ^; {
; Location: In[104]:1
pushq %rax
movabsq $power_by_squaring, %rax
movl $10, %edi
movl $7, %esi
callq *%rax
;}}}
; Function *; {
; Location: int.jl:54
addq %rax, %rax
leaq (%rax,%rax,4), %rax
;}
popq %rcx
retq
;}
See f()
turns out to be just a constant, while g()
will evaluate stuff at run-time.
I think julia started this integer exponentiation trick around this commit if you would like to dig more.
EDIT: Let's compile-time optimize c2
I have also prepared a function to compute integer-integer exponents, with which julia will also optimize for non-power-of-2 exponents. I am not sure it is correct in all cases, though.
@inline function ipow(base::Int, exp::Int)
result = 1;
flag = true;
while flag
if (exp & 1 > 0)
result *= base;
end
exp >>= 1;
base *= base;
flag = exp != 0
end
return result;
end
Now replace your 10^9
in c2
with ipow(10,9)
, and enjoy the power of compile-time optimization.
Also see this question for power-by-squaring.
Please don't use this function as is, since it tries to inline all the exponentiation, whether or not it consists of literals. You wouldn't want that.
Upvotes: 5