Reputation: 259
There are a lot of posts regarding the speed vectorized versus devectorized code in Julia, but my question regards a very simple piece of code that I ran in Python and Julia to compare performance. I have a huge Python code that I will transcribe to Julia if and only if functions like this one are significantly sped up.
I am running everything in Jupyter notebooks and I know that Julia runs faster in a .jl file; however, I compared with running from the terminal and the difference in this case is rather negligible.
The Python code uses broadcasting and vectorization:
def test(x0,dx,i):
q = np.arange(-x0,x0,dx)
z = np.zeros(shape=(i,len(q)), dtype=np.complex128)
B = np.exp(-1j*q)
for s in range(1,i):
A = np.exp(-(q-q[:,np.newaxis])**2)*np.exp(-1j*s*q)
z[s] = B*np.sum(A,axis=1)*dx
return z
while in the Julia translation I attempted to use for loops, but ended up using comprehension once, since it was faster than another for loop::
function test(x0,dx,n)
z = zeros(ComplexF64, (n, Int(2*x0/dx + 1)))
B = exp.(-1im*(-x0:dx:x0-dx))
for s in 1:n-1
for (i,q) in enumerate(-x0:dx:x0-dx)
A = [exp(-(q-Q)^2)exp(-1im*s*Q) for Q in -x0:dx:x0-dx]
z[s,i] = B[i]*sum(A)*dx
end
end
z
end
The results are
%timeit test(10,.1,10)
2.81 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@btime test(10,.1,10)
55.075 ms (2176212 allocations: 61.20 MiB)
That is, the Julia code is much slower. I'm absolutely sure I'm doing something wrong here, since the allocations shouldn't be that huge. I tried to optimize this as much as I could, but I started learning Julia a couple of days ago and couldn't go much further. Any tips on how to improve performance here is hugely appreciated.
Upvotes: 4
Views: 1151
Reputation: 6492
Just for comparison I want to add a optimized Python (Numba) solution. The solution by Oscar Smith still looks a bit slow, but this may also be the result of a slower processor.
Some comparisons, optimized Numba vs. optimized Julia code are really good to learn how to write efficient code in Julia.
Code
import numpy as np
import numba as nb
def Test_orig(x0,dx,i):
q = np.arange(-x0,x0,dx)
z = np.zeros(shape=(i,len(q)), dtype=np.complex128)
B = np.exp(-1j*q)
for s in range(1,i):
A = np.exp(-(q-q[:,np.newaxis])**2)*np.exp(-1j*s*q)
z[s] = B*np.sum(A,axis=1)*dx
return z
@nb.njit(fastmath=True,parallel=True)
def Test_nb(x0,dx,n):
q = np.arange(-x0,x0,dx)
B = np.exp(-1j*q)
z = np.zeros(shape=(n,q.shape[0]), dtype=np.complex128)
TMP_1 = np.empty(shape=(q.shape[0],q.shape[0]), dtype=np.complex128)
for i in nb.prange(q.shape[0]):
for j in range(q.shape[0]):
TMP_1[i,j]=np.exp(-(q[i]-q[j])**2)
TMP_2 = np.empty(shape=(q.shape[0],q.shape[0]), dtype=np.complex128)
for s in nb.prange(1,n):
for j in range(q.shape[0]):
TMP_2[s,j]=np.exp(-1j*s*q[j])
for s in nb.prange(1,n):
for i in range(q.shape[0]):
sum=0.j
for j in range(q.shape[0]):
sum+=TMP_1[i,j]*TMP_2[s,j]
z[s,i]=sum*B[i]*dx
return z
Timings
%timeit res_1=Test_orig(10,.1,10)
2.64 ms ± 276 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit res_2=Test_nb(10,.1,10)
192 µs ± 7.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#@nb.njit(fastmath=True,parallel=False,cache=True) -> single threaded, caching possible
472 µs ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
#@nb.njit(fastmath=False,parallel=False,cache=True) -> without fastmath compiler flag
643 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
res_1=Test_orig(10,.1,10)
res_2=Test_nb(10,.1,10)
np.allclose(res_1,res_2)
True
Edit: Julia Timings (function from Oscar Smith)
@benchmark test(10,.1,10)
BenchmarkTools.Trial:
memory estimate: 1.16 MiB
allocs estimate: 1014
--------------
minimum time: 643.072 μs (0.00% GC)
median time: 662.869 μs (0.00% GC)
mean time: 729.127 μs (4.15% GC)
maximum time: 34.947 ms (97.25% GC)
--------------
samples: 6841
evals/sample: 1
Upvotes: 4
Reputation: 6398
There were several things initially holding this back: The first and biggest was the loop comprehension which was allocating a ton as you correctly diagnosed (16ms). Once that was solved, the biggest issue was the way it was repeatedly computing the same complex exponents (1.6 ms).
Once both of those issues were addressed, recognizing the linear algebra in the problem allowed both cleaner code, and also allowed julia to call into blas to do more efficient matrix multiplication. (900 μs). Here is the most updated code that is about a factor of 3 better than the equivalent numpy
using LinearAlgebra
function test(x0,dx,n)
Q = collect(-x0:dx:x0-dx)
A = complex.(exp.(-(Q.-transpose(Q)).^2))
B = exp.(-im.*transpose(1:n-1).*Q)
z = Matrix{ComplexF64}(undef, n-1, length(Q))
for (i, q) in enumerate(Q)
total = transpose(@view A[:, i]) * B
z[:, i] = dx*exp(-q*im) .* total
end
z
end
In response to the numba attempt, here is one more attempt that gets down to 725μs by working around a bug where multiplying real and complex arrays would convert the real to a complex. Manually coding the multiplication yields this.
function test(x0,dx,n)
Q = collect(-x0:dx:x0-dx)
A = exp.(-(Q.-transpose(Q)).^2)
B = exp.(-im.*transpose(1:n-1).*Q)
rB = real.(B)
iB = imag.(B)
z = Matrix{ComplexF64}(undef, n-1, length(Q))
for (i, q) in enumerate(Q)
At = transpose(@view A[:, i])
total = (At * rB) .+ (At * iB).*im
z[:, i] = dx*exp(-q*im) .* total
end
z
end
Upvotes: 8