Reputation: 163
I would like to convert some of my code for scattering problem in MATLAB, and python to Julia. But after 1 day of reading docs, searching in Phind and reading posts like this Parallel computing in Julia - running a simple for-loop on multiple cores I still cannot understand Julia's Parallelism, neither MATLAB for that matter.
TL;DR: Let's forget about MATLAB vs Julia comparison. Someone please explain me why @distributed is 3 to some times 4 times faster than @threads here. If you want to know what led me to this question or copy the following code, please read the rest. Thank you.
In Julia's Manual we can see:
Julia's multi-threading provides the ability to schedule Tasks simultaneously on more than one thread or CPU core, sharing memory. This is usually the easiest way to get parallelism on one's PC or on a single large multi-core server. Julia's multi-threading is composable. When one multi-threaded function calls another multi-threaded function, Julia will schedule all the threads globally on available resources, without oversubscribing. https://docs.julialang.org/en/v1/manual/parallel-computing/
If I understand it correctly, all it says is that when there is only "1" server, parallelism can be achieved by means of threads. So if someone wants to convert Matlab parfor loop to its Julia equivalent, then @threads will do the job, right? Wrong!
Imagine I want to convert this Matlab (2023a) code to Julia (1.9.2) code:
tic
k = zeros(10000, 100, 100);
for i = 1:10000
k(i, :, :) = rand(100, 100)*rand(100, 100)* ...
rand(100, 100)*rand(100, 100)*rand(100, 100)* ...
rand(100, 100)*rand(100, 100)*rand(100, 100)*...
rand(100, 100)*rand(100, 100);
end
toc
The equivalent is (roughly) the following (@fastmath makes the performance worse):
function f()
k = zeros(10000, 100, 100)
@simd for i = 1:10000
@inbounds k[i, :, :] = rand(Float64, (100, 100))*rand(Float64, (100, 100))*
rand(Float64, (100, 100))*rand(Float64, (100, 100))*
rand(Float64, (100, 100))*rand(Float64, (100, 100))*
rand(Float64, (100, 100))*rand(Float64, (100, 100))*
rand(Float64, (100, 100))*rand(Float64, (100, 100));
end
return k
end
@time t = f();
I am not using @btime, because after setting BenchmarkTools.DEFAULT_PARAMETERS.samples = 1
it crashes my Jupyter Lab (and also Vscode without Jupyter) for some reason.
Anyway my expectation was getting roughly the same amount of execution time after compiling Julia's code once, even though all I am hearing and seeing is that Julia's "for" loop is faster.That is wrong though. Matlab version is much more performant on my Intel Cori 7 8550U (it has 4 cores, and 8 logical threads), roughly 15~20% faster. So I looked at my CPU and saw that Matlab utilizes some form of parallelism automatically, using ~90% of CPU. So I told myself, maybe MATLAB JIT is clever, and I am comparing apples to oranges, so I changed Julia's code to the following:
function f()
k = zeros(10000, 100, 100)
Threads.@threads for i = 1:10000
@inbounds k[i, :, :] = rand(Float64, (100, 100))*rand(Float64, (100, 100))*
rand(Float64, (100, 100))*rand(Float64, (100, 100))*
rand(Float64, (100, 100))*rand(Float64, (100, 100))*
rand(Float64, (100, 100))*rand(Float64, (100, 100))*
rand(Float64, (100, 100))*rand(Float64, (100, 100));
end
return k
end
@time t = f();
And I got almost the same performance! "Ok, that is it", I told myself, so matlab uses parallelism underhood automatically, Julia doesn't do that. It is fine. But I was wrong. So I changed Matlab code to
tic
k = zeros(10000, 100, 100);
parfor i = 1:10000
k(i, :, :) = rand(100, 100)*rand(100, 100)* ...
rand(100, 100)*rand(100, 100)*rand(100, 100)* ...
rand(100, 100)*rand(100, 100)*rand(100, 100)*...
rand(100, 100)*rand(100, 100);
end
toc
and it became 4 times faster than before! But wait, wasn't matlab already using parallelism? Why this code is faster then? We come back to that later. Let's talk about Julia for now. So I had just followed documentation, wrote down a parallel code that I thought would be as performant as Matlab code, but MATLAB ended up being 4 times faster in the end. What was wrong? After a bit of digging, I changed Julia's code to the following:
addprocs(8)
function f()
k = SharedArray{Float64}(10000, 100, 100)
@sync @distributed for i = 1:10000
@inbounds k[i, :, :] = rand(Float64, (100, 100))*rand(Float64, (100, 100))*
rand(Float64, (100, 100))*rand(Float64, (100, 100))*
rand(Float64, (100, 100))*rand(Float64, (100, 100))*
rand(Float64, (100, 100))*rand(Float64, (100, 100))*
rand(Float64, (100, 100))*rand(Float64, (100, 100));
end
return k
end
@time t = f();
And got the same performance as parfor Matlab code. Ok so far so good, when one wants to achieve parallelism, they should use @distributed then. But
what is up with Julia documentation then? Why does it sound as if when we have 1 server/pc, @threads would be some how equivalent to @distributed? I am not the only one confused by these macros, and all suggestion I see on internet is using both of them to see which one works better for a given problem! Here some examples: https://www.reddit.com/r/Julia/comments/cynbbb/question_multiprocessing_vs_multithreading/ https://discourse.julialang.org/t/overhead-of-distributed-module-vs-threads/21588 This makes no sense though. There should be one proper way of doing things, for a given problem. I understand more eccentric macros like @fastmath may need testing, but it makes no sense to me that there is no "proper" way of parallelism in Julia. Maybe there is, but I cannot understand it from documentation.
Why Julia doesn't use this multi-threading thing under-hood, just like MATLAB does? Is there some kind of language limitation, or is it by design like this, if so why? Now, every time that I want to convert some non-parallel (without parfor) MATLAB code to Julia I need to add @threads to get the same performance. Isn't this redundancy?
It does not make sense to me that normal MATLAB loop, or multi-threaded Julia's code, both use almost 4 cores, yet somehow MATLAB parfor or multi-process Julia code, out perform them significantly (4 times!) in term of speed. I mean, in both scenarios they are using same amount of resources (CPU wise, not Ram), so where this performance gain comes from?! I have 24GB RAM so it is not a bottleneck here.
And yes, I literally started learning Julia yesterday, so go easy on me.
Upvotes: 2
Views: 2852
Reputation: 163
My original post became super long, therefore I will add some of my insights here, thanks to @DNF and @Przemyslaw Szufel who helped me to learn some important aspects of this language.
@time begin
A = Matrix{Float64}(undef, 100, 100)
for i = 1:10
A .= rand(100, 100)
end
end
0.000797 seconds (32 allocations: 860.047 KiB)
Whereas
@time begin
A = Matrix{Float64}(undef, 100, 100)
for i = 1:10
rand!(A)
end
end
0.000138 seconds (2 allocations: 78.172 KiB)
It may be obvious for Julia's users, but coming from MATLAB, with the exception of Sparse Matrices, I cannot think of any similar scenario that works that way. For example in MATLAB
tic
A = zeros(100, 100);
for i = 1:10
A = rand(100, 100);
end
toc
rand(100, 100) will immediately destroys zeros(100, 100) in the memory and replace it with new values. It doesn't allocate new space in memory, it overwrites it (and yes there is no need for zeros(100, 100) preallocation here). So the very concept of preallocation, even though exists in both languages, is in practice widely different. I wish there was a more clear hint in documentation.
function distribute()
k = SharedArray{Float64}(100, 100, 10000)
A = Matrix{Float64}(undef, 100, 100)
B = Matrix{Float64}(undef, 100, 100)
Y = Matrix{Float64}(undef, 100, 100)
@inbounds @sync @distributed for i = 1:10000
rand!(A)
for j = 1:9
mul!(Y, A, rand!(B))
A .= Y
end
k[:, :, i] = A
end
return k
end
distributed();
@time t = distribute();
2.671872 seconds (2.11 k allocations: 325.055 KiB)
function thread()
k = Array{Float64}(undef, 100, 100, 10000)
A = Matrix{Float64}(undef, 100, 100)
B = Matrix{Float64}(undef, 100, 100)
Y = Matrix{Float64}(undef, 100, 100)
@inbounds Threads.@threads for i = 1:10000
rand!(A)
for j = 1:9
mul!(Y, A, rand!(B))
A .= Y
end
k[:, :, i] = A
end
return k
end
thread()
@time t = thread();
8.549158 seconds (59 allocations: 763.174 MiB, 0.07% gc time)
Even after solving allocations issue, @distributed still does beat @threads! I also tried using MKL. The results were interesting,
@everywhere using MKL
@time t = thread();
4.133106 seconds (633 allocations: 763.212 MiB, 0.29% gc time)
@time t = distribute();
4.689101 seconds (2.17 k allocations: 326.586 KiB)
I don't know how to make sense of that, MKL degerades @distributed speed significantly. Nevertheless, getting roughly the same speed in both scenarios gave me some peace. I also tried MersenneTwister, but I couldn't see any real difference in term of speed. distribute function without mkl beats MATLAB easily, by margin of 30% or something. With MKL however, I see no difference, which I probably shouldn't, which is a good thing.
Edit 1:
Turns out adding LinearAlgebra.BLAS.set_num_threads(1) makes a huge difference for thread function (see @PrzemyslawSzufel's comment below):
LinearAlgebra.BLAS.set_num_threads(1)
@time k = thread();
2.455945 seconds (344 allocations: 763.190 MiB, 0.04% gc time)
Upvotes: 0
Reputation: 42244
In computing (using Julia syntax in point below) you have the following parallelism types:
@simd
- utilize Single instruction, multiple data (SIMD) feature of a CPU@async
- coroutines aka green threads (good for IO parallelization)@threads
- multithreading, requires seting JULIA_NUM_THREADS
environment variable or -t
parameter when starting Julia@distributed
- multiprocessing on a single or multiple machines, need to use -p
julia command line option or --machine-file
option or addprocs()
commandNow having this knowledge I briefly answer your equations:
Multi-threading is on a single machine.
Multi-processing can occur on a single machine or multiple machines.
Multiprocessing on multiple machines is called distribute computing. Julia has basically the same API for multiprocessing and distributed computing - both are available via using Distributed
Finally note that in multi-processing a process can be single or multi-threaded.
The above concepts are not specific to Julia in any way (although Julia has nice APIs to achieve such functionality). You definitely should have a look at: https://en.wikipedia.org/wiki/Thread_(computing)#Threads_vs_processes
Julia BLAS for linear algebra is using multi-threading independently of Julia configuration, try LinearAlgebra.BLAS.get_num_threads()
. Some other Julia libraries use multi-threading and this depends on the value of JULIA_NUM_THREADS
environment variable or -t
parameter.
So several libraries use it under-the-hood.
However writing a multi-threaded code (in any programming language, Julia included) is difficult due to possible code parallelization problems such as deadlocks, thread chase or false sharing. It is not auto-magical and requires several hours spent on understanding the concepts.
Julia code outperform MATLAB's code. However writing a performant code is not easy. Perhaps start with testing yourself all codes from this list: https://docs.julialang.org/en/v1/manual/performance-tips/
For the code that you are running here Threads.@threads
seems to be the best approach (assuming that you plan to use around 8-16 threads).
If you run this long enough your main problem are allocations. You could use in-place multiplication and hence allocate much less memory (and have less garbage collector events).
begin
Y = Matrix{Float64}(undef, 100, 100)
A = rand(100,100)
B = Matrix{Float64}(undef, 100, 100)
for i = 1:10
rand!(B)
mul!(Y,A,B)
if i < 10
A .= Y
end
end
Y
end;
The code above has (6 allocations: 234.52 KiB)
while normal multiplication (38 allocations: 1.45 MiB)
. This will make a noticeable difference when the workload is huge.
Just tell me why distributed is 4 time faster than threads in Julia
OK this is tricky. There are few possible reasons (and most likely they combine for the observed behavior):
@distributed
than Threads.@threads
. Additionally, from my empirical observations, beyond 8 jobs in Julia processes seem to scale better than threads. That is why you should consider in-place operations (LinearAlgebra.mul!
) as suggested by myself and DNFLinearAlgebra.BLAS
is not composable with threading mechanism. This means that 8 processes are in the end utilizing eg. a total 32 threads (depending on your machine). This combined with the garbage collector makes the multi-threaded and distributed scenarios harder to compare.In conclusion, for garbage collector-heavy-jobs you will generally see a better performance with multi-processing (using Distributed
) than multi-threading.
MATLAB is using MKL and MersenneTwister. When making benchmark it might be worth writing the Julia code the same way.
Multiplying two rands (uses Xoshori 256++ and OpenBLAS) yields the following result )
julia> @btime rand(100,100)*rand(100,100);
118.500 μs (6 allocations: 234.52 KiB)
Let's try MersenneTwister and MKL (see also https://github.com/JuliaLinearAlgebra/MKL.jl ):
julia> using MKL, Random
julia> mt = MersenneTwister();
julia> @btime rand($mt, 100,100)*rand($mt, 100,100);
68.900 μs (6 allocations: 234.52 KiB)
Worth knowing that changing this code to MKL and MT is a 2x speed up.
Upvotes: 4