Paradoxy
Paradoxy

Reputation: 163

Understanding Julia multi-thread / multi-process design

I would like to convert some of my code for scattering problem in MATLAB, and python to Julia. But after 1 day of reading docs, searching in Phind and reading posts like this Parallel computing in Julia - running a simple for-loop on multiple cores I still cannot understand Julia's Parallelism, neither MATLAB for that matter.

TL;DR: Let's forget about MATLAB vs Julia comparison. Someone please explain me why @distributed is 3 to some times 4 times faster than @threads here. If you want to know what led me to this question or copy the following code, please read the rest. Thank you.

threads vs distributed

In Julia's Manual we can see:

Julia's multi-threading provides the ability to schedule Tasks simultaneously on more than one thread or CPU core, sharing memory. This is usually the easiest way to get parallelism on one's PC or on a single large multi-core server. Julia's multi-threading is composable. When one multi-threaded function calls another multi-threaded function, Julia will schedule all the threads globally on available resources, without oversubscribing. https://docs.julialang.org/en/v1/manual/parallel-computing/

If I understand it correctly, all it says is that when there is only "1" server, parallelism can be achieved by means of threads. So if someone wants to convert Matlab parfor loop to its Julia equivalent, then @threads will do the job, right? Wrong!

Imagine I want to convert this Matlab (2023a) code to Julia (1.9.2) code:

tic
k = zeros(10000, 100, 100);
for i = 1:10000 
    k(i, :, :) = rand(100, 100)*rand(100, 100)* ...
    rand(100, 100)*rand(100, 100)*rand(100, 100)* ...
    rand(100, 100)*rand(100, 100)*rand(100, 100)*...
    rand(100, 100)*rand(100, 100);
end
toc

The equivalent is (roughly) the following (@fastmath makes the performance worse):

function f()
     k = zeros(10000, 100, 100)
     @simd for i = 1:10000
     @inbounds k[i, :, :] = rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100));
    end
    return k
end

@time t = f();

I am not using @btime, because after setting BenchmarkTools.DEFAULT_PARAMETERS.samples = 1 it crashes my Jupyter Lab (and also Vscode without Jupyter) for some reason.

Anyway my expectation was getting roughly the same amount of execution time after compiling Julia's code once, even though all I am hearing and seeing is that Julia's "for" loop is faster.That is wrong though. Matlab version is much more performant on my Intel Cori 7 8550U (it has 4 cores, and 8 logical threads), roughly 15~20% faster. So I looked at my CPU and saw that Matlab utilizes some form of parallelism automatically, using ~90% of CPU. So I told myself, maybe MATLAB JIT is clever, and I am comparing apples to oranges, so I changed Julia's code to the following:

function f()
     k = zeros(10000, 100, 100)
     Threads.@threads for i = 1:10000
     @inbounds k[i, :, :] = rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100));
    end
    return k
end

@time t = f();

And I got almost the same performance! "Ok, that is it", I told myself, so matlab uses parallelism underhood automatically, Julia doesn't do that. It is fine. But I was wrong. So I changed Matlab code to

tic
k = zeros(10000, 100, 100);
parfor i = 1:10000 
    k(i, :, :) = rand(100, 100)*rand(100, 100)* ...
    rand(100, 100)*rand(100, 100)*rand(100, 100)* ...
    rand(100, 100)*rand(100, 100)*rand(100, 100)*...
    rand(100, 100)*rand(100, 100);
end
toc

and it became 4 times faster than before! But wait, wasn't matlab already using parallelism? Why this code is faster then? We come back to that later. Let's talk about Julia for now. So I had just followed documentation, wrote down a parallel code that I thought would be as performant as Matlab code, but MATLAB ended up being 4 times faster in the end. What was wrong? After a bit of digging, I changed Julia's code to the following:

addprocs(8)
function f()
    k = SharedArray{Float64}(10000, 100, 100)
     @sync @distributed for i = 1:10000
     @inbounds k[i, :, :] = rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100))*
                            rand(Float64, (100, 100))*rand(Float64, (100, 100));
    end
    return k
end

@time t = f();

And got the same performance as parfor Matlab code. Ok so far so good, when one wants to achieve parallelism, they should use @distributed then. But

  1. what is up with Julia documentation then? Why does it sound as if when we have 1 server/pc, @threads would be some how equivalent to @distributed? I am not the only one confused by these macros, and all suggestion I see on internet is using both of them to see which one works better for a given problem! Here some examples: https://www.reddit.com/r/Julia/comments/cynbbb/question_multiprocessing_vs_multithreading/ https://discourse.julialang.org/t/overhead-of-distributed-module-vs-threads/21588 This makes no sense though. There should be one proper way of doing things, for a given problem. I understand more eccentric macros like @fastmath may need testing, but it makes no sense to me that there is no "proper" way of parallelism in Julia. Maybe there is, but I cannot understand it from documentation.

  2. Why Julia doesn't use this multi-threading thing under-hood, just like MATLAB does? Is there some kind of language limitation, or is it by design like this, if so why? Now, every time that I want to convert some non-parallel (without parfor) MATLAB code to Julia I need to add @threads to get the same performance. Isn't this redundancy?

  3. It does not make sense to me that normal MATLAB loop, or multi-threaded Julia's code, both use almost 4 cores, yet somehow MATLAB parfor or multi-process Julia code, out perform them significantly (4 times!) in term of speed. I mean, in both scenarios they are using same amount of resources (CPU wise, not Ram), so where this performance gain comes from?! I have 24GB RAM so it is not a bottleneck here.

And yes, I literally started learning Julia yesterday, so go easy on me.

Upvotes: 2

Views: 2852

Answers (2)

Paradoxy
Paradoxy

Reputation: 163

My original post became super long, therefore I will add some of my insights here, thanks to @DNF and @Przemyslaw Szufel who helped me to learn some important aspects of this language.

  1. There is a huge difference between MATLAB and Julia which is not explicitly mentioned here https://docs.julialang.org/en/v1/manual/noteworthy-differences/, and that is the fact that any function call would allocate some data in the memory, instead of overwriting existing data. Here is an example:
@time begin
    A = Matrix{Float64}(undef, 100, 100)
    for i = 1:10
        A .= rand(100, 100)
    end
end
0.000797 seconds (32 allocations: 860.047 KiB)

Whereas

@time begin
    A = Matrix{Float64}(undef, 100, 100)
    for i = 1:10
        rand!(A)
    end
end
0.000138 seconds (2 allocations: 78.172 KiB)

It may be obvious for Julia's users, but coming from MATLAB, with the exception of Sparse Matrices, I cannot think of any similar scenario that works that way. For example in MATLAB

tic
A = zeros(100, 100);
for i = 1:10
    A = rand(100, 100);
end
toc

rand(100, 100) will immediately destroys zeros(100, 100) in the memory and replace it with new values. It doesn't allocate new space in memory, it overwrites it (and yes there is no need for zeros(100, 100) preallocation here). So the very concept of preallocation, even though exists in both languages, is in practice widely different. I wish there was a more clear hint in documentation.

  1. Both Matlab JIT and Julia compiler can potentially convert normal for loop to a multi-threaded one under-hood. It seems Matlab is more aggressive in this respect, because it uses MKL. Nevertheless, I changed my Julia code according to suggestion given here to the followings:
function distribute()
    k = SharedArray{Float64}(100, 100, 10000)
    A = Matrix{Float64}(undef, 100, 100)
    B = Matrix{Float64}(undef, 100, 100)
    Y = Matrix{Float64}(undef, 100, 100)
    @inbounds @sync @distributed  for i = 1:10000
        rand!(A)
        for j = 1:9
                mul!(Y, A, rand!(B))
                A .= Y
        end
        k[:, :, i] = A
    end
    return k
end

distributed();
@time t = distribute();
2.671872 seconds (2.11 k allocations: 325.055 KiB)
function thread()
    k = Array{Float64}(undef, 100, 100, 10000)
    A = Matrix{Float64}(undef, 100, 100)
    B = Matrix{Float64}(undef, 100, 100)
    Y = Matrix{Float64}(undef, 100, 100)
    @inbounds Threads.@threads for i = 1:10000
        rand!(A)
        for j = 1:9
                mul!(Y, A, rand!(B))
                A .= Y
        end
        k[:, :, i] = A
    end
    return k
end
thread()
@time t = thread();
8.549158 seconds (59 allocations: 763.174 MiB, 0.07% gc time)

Even after solving allocations issue, @distributed still does beat @threads! I also tried using MKL. The results were interesting,

@everywhere using MKL
@time t = thread();
4.133106 seconds (633 allocations: 763.212 MiB, 0.29% gc time)
@time t = distribute();
4.689101 seconds (2.17 k allocations: 326.586 KiB)

I don't know how to make sense of that, MKL degerades @distributed speed significantly. Nevertheless, getting roughly the same speed in both scenarios gave me some peace. I also tried MersenneTwister, but I couldn't see any real difference in term of speed. distribute function without mkl beats MATLAB easily, by margin of 30% or something. With MKL however, I see no difference, which I probably shouldn't, which is a good thing.

Edit 1:

Turns out adding LinearAlgebra.BLAS.set_num_threads(1) makes a huge difference for thread function (see @PrzemyslawSzufel's comment below):

LinearAlgebra.BLAS.set_num_threads(1)
@time k = thread();
2.455945 seconds (344 allocations: 763.190 MiB, 0.04% gc time)

Upvotes: 0

Przemyslaw Szufel
Przemyslaw Szufel

Reputation: 42244

In computing (using Julia syntax in point below) you have the following parallelism types:

  • @simd - utilize Single instruction, multiple data (SIMD) feature of a CPU
  • @async - coroutines aka green threads (good for IO parallelization)
  • @threads - multithreading, requires seting JULIA_NUM_THREADS environment variable or -t parameter when starting Julia
  • @distributed - multiprocessing on a single or multiple machines, need to use -p julia command line option or --machine-file option or addprocs() command
  • GPU computing - perhaps start with CUDA.jl

Now having this knowledge I briefly answer your equations:

Ad 1.

Multi-threading is on a single machine.

Multi-processing can occur on a single machine or multiple machines.

Multiprocessing on multiple machines is called distribute computing. Julia has basically the same API for multiprocessing and distributed computing - both are available via using Distributed

Finally note that in multi-processing a process can be single or multi-threaded.

The above concepts are not specific to Julia in any way (although Julia has nice APIs to achieve such functionality). You definitely should have a look at: https://en.wikipedia.org/wiki/Thread_(computing)#Threads_vs_processes

Ad 2.

Julia BLAS for linear algebra is using multi-threading independently of Julia configuration, try LinearAlgebra.BLAS.get_num_threads(). Some other Julia libraries use multi-threading and this depends on the value of JULIA_NUM_THREADS environment variable or -t parameter. So several libraries use it under-the-hood. However writing a multi-threaded code (in any programming language, Julia included) is difficult due to possible code parallelization problems such as deadlocks, thread chase or false sharing. It is not auto-magical and requires several hours spent on understanding the concepts.

Ad. 3.

Julia code outperform MATLAB's code. However writing a performant code is not easy. Perhaps start with testing yourself all codes from this list: https://docs.julialang.org/en/v1/manual/performance-tips/ For the code that you are running here Threads.@threads seems to be the best approach (assuming that you plan to use around 8-16 threads).

Note on the code

If you run this long enough your main problem are allocations. You could use in-place multiplication and hence allocate much less memory (and have less garbage collector events).

begin
    Y = Matrix{Float64}(undef, 100, 100)
    A = rand(100,100)
    B = Matrix{Float64}(undef, 100, 100)
    for i = 1:10
        rand!(B)
        mul!(Y,A,B)
        if i < 10
            A .= Y
        end
    end
    Y
end;

The code above has (6 allocations: 234.52 KiB) while normal multiplication (38 allocations: 1.45 MiB). This will make a noticeable difference when the workload is huge.

EDIT

Just tell me why distributed is 4 time faster than threads in Julia

OK this is tricky. There are few possible reasons (and most likely they combine for the observed behavior):

  • Your code is performing a huge number of allocations which is kicking-in the garbage collector. It is always easier to perform garbage collection on 8 separate disconnected processes than on a single multi-threaded one. When lots of garbage collection is involved it might be better to use @distributed than Threads.@threads. Additionally, from my empirical observations, beyond 8 jobs in Julia processes seem to scale better than threads. That is why you should consider in-place operations (LinearAlgebra.mul!) as suggested by myself and DNF
  • as noted by myself and DNF LinearAlgebra.BLAS is not composable with threading mechanism. This means that 8 processes are in the end utilizing eg. a total 32 threads (depending on your machine). This combined with the garbage collector makes the multi-threaded and distributed scenarios harder to compare.

In conclusion, for garbage collector-heavy-jobs you will generally see a better performance with multi-processing (using Distributed) than multi-threading.

EDIT 2

MATLAB is using MKL and MersenneTwister. When making benchmark it might be worth writing the Julia code the same way.

Multiplying two rands (uses Xoshori 256++ and OpenBLAS) yields the following result )

julia> @btime rand(100,100)*rand(100,100);
  118.500 μs (6 allocations: 234.52 KiB)

Let's try MersenneTwister and MKL (see also https://github.com/JuliaLinearAlgebra/MKL.jl ):

julia> using MKL, Random

julia> mt = MersenneTwister();

julia> @btime rand($mt, 100,100)*rand($mt, 100,100);
  68.900 μs (6 allocations: 234.52 KiB)

Worth knowing that changing this code to MKL and MT is a 2x speed up.

Upvotes: 4

Related Questions