Reputation: 956
My Julia for loop is of the "embarrassingly parallel" form
M=Array{Float64}(undef,200,100,100)
for i in 1:200
M[i,:,:,:]=construct_row(i)
end
where construct_row(i)
is some function returning a 100x100 matrix.
I would like to use the 64 cores available to me (in fact 272 threads because of hyperthreading) to parallelize this loop by having each k value run on its own thread. So I preface the for loop with Threads.@threads.
As far as I understand, this is an obvious threadsafe situation, no synchronization is necessary. However, after reading https://discourse.julialang.org/t/poor-performance-while-multithreading-julia-1-0/20325/9, I am concerned about the comment by foobar_lv2
The most deadly pattern would be e.g. a 4xN matrix, where thread k reads and writes into M[k, :]: Obviously threadsafe for humans, very non-obviously threadsafe for your poor CPU that will run in circles.
So: Will multithreading work in the way I described for my for loop, or am I missing some major issue here?
Upvotes: 2
Views: 689
Reputation: 42194
Julia matrices are column major. Indeed you will get the best performance when each Thread is mutating adjacent memory cells. Hence the best performance will be obtained via:
@inbounds for i in 1:100
(@view M[:,:,i]) .= construct_row(i)
end
To illustrate let's test on a Julia running 4 threads:
julia> const m=Array{Float64}(undef,100,100,100);
julia> @btime Threads.@threads for i in 1:100
(@view m[:,:,i]) .= reshape(1.0:10_000.0,100,100)
end
572.500 μs (19 allocations: 2.39 KiB)
julia> @btime Threads.@threads for i in 1:100
(@view m[i,:,:]) .= reshape(1.0:10_000.0,100,100)
end
1.051 ms (21 allocations: 2.45 KiB)
You can see that mutating adjacent cells means the 2x performance.
Having such a huge number of threads you should also consider using multiprocessing and SharedArrays
. My experience is that above 16 or 32 threads - multiprocessing can yield better performance than multi-threading. This is however case-specific and needs appropriate benchmarks.
Upvotes: 3