math_lover
math_lover

Reputation: 956

Julia threadsafe loop parallelism for matrix construction

My Julia for loop is of the "embarrassingly parallel" form

M=Array{Float64}(undef,200,100,100)

for i in 1:200
    M[i,:,:,:]=construct_row(i)
end

where construct_row(i) is some function returning a 100x100 matrix.

I would like to use the 64 cores available to me (in fact 272 threads because of hyperthreading) to parallelize this loop by having each k value run on its own thread. So I preface the for loop with Threads.@threads.

As far as I understand, this is an obvious threadsafe situation, no synchronization is necessary. However, after reading https://discourse.julialang.org/t/poor-performance-while-multithreading-julia-1-0/20325/9, I am concerned about the comment by foobar_lv2

The most deadly pattern would be e.g. a 4xN matrix, where thread k reads and writes into M[k, :]: Obviously threadsafe for humans, very non-obviously threadsafe for your poor CPU that will run in circles.

So: Will multithreading work in the way I described for my for loop, or am I missing some major issue here?

Upvotes: 2

Views: 689

Answers (1)

Przemyslaw Szufel
Przemyslaw Szufel

Reputation: 42194

Julia matrices are column major. Indeed you will get the best performance when each Thread is mutating adjacent memory cells. Hence the best performance will be obtained via:

@inbounds for i in 1:100
    (@view M[:,:,i]) .= construct_row(i)
end

To illustrate let's test on a Julia running 4 threads:

julia> const m=Array{Float64}(undef,100,100,100);

julia> @btime Threads.@threads for i in 1:100
          (@view m[:,:,i]) .= reshape(1.0:10_000.0,100,100)
       end
  572.500 μs (19 allocations: 2.39 KiB)

julia> @btime Threads.@threads for i in 1:100
          (@view m[i,:,:]) .= reshape(1.0:10_000.0,100,100)
       end
  1.051 ms (21 allocations: 2.45 KiB)

You can see that mutating adjacent cells means the 2x performance.

Having such a huge number of threads you should also consider using multiprocessing and SharedArrays. My experience is that above 16 or 32 threads - multiprocessing can yield better performance than multi-threading. This is however case-specific and needs appropriate benchmarks.

Upvotes: 3

Related Questions