Reputation: 1981
I'm solving a PDE using an implicit scheme, which I can divide into two matrices at every time step, that are then connected by a boundary condition (also at every time step). I'm trying to speed up the process by using multi-processing to invert both matrices at the same time.
Here's an example of what this looks like in a minimal (non-PDE-solving) example.
using Distributed
using LinearAlgebra
function backslash(N, T, b, exec)
A = zeros(N,N)
α = 0.1
for i in 1:N, j in 1:N
abs(i-j)<=1 && (A[i,j]+=-α)
i==j && (A[i,j]+=3*α+1)
end
A = Tridiagonal(A)
a = zeros(N, 4, T)
if exec == "parallel"
for i = 1:T
@distributed for j = 1:2
a[:, j, i] = A\b[:, i]
end
end
elseif exec == "single"
for i = 1:T
for j = 1:2
a[:, j, i] = A\b[:, i]
end
end
end
return a
end
b = rand(1000, 1000)
a_single = @time backslash(1000, 1000, b, "single");
a_parallel = @time backslash(1000, 1000, b, "parallel");
a_single == a_parallel
Here comes the problem: the last line evaluate to true, with an 6-fold speed-up, however, only 2-fold should be possible. What am I getting wrong?
Upvotes: 2
Views: 170
Reputation: 42244
@distributed
loop exits prematurely@distributed
loop does not collect the resultsHence obviously you have:
julia> addprocs(2);
julia> sum(backslash(1000, 1000, b, "single")), sum(backslash(1000, 1000, b, "parallel"))
(999810.3418359067, 0.0)
So in order to make your code work you need to collect the data from the distributed loop which can be done as:
function backslash2(N, T, b, exec)
A = zeros(N,N)
α = 0.1
for i in 1:N, j in 1:N
abs(i-j)<=1 && (A[i,j]+=-α)
i==j && (A[i,j]+=3*α+1)
end
A = Tridiagonal(A)
a = zeros(N, 4, T)
if exec == :parallel
for i = 1:T
aj = @distributed (append!) for j = 1:2
[A\b[:, i]]
end
# you could consider using SharedArrays instead
a[:, 1, i] .= aj[1]
a[:, 2, i] .= aj[2]
end
elseif exec == :single
for i = 1:T
for j = 1:2
a[:, j, i] = A\b[:, i]
end
end
end
return a
end
Now you have equal results:
julia> sum(backslash2(1000, 1000, b, :single)) == sum(backslash2(1000, 1000, b, :parallel))
true
However, the distributed code is very inefficient for loops that take few milliseconds to execute so the @distributed
code will take in this example many times longer to execute as you run it 1000 times and it takes something like few milliseconds to dispatch a distributed job each time.
Perhaps your production task takes longer so it than makes sense. Or maybe you will consider Threads.@threads
instead.
Last but not least BLAS might be configured to be multi-threaded and in this scenario on a single machine it might make no sense to parallelize (depends on the scenario)
Upvotes: 1