Reputation: 11
I am trying to understand whether data of SharedArray type is being moved across processes and therefore causing overhead.
After defining the variables in my Main module (process 1), I called pmap on Array (im1) and SharedArray (im1_shared) versions of my data.
pmap(someFunction(im1, im2, x), iterations)
pmap(someFunction(im1_shared, im2_shared, x), iterations)
So im1, im2 and im1_shared, im2_shared are kind of default arguments and slices are being taken with an iterator x and processed by the workers.
Using
@fetchfrom 2 varinfo()
I get:
im1 122.070 MiB 4000×4000 Array{Float64,2}
im2 122.070 MiB 4000×4000 Array{Float64,2}
im1_shared 122.071 MiB 4000×4000 SharedArray{Float64,2}
im2_shared 122.071 MiB 4000×4000 SharedArray{Float64,2}
So my thoughts and confusions on this:
The workers (here worker 2) also list the SharedArrays. So I am thinking that either one of these could be partially correct:
2.1. varinfo()
lists all variables in the local workspace of the workers, but the SharedArrays are not stored in the local memory of those. They are just listed to imply that a worker has access to them.
In this case, if SharedArrays are only stored once and all workers have access to them, why aren't these types default in Julia - to minimize overhead in the first place?
2.2 The SharedArray was also copied to each worker, thus the 122 MiB for each SharedArray. So the only advantage of SharedArrays over Arrays is the access for every worker. The stored data has to be copied to each either way.
In this case, the only way of avoiding overhead is by using something like a distributed array and let workers only operate on chunks they already have access to/have stored in their memory, right?
Could you please help me sort my mind on these two scenarios (2.1 and 2.2).
UPDATE 1: Here is a working example:
@everywhere using InteractiveUtils # to call varinfo() on all workers
### FUNCTIONS
@everywhere function foo(x::Array{Float64, 2}, y::Array{Float64, 2}, t::Int64)
#just take a slice of both arrays at dfined steps and sum the values
x_slice = x[t:t+5, t:t+5]
y_slice = y[t:t+5, t:t+5]
return x_slice + y_slice
end
@everywhere function fooShared(x::SharedArray{Float64, 2}, y::SharedArray{Float64, 2}, t::Int64)
#just take a slice of both arrays at dfined steps and sum the values
x_slice = x[t:t+5, t:t+5]
y_slice = y[t:t+5, t:t+5]
return x_slice + y_slice
end
### DATA
n = 1000
#the two Arrays
im1 = rand(1.0:2.0, n, n)
im2 = copy(im1);
#The two shared arrays
im1_shared = SharedArray(im1)
im2_shared = SharedArray(im2);
@fetchfrom 2 varinfo() # im1_shared and im2_shared are not yet listed, of course not...
pmap(x -> foo(im1, im2, x), [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
pmap(x -> fooShared(im1_shared, im2_shared, x), [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
@fetchfrom 2 varinfo() # im1_shared and im2_shared are now listed
Upvotes: 1
Views: 989
Reputation: 42214
SharedArray
is shared among many Julia processes via memory mapping (https://docs.julialang.org/en/v1/stdlib/Mmap/index.html). Data can be initialized in the the following way:
using Distributed
Distributed.addprocs(2);
@everywhere using SharedArrays
@everywhere function ff(ss::SharedArray)
println(myid()," ",localindices(ss))
for ind in localindices(ss)
ss[ind] = rand(1.0:2.0)
end
end
And now let us perform the actual initialization:
julia> s = SharedArray{Float64}((1000,1000),init=ff)
From worker 2: 2 1:500000
From worker 3: 3 500001:1000000
1000×1000 SharedArray{Float64,2}:
2.0 1.0 1.0 1.0 1.0 2.0 … 2.0 1.0 2.0 2.0 1.0
2.0 2.0 2.0 2.0 2.0 2.0 2.0 1.0 2.0 1.0 2.0
2.0 1.0 1.0 2.0 1.0 2.0 1.0 1.0 1.0 1.0 2.0
⋮ ⋮ ⋱ ⋮
1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 1.0 1.0 1.0
1.0 2.0 1.0 2.0 2.0 1.0 2.0 2.0 1.0 1.0 1.0
2.0 2.0 1.0 2.0 1.0 2.0 2.0 1.0 1.0 2.0 2.0
You can see that each worker initialized a separate part of the array that it works on.
Upvotes: 1