SharedArrays and data movement across processes in Julia

Question

I am trying to understand whether data of SharedArray type is being moved across processes and therefore causing overhead.

After defining the variables in my Main module (process 1), I called pmap on Array (im1) and SharedArray (im1_shared) versions of my data.

pmap(someFunction(im1, im2, x), iterations)
pmap(someFunction(im1_shared, im2_shared, x), iterations)

So im1, im2 and im1_shared, im2_shared are kind of default arguments and slices are being taken with an iterator x and processed by the workers.

Using

@fetchfrom 2 varinfo()

I get:

im1 122.070 MiB 4000×4000 Array{Float64,2}

im2 122.070 MiB 4000×4000 Array{Float64,2}

im1_shared 122.071 MiB 4000×4000 SharedArray{Float64,2}

im2_shared 122.071 MiB 4000×4000 SharedArray{Float64,2}

So my thoughts and confusions on this:

During the pmap function call on the Array types, im1 and im2 were copied to all 3 additional workers. In the end, I have 4 copies of im1 and im2 in my memory.
The workers (here worker 2) also list the SharedArrays. So I am thinking that either one of these could be partially correct:

2.1. varinfo() lists all variables in the local workspace of the workers, but the SharedArrays are not stored in the local memory of those. They are just listed to imply that a worker has access to them.

In this case, if SharedArrays are only stored once and all workers have access to them, why aren't these types default in Julia - to minimize overhead in the first place?

2.2 The SharedArray was also copied to each worker, thus the 122 MiB for each SharedArray. So the only advantage of SharedArrays over Arrays is the access for every worker. The stored data has to be copied to each either way.

In this case, the only way of avoiding overhead is by using something like a distributed array and let workers only operate on chunks they already have access to/have stored in their memory, right?

Could you please help me sort my mind on these two scenarios (2.1 and 2.2).

UPDATE 1: Here is a working example:

@everywhere using InteractiveUtils # to call varinfo() on all workers

### FUNCTIONS
@everywhere function foo(x::Array{Float64, 2}, y::Array{Float64, 2}, t::Int64)
    #just take a slice of both arrays at dfined steps and sum the values
    x_slice = x[t:t+5, t:t+5]
    y_slice = y[t:t+5, t:t+5]
    return x_slice + y_slice   
end

@everywhere function fooShared(x::SharedArray{Float64, 2}, y::SharedArray{Float64, 2}, t::Int64)
    #just take a slice of both arrays at dfined steps and sum the values
    x_slice = x[t:t+5, t:t+5]
    y_slice = y[t:t+5, t:t+5]
    return x_slice + y_slice   
end

### DATA
n = 1000
#the two Arrays
im1 = rand(1.0:2.0, n, n)
im2 = copy(im1);

#The two shared arrays
im1_shared = SharedArray(im1)
im2_shared = SharedArray(im2);

@fetchfrom 2 varinfo() # im1_shared and im2_shared are not yet listed, of course not...

pmap(x -> foo(im1, im2, x), [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
pmap(x -> fooShared(im1_shared, im2_shared, x), [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

@fetchfrom 2 varinfo() # im1_shared and im2_shared are now listed

SharedArrays and data movement across processes in Julia

Answers (1)

Related Questions