varantir
varantir

Reputation: 6854

Saving data in parallel in julia

I am confronted with a problem when submitting many jobs to a cluster, where each job is calculating some data and saving it (with many variables in terms of a .jld file) to some drive, for example like this

function f(savedir, pid, params)
    ...
    save(savedir*"$(pid).jld",result)
end

After the calculation I need to process the data and load each .jld file to access the variables individually. Even though the final reduction is rather small, this takes a lot of time. I though about saving it all to one .jld file, but in that case I run into the problem that the file is potentially accessed at the same time, since the jobs run in parallel. Further I though about collecting the data in a out-of-core fashion using juliaDB, but in the end I do not see why this should be any better. I know that this could be solved with some database server, but this seems to be an overkill for my problem. How do you deal with this kind of problems?

Best,

v.

Upvotes: 1

Views: 427

Answers (1)

Przemyslaw Szufel
Przemyslaw Szufel

Reputation: 42234

If the data is small simply use the IOBuffer mechanism and send it from workers to the master:

using Distributed, Serialization
addprocs(4)
@everywhere using Distributed, Serialization

rrs = @distributed (hcat) for i in 1:12
    b=IOBuffer()
    myres = (rand(), randn(), myid()) # emulates some big computations 
                                      # that you are running 
    serialize(b,myres)
    b.data
end

And here is a sample code deserializing back the results:

julia> for i in 1:size(rrs,2)
           res = deserialize(IOBuffer(@view rrs[:, i]))
           println(res)
       end
(0.8656737453513623, 1.0594978554855077, 2)
(0.6637467726391784, 0.35682413048990763, 2)
(0.32579653913039386, 0.2512902466296038, 2)
(0.3033490905926888, 1.7662416364260713, 3)
...

If your data is too big and your cluster is distributed than you need to use some other orchestration mechanism. One possible lightweight solution that I use sometimes is the following bunch of bash codes: https://github.com/pszufe/KissCluster This tool is a set of bash script built around the following bash command very useful for any file-based scenario:

nohup seq $start $end | xargs --max-args=1 --max-procs=$nproc julia run.jl &>> somelogfile.txt &

Nevertheless when possible consider using Julia's Distributed package.

Upvotes: 2

Related Questions