Is there any something like shared-memory in Dask for large object multiprocessing job?

Question

In a regression test, I got a 1000*100000 pandas dataframe like this:

df=pd.DataFrame(np.random.random((1000,100)))

The first column is y label, the others is x1-x99. I need to pick out three or seven var-x to fit y , run each regression, get all the output and find the best choice.

I find that in the Ray project By calling ray.put(object), the large array is stored in shared memory and can be accessed by all of the worker processes without creating copies.

There is too many occasions (161700+3921225+....) and It's OK to read only the base-dataframe since these workers do not communicate with each other, they just need return the output to the main one.

Is there something similar in Dask to avoid copy the data into each worker? It might be like:

dask.put(df)

Then each worker might read their own jobs like:

from itertools import combinations
rt=[]
for c in combinations(range(100),3):
    (i,j,k)=c
    rt.append(model(df.iloc[:,0],df.iloc[:,[i,j,k]]).fit())
rt=dask.compute(*rt)

So that avoid creating each y,X copys in main and sending each y,X to all workers?

Thomas Moerman · Accepted Answer

Ray uses PyArrow Plasma under the hood to store shared data in the context of a single machine.

While Dask does not explicitly support Plasma, you can quite easily use it to store and read shared data from within worker functions. You can retrieve the data from Plasma if the worker function knows the Plasma ObjectId under which the data is stored.

Plasma example code here.

Is there any something like shared-memory in Dask for large object multiprocessing job?

Answers (1)

Related Questions