Reputation: 37
I trying to send large numpy
ndarrays through client.scatter(np_ndarray)
. The np_ndarray is about 10GB; I am getting this error msgpack Could not serialize object of type ndarray
.
I used alternatively pickle
while creating my client, this way Client(self.adr, serializers=['dask', 'pickle'])
.
Is there a limit in data size that msgpack can not manage?
Is msgpack always used when data is sent by scatter,
or dask
decides about the protocol depending on the data type?
I noticed that there is a project for Msgpack-Numpy
. Are you planning to add support for it in dask
, in case I describe an eventual issue in dask
?
When I initialize my client this way, what are the main advantages and disadvantages?
Thank you!
Upvotes: 1
Views: 1635
Reputation: 794
To answer your other three questions:
Is msgpack always used when data is sent by scatter, or dask decides about the protocol depending on the data type?
Yes, Dask does select a default serializer depending on your data, ref: Dask Docs - Serialization
I noticed that there is a project for Msgpack-Numpy. Are you planning to add support for it in dask, in case I describe an eventual issue in dask?
I checked with a Dask contributor, and looks there is no plan to support it, now or in the near future. That said, please feel free to start a discussion to gather more thoughts. :)
When I initialize my client this way, what are the main advantages and disadvantages?
Serialization in Dask is tricky, so it's hard to define (dis)advantages. But, generally speaking, manually specifying serializers is not recommended.
Upvotes: 2
Reputation: 16551
If you would like to use msgpack
, then the maximum limit is about 4.3 GB, see docs:
- a value of an Integer object is limited from -(2^63) upto (2^64)-1
- maximum length of a Binary object is (2^32)-1
- maximum byte size of a String object is (2^32)-1
There is a discussion of some strategies here, specifically if it's possible to encode the object as a string, the string can be split into multiple parts and then each part sent individually. The receiving side would then have to concatenate these and decode. Another option is streaming.
Upvotes: 1
Reputation: 16551
Rather than sending large data to workers, it might be more efficient to store the data (locally or remotely, as appropriate) and ask workers to load it. Something like this:
from joblib import dump, load
path_to_pickle = 'large_numpy.pickle'
dump(large_numpy, path_to_pickle)
def myfunc(path_to_pickle):
large_numpy = load(path_to_pickle)
# do something
fut = client.submit(myfunc, path_to_pickle)
Upvotes: 0