I trying to send large numpy ndarrays through client.scatter(np_ndarray) . The np_ndarray is about 10GB; I am getting this error msgpack Could not serialize object of type ndarray . I used alternatively pickle while creating my client, this way Client(self.adr, serializers=['dask', 'pickle']) . Is there a limit in data size that msgpack can not manage? Is msgpack always used when data is sent by scatter, or dask decides about the protocol depending on the data type? I noticed that there is a project for Msgpack-Numpy . Are you planning to add support for it in dask , in case I describe an eventual issue in dask ? When I initialize my client this way, what are the main advantages and disadvantages? Thank you!

pythondasknumpy-ndarraydask-distributedmsgpack

Reputation: 37

msgpack could not serialize large numpy ndarrays

I trying to send large numpy ndarrays through client.scatter(np_ndarray). The np_ndarray is about 10GB; I am getting this error msgpack Could not serialize object of type ndarray.

I used alternatively pickle while creating my client, this way Client(self.adr, serializers=['dask', 'pickle']).

Is there a limit in data size that msgpack can not manage?
Is msgpack always used when data is sent by scatter, or dask decides about the protocol depending on the data type?
I noticed that there is a project for Msgpack-Numpy. Are you planning to add support for it in dask, in case I describe an eventual issue in dask?
When I initialize my client this way, what are the main advantages and disadvantages?

Thank you!

Upvotes: 1

Answers (3)

pavithraes

Reputation: 794

To answer your other three questions:

Is msgpack always used when data is sent by scatter, or dask decides about the protocol depending on the data type?

Yes, Dask does select a default serializer depending on your data, ref: Dask Docs - Serialization

I noticed that there is a project for Msgpack-Numpy. Are you planning to add support for it in dask, in case I describe an eventual issue in dask?

I checked with a Dask contributor, and looks there is no plan to support it, now or in the near future. That said, please feel free to start a discussion to gather more thoughts. :)

When I initialize my client this way, what are the main advantages and disadvantages?

Serialization in Dask is tricky, so it's hard to define (dis)advantages. But, generally speaking, manually specifying serializers is not recommended.

Upvotes: 2

SultanOrazbayev

Reputation: 16551

If you would like to use msgpack, then the maximum limit is about 4.3 GB, see docs:

a value of an Integer object is limited from -(2^63) upto (2^64)-1

maximum length of a Binary object is (2^32)-1

maximum byte size of a String object is (2^32)-1

There is a discussion of some strategies here, specifically if it's possible to encode the object as a string, the string can be split into multiple parts and then each part sent individually. The receiving side would then have to concatenate these and decode. Another option is streaming.

Upvotes: 1

SultanOrazbayev

Reputation: 16551

Rather than sending large data to workers, it might be more efficient to store the data (locally or remotely, as appropriate) and ask workers to load it. Something like this:

from joblib import dump, load

path_to_pickle = 'large_numpy.pickle'
dump(large_numpy, path_to_pickle)

def myfunc(path_to_pickle):
    large_numpy = load(path_to_pickle)
    # do something

fut = client.submit(myfunc, path_to_pickle)

Upvotes: 0

msgpack could not serialize large numpy ndarrays

Answers (3)

Related Questions