Parallelizing without pickle

Alex Gaynor explains some problems with pickle in his talk "Pickles are for delis, not software", including security, reliability, human-readableness. I am generally wary of using pickle on data in my python programs. As a general rule, I much prefer to pass my data around with json or other serialization formats, specified by myself, manually.

The situation I'm interested in is: I've gathered some data in my python program and I want to run an embarrassingly parallel task on it a bunch of times in parallel.

As far as I know, the nicest parallelization library for doing this in python right now is dask-distributed, followed by joblib-parallel, concurrent.futures, and multiprocessing.

However, all of these solutions use pickle for serialization. Given the various issues with pickle, I'm inclined to simply send a json array to a subprocess of GNU parallel. But of course, this feels like a hack, and loses all the fancy goodness of Dask.

Is it possible to specify a different default serialization format for my data, but continue to parallelize in python, preferably dask, without resorting to pickle or gnu parallel?

Upvotes: 2

Answers (2)

Kirk Strauser

Reputation: 30957

Pickles are bad for long-term storage ("what if my class definition changes after I've persisted something to a database?") and terrible for accepting as user input:

def foo():
    os.system('rm -rf /')

return {'lol': foo}

But I don't think there's any problem at all with using them in this specific case. Suppose you're passing around datetime objects. Do you really want to write your own ad-hoc JSON adapter to serialize and deserialize them? I mean, you can, but do you want to? Pickles are well specified, and the process is fast. That's kind of exactly what you want here, where you're neither persisting the intermediate serialized object nor accepting objects from third parties. You're literally passing them from yourself to yourself.

I'd highly recommend picking the library you want to use -- you like Dask? Go for it! -- and not worrying about its innards until such time as you specifically have to care. In the mean time, concentrate on the parts of your program that are unique to your problem. Odds are good that the underlying serialization format won't be one of them.

Upvotes: 0

mdurant

Reputation: 28684

The page http://distributed.dask.org/en/latest/protocol.html is worth a read regarding how Dask passes information around a set of distributed workers and scheduler. As can be seen, (cloud)pickle enters the picture for things like functions, which we want to be able to pass to workers, so they can execute them, but data is generally sent via fairly efficient msgpack serialisation. There would be no way to serialise functions with JSON. In fact, there is a fairly flexible internal dispatch mechanism for deciding what gets serialised with what mechanism, but there is no need to get into that here.

I would also claim that pickle is a fine way to serialise some things when passing between processes, so long as you have gone to the trouble to ensure consistent environments between them, which is an assumption that Dask makes.

-edit-

You could of course include fuction names or escapes in JSON, but I would suggest that's just as brittle as pickle anyway.

Upvotes: 2

Parallelizing without pickle

Answers (2)

Related Questions