Reputation: 255
I'm implementing various kinds of data processing pipelines using dask.distributed. Usually the original data is read from S3 and in the end processed (large) collection would be written to CSV on S3 as well.
I can run the processing asynchonously and monitor progress, but I've noticed that all to_xxx() methods that store collections to file(s) seem to be synchronous calls. One downside of it is that the call blocks, potentially for a very long time. Second, I cannot easily construct a complete graph to be executed later.
Is there a way to run e.g. to_csv() asynchronously and get a future object instead of blocking?
PS: I'm pretty sure that I can implement async storage myself, e.g. by converting collection to delayed() and storing each partition. But it seems like a common case - unless I missed already existing feature it would be nice to have something like this included in the framework.
Upvotes: 3
Views: 355
Reputation: 57319
Most to_*
functions have a compute=True
keyword argument that can be replaced with compute=False
. In these cases it will return a sequence of delayed values that you can then compute asynchronously
values = df.to_csv('s3://...', compute=False)
futures = client.compute(values)
Upvotes: 2