Dask: Gather futures remotely

Question

I have a large number of futures pointing to data which I subsequently need to aggregate remotely in another task:

futures = [client.submit(get_data, i) for i in range(0, LARGE_NUMBER)]
agg_future = client.submit(aggregate, futures)

The issue with the above is that the client complains about the size of the argument I am submitting due to the large number of futures.

If I were willing to pull the data back to the client, I would simply use gather to collect the results:

agg_local = aggregate(client.gather(futures))

This, however, I would explicitly like to avoid. Is there a way (ideally non-blocking) to effectively gather the futures results within a remote task without having the client complain about the size of the list of futures being aggregated?

mdurant · Accepted Answer

If your workload really suits the creation of many many futures and aggregating them in a single function, you could easily ignore the warning and continue.

However, you might find it more efficient to perform something like the tree summation example from the docs. That case is for delayed, but a client.submit version would look pretty similar, something like replacing the line

lazy = add(L[i], L[i + 1])

with

lazy = client.submit(agg_func, L[i], L[i + 1])

but you will have to figure out a version of your aggregation function which can work pairwise to produce a grand result. Presumably, this would result in a rather large number of futures in-play on the scheduler, which may cause additional latency, so profile to see what works well!

Dask: Gather futures remotely

Answers (2)

Related Questions