(Dask) How to distribute expensive resource needed for computation?

Question

What is the best way to distribute a task across a dataset that uses a relatively expensive-to-create resource or object for the computation.

# in pandas
df = pd.read_csv(...)
foo = Foo() # expensive initialization.
result = df.apply(lambda x: foo.do(x))

# in dask?
# is it possible to scatter the foo to the workers?
client.scatter(...

I plan on using this with dask_jobqueue with SGECluster.

MRocklin · Accepted Answer

foo = dask.delayed(Foo)()  # create your expensive thing on the workers instead of locally

def do(row, foo):
    return foo.do(row)

df.apply(do, foo=foo)  # include it as an explicit argument, not a closure within a lambda

(Dask) How to distribute expensive resource needed for computation?

Answers (1)

Related Questions