cjlovering
cjlovering

Reputation: 57

(Dask) How to distribute expensive resource needed for computation?

What is the best way to distribute a task across a dataset that uses a relatively expensive-to-create resource or object for the computation.

# in pandas
df = pd.read_csv(...)
foo = Foo() # expensive initialization.
result = df.apply(lambda x: foo.do(x))

# in dask?
# is it possible to scatter the foo to the workers?
client.scatter(...

I plan on using this with dask_jobqueue with SGECluster.

Upvotes: 1

Views: 61

Answers (1)

MRocklin
MRocklin

Reputation: 57319

foo = dask.delayed(Foo)()  # create your expensive thing on the workers instead of locally

def do(row, foo):
    return foo.do(row)

df.apply(do, foo=foo)  # include it as an explicit argument, not a closure within a lambda

Upvotes: 1

Related Questions