Reputation: 57
What is the best way to distribute a task across a dataset that uses a relatively expensive-to-create resource or object for the computation.
# in pandas
df = pd.read_csv(...)
foo = Foo() # expensive initialization.
result = df.apply(lambda x: foo.do(x))
# in dask?
# is it possible to scatter the foo to the workers?
client.scatter(...
I plan on using this with dask_jobqueue with SGECluster.
Upvotes: 1
Views: 61
Reputation: 57319
foo = dask.delayed(Foo)() # create your expensive thing on the workers instead of locally
def do(row, foo):
return foo.do(row)
df.apply(do, foo=foo) # include it as an explicit argument, not a closure within a lambda
Upvotes: 1