Reputation: 198
I am using read_csv() to read a long list of csv files and return two dataframes. I have managed to speed up this action by using dask. Unfortunately, I have not been able to return multiple variables when using dask.
The minimum working example below replicates my issue:
@delayed(nout = 2)
def function(a):
d = 0
c = a + a
if a>4: # random condition to make c and d of different lenghts
d = a * a
return pd.DataFrame([c])#, pd.DataFrame([d])
list = [1,2,3,4,5]
dfs = [delayed(function)(int) for int in list]
ddf = dd.from_delayed(dfs)
ddf.compute()
Any ideas to resolve this issue is appreciated. Thanks.
Upvotes: 1
Views: 425
Reputation: 16551
The delayed
decorator has nout
parameter, so something like this might work:
from dask import delayed
@delayed(nout=2)
def function(a,b):
c = a + b
d = a * b
return c, d
delayed_c, delayed_d = function(2, 3)
From the question it's not clear at which step dataframes come in, but the key part of the question (returning more than one value from dask delayed) is answered by using nout
, see this answer for full details.
Update:
The delayed function in the updated question returns a tuple of dataframes, this means that dd.from_delayed
should be called either on each element of the tuple or the tuple should be unpacked:
dfs = [delayed_value for int in list for delayed_value in function(int)]
ddf = dd.from_delayed(dfs)
ddf.compute()
Upvotes: 2