Reputation: 43
I am trying to add multiple columns to a dask dataframe to store the results of an apply function. This will be my first question on stack overflow, I hope this isn't too long!
Current I have this working piece of code:
from dask import dataframe as dd
from multiprocessing import cpu_count
nCores = cpu_count()
import dask.multiprocessing
dask.config.set(scheduler='processes')
def dfFunc(varA, varB):
# Some calculations...
return NewValue
ddf = dd.from_pandas(weather,npartitions=nCores)
ddf['NewCol1'] = ddf.map_partitions(lambda df: df.apply(lambda x: dfFunc(x['VarA'],x['VarB']), axis=1))
res = ddf.compute()
Essentially, I create as dask dataframe from a pandas dataframe 'weather' then I apply the function 'dfFunc' to each row of the dataframe.
This piece of code works fine, as the output 'res' is the original weather dataframe with a new column called 'NewCol1'.
My confusion comes in where if I want my function to return a list rather than a single value, how do I then go about creating multiple columns in the dask dataframe.
From looking at previous threads, by using a list this is supposedly suppose to add columns to a Pandas Dataframe. Hence changing the lines
return NewValue
ddf['newCol1'] =
To the following:
return [NewValue1,NewValue2]
ddf =
However it does not seem to work so well with a dask dataframe or I just don't know how to correctly code this as I end up with a single column with a list of values in it.
X Y
val val [NewValue1,NewValue2]
As a bonus I would like to assign names to these columns in this process as well, but as ddf.compute() returns a pandas dataframe adding column names thereafter shouldn't be too difficult.
Upvotes: 2
Views: 2815
Reputation: 43
It appears there is already a similar question that I missed on stack overflow. Well at least a question that provides a solution to this problem.
Dask Dataframe split column of list into multiple columns
Upvotes: 1