Pandas-Dask DataFrame Apply Function with List Return

Question

I am trying to add multiple columns to a dask dataframe to store the results of an apply function. This will be my first question on stack overflow, I hope this isn't too long!

Current I have this working piece of code:

from dask import dataframe as dd
from multiprocessing import cpu_count
nCores = cpu_count()

import dask.multiprocessing
dask.config.set(scheduler='processes')

def dfFunc(varA, varB):
    # Some calculations...
    return NewValue

ddf = dd.from_pandas(weather,npartitions=nCores)
ddf['NewCol1'] = ddf.map_partitions(lambda df: df.apply(lambda x: dfFunc(x['VarA'],x['VarB']), axis=1))
res = ddf.compute()

Essentially, I create as dask dataframe from a pandas dataframe 'weather' then I apply the function 'dfFunc' to each row of the dataframe.

This piece of code works fine, as the output 'res' is the original weather dataframe with a new column called 'NewCol1'.

My confusion comes in where if I want my function to return a list rather than a single value, how do I then go about creating multiple columns in the dask dataframe.

From looking at previous threads, by using a list this is supposedly suppose to add columns to a Pandas Dataframe. Hence changing the lines

return NewValue
ddf['newCol1'] =

To the following:

return [NewValue1,NewValue2]
ddf =

However it does not seem to work so well with a dask dataframe or I just don't know how to correctly code this as I end up with a single column with a list of values in it.

X    Y    
val  val  [NewValue1,NewValue2]

As a bonus I would like to assign names to these columns in this process as well, but as ddf.compute() returns a pandas dataframe adding column names thereafter shouldn't be too difficult.

Pandas-Dask DataFrame Apply Function with List Return

Answers (1)

Related Questions