Dask Dataframe - multiple rows from each row

Question

I have this dask dataframe that has two columns, one of which contains tuples (or arrays). What I want is to have a new dataframe that has a row for each element of the tuple in each row.

An example dataframe can be constructed like this:

import pandas as pd
import dask.dataframe as dd
tmp = pd.DataFrame({'name': range(10), 'content': [range(i) for i in range(10)]})
ddf = dd.from_pandas(tmp, npartitions=1)

It is shaped like this:

ddf:   name    content
       0       ()
       1       (0)
       2       (0, 1)
       3       (0, 1, 2)
       ...

My goal is to have something that looks like this:

ddf:   name    element
       1       0
       2       0
       2       1
       3       0
       3       1
       3       2
       ...

Thank you in advance for your help.

Actually, my ultimate goal is to count the occurrencies in 'element', which is straight-forward if I can get to the last df I showed. If you know another -maybe easier- way to achieve this, I would really appreciate it if you shared it.

Ben.T · Accepted Answer

You can transform the dataframe tmp in the shape you want by doing:

tmp_2 = (tmp.set_index('name')['content']
            .apply(pd.Series).stack().astype(int)
             .reset_index().drop('level_1',1).rename(columns={0:'content'}))

and then create your ddf the same way.

It's not in dask as you said in a comment you might be able to replicate in dask.

Dask Dataframe - multiple rows from each row

Answers (1)

Related Questions