Reputation: 162
I have this dask dataframe that has two columns, one of which contains tuples (or arrays). What I want is to have a new dataframe that has a row for each element of the tuple in each row.
An example dataframe can be constructed like this:
import pandas as pd
import dask.dataframe as dd
tmp = pd.DataFrame({'name': range(10), 'content': [range(i) for i in range(10)]})
ddf = dd.from_pandas(tmp, npartitions=1)
It is shaped like this:
ddf: name content
0 ()
1 (0)
2 (0, 1)
3 (0, 1, 2)
...
My goal is to have something that looks like this:
ddf: name element
1 0
2 0
2 1
3 0
3 1
3 2
...
Thank you in advance for your help.
Actually, my ultimate goal is to count the occurrencies in 'element'
, which is straight-forward if I can get to the last df I showed. If you know another -maybe easier- way to achieve this, I would really appreciate it if you shared it.
Upvotes: 2
Views: 543
Reputation: 29635
You can transform the dataframe tmp
in the shape you want by doing:
tmp_2 = (tmp.set_index('name')['content']
.apply(pd.Series).stack().astype(int)
.reset_index().drop('level_1',1).rename(columns={0:'content'}))
and then create your ddf the same way.
It's not in dask as you said in a comment you might be able to replicate in dask.
Upvotes: 1