Reputation: 1227
I have the following DataFrame df
:
center status devices
1 Green [d1, d2]
1 Green [d5, d1, d2]
2 Green []
3 Green [d5, d6]
I need to unfold lists in the column devices
. The goal is to group data by center
and device
, and then count the number of observations per group.
The expected result would be the following one:
center device count
1 d1 2
1 d2 2
1 d5 1
3 d5 1
3 d6 1
Upvotes: 3
Views: 823
Reputation: 323376
Using unnesting after filter the empty list out then groupby
size
unnesting(df[df.devices.astype(bool)],['devices']).groupby(['center','devices']).size().reset_index(name='count')
Out[214]:
center devices count
0 1 d1 2
1 1 d2 2
2 1 d5 1
3 3 d5 1
4 3 d6 1
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
Upvotes: 2
Reputation: 863611
First flatten list
s and then aggregate by DataFrameGroupBy.size
:
#create Series
s = df.set_index('center')['devices']
#create DataFrame, reshape by stack and conver MultiIndex to columns
df = pd.DataFrame(s.values.tolist(), index=s.index).stack().reset_index()
df.columns= ['center','i','devices']
#aggregate count
df = df.groupby(['center','devices']).size().reset_index(name='count')
print (df)
center device count
0 1 d1 2
1 1 d2 2
2 1 d5 1
3 3 d5 1
4 3 d6 1
Another solution for improve performance:
from itertools import chain
df = pd.DataFrame({
'devices' : list(chain.from_iterable(df['devices'].tolist())),
'center' : df['center'].values.repeat(df['devices'].str.len())
})
df = df.groupby(['center','devices']).size().reset_index(name='count')
print (df)
center devices count
0 1 d1 2
1 1 d2 2
2 1 d5 1
3 3 d5 1
4 3 d6 1
Upvotes: 3