Reputation: 1048
df_tier
device_id month_id net_grp watch_hours class
843 201707 TCH 0.250277 Ser
896 201803 NJV 0.820833 Ser
748 201711 SHX 2.461111 Sr
115 201712 SPE 1.478055 Opter
107 201802 US 2.575555 Opter
249 201710 ECC 3.869166 Ser
786 201711 NCK 0.563888 Opter
183 201802 HO 1.690555 Opter
752 201712 LC 0.993611 Opter
I am doing a pivot operation on a dataset, the end result will contain around 5 million rows and 600 columns. Above is a sample of the data from which the pivot is created.
#Pivot
df_tier.pivot_table(index=['device_id'],
columns = 'net_grp',
values = 'watch_hours',
aggfunc = sum,fill_value = 0).reset_index()
this operation on a sample of 200000 records takes around 93.7 ms and when I do group by as below:
#Grouby
df_tier.groupby(['device_id','net_grp']).agg({'tuning_hours':['sum']}).unstack(level='net_grp').fillna(0).reset_index().set_index('device_id')
that takes around 15ms.
However the output of pivot is more usable as it contains proper column names whereas the one from groupby :
The column names in group by are not proper,
(watch_hours
,sum
,TCH
),(watch_hours
,sum
,SPE
),(watch_hours
,sum
,NCK
)
How do I get only the net_grp
say TCH
from these as column names?
I want to use groupby here because of performance issues when I run on the whole data. Can someone please help me to get the column names properly?
Thanks
Upvotes: 2
Views: 1253
Reputation: 863166
I suggest use sum
instead agg
what remove first and second unnecessary level from Multiindex
in columns:
df1 = (df_tier.groupby(['device_id','net_grp'])['watch_hours']
.sum()
.unstack(level='net_grp', fill_value=0))
print (df1)
net_grp ECC HO LC NCK NJV SHX \
device_id
107 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
115 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
183 0.000000 1.690555 0.000000 0.000000 0.000000 0.000000
249 3.869166 0.000000 0.000000 0.000000 0.000000 0.000000
748 0.000000 0.000000 0.000000 0.000000 0.000000 2.461111
752 0.000000 0.000000 0.993611 0.000000 0.000000 0.000000
786 0.000000 0.000000 0.000000 0.563888 0.000000 0.000000
843 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
896 0.000000 0.000000 0.000000 0.000000 0.820833 0.000000
net_grp SPE TCH US
device_id
107 0.000000 0.000000 2.575555
115 1.478055 0.000000 0.000000
183 0.000000 0.000000 0.000000
249 0.000000 0.000000 0.000000
748 0.000000 0.000000 0.000000
752 0.000000 0.000000 0.000000
786 0.000000 0.000000 0.000000
843 0.000000 0.250277 0.000000
896 0.000000 0.000000 0.000000
If want use solution with agg
is possible remove first and second level by droplevel
:
df1 = (df_tier.groupby(['device_id','net_grp'])
.agg({'watch_hours':['sum']})
.unstack(level='net_grp', fill_value=0))
df1.columns = df1.columns.droplevel([0,1])
Upvotes: 2