Shuvayan Das
Shuvayan Das

Reputation: 1048

pivot_table vs groupby : column names

df_tier

device_id   month_id    net_grp     watch_hours class
843         201707      TCH         0.250277    Ser
896         201803      NJV         0.820833    Ser
748         201711      SHX         2.461111    Sr
115         201712      SPE         1.478055    Opter
107         201802      US          2.575555    Opter
249         201710      ECC         3.869166    Ser
786         201711      NCK         0.563888    Opter
183         201802      HO          1.690555    Opter
752         201712      LC          0.993611    Opter

I am doing a pivot operation on a dataset, the end result will contain around 5 million rows and 600 columns. Above is a sample of the data from which the pivot is created.

#Pivot
df_tier.pivot_table(index=['device_id'],
                        columns = 'net_grp',
                        values = 'watch_hours',
                        aggfunc = sum,fill_value = 0).reset_index()

this operation on a sample of 200000 records takes around 93.7 ms and when I do group by as below:

#Grouby
df_tier.groupby(['device_id','net_grp']).agg({'tuning_hours':['sum']}).unstack(level='net_grp').fillna(0).reset_index().set_index('device_id')

that takes around 15ms.

However the output of pivot is more usable as it contains proper column names whereas the one from groupby :

The column names in group by are not proper,

(watch_hours,sum,TCH),(watch_hours,sum,SPE),(watch_hours,sum,NCK)

How do I get only the net_grp say TCH from these as column names?

I want to use groupby here because of performance issues when I run on the whole data. Can someone please help me to get the column names properly?

Thanks

Upvotes: 2

Views: 1253

Answers (1)

jezrael
jezrael

Reputation: 863166

I suggest use sum instead agg what remove first and second unnecessary level from Multiindex in columns:

df1 = (df_tier.groupby(['device_id','net_grp'])['watch_hours']
              .sum()
              .unstack(level='net_grp', fill_value=0))
print (df1)
net_grp         ECC        HO        LC       NCK       NJV       SHX  \
device_id                                                               
107        0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
115        0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
183        0.000000  1.690555  0.000000  0.000000  0.000000  0.000000   
249        3.869166  0.000000  0.000000  0.000000  0.000000  0.000000   
748        0.000000  0.000000  0.000000  0.000000  0.000000  2.461111   
752        0.000000  0.000000  0.993611  0.000000  0.000000  0.000000   
786        0.000000  0.000000  0.000000  0.563888  0.000000  0.000000   
843        0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
896        0.000000  0.000000  0.000000  0.000000  0.820833  0.000000   

net_grp         SPE       TCH        US  
device_id                                
107        0.000000  0.000000  2.575555  
115        1.478055  0.000000  0.000000  
183        0.000000  0.000000  0.000000  
249        0.000000  0.000000  0.000000  
748        0.000000  0.000000  0.000000  
752        0.000000  0.000000  0.000000  
786        0.000000  0.000000  0.000000  
843        0.000000  0.250277  0.000000  
896        0.000000  0.000000  0.000000 

If want use solution with agg is possible remove first and second level by droplevel:

df1 = (df_tier.groupby(['device_id','net_grp'])
              .agg({'watch_hours':['sum']})
              .unstack(level='net_grp', fill_value=0))

df1.columns = df1.columns.droplevel([0,1])

Upvotes: 2

Related Questions