dartdog
dartdog

Reputation: 10862

Get rid of excess Labels on Pandas DataFrames

so I got a DataFrame by doing:

dfgrp=df.groupby(['CCS_Category_ICD9','Gender'])['f0_'].sum()
ndf=pd.DataFrame(dfgrp)
ndf
                            f0_
CCS_Category_ICD9   Gender  
1                      F    889
                       M    796
                       U    2
2                      F    32637
                       M    33345
                       U    34

Where f0_ is the sum of the counts by Gender All I really want is a simple one level dataframe similar to this which I got via

ndf=ndf.unstack(level=1)
ndf
                   f0_
   Gender          F        M        U
CCS_Category_ICD9           
1                    889.0     796.0    2.0
2                    32637.0   33345.0  34.0
3                    2546.0    1812.0   NaN
4                   347284.0   213782.0 34.0

But what I want is:

CCS_Category_ICD9    F         M         U      
1                    889.0     796.0    2.0
2                    32637.0   33345.0  34.0
3                    2546.0    1812.0   NaN
4                   347284.0   213782.0 34.0

I cannot figure out how to flatten or get rid of the levels associated with f0_ and Gender All I need is the "M","F","U" column headings so I have a simple one level dataframe. I have tried reset_index and set_index along with several other variations, with no luck...

At the end I want to have a simple crosstab with row and column totals (which my example does not show..

well I did (as suggested in one answer):

ndf = ndf.f0_.unstack()
ndf

Which gave me:

Gender  F      M            U
CCS_Category_ICD9           
1   889.0     796.0     2.0
2   32637.0   33345.0   34.0
3   2546.0    1812.0    NaN
4   347284.0  213782.0  34.0

Followed by:

 nndf=ndf.reset_index(['CCS_Category_ICD9','F','M','U'])
 nndf
 Gender CCS_Category_ICD9   F     M         U
  0     1                889.0    796.0     2.0
  1     2                32637.0  33345.0   34.0
  2     3                2546.0   1812.0    NaN
  3     4                347284.0 213782.0  34.0
  4     5                3493.0   7964.0    1.0
  5     6                12295.0  9998.0    4.0

Which just about does it But I cannot change the index name from Gender to something like Idx no matter what I do I get an extra row added with the New name ie a row titled Idx just under Gender.. Also is there a more straight forward solution?

Upvotes: 2

Views: 1536

Answers (3)

unutbu
unutbu

Reputation: 879611

Generally, use df.pivot when you want use a column as the row index and another column as the column index. Use df.pivot_table when you need to aggregate values due to rows with duplicate (row,column) pairs.

In this case, instead of df.groupby(...)[...].sum().unstack() you could use df.pivot_table:

import numpy as np
import pandas as pd
N = 100
df = pd.DataFrame({'CCS': np.random.choice([1,2], size=N),
                   'Gender':np.random.choice(['F','M','U'], size=N),
                   'f0':np.random.randint(10, size=N)})
result = df.pivot_table(index='CCS', columns='Gender', values='f0', aggfunc='sum')
result.columns.name = None
result = result.reset_index()

yields

   CCS   F    M   U
0    1  89  104  90
1    2  66   65  65

Notice that after calling pivot_table(), the DataFrame result has named index and column Indexes:

In [176]: result = df.pivot_table(index='CCS', columns='Gender', values='f0', aggfunc='sum'); result
Out[176]: 
Gender   F    M   U
CCS                
1       89  104  90
2       66   65  65

The index is named CSS:

In [177]: result.index
Out[177]: Int64Index([1, 2], dtype='int64', name='CCS')  

and the columns index is named Gender:

In [178]: result.columns
Out[178]: Index(['F', 'M', 'U'], dtype='object', name='Gender') # <-- notice the name='Gender'

To remove the name from an Index, assign None to the name attribute:

In [179]: result.columns.name = None

In [180]: result
Out[180]: 
      F   M   U
CCS            
1    95  68  67
2    82  63  68

Though it's not needed here, to remove names from the levels of a MultiIndex, assign a list of Nones to the names (plural) attribute:

result.columns.names = [None]*numlevels

Upvotes: 0

piRSquared
piRSquared

Reputation: 294278

Because ndf is a pd.DataFrame it has a column index. When you performed unstack() it appends the last level from the row index to the column index. Since columns already had f0_, you got a second level. To flatten the way you'd like, call unstack() on the column instead.

ndf = ndf.f0_.unstack()

The text Gender is the name of the column index. If you want to get rid of it, you have to overwrite the name attribute for that object.

ndf.columns.name = None

Use this right after the ndf.f0_.unstack()

Upvotes: 0

Stefan
Stefan

Reputation: 42875

You can

df.loc[:, 'f0_'] 

for the DataFrame resulting from .unstack(), ie, select the first level of your MultiIndex columns which only leaves the gender level , or alternatively

df.columns = df.columns.droplevel()

see MultiIndex.droplevel docs

Upvotes: 1

Related Questions