Reputation: 10862
so I got a DataFrame by doing:
dfgrp=df.groupby(['CCS_Category_ICD9','Gender'])['f0_'].sum()
ndf=pd.DataFrame(dfgrp)
ndf
f0_
CCS_Category_ICD9 Gender
1 F 889
M 796
U 2
2 F 32637
M 33345
U 34
Where f0_ is the sum of the counts by Gender All I really want is a simple one level dataframe similar to this which I got via
ndf=ndf.unstack(level=1)
ndf
f0_
Gender F M U
CCS_Category_ICD9
1 889.0 796.0 2.0
2 32637.0 33345.0 34.0
3 2546.0 1812.0 NaN
4 347284.0 213782.0 34.0
But what I want is:
CCS_Category_ICD9 F M U
1 889.0 796.0 2.0
2 32637.0 33345.0 34.0
3 2546.0 1812.0 NaN
4 347284.0 213782.0 34.0
I cannot figure out how to flatten or get rid of the levels associated with f0_ and Gender All I need is the "M","F","U" column headings so I have a simple one level dataframe. I have tried reset_index and set_index along with several other variations, with no luck...
At the end I want to have a simple crosstab with row and column totals (which my example does not show..
well I did (as suggested in one answer):
ndf = ndf.f0_.unstack()
ndf
Which gave me:
Gender F M U
CCS_Category_ICD9
1 889.0 796.0 2.0
2 32637.0 33345.0 34.0
3 2546.0 1812.0 NaN
4 347284.0 213782.0 34.0
Followed by:
nndf=ndf.reset_index(['CCS_Category_ICD9','F','M','U'])
nndf
Gender CCS_Category_ICD9 F M U
0 1 889.0 796.0 2.0
1 2 32637.0 33345.0 34.0
2 3 2546.0 1812.0 NaN
3 4 347284.0 213782.0 34.0
4 5 3493.0 7964.0 1.0
5 6 12295.0 9998.0 4.0
Which just about does it But I cannot change the index name from Gender to something like Idx no matter what I do I get an extra row added with the New name ie a row titled Idx just under Gender.. Also is there a more straight forward solution?
Upvotes: 2
Views: 1536
Reputation: 879611
Generally, use df.pivot
when you want use a column as the row index and another column as the column index. Use df.pivot_table
when you need to aggregate values due to rows with duplicate (row,column) pairs.
In this case, instead of df.groupby(...)[...].sum().unstack()
you could use
df.pivot_table
:
import numpy as np
import pandas as pd
N = 100
df = pd.DataFrame({'CCS': np.random.choice([1,2], size=N),
'Gender':np.random.choice(['F','M','U'], size=N),
'f0':np.random.randint(10, size=N)})
result = df.pivot_table(index='CCS', columns='Gender', values='f0', aggfunc='sum')
result.columns.name = None
result = result.reset_index()
yields
CCS F M U
0 1 89 104 90
1 2 66 65 65
Notice that after calling pivot_table()
, the DataFrame result
has named
index and column Indexes
:
In [176]: result = df.pivot_table(index='CCS', columns='Gender', values='f0', aggfunc='sum'); result
Out[176]:
Gender F M U
CCS
1 89 104 90
2 66 65 65
The index is named CSS
:
In [177]: result.index
Out[177]: Int64Index([1, 2], dtype='int64', name='CCS')
and the columns index is named Gender
:
In [178]: result.columns
Out[178]: Index(['F', 'M', 'U'], dtype='object', name='Gender') # <-- notice the name='Gender'
To remove the name from an Index
, assign None
to the name
attribute:
In [179]: result.columns.name = None
In [180]: result
Out[180]:
F M U
CCS
1 95 68 67
2 82 63 68
Though it's not needed here, to remove names from the levels of a MultiIndex,
assign a list of None
s to the names
(plural) attribute:
result.columns.names = [None]*numlevels
Upvotes: 0
Reputation: 294278
Because ndf
is a pd.DataFrame
it has a column index. When you performed unstack()
it appends the last level from the row index to the column index. Since columns already had f0_
, you got a second level. To flatten the way you'd like, call unstack()
on the column instead.
ndf = ndf.f0_.unstack()
The text Gender
is the name of the column index. If you want to get rid of it, you have to overwrite the name attribute for that object.
ndf.columns.name = None
Use this right after the ndf.f0_.unstack()
Upvotes: 0