Reputation: 21
Using python 3.7 and pandas 0.23.4. I'm trying to make pivot tables with ordered categorical data. If I include margins, the subtotals don't seem to be in the correct order.
import pandas as pd
m='male'
f='female'
data = {'num': [0,1,2,3,4,5,6,7,8,9],
'age': [1,2,2,3,3,3,3,1,2,3],
'sex': [f,f,f,f,f,f,f,m,m,m]}
df = pd.DataFrame(data=data)
df['age1'] = pd.Categorical(df['age'],categories=[3,2,1],ordered=True)
df['sex1'] = pd.Categorical(df['sex'],categories=[m,f],ordered=True)
pd.pivot_table(df,values='num',index='age1',columns='sex1',aggfunc='count',margins=True)
Output (incorrect margins order, the 'All' sums are not in the right rows or columns):
sex1 male female All
age1
3 1 4 2
2 1 2 3
1 1 1 5
All 7 3 10
Expected output (correct margins order):
sex1 male female All
age1
3 1 4 5
2 1 2 3
1 1 1 2
All 3 7 10
In this example it might be better to create the categories with ordered=False. However much of my data is automatically ordered (using pd.cut) so I would like to know if this is intended behavior, and if so, is there a way to remove the ordering on a category that was created with an order?
Edit- here's an example using pd.cut. I changed the 'age' column values to appear in reverse of the cut order.
import pandas as pd
m='male'
f='female'
data = {'num': [0,1,2,3,4,5,6,7,8,9],
'age': [3,3,3,3,2,2,1,1,2,3],
'sex': [f,f,f,f,f,f,f,m,m,m]}
df = pd.DataFrame(data=data)
df['cut'] = pd.cut(df['age'],[1,2,3,4],labels=['<2','2','>2'],right=False)
pd.pivot_table(df,values='num',index='cut',columns='sex',aggfunc='count',margins=True)
Output, again with incorrect row margins (corresponding to the ordered category from pd.cut).
sex female male All
cut
<2 1 1 5
2 2 1 3
>2 4 1 2
All 7 3 10
Expected output would be the correct row margin order.
sex female male All
cut
<2 1 1 2
2 2 1 3
>2 4 1 5
All 7 3 10
Upvotes: 2
Views: 1205
Reputation: 1844
Here is a fix For your original question. Remove the ordered argument from df['age1']
and df['sex1']
My changes:
import pandas as pd
m = 'male'
f = 'female'
data = {'num': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
'age': [1, 2, 2, 3, 3, 3, 3, 1, 2, 3],
'sex': [f, f, f, f, f, f, f, m, m, m]}
df = pd.DataFrame(data=data)
df['age1'] = pd.Categorical(df['age'], categories=[3, 2, 1])
df['sex1'] = pd.Categorical(df['sex'], categories=[m, f])
pd.pivot_table(df, values='num', index='age1',
columns='sex1', aggfunc='count', margins=True)
Output:
sex male female All
age1
3 1 4 5
2 1 2 3
1 1 1 2
All 3 7 10
From the docs: Whether or not this categorical is treated as a ordered categorical. If True, the resulting categorical will be ordered. An ordered categorical respects, when sorted, the order of its categories attribute
So it looks like when you pass ordered=True into pd.Categorical the pivot table will not sort the aggregate based on the category but with the variable appears in your dataframe. If you look at your dataframe, female comes before male so ordered will keep that sort for the aggregate.
So to answer your question this is the intended behavior of ordered, and you should be careful doing this if you will be sorting your dataframe elsewhere in your script.
Upvotes: 1