Reputation: 2506
I have run into the following problem of sorting the row and column headers.
Here is how to reproduce this:
X =pd.DataFrame(dict(x=np.random.normal(size=100), y=np.random.normal(size=100)))
A=pd.qcut(X['x'], [0,0.25,0.5,0.75,1.0]) #create a factor
B=pd.qcut(X['y'], [0,0.25,0.5,0.75,1.0]) # create another factor
g = X.groupby([A,B])['x'].mean() #do a two-way bucketing
print g
#this gives the following and so far so good
x y
[-2.315, -0.843] [-2.58, -0.567] -1.041167
(-0.567, 0.0321] -1.722926
(0.0321, 0.724] -1.245856
(0.724, 3.478] -1.240876
(-0.843, -0.228] [-2.58, -0.567] -0.576264
(-0.567, 0.0321] -0.501709
(0.0321, 0.724] -0.522697
(0.724, 3.478] -0.506259
(-0.228, 0.382] [-2.58, -0.567] 0.175768
(-0.567, 0.0321] 0.214353
(0.0321, 0.724] 0.113650
(0.724, 3.478] -0.013758
(0.382, 2.662] [-2.58, -0.567] 0.983807
(-0.567, 0.0321] 1.214640
(0.0321, 0.724] 0.808608
(0.724, 3.478] 1.515334
Name: x, dtype: float64
#Now let's make a two way table and here is the problem:
HTML(g.unstack().to_html())
This shows:
y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478] [-2.58, -0.567]
x
(-0.228, 0.382] 0.214353 0.113650 -0.013758 0.175768
(-0.843, -0.228] -0.501709 -0.522697 -0.506259 -0.576264
(0.382, 2.662] 1.214640 0.808608 1.515334 0.983807
[-2.315, -0.843] -1.722926 -1.245856 -1.240876 -1.041167
Note how the headers are no longer sorted. I am wondering what is a good way to solve this problem so as to making interactive work easy.
To further track down where the problem is, run the following:
g.unstack().columns
It gives me this: Index([(-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478], [-2.58, -0.567]], dtype=object)
Now compare this with B.levels:
B.levels
Index([[-2.58, -0.567], (-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478]], dtype=object)
Obviously the order originally in Factor is lost.
Now to make the matter even worse, let's do a multi-level cross table:
g2 = X.groupby([A,B]).agg('mean')
g3 = g2.stack().unstack(-2)
HTML(g3.to_html())
It shows something like:
y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478]
x
(-0.228, 0.382] x 0.214353 0.113650 -0.013758
y -0.293465 0.321836 1.180369
(-0.843, -0.228] x -0.501709 -0.522697 -0.506259
y -0.204811 0.324571 1.167005
(0.382, 2.662] x 1.214640 0.808608 1.515334
y -0.195446 0.161198 1.074532
[-2.315, -0.843] x -1.722926 -1.245856 -1.240876
y -0.392896 0.335471 1.730513
With both the row and column labels sorted incorrectly.
Thanks.
Upvotes: 17
Views: 7311
Reputation: 375695
This seems like a little bit of a hack, but here goes:
In [11]: g_unstacked = g.unstack()
In [12]: g_unstacked
Out[12]:
y (-0.565, 0.12] (0.12, 0.791] (0.791, 2.57] [-2.177, -0.565]
x
(-0.068, 0.625] 0.389408 0.267252 0.283344 0.258337
(-0.892, -0.068] -0.121413 -0.471889 -0.448977 -0.462180
(0.625, 1.639] 0.987372 1.006496 0.830710 1.202158
[-3.124, -0.892] -1.513954 -1.482813 -1.394198 -1.756679
Making use the fact that unique
preserves order* (grabbing the unique first terms in from the index of g):
In [13]: g.index.get_level_values(0).unique()
Out[13]:
array(['[-3.124, -0.892]', '(-0.892, -0.068]', '(-0.068, 0.625]',
'(0.625, 1.639]'], dtype=object)
As you can see, these are in the correct order.
Now you can reindex
by this:
In [14]: g_unstacked.reindex(g.index.get_level_values(0).unique())
Out[14]:
y (-0.565, 0.12] (0.12, 0.791] (0.791, 2.57] [-2.177, -0.565]
[-3.124, -0.892] -1.513954 -1.482813 -1.394198 -1.756679
(-0.892, -0.068] -0.121413 -0.471889 -0.448977 -0.462180
(-0.068, 0.625] 0.389408 0.267252 0.283344 0.258337
(0.625, 1.639] 0.987372 1.006496 0.830710 1.202158
Which is now in the correct order.
Update (I missed that the columns were also not in order).
You can use the same trick for the columns (you'll have to chain these operations):
In [15]: g_unstacked.reindex_axis(g.index.get_level_values(1).unique(), axis=1)
* this is the reason Series unique is significantly faster than np.unique
.
Upvotes: 11