Tom Bennett
Tom Bennett

Reputation: 2506

Pandas DataFrame.unstack() Changes Order of Row and Column Headers

I have run into the following problem of sorting the row and column headers.

Here is how to reproduce this:

X =pd.DataFrame(dict(x=np.random.normal(size=100), y=np.random.normal(size=100)))
A=pd.qcut(X['x'], [0,0.25,0.5,0.75,1.0]) #create a factor
B=pd.qcut(X['y'], [0,0.25,0.5,0.75,1.0]) # create another factor

g = X.groupby([A,B])['x'].mean() #do a two-way bucketing


print g 

#this gives the following and so far so good

x                 y               
[-2.315, -0.843]  [-2.58, -0.567]    -1.041167
                  (-0.567, 0.0321]   -1.722926
                  (0.0321, 0.724]    -1.245856
                  (0.724, 3.478]     -1.240876
(-0.843, -0.228]  [-2.58, -0.567]    -0.576264
                  (-0.567, 0.0321]   -0.501709
                  (0.0321, 0.724]    -0.522697
                  (0.724, 3.478]     -0.506259
(-0.228, 0.382]   [-2.58, -0.567]     0.175768
                  (-0.567, 0.0321]    0.214353
                  (0.0321, 0.724]     0.113650
                  (0.724, 3.478]     -0.013758
(0.382, 2.662]    [-2.58, -0.567]     0.983807
                  (-0.567, 0.0321]    1.214640
                  (0.0321, 0.724]     0.808608
                  (0.724, 3.478]      1.515334
Name: x, dtype: float64

#Now let's make a two way table and here is the problem:

HTML(g.unstack().to_html())

This shows:

y                 (-0.567, 0.0321]  (0.0321, 0.724]  (0.724, 3.478]  [-2.58, -0.567]
x                                                                                   
(-0.228, 0.382]           0.214353         0.113650       -0.013758         0.175768
(-0.843, -0.228]         -0.501709        -0.522697       -0.506259        -0.576264
(0.382, 2.662]            1.214640         0.808608        1.515334         0.983807
[-2.315, -0.843]         -1.722926        -1.245856       -1.240876        -1.041167

Note how the headers are no longer sorted. I am wondering what is a good way to solve this problem so as to making interactive work easy.

To further track down where the problem is, run the following:

g.unstack().columns

It gives me this: Index([(-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478], [-2.58, -0.567]], dtype=object)

Now compare this with B.levels:

B.levels
Index([[-2.58, -0.567], (-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478]], dtype=object)

Obviously the order originally in Factor is lost.

Now to make the matter even worse, let's do a multi-level cross table:

g2 = X.groupby([A,B]).agg('mean')
g3 = g2.stack().unstack(-2)
HTML(g3.to_html())

It shows something like:

y                   (-0.567, 0.0321]  (0.0321, 0.724]  (0.724, 3.478]  
x                                                                       
(-0.228, 0.382]  x          0.214353         0.113650       -0.013758   
                 y         -0.293465         0.321836        1.180369   
(-0.843, -0.228] x         -0.501709        -0.522697       -0.506259   
                 y         -0.204811         0.324571        1.167005   
(0.382, 2.662]   x          1.214640         0.808608        1.515334   
                 y         -0.195446         0.161198        1.074532   
[-2.315, -0.843] x         -1.722926        -1.245856       -1.240876   
                 y         -0.392896         0.335471        1.730513  

With both the row and column labels sorted incorrectly.

Thanks.

Upvotes: 17

Views: 7311

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375695

This seems like a little bit of a hack, but here goes:

In [11]: g_unstacked = g.unstack()

In [12]: g_unstacked
Out[12]:
y                 (-0.565, 0.12]  (0.12, 0.791]  (0.791, 2.57]  [-2.177, -0.565]
x
(-0.068, 0.625]         0.389408       0.267252       0.283344          0.258337
(-0.892, -0.068]       -0.121413      -0.471889      -0.448977         -0.462180
(0.625, 1.639]          0.987372       1.006496       0.830710          1.202158
[-3.124, -0.892]       -1.513954      -1.482813      -1.394198         -1.756679

Making use the fact that unique preserves order* (grabbing the unique first terms in from the index of g):

In [13]: g.index.get_level_values(0).unique()
Out[13]:
array(['[-3.124, -0.892]', '(-0.892, -0.068]', '(-0.068, 0.625]',
       '(0.625, 1.639]'], dtype=object)

As you can see, these are in the correct order.

Now you can reindex by this:

In [14]: g_unstacked.reindex(g.index.get_level_values(0).unique())
Out[14]:
y                 (-0.565, 0.12]  (0.12, 0.791]  (0.791, 2.57]  [-2.177, -0.565]
[-3.124, -0.892]       -1.513954      -1.482813      -1.394198         -1.756679
(-0.892, -0.068]       -0.121413      -0.471889      -0.448977         -0.462180
(-0.068, 0.625]         0.389408       0.267252       0.283344          0.258337
(0.625, 1.639]          0.987372       1.006496       0.830710          1.202158

Which is now in the correct order.

Update (I missed that the columns were also not in order).
You can use the same trick for the columns (you'll have to chain these operations):

In [15]: g_unstacked.reindex_axis(g.index.get_level_values(1).unique(), axis=1)

* this is the reason Series unique is significantly faster than np.unique.

Upvotes: 11

Related Questions