Matt P.
Matt P.

Reputation: 153

Pandas groupby indexing preservation

I have a dataframe in pandas that looks like this:

In[12]: a
Out[12]: 
    alpha  beta  gamma           G1           G2           G3  \
0     7         1          1     5.617164     9.048353     2.229996   
1     7         2          2  1469.200000  1475.224000  1468.049000   
2     7         3         62     5.595052     8.975431     2.203285   
3     7         4         62   355.046700   -29.822800     6.901535   
4     7         5         62   706.328700   -67.995310     9.047969   
5     7         6         62   882.444500   -87.075830    10.673920   
6     7         7         62   706.259600   -67.769160     9.366320   
7     7         8         62   355.401700   -29.832250     6.651977   
8     7         9         62     5.669636     9.041898     2.168643   
9     7        10         12     5.663535     9.039821     2.166068   
10    7        11         22  1469.277000  1475.238000  1468.002000   

            G4          G5         G6  
0    -67.46422   -22.76504  -142.4664  
1   1397.88500  1450.74000  1328.4590  
2    -67.48421   -22.68970  -142.5102  
3    -68.16720   -22.94553  -134.2837  
4    -68.53533   -22.72636  -128.2633  
5    -68.89483   -22.90121  -129.0006  
6    -68.86677   -22.81106  -127.4094  
7    -68.27580   -22.86239  -133.6061  
8    -67.44511   -22.82177  -142.4721  
9    -67.44169   -22.79920  -142.4393  
10  1397.89300  1450.75400  1328.5100  

I am using pandas.groupby function to group the dataframe by the columns 'alpha' and 'beta'. When I run the function, the output of the groups looks as intended, with indexing order preserved.

b = a.groupby(['alpha','beta'])

b.groups
Out[18]: 
{(7, 1): Index(['0'], dtype='object'),
 (7, 2): Index(['1'], dtype='object'),
 (7, 3): Index(['2'], dtype='object'),
 (7, 4): Index(['3'], dtype='object'),
 (7, 5): Index(['4'], dtype='object'),
 (7, 6): Index(['5'], dtype='object'),
 (7, 7): Index(['6'], dtype='object'),
 (7, 8): Index(['7'], dtype='object'),
 (7, 9): Index(['8'], dtype='object'),
 (7, 10): Index(['9'], dtype='object'),
 (7, 11): Index(['10'], dtype='object')}

However, the function I am using in the future requires the values inside the tuples of this groupby function to be strings, not integers. Thus I simply change the indicies in 'a' to be strings instead of integers through the following:

a[['alpha','beta']] = a[['alpha','beta']].astype(str)

I then perform the same function as previous to group the dataframe by 'alpha' and 'beta'. HOWEVER, when I do this, the order of indexing is no longer preserved:

b = a.groupby(['alpha','beta'])

b.groups
Out[26]: 
{('7', '1'): Index(['0'], dtype='object'),
 ('7', '10'): Index(['9'], dtype='object'),
 ('7', '11'): Index(['10'], dtype='object'),
 ('7', '2'): Index(['1'], dtype='object'),
 ('7', '3'): Index(['2'], dtype='object'),
 ('7', '4'): Index(['3'], dtype='object'),
 ('7', '5'): Index(['4'], dtype='object'),
 ('7', '6'): Index(['5'], dtype='object'),
 ('7', '7'): Index(['6'], dtype='object'),
 ('7', '8'): Index(['7'], dtype='object'),
 ('7', '9'): Index(['8'], dtype='object')}

As you can see, some of the index locations have "flip-flopped". I have tried forcing the groupby input sort=False, but it returns the same thing. On one note, I would like to know how to stop this from happening, but I am also curious as to WHY this is happening (trying to understand pandas more). Is there a hierarchical order that pandas is looking for when the dtype is an 'object' instead of an 'integer'?

Thanks in advance.

Upvotes: 1

Views: 96

Answers (2)

jezrael
jezrael

Reputation: 862431

Reason is groupby sorting by default, so for avoid it need:

b = a.groupby(['alpha','beta'], sort=False)

Sorting stings is lexicographically, so it seems get flip-flopped values.

EDIT: Unfortunately it does not help and reason is because dictionary under python 3.6 are not orderable.

Upvotes: 1

Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

I would like to know how to stop this from happening

As @jezrael already suggested, you can stop the sorting in the groupby if your source data is sorted already according to what you need and the output will be as desired -

b = a.groupby(['alpha','beta'], sort=False)

The reason why you may not be able to see this in your output is that dict in python is not orderable.

I am also curious as to WHY this is happening

When pandas sorts, it also keeps in mind the data type of the columns by which you are sorting. In case(1) (before you cast it to a str), it treats it as an integer and sorts accordingly.

The moment you convert it to a str, it sorts lexicographically -

{('7', '1'): Index(['0'], dtype='object'),
 ('7', '10'): Index(['9'], dtype='object'),
 ('7', '11'): Index(['10'], dtype='object'),
 ('7', '2'): Index(['1'], dtype='object'),
 ('7', '3'): Index(['2'], dtype='object'),
 ('7', '4'): Index(['3'], dtype='object'),

Notice here how ('7', '1') comes first and then the second is ('7', '10'). Basically, by character by character basis, 1 should be followed by 10, and not 2

Is there a hierarchical order that pandas is looking for when the dtype is an object instead of an integer?

Yes the hierarchy matters as well as the data type. If you are sorting by two columns, one int and one str (in that order), then it will first sort numerically and then sort lexicographically.

Hope this helps.

Upvotes: 0

Related Questions