Reputation: 153
I have a dataframe in pandas that looks like this:
In[12]: a
Out[12]:
alpha beta gamma G1 G2 G3 \
0 7 1 1 5.617164 9.048353 2.229996
1 7 2 2 1469.200000 1475.224000 1468.049000
2 7 3 62 5.595052 8.975431 2.203285
3 7 4 62 355.046700 -29.822800 6.901535
4 7 5 62 706.328700 -67.995310 9.047969
5 7 6 62 882.444500 -87.075830 10.673920
6 7 7 62 706.259600 -67.769160 9.366320
7 7 8 62 355.401700 -29.832250 6.651977
8 7 9 62 5.669636 9.041898 2.168643
9 7 10 12 5.663535 9.039821 2.166068
10 7 11 22 1469.277000 1475.238000 1468.002000
G4 G5 G6
0 -67.46422 -22.76504 -142.4664
1 1397.88500 1450.74000 1328.4590
2 -67.48421 -22.68970 -142.5102
3 -68.16720 -22.94553 -134.2837
4 -68.53533 -22.72636 -128.2633
5 -68.89483 -22.90121 -129.0006
6 -68.86677 -22.81106 -127.4094
7 -68.27580 -22.86239 -133.6061
8 -67.44511 -22.82177 -142.4721
9 -67.44169 -22.79920 -142.4393
10 1397.89300 1450.75400 1328.5100
I am using pandas.groupby function to group the dataframe by the columns 'alpha' and 'beta'. When I run the function, the output of the groups looks as intended, with indexing order preserved.
b = a.groupby(['alpha','beta'])
b.groups
Out[18]:
{(7, 1): Index(['0'], dtype='object'),
(7, 2): Index(['1'], dtype='object'),
(7, 3): Index(['2'], dtype='object'),
(7, 4): Index(['3'], dtype='object'),
(7, 5): Index(['4'], dtype='object'),
(7, 6): Index(['5'], dtype='object'),
(7, 7): Index(['6'], dtype='object'),
(7, 8): Index(['7'], dtype='object'),
(7, 9): Index(['8'], dtype='object'),
(7, 10): Index(['9'], dtype='object'),
(7, 11): Index(['10'], dtype='object')}
However, the function I am using in the future requires the values inside the tuples of this groupby function to be strings, not integers. Thus I simply change the indicies in 'a' to be strings instead of integers through the following:
a[['alpha','beta']] = a[['alpha','beta']].astype(str)
I then perform the same function as previous to group the dataframe by 'alpha' and 'beta'. HOWEVER, when I do this, the order of indexing is no longer preserved:
b = a.groupby(['alpha','beta'])
b.groups
Out[26]:
{('7', '1'): Index(['0'], dtype='object'),
('7', '10'): Index(['9'], dtype='object'),
('7', '11'): Index(['10'], dtype='object'),
('7', '2'): Index(['1'], dtype='object'),
('7', '3'): Index(['2'], dtype='object'),
('7', '4'): Index(['3'], dtype='object'),
('7', '5'): Index(['4'], dtype='object'),
('7', '6'): Index(['5'], dtype='object'),
('7', '7'): Index(['6'], dtype='object'),
('7', '8'): Index(['7'], dtype='object'),
('7', '9'): Index(['8'], dtype='object')}
As you can see, some of the index locations have "flip-flopped". I have tried forcing the groupby input sort=False
, but it returns the same thing. On one note, I would like to know how to stop this from happening, but I am also curious as to WHY this is happening (trying to understand pandas more). Is there a hierarchical order that pandas is looking for when the dtype is an 'object' instead of an 'integer'?
Thanks in advance.
Upvotes: 1
Views: 96
Reputation: 862431
Reason is groupby
sorting by default, so for avoid it need:
b = a.groupby(['alpha','beta'], sort=False)
Sorting stings is lexicographically, so it seems get flip-flopped
values.
EDIT: Unfortunately it does not help and reason is because dictionary
under python 3.6
are not orderable.
Upvotes: 1
Reputation: 9081
I would like to know how to stop this from happening
As @jezrael already suggested, you can stop the sorting in the groupby
if your source data is sorted already according to what you need and the output will be as desired -
b = a.groupby(['alpha','beta'], sort=False)
The reason why you may not be able to see this in your output is that dict
in python
is not orderable.
I am also curious as to WHY this is happening
When pandas sorts, it also keeps in mind the data type of the columns by which you are sorting. In case(1) (before you cast it to a str
), it treats it as an integer and sorts accordingly.
The moment you convert it to a str
, it sorts lexicographically -
{('7', '1'): Index(['0'], dtype='object'),
('7', '10'): Index(['9'], dtype='object'),
('7', '11'): Index(['10'], dtype='object'),
('7', '2'): Index(['1'], dtype='object'),
('7', '3'): Index(['2'], dtype='object'),
('7', '4'): Index(['3'], dtype='object'),
Notice here how ('7', '1')
comes first and then the second is ('7', '10')
. Basically, by character by character basis, 1
should be followed by 10
, and not 2
Is there a hierarchical order that pandas is looking for when the dtype is an object
instead of an integer
?
Yes the hierarchy matters as well as the data type. If you are sorting by two columns, one int
and one str
(in that order), then it will first sort numerically and then sort lexicographically.
Hope this helps.
Upvotes: 0