Reputation: 7255
I am trying to sort a pandas dataframe based on values from two columns. For some reason it is treating integers as strings, while several codes earlier those values were still integers. Not sure what caused the changes, but any way:
df =
contig pos ref haplotype_block hap_X hap_Y odds_ratio My_hap Sp_hap
2 5207 T 1856 T A 167.922 T A
2 5238 G 1856 C G - C G
2 5723 A 1856 A T - A T
2 5867 C 1856 T C - T C
2 155667 G 2816 G * 1.0 N N
2 155670 T 2816 T * - N N
2 67910 C 2 C T 0.21600000000000003 T C
2 67941 A 2 A T - T A
2 68016 A 2 A G - G A
2 118146 C 132 T C 1369.0 T C
2 118237 A 132 C A - C A
2 118938 A 1157 T A 0.002 A T
df.sort_values(by=['contig', 'pos'], inplace=True, ascending=False)
print(df) #is giving me
contig pos ref haplotype_block hap_X hap_Y odds_ratio My_hap Sp_hap
2 118146 C 132 T C 1369.0 T C
2 118237 A 132 C A - C A
2 118938 A 1157 T A 0.002 A T
2 155667 G 2816 G * 1.0 N N
2 155670 T 2816 T * - N N
2 5207 T 1856 T A 167.922 T A
2 5238 G 1856 C G - C G
2 5723 A 1856 A T - A T
2 5867 C 1856 T C - T C
......
So, its only sorting the data using first digits of both columns (contig and pos)
. Why is this happening? and a very simple memory efficient way of solving it?
Thanks,
Post edit details:
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 9 columns):
contig 333 non-null int64
pos 333 non-null object
ref 333 non-null object
haplotype_block 333 non-null int64
hap_X 333 non-null object
hap_Y 333 non-null object
odds_ratio 333 non-null object
My_hap 333 non-null object
Sp_hap 333 non-null object
dtypes: int64(2), object(7)
memory usage: 23.5+ KB
None
Upvotes: 2
Views: 2998
Reputation: 7255
convert the values to integer:
df['contig'] = df['contig'].astype(int)
df['pos'] = df['pos'].astype(int)
Then sort with inplace
df.sort_values(by=['contig', 'pos'], inplace=True, ascending=True)
Thanks,
Upvotes: 2