Newkid
Newkid

Reputation: 365

Pandas sort_values does not sort numbers correctly

I have sorted a dataframe by a specific column but the answer that pandas spits out is not exactly correct.

Here is the code I have used:

league_dataframe.sort_values('overall_league_position')

The result that the sort method yields values in column 'overall league position' are not sorted in ascending order which is the default for the method.

enter image description here

What am I doing wrong?

Upvotes: 24

Views: 49246

Answers (2)

cottontail
cottontail

Reputation: 23071

If the dtype of the column should be numeric, then converting to a numeric dtype before sorting is the correct approach. However, if you want to preserve the string/object dtype but sort naturally anyway, you can pass a pd.to_numeric as a key to sort_values().

df = pd.DataFrame({"Col": ['1', '2', '3', '10', '20', '19']})
df = df.sort_values('Col', key=pd.to_numeric)

This is similar to sorted(mylist, key=float) in vanilla Python where the type of the actual data is not modified but that of the keys change. However, unlike sorted(), the function to pass as a key to sort_values() must be vectorized (as in, it must return the entire column of sorting keys); so float cannot work but pd.to_numeric can.

A more illustrative example is to sort a column of percentages naturally. In that case, we can pass a lambda function that strips the percentage symbols and converts the column into a numeric one as the sorting key.

df = pd.DataFrame({"Col": ['1%', '10%', '3%', '2%', '20%', '19%']})
df = df.sort_values('Col', key=lambda x: pd.to_numeric(x.str.rstrip('%'), errors='coerce'))
# if the data is clean, can use `astype` as well
df = df.sort_values('Col', key=lambda x: x.str.rstrip('%').astype(float))

result

Upvotes: 0

cs95
cs95

Reputation: 402333

For whatever reason, you seem to be working with a column of strings, and sort_values is returning you a lexsorted result.

Here's an example.

df = pd.DataFrame({"Col": ['1', '2', '3', '10', '20', '19']})
df

  Col
0   1
1   2
2   3
3  10
4  20
5  19

df.sort_values('Col')

  Col
0   1
3  10
5  19
1   2
4  20
2   3

The remedy is to convert it to numeric, either using .astype or pd.to_numeric.

df.Col = df.Col.astype(float)

Or,

df.Col = pd.to_numeric(df.Col, errors='coerce')
df.sort_values('Col')

   Col
0    1
1    2
2    3
3   10
5   19
4   20

The only difference b/w astype and pd.to_numeric is that the latter is more robust at handling non-numeric strings (they're coerced to NaN), and will attempt to preserve integers if a coercion to float is not necessary (as is seen in this case).

Upvotes: 36

Related Questions