Reputation: 365
I have sorted a dataframe by a specific column but the answer that pandas spits out is not exactly correct.
Here is the code I have used:
league_dataframe.sort_values('overall_league_position')
The result that the sort method yields values in column 'overall league position' are not sorted in ascending order which is the default for the method.
What am I doing wrong?
Upvotes: 24
Views: 49246
Reputation: 23071
If the dtype of the column should be numeric, then converting to a numeric dtype before sorting is the correct approach. However, if you want to preserve the string/object dtype but sort naturally anyway, you can pass a pd.to_numeric
as a key to sort_values()
.
df = pd.DataFrame({"Col": ['1', '2', '3', '10', '20', '19']})
df = df.sort_values('Col', key=pd.to_numeric)
This is similar to sorted(mylist, key=float)
in vanilla Python where the type of the actual data is not modified but that of the keys change. However, unlike sorted()
, the function to pass as a key to sort_values()
must be vectorized (as in, it must return the entire column of sorting keys); so float
cannot work but pd.to_numeric
can.
A more illustrative example is to sort a column of percentages naturally. In that case, we can pass a lambda function that strips the percentage symbols and converts the column into a numeric one as the sorting key.
df = pd.DataFrame({"Col": ['1%', '10%', '3%', '2%', '20%', '19%']})
df = df.sort_values('Col', key=lambda x: pd.to_numeric(x.str.rstrip('%'), errors='coerce'))
# if the data is clean, can use `astype` as well
df = df.sort_values('Col', key=lambda x: x.str.rstrip('%').astype(float))
Upvotes: 0
Reputation: 402333
For whatever reason, you seem to be working with a column of strings, and sort_values
is returning you a lexsorted result.
Here's an example.
df = pd.DataFrame({"Col": ['1', '2', '3', '10', '20', '19']})
df
Col
0 1
1 2
2 3
3 10
4 20
5 19
df.sort_values('Col')
Col
0 1
3 10
5 19
1 2
4 20
2 3
The remedy is to convert it to numeric, either using .astype
or pd.to_numeric
.
df.Col = df.Col.astype(float)
Or,
df.Col = pd.to_numeric(df.Col, errors='coerce')
df.sort_values('Col')
Col
0 1
1 2
2 3
3 10
5 19
4 20
The only difference b/w astype
and pd.to_numeric
is that the latter is more robust at handling non-numeric strings (they're coerced to NaN
), and will attempt to preserve integers if a coercion to float is not necessary (as is seen in this case).
Upvotes: 36