Reputation: 157
How can I rank a DataFrame based on 2 columns?
On below example, col_b
would the tie breaker for col_a
.
DataFrame:
df = pd.DataFrame({'col_a':[0,0,0,1,1,1], 'col_b':[5,2,8,3,7,4]})
df
col_a col_b
0 0 5
1 0 2
2 0 8
3 1 3
4 1 7
5 1 4
Expected Output:
col_a col_b Rank
0 0 5 2
1 0 2 1
2 0 8 3
3 1 3 4
4 1 7 6
5 1 4 5
Upvotes: 4
Views: 7848
Reputation: 107287
Here is a one-line approach using sort_values
:
In [135]: df['rank'] = df.sort_values(['col_a', 'col_b'])['col_b'].index + 1
In [136]: df
Out[136]:
col_a col_b rank
0 0 5 2
1 0 2 1
2 0 8 3
3 1 3 4
4 1 7 6
5 1 4 5
The logic behind this snippet: Basically, the DataFrame.sort_values
function accepts multiple column names and returns a sorted copy of the dataframe based on the order of passed column names. The default sorting order is ascending
which is what we want. If you wanted another order you could pass the order as an iterable of booleans to the ascending
keyword argument. At the end the new indices of the column_b
is what we want (plus one).
Upvotes: 4
Reputation: 7994
Using numpy
's argsort
method.
df = pd.DataFrame({'col_a':[0,0,0,1,1,1], 'col_b':[5,2,8,3,7,4]})
df["rank"] = np.argsort(
df.values.copy().view(dtype=[('x', int), ('y', int)]).flatten(),
order=("x","y")
) + 1
col_a col_b rank
0 0 5 2
1 0 2 1
2 0 8 3
3 1 3 4
4 1 7 6
5 1 4 5
Upvotes: 1
Reputation: 43494
Here is one way. Create a temp DataFrame by sorting the columns and re-indexing. Then use the new index as the rank and join back to the original df.
temp_df = df.sort_values(['col_a', 'col_b']).reset_index()
temp_df['rank'] = temp_df.index + 1
print(temp_df)
# index col_a col_b rank
#0 1 0 2 1
#1 0 0 5 2
#2 2 0 8 3
#3 3 1 3 4
#4 5 1 4 5
#5 4 1 7 6
The column 'index'
corresponds to the index in the original DataFrame. Use this to join temp_df
back to df
and select the columns you want:
df = df.join(temp_df.set_index('index'), rsuffix="_r")[['col_a', 'col_b', 'rank']]
print(df)
# col_a col_b rank
#0 0 5 2
#1 0 2 1
#2 0 8 3
#3 1 3 4
#4 1 7 6
#5 1 4 5
Upvotes: 3
Reputation: 157
Found my own solution: Create a tuple with the columns and rank it. Won't handle different ascending/descending order, but it is good for my problem.
df['rank'] = df[['col_a','col_b']].apply(tuple, 1).rank()
Upvotes: 2