Reputation: 2798
I am trying to rank a pandas data frame based on two columns. I can rank it based on one column, but how can to rank it based on two columns? 'SaleCount', then 'TotalRevenue'?
import pandas as pd
df = pd.DataFrame({'TotalRevenue':[300,9000,1000,750,500,2000,0,600,50,500],
'Date':['2016-12-02' for i in range(10)],
'SaleCount':[10,100,30,35,20,100,0,30,2,20],
'shops':['S3','S2','S1','S5','S4','S8','S6','S7','S9','S10']})
df['Rank'] = df.SaleCount.rank(method='dense',ascending = False).astype(int)
#df['Rank'] = df.TotalRevenue.rank(method='dense',ascending = False).astype(int)
df.sort_values(['Rank'], inplace=True)
print(df)
current output:
Date SaleCount TotalRevenue shops Rank
1 2016-12-02 100 9000 S2 1
5 2016-12-06 100 2000 S8 1
3 2016-12-04 35 750 S5 2
2 2016-12-03 30 1000 S1 3
7 2016-12-08 30 600 S7 3
9 2016-12-10 20 500 S10 4
4 2016-12-05 20 500 S4 4
0 2016-12-01 10 300 S3 5
8 2016-12-09 2 50 S9 6
6 2016-12-07 0 0 S6 7
I'm trying to generate an output like this:
Date SaleCount TotalRevenue shops Rank
1 2016-12-02 100 9000 S2 1
5 2016-12-02 100 2000 S8 2
3 2016-12-02 35 750 S5 3
2 2016-12-02 30 1000 S1 4
7 2016-12-02 30 600 S7 5
9 2016-12-02 20 500 S10 6
4 2016-12-02 20 500 S4 6
0 2016-12-02 10 300 S3 7
8 2016-12-02 2 50 S9 8
6 2016-12-02 0 0 S6 9
Upvotes: 25
Views: 55132
Reputation: 1
If you use groupby you're not actually creating a multi-factor rank, you're just ranking by one value and subsorting.
This isn't the most elegant solution, but you could just create rank columns for each type of value you want to rank, aggregate and sort them by the aggregate.
This shows multi-factor rank before sorting.
TotalRevenue SaleCount shops TotRevRnk SaleCntRnk ShopRnk AggRank
1 9000 100 S2 1 1 4 6
5 2000 100 S8 2 1 1 4
3 750 35 S5 4 2 3 9
2 1000 30 S1 3 3 5 11
7 600 30 S7 5 3 2 10
(Using the data from the example above for comparison purposes) If you sort the aggregate rank, you can see that the first two items flip position. So, if you want a multi-factor rank, rather then just cascading sorts this is probably a better solution.
Upvotes: 0
Reputation: 310
This function will rank successively by a list of columns and supports ranking with groups (something that cannot be done if you just order all rows by multiple columns).
def rank_multicol(
df: pd.DataFrame,
rank_by: List[str],
group_by: Optional[List[str]] = None,
ascending: Union[List[bool], bool] = True,
rank_col_name: str = 'rank',
) - > pd.DataFrame:
df_aux = df.copy()
columns_to_group_by = [] if group_by is None else group_by
if type(ascending) is bool:
ascending = [ascending for _ in range(len(rank_by))]
elif len(ascending) != len(rank_by):
raise ValueError("`ascending` must be a scalar or have the same length of `rank_by`.")
for idx, feature in enumerate(rank_by):
# TODO: Optimize if no untying is required
if columns_to_group_by:
df_to_rank = df_aux.groupby(columns_to_group_by)
else:
df_to_rank = df_aux.copy()
ranks = (
df_to_rank
[feature]
.rank(ascending=ascending[idx], method='min')
.rename(rank_col_name)
)
if rank_col_name in df_aux:
df_aux[rank_col_name] = ranks + (df_aux[rank_col_name] - 1)
else:
df_aux[rank_col_name] = ranks
columns_to_group_by.append(feature)
return df_aux
Upvotes: 0
Reputation: 294258
pd.factorize
will generate unique values for each unique element of a iterable. We only need to sort in the order we'd like, then factorize. In order to do multiple columns, we convert the sorted result to tuples.
cols = ['SaleCount', 'TotalRevenue']
tups = df[cols].sort_values(cols, ascending=False).apply(tuple, 1)
f, i = pd.factorize(tups)
factorized = pd.Series(f + 1, tups.index)
df.assign(Rank=factorized)
Date SaleCount TotalRevenue shops Rank
1 2016-12-02 100 9000 S2 1
5 2016-12-02 100 2000 S8 2
3 2016-12-02 35 750 S5 3
2 2016-12-02 30 1000 S1 4
7 2016-12-02 30 600 S7 5
4 2016-12-02 20 500 S4 6
9 2016-12-02 20 500 S10 6
0 2016-12-02 10 300 S3 7
8 2016-12-02 2 50 S9 8
6 2016-12-02 0 0 S6 9
Upvotes: 13
Reputation: 59549
sort_values
+ GroupBy.ngroup
This will give the dense
ranking.
Columns should be sorted in the desired order prior to the groupby. Specifying sort=False
within the groupby
then respects this sorting so that groups are labeled in the order they appear within the sorted DataFrame.
cols = ['SaleCount', 'TotalRevenue']
df['Rank'] = df.sort_values(cols, ascending=False).groupby(cols, sort=False).ngroup() + 1
Output:
print(df.sort_values('Rank'))
TotalRevenue Date SaleCount shops Rank
1 9000 2016-12-02 100 S2 1
5 2000 2016-12-02 100 S8 2
3 750 2016-12-02 35 S5 3
2 1000 2016-12-02 30 S1 4
7 600 2016-12-02 30 S7 5
4 500 2016-12-02 20 S4 6
9 500 2016-12-02 20 S10 6
0 300 2016-12-02 10 S3 7
8 50 2016-12-02 2 S9 8
6 0 2016-12-02 0 S6 9
Upvotes: 7
Reputation: 18628
The generic way to do that is to group the desired fiels in a tuple, whatever the types.
df["Rank"] = df[["SaleCount","TotalRevenue"]].apply(tuple,axis=1)\
.rank(method='dense',ascending=False).astype(int)
df.sort_values("Rank")
TotalRevenue Date SaleCount shops Rank
1 9000 2016-12-02 100 S2 1
5 2000 2016-12-02 100 S8 2
3 750 2016-12-02 35 S5 3
2 1000 2016-12-02 30 S1 4
7 600 2016-12-02 30 S7 5
4 500 2016-12-02 20 S4 6
9 500 2016-12-02 20 S10 6
0 300 2016-12-02 10 S3 7
8 50 2016-12-02 2 S9 8
6 0 2016-12-02 0 S6 9
Upvotes: 31
Reputation: 33938
(The correct way to rank two (nonnegative) int columns is as per Nickil Maveli's answer, to cast them to string, concatenate them and cast back to int.)
However here's a shortcut if you know that TotalRevenue
is constrained to some range e.g. 0 to MAX_REVENUE=100,000 ; directly manipulate them as nonnegative integers:
df['Rank'] = (df['SaleCount']*MAX_REVENUE + df['TotalRevenue']).rank(method='dense', ascending=False).astype(int)
df.sort_values('Rank2')
Upvotes: 1
Reputation: 29711
Another way would be to type-cast both the columns of interest to str
and combine them by concatenating them. Convert these back to numerical values so that they could be differentiated based on their magnitude.
In method=dense
, ranks of duplicated values would remain unchanged. (Here: 6)
Since you want to rank these in their descending order, specifying ascending=False
in Series.rank()
would let you achieve the desired result.
col1 = df["SaleCount"].astype(str)
col2 = df["TotalRevenue"].astype(str)
df['Rank'] = (col1+col2).astype(int).rank(method='dense', ascending=False).astype(int)
df.sort_values('Rank')
Upvotes: 8