Reputation: 2798

Pandas rank by multiple columns

I am trying to rank a pandas data frame based on two columns. I can rank it based on one column, but how can to rank it based on two columns? 'SaleCount', then 'TotalRevenue'?

import pandas as pd

df = pd.DataFrame({'TotalRevenue':[300,9000,1000,750,500,2000,0,600,50,500],
    'Date':['2016-12-02' for i in range(10)],
    'SaleCount':[10,100,30,35,20,100,0,30,2,20],
    'shops':['S3','S2','S1','S5','S4','S8','S6','S7','S9','S10']})

df['Rank'] = df.SaleCount.rank(method='dense',ascending = False).astype(int)

#df['Rank'] = df.TotalRevenue.rank(method='dense',ascending = False).astype(int)
df.sort_values(['Rank'], inplace=True)

print(df)

current output:

    Date        SaleCount   TotalRevenue    shops   Rank
1   2016-12-02  100          9000            S2      1
5   2016-12-06  100          2000            S8      1
3   2016-12-04  35           750             S5      2
2   2016-12-03  30           1000            S1      3
7   2016-12-08  30           600             S7      3
9   2016-12-10  20           500             S10     4
4   2016-12-05  20           500             S4      4
0   2016-12-01  10           300             S3      5
8   2016-12-09  2            50              S9      6
6   2016-12-07  0            0               S6      7

I'm trying to generate an output like this:

    Date        SaleCount   TotalRevenue    shops   Rank
1   2016-12-02  100          9000            S2      1
5   2016-12-02  100          2000            S8      2
3   2016-12-02  35           750             S5      3
2   2016-12-02  30           1000            S1      4
7   2016-12-02  30           600             S7      5
9   2016-12-02  20           500             S10     6
4   2016-12-02  20           500             S4      6
0   2016-12-02  10           300             S3      7
8   2016-12-02  2            50              S9      8
6   2016-12-02  0            0               S6      9

Upvotes: 25

Answers (7)

Tyrin Avery

Reputation: 1

If you use groupby you're not actually creating a multi-factor rank, you're just ranking by one value and subsorting. This isn't the most elegant solution, but you could just create rank columns for each type of value you want to rank, aggregate and sort them by the aggregate.
This shows multi-factor rank before sorting.

   TotalRevenue  SaleCount shops  TotRevRnk SaleCntRnk ShopRnk AggRank
1          9000  100        S2     1        1           4        6
5          2000  100        S8     2        1           1        4
3           750  35         S5     4        2           3        9
2          1000  30         S1     3        3           5        11
7           600  30         S7     5        3           2        10

(Using the data from the example above for comparison purposes) If you sort the aggregate rank, you can see that the first two items flip position. So, if you want a multi-factor rank, rather then just cascading sorts this is probably a better solution.

Upvotes: 0

VaM

Reputation: 310

This function will rank successively by a list of columns and supports ranking with groups (something that cannot be done if you just order all rows by multiple columns).

def rank_multicol(
    df: pd.DataFrame,
    rank_by: List[str],
    group_by: Optional[List[str]] = None,
    ascending: Union[List[bool], bool] = True,
    rank_col_name: str = 'rank',
) - > pd.DataFrame:
    df_aux = df.copy()
    columns_to_group_by = [] if group_by is None else group_by
    if type(ascending) is bool:
        ascending = [ascending for _ in range(len(rank_by))]
    elif len(ascending) != len(rank_by):
        raise ValueError("`ascending` must be a scalar or have the same length of `rank_by`.")

    for idx, feature in enumerate(rank_by):
        # TODO: Optimize if no untying is required
        if columns_to_group_by:
            df_to_rank = df_aux.groupby(columns_to_group_by)
        else:
            df_to_rank = df_aux.copy()
        ranks = (
            df_to_rank
            [feature]
            .rank(ascending=ascending[idx], method='min')
            .rename(rank_col_name)
        )
        if rank_col_name in df_aux:
            df_aux[rank_col_name] = ranks + (df_aux[rank_col_name] - 1)
        else:
            df_aux[rank_col_name] = ranks

        columns_to_group_by.append(feature)
    return df_aux

Upvotes: 0

piRSquared

Reputation: 294258

pd.factorize will generate unique values for each unique element of a iterable. We only need to sort in the order we'd like, then factorize. In order to do multiple columns, we convert the sorted result to tuples.

cols = ['SaleCount', 'TotalRevenue']
tups = df[cols].sort_values(cols, ascending=False).apply(tuple, 1)
f, i = pd.factorize(tups)
factorized = pd.Series(f + 1, tups.index)

df.assign(Rank=factorized)

         Date  SaleCount  TotalRevenue shops  Rank
1  2016-12-02        100          9000    S2     1
5  2016-12-02        100          2000    S8     2
3  2016-12-02         35           750    S5     3
2  2016-12-02         30          1000    S1     4
7  2016-12-02         30           600    S7     5
4  2016-12-02         20           500    S4     6
9  2016-12-02         20           500   S10     6
0  2016-12-02         10           300    S3     7
8  2016-12-02          2            50    S9     8
6  2016-12-02          0             0    S6     9

Upvotes: 13

ALollz

Reputation: 59549

`sort_values` + `GroupBy.ngroup`

This will give the dense ranking.

Columns should be sorted in the desired order prior to the groupby. Specifying sort=False within the groupby then respects this sorting so that groups are labeled in the order they appear within the sorted DataFrame.

cols = ['SaleCount', 'TotalRevenue']
df['Rank'] = df.sort_values(cols, ascending=False).groupby(cols, sort=False).ngroup() + 1

Output:

print(df.sort_values('Rank'))

   TotalRevenue        Date  SaleCount shops  Rank
1          9000  2016-12-02        100    S2     1
5          2000  2016-12-02        100    S8     2
3           750  2016-12-02         35    S5     3
2          1000  2016-12-02         30    S1     4
7           600  2016-12-02         30    S7     5
4           500  2016-12-02         20    S4     6
9           500  2016-12-02         20   S10     6
0           300  2016-12-02         10    S3     7
8            50  2016-12-02          2    S9     8
6             0  2016-12-02          0    S6     9

Upvotes: 7

B. M.

Reputation: 18628

The generic way to do that is to group the desired fiels in a tuple, whatever the types.

df["Rank"] = df[["SaleCount","TotalRevenue"]].apply(tuple,axis=1)\
             .rank(method='dense',ascending=False).astype(int)

df.sort_values("Rank")

   TotalRevenue        Date  SaleCount shops  Rank
1          9000  2016-12-02        100    S2     1
5          2000  2016-12-02        100    S8     2
3           750  2016-12-02         35    S5     3
2          1000  2016-12-02         30    S1     4
7           600  2016-12-02         30    S7     5
4           500  2016-12-02         20    S4     6
9           500  2016-12-02         20   S10     6
0           300  2016-12-02         10    S3     7
8            50  2016-12-02          2    S9     8
6             0  2016-12-02          0    S6     9

Upvotes: 31

smci

Reputation: 33938

(The correct way to rank two (nonnegative) int columns is as per Nickil Maveli's answer, to cast them to string, concatenate them and cast back to int.)

However here's a shortcut if you know that TotalRevenue is constrained to some range e.g. 0 to MAX_REVENUE=100,000 ; directly manipulate them as nonnegative integers:

df['Rank'] = (df['SaleCount']*MAX_REVENUE + df['TotalRevenue']).rank(method='dense', ascending=False).astype(int)

df.sort_values('Rank2')

Upvotes: 1

Nickil Maveli

Reputation: 29711

Another way would be to type-cast both the columns of interest to str and combine them by concatenating them. Convert these back to numerical values so that they could be differentiated based on their magnitude.

In method=dense, ranks of duplicated values would remain unchanged. (Here: 6)

Since you want to rank these in their descending order, specifying ascending=False in Series.rank() would let you achieve the desired result.

col1 = df["SaleCount"].astype(str) 
col2 = df["TotalRevenue"].astype(str)
df['Rank'] = (col1+col2).astype(int).rank(method='dense', ascending=False).astype(int)
df.sort_values('Rank')

Upvotes: 8

Pandas rank by multiple columns

Answers (7)

sort_values + GroupBy.ngroup

Related Questions

`sort_values` + `GroupBy.ngroup`