Python 3: Rank dataframe using multiple columns

Question

I have a dataframe "df_orders_raw" like this:

ORD_KEY    ORD_DT_KEY  ORD_TM_KEY  SKU_KEY    QTY  
933915294  7598        13          401550750  1
933915294  7598        13          409868344  1
933915294  7598        13          428852481  1
933919765  7598        13          432771563  2
933892844  7598        24          429377565  1
933892844  7598        24          413100006  1
933892844  7598        24          433950159  1
933908232  7598        38          427905127  1
933882107  7598        40          415639133  1
933882107  7598        40          428839526  1
933918543  7598        44          429503890  1
933918543  7598        44          429501262  1

I want to add an ORD_RANK column to this frame ranking data by ORD_DT_KEY, ORD_TM_KEY, ORD_KEY meaning, data should be grouped by ORD_DT_KEY first, and then ORD_TM_KEY will break first level ties followed by ORD_KEY.

Resulting ranks should look as below:

ORD_KEY    ORD_DT_KEY  ORD_TM_KEY  SKU_KEY    QTY  ORD_RANK
933915294  7598        13          401550750  1    1
933915294  7598        13          409868344  1    1
933915294  7598        13          428852481  1    1
933919765  7598        13          432771563  2    2
933892844  7598        24          429377565  1    3
933892844  7598        24          413100006  1    3
933892844  7598        24          433950159  1    3
933908232  7598        38          427905127  1    4
933882107  7598        40          415639133  1    5
933882107  7598        40          428839526  1    5
933918543  7598        44          429503890  1    6
933918543  7598        44          429501262  1    6

After reading through a bunch of similar QnAs, my current code looks as below:

df_orders = df_orders_raw.copy() 
col1 = df_orders['ORD_DT_KEY'].astype(str)
col2 = df_orders['ORD_TM_KEY'].astype(str)
col3 = df_orders['ORD_KEY'].astype(str)
key = col1+col2+col3

df_orders['ORD_RANK'] = (key).astype(float).rank(method='dense').astype(int)

This code works for most cases but starts breaking when key becomes very large (10^17 or larger) since it starts rounding up float numbers. It worked in Python 2.7 since I was using "long" instead of "float", but that's not supported in Python 3.5. If I use (key).astype(int), it throws an OverflowError. After reading similar questions, people suggested using float but that gives me another issue. How can I modify code to make it work for all cases?

BENY · Accepted Answer

Using ngroup

df['ORD_RANK']=df.groupby(['ORD_DT_KEY','ORD_TM_KEY','ORD_KEY']).ngroup()+1
df
Out[1010]: 
      ORD_KEY  ORD_DT_KEY  ORD_TM_KEY    SKU_KEY  QTY  ORD_RANK
0   933915294        7598          13  401550750    1         1
1   933915294        7598          13  409868344    1         1
2   933915294        7598          13  428852481    1         1
3   933919765        7598          13  432771563    2         2
4   933892844        7598          24  429377565    1         3
5   933892844        7598          24  413100006    1         3
6   933892844        7598          24  433950159    1         3
7   933908232        7598          38  427905127    1         4
8   933882107        7598          40  415639133    1         5
9   933882107        7598          40  428839526    1         5
10  933918543        7598          44  429503890    1         6
11  933918543        7598          44  429501262    1         6

Python 3: Rank dataframe using multiple columns

Answers (1)

Related Questions