srgam
srgam

Reputation: 366

How to create a rank from a df with Pandas

I have a table that is cronologically sorted, with an state and an amount fore each date. The table looks as follows:

Date State Amount
01/01/2022 1 1233.11
02/01/2022 1 16.11
03/01/2022 2 144.58
04/01/2022 1 298.22
05/01/2022 2 152.34
06/01/2022 2 552.01
07/01/2022 3 897.25

To generate the dataset:

pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})

I want to add a column called rank that is increased when the state changes. So if you have twenty times state 1, it is just rank 1. If then you have state 2, when the state 1 appears again, the rank is increased. That is, if for two days in a row State is 1, Rank is 1. Then, another state appears. When State 1 appears again, Rank would increment to 2.

I want to add a column called "Rank" which has a value that increments itself if a given state appears again. It is like a counter amount of times that state appear consecutively. That it, if state. An example would be as follows:

Date State Amount Rank
01/01/2022 1 1233.11 1
02/01/2022 1 16.11 1
03/01/2022 2 144.58 1
04/01/2022 1 298.22 2
05/01/2022 2 152.34 2
06/01/2022 2 552.01 2
07/01/2022 3 897.25 1

This could be also understanded as follows:

Date State Amount Rank_State1 Rank_State2 Rank_State2
01/01/2022 1 1233.11 1
02/01/2022 1 16.11 1
03/01/2022 2 144.58 1
04/01/2022 1 298.22 2
05/01/2022 2 152.34 2
06/01/2022 2 552.01 2
07/01/2022 3 897.25 1

Does anyone know how to build that Rank column starting from the previous table?

Upvotes: 0

Views: 82

Answers (1)

James_SO
James_SO

Reputation: 1387

Your problem is in the general category of state change accumulation, which suggests an approach using cumulative sums and booleans.

Here's one way you can do it - maybe not the most elegant, but I think it does what you need

import pandas as pd
someDF = pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})

someDF["StateAccumulator"] = someDF["state"].apply(str).cumsum()

def groupOccurrence(someRow):
    sa = someRow["StateAccumulator"]
    s = str(someRow["state"])
    stateRank = len("".join([i if i != '' else " " for i in sa.split(s)]).split())\
                    + int((sa.split(s)[0] == '') or (int(sa.split(s)[-1] == '')) and sa[-1] != s)
    return stateRank


someDF["Rank"] = someDF.apply(lambda x: groupOccurrence(x), axis=1)

If I understand correctly, this is the result you want - "Rank" is intended to represent the number of times a given set of contiguous states have appeared:

          date  state  amount StateAccumulator  Rank
0   01/08/2022      1     144                1     1
1   02/08/2022      1     142               11     1
2   03/08/2022      2     166              112     1
3   04/08/2022      2     144             1122     1
4   05/08/2022      3     142            11223     1
5   06/08/2022      1     166           112231     2
6   07/08/2022      1     144          1122311     2
7   08/08/2022      2     142         11223112     2
8   09/08/2022      2     166        112231122     2
9   10/08/2022      2     142       1122311222     2
10  11/08/2022      1     166      11223112221     3

Notes:

  • instead of the somewhat hacky string cumsum method I'm using here, you could probably use a list accumulation function and then use a pandas split-apply-combine method to do the counting in the lambda function
  • you would then apply a state change boolean, and do a cumsum on the state change boolean, filtered/grouped on the state value (so, how many state changes do we have for any given state)
  • state change boolean is done like this: someDF["StateChange"] = someDF["state"] != someDF["state"].shift()
  • so for a given state at a given row, you'd count how many state changes had occurred in the previous rows.

Upvotes: 1

Related Questions