Reputation: 366
I have a table that is cronologically sorted, with an state and an amount fore each date. The table looks as follows:
Date | State | Amount |
---|---|---|
01/01/2022 | 1 | 1233.11 |
02/01/2022 | 1 | 16.11 |
03/01/2022 | 2 | 144.58 |
04/01/2022 | 1 | 298.22 |
05/01/2022 | 2 | 152.34 |
06/01/2022 | 2 | 552.01 |
07/01/2022 | 3 | 897.25 |
To generate the dataset:
pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})
I want to add a column called rank that is increased when the state changes. So if you have twenty times state 1, it is just rank 1. If then you have state 2, when the state 1 appears again, the rank is increased. That is, if for two days in a row State is 1, Rank is 1. Then, another state appears. When State 1 appears again, Rank would increment to 2.
I want to add a column called "Rank" which has a value that increments itself if a given state appears again. It is like a counter amount of times that state appear consecutively. That it, if state. An example would be as follows:
Date | State | Amount | Rank |
---|---|---|---|
01/01/2022 | 1 | 1233.11 | 1 |
02/01/2022 | 1 | 16.11 | 1 |
03/01/2022 | 2 | 144.58 | 1 |
04/01/2022 | 1 | 298.22 | 2 |
05/01/2022 | 2 | 152.34 | 2 |
06/01/2022 | 2 | 552.01 | 2 |
07/01/2022 | 3 | 897.25 | 1 |
This could be also understanded as follows:
Date | State | Amount | Rank_State1 | Rank_State2 | Rank_State2 |
---|---|---|---|---|---|
01/01/2022 | 1 | 1233.11 | 1 | ||
02/01/2022 | 1 | 16.11 | 1 | ||
03/01/2022 | 2 | 144.58 | 1 | ||
04/01/2022 | 1 | 298.22 | 2 | ||
05/01/2022 | 2 | 152.34 | 2 | ||
06/01/2022 | 2 | 552.01 | 2 | ||
07/01/2022 | 3 | 897.25 | 1 |
Does anyone know how to build that Rank column starting from the previous table?
Upvotes: 0
Views: 82
Reputation: 1387
Your problem is in the general category of state change accumulation, which suggests an approach using cumulative sums and booleans.
Here's one way you can do it - maybe not the most elegant, but I think it does what you need
import pandas as pd
someDF = pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})
someDF["StateAccumulator"] = someDF["state"].apply(str).cumsum()
def groupOccurrence(someRow):
sa = someRow["StateAccumulator"]
s = str(someRow["state"])
stateRank = len("".join([i if i != '' else " " for i in sa.split(s)]).split())\
+ int((sa.split(s)[0] == '') or (int(sa.split(s)[-1] == '')) and sa[-1] != s)
return stateRank
someDF["Rank"] = someDF.apply(lambda x: groupOccurrence(x), axis=1)
If I understand correctly, this is the result you want - "Rank" is intended to represent the number of times a given set of contiguous states have appeared:
date state amount StateAccumulator Rank
0 01/08/2022 1 144 1 1
1 02/08/2022 1 142 11 1
2 03/08/2022 2 166 112 1
3 04/08/2022 2 144 1122 1
4 05/08/2022 3 142 11223 1
5 06/08/2022 1 166 112231 2
6 07/08/2022 1 144 1122311 2
7 08/08/2022 2 142 11223112 2
8 09/08/2022 2 166 112231122 2
9 10/08/2022 2 142 1122311222 2
10 11/08/2022 1 166 11223112221 3
Notes:
someDF["StateChange"] = someDF["state"] != someDF["state"].shift()
Upvotes: 1