Reputation: 768
Let's say we have a dataframe like this:
Id Policy_id Start_Date End_Date Fee1 Fee2 Last_dup
0 b123 2019/02/24 2019/03/23 0 23 0
1 b123 2019/03/24 2019/04/23 0 23 0
2 b123 2019/04/24 2019/05/23 10 23 1
3 c123 2018/09/01 2019/09/30 10 0 0
4 c123 2018/10/01 2019/10/31 10 0 1
5 d123 2017/02/24 2019/03/23 0 0 0
6 d123 2017/03/24 2019/04/23 0 0 1
The column Last_dup
is the result of applying .duplicated
(answer).
The result of substraction of End_Date
and Start_Date
in this case is always 30 days for simplification. My goal is to detect change of Fee1
and Fee2
in the last 2 months for each Policy_id
.
So first, I want to locate the last element of Policy_id
then go up from the last element and compare the fees between months and detect change.
The expected result:
Id Policy_id Start_Date End_Date Fee1 Fee2 Last_dup Changed
0 b123 2019/02/24 2019/03/23 0 23 0 0
1 b123 2019/03/24 2019/04/23 0 23 0 0
2 b123 2019/04/24 2019/05/23 10 23 1 1
3 c123 2018/09/01 2019/09/30 10 0 0 0
4 c123 2018/10/01 2019/10/31 10 0 1 0
5 d123 2017/02/24 2019/03/23 0 0 0 0
6 d123 2017/03/24 2019/04/23 0 0 1 0
I need to start for the specific row where Last_dup
is 1
then go up and compare change of FeeX
. Thanks!
Upvotes: 0
Views: 263
Reputation: 611
I think adding a "transaction number column" for each policy will make this easier. Then you can just de-dupe the transactions to see if there are "changed" rows.
Look at the following for example:
import pandas as pd
dat = [['b123', 234, 522], ['b123', 234, 522], ['c123', 34, 23],
['c123', 38, 23], ['c123', 34, 23]]
cols = ['Policy_id', 'Fee1', 'Fee2']
df = pd.DataFrame(dat, columns=cols)
df['transaction_id'] = 1
df['transaction_id'] = df.groupby('Policy_id').cumsum()['transaction_id']
df2 = df[cols].drop_duplicates()
final_df = df2.join(df[['transaction_id']])
The output is:
Policy_id Fee1 Fee2 transaction_id
0 b123 234 522 1
2 c123 34 23 1
3 c123 38 23 2
And since b123
only has one transaction after de-duping, you know that nothing changed. Something had to change with c123
.
You can get all the changed transactions with final_df[final_df.transaction_id > 1]
.
As mentioned, you might have to do some other math with the dates, but this should get you most of the way there.
Edit: If you want to only look at the last two months, you can filter the DataFrame prior to running the above.
How to do this:
Make a variable for your filtered date like so:
from datetime import date, timedelta
filtered_date = date.today() - timedelta(days=60)
Then I would use the pyjanitor
package to use its filter_date method. Just filter on whatever column is the column that you want; I thought that Start_date
appears most reasonable.
import janitor
final_df.filter_date("Start_date", start=filtered_date)
Once you run import janitor
, final_df
will magically have the filter_date
method available.
You can see more filter_date
examples here.
Upvotes: 1