Soufiane Sabiri
Soufiane Sabiri

Reputation: 768

How to detect change in last 2 months starting from specific row in Pandas DataFrame

Let's say we have a dataframe like this:

Id  Policy_id   Start_Date  End_Date    Fee1    Fee2    Last_dup
0   b123        2019/02/24  2019/03/23  0       23      0
1   b123        2019/03/24  2019/04/23  0       23      0
2   b123        2019/04/24  2019/05/23  10      23      1
3   c123        2018/09/01  2019/09/30  10      0       0
4   c123        2018/10/01  2019/10/31  10      0       1
5   d123        2017/02/24  2019/03/23  0       0       0
6   d123        2017/03/24  2019/04/23  0       0       1

The column Last_dup is the result of applying .duplicated (answer).

The result of substraction of End_Date and Start_Date in this case is always 30 days for simplification. My goal is to detect change of Fee1 and Fee2 in the last 2 months for each Policy_id.

So first, I want to locate the last element of Policy_id then go up from the last element and compare the fees between months and detect change.

The expected result:

Id  Policy_id   Start_Date  End_Date    Fee1    Fee2    Last_dup    Changed
0   b123        2019/02/24  2019/03/23  0       23      0           0
1   b123        2019/03/24  2019/04/23  0       23      0           0
2   b123        2019/04/24  2019/05/23  10      23      1           1
3   c123        2018/09/01  2019/09/30  10      0       0           0
4   c123        2018/10/01  2019/10/31  10      0       1           0
5   d123        2017/02/24  2019/03/23  0       0       0           0
6   d123        2017/03/24  2019/04/23  0       0       1           0

I need to start for the specific row where Last_dup is 1 then go up and compare change of FeeX. Thanks!

Upvotes: 0

Views: 263

Answers (1)

Sam
Sam

Reputation: 611

I think adding a "transaction number column" for each policy will make this easier. Then you can just de-dupe the transactions to see if there are "changed" rows.

Look at the following for example:

import pandas as pd

dat = [['b123', 234, 522], ['b123', 234, 522], ['c123', 34, 23], 
['c123', 38, 23], ['c123', 34, 23]]

cols = ['Policy_id', 'Fee1', 'Fee2']

df = pd.DataFrame(dat, columns=cols)

df['transaction_id'] = 1
df['transaction_id'] = df.groupby('Policy_id').cumsum()['transaction_id']

df2 = df[cols].drop_duplicates()

final_df = df2.join(df[['transaction_id']])

The output is:

      Policy_id  Fee1  Fee2  transaction_id
0      b123   234   522               1
2      c123    34    23               1
3      c123    38    23               2

And since b123 only has one transaction after de-duping, you know that nothing changed. Something had to change with c123.

You can get all the changed transactions with final_df[final_df.transaction_id > 1].

As mentioned, you might have to do some other math with the dates, but this should get you most of the way there.

Edit: If you want to only look at the last two months, you can filter the DataFrame prior to running the above.

How to do this:

Make a variable for your filtered date like so:

from datetime import date, timedelta
filtered_date = date.today() - timedelta(days=60)

Then I would use the pyjanitor package to use its filter_date method. Just filter on whatever column is the column that you want; I thought that Start_date appears most reasonable.

import janitor

final_df.filter_date("Start_date", start=filtered_date)

Once you run import janitor, final_df will magically have the filter_date method available.

You can see more filter_date examples here.

Upvotes: 1

Related Questions