Reputation: 347
I have the following dataframe with ("ID", "Month" and "status"). Status is regarding "Churn"= 1 and 'Not Churn" = 2. I want to delete all rows for ID's who are already churned except the first appearance. For example:
Dataframe
ID Month Status
2310 201708 2
2310 201709 2
2310 201710 1
2310 201711 1
2310 201712 1
2310 201801 1
2311 201704 2
2311 201705 2
2311 201706 2
2311 201707 2
2311 201708 2
2311 201709 2
2311 201710 1
2311 201711 1
2311 201712 1
2312 201708 2
2312 201709 2
2312 201710 2
2312 201711 1
2312 201712 1
2312 201801 1
After deleting I should have the following dataframe
ID Month Status
2310 201708 2
2310 201709 2
2310 201710 1
2311 201704 2
2311 201705 2
2311 201706 2
2311 201707 2
2311 201708 2
2311 201709 2
2311 201710 1
2312 201708 2
2312 201709 2
2312 201710 2
2312 201711 1
I tried the following- first to find min date for each customer ID and status=1
df1=df[df.Status==1].groupby('ID')['Month'].min()
then I have to delete all rows for each ID with status 1 greater than min value for MOnth.
Upvotes: 1
Views: 227
Reputation: 715
If you're familiar with DataFrame.idxmin
to get the indices of the elements of the most recent month, you could try:
# find minimum months
min_df = df.groupby(['ID','Status'])['Month'].idxmin().reset_index(drop=True)
# find indices of status 2 rows
df2 = df[df['Status'].eq(2)].index.to_series()
# append indices together
idx_df = min_df.append(df2).drop_duplicates()
# filter indices
df_new = df.iloc[idx_df].sort_index()
print(df_new)
ID Month Status
0 2310 201708 2
1 2310 201709 2
2 2310 201710 1
6 2311 201704 2
7 2311 201705 2
8 2311 201706 2
9 2311 201707 2
10 2311 201708 2
11 2311 201709 2
12 2311 201710 1
15 2312 201708 2
16 2312 201709 2
17 2312 201710 2
18 2312 201711 1
Update
Or, you may think about using GroupBy.apply
:
df1 = df.groupby(['ID','Status']).apply(lambda x: (x['Status'].eq(2)) | (x['Month'].eq(x['Month'].min())))
df1 = df1.reset_index(level=['ID','Status'], drop=True)
df_new = df.loc[df1]
print(df_new)
ID Month Status
0 2310 201708 2
1 2310 201709 2
2 2310 201710 1
6 2311 201704 2
7 2311 201705 2
8 2311 201706 2
9 2311 201707 2
10 2311 201708 2
11 2311 201709 2
12 2311 201710 1
15 2312 201708 2
16 2312 201709 2
17 2312 201710 2
18 2312 201711 1
Update 2
However, if you're simply wanting to remove all of the status 1 rows that come after the row with the earliest month, then you could simply sort_values
and transform
:
df = df.sort_values(by=['ID','Month']).reset_index(drop=True)
df = df[df.groupby('ID')['Status'].transform(lambda x: ~(x.duplicated() & (x == 1)))]
print(df)
ID Month Status
0 2310 201708 2
1 2310 201709 2
2 2310 201710 1
6 2311 201704 2
7 2311 201705 2
8 2311 201706 2
9 2311 201707 2
10 2311 201708 2
11 2311 201709 2
12 2311 201710 1
15 2312 201708 2
16 2312 201709 2
17 2312 201710 2
18 2312 201711 1
Upvotes: 1
Reputation: 153510
IIUC, you can use groupby
with transform
with boolean logic and then boolean indexing:
df[df.groupby('ID')['Status'].transform(lambda x: ~(x.duplicated() & (x == 1)))]
Output:
ID Month Status
0 2310 201708 2
1 2310 201709 2
2 2310 201710 1
6 2311 201704 2
7 2311 201705 2
8 2311 201706 2
9 2311 201707 2
10 2311 201708 2
11 2311 201709 2
12 2311 201710 1
15 2312 201708 2
16 2312 201709 2
17 2312 201710 2
18 2312 201711 1
Upvotes: 1