Reputation: 789
I have a dataframe df
:-
ID | Date | Event |
---|---|---|
1 | 30-10-2013 | Success |
1 | 08-11-2013 | Success |
1 | 06-12-2013 | Success |
1 | 24-02-2014 | Click |
1 | 24-02-2014 | Form |
1 | 04-03-2014 | |
1 | 15-04-2014 | Success |
1 | 16-04-2014 | Click |
1 | 17-05-2014 | Success |
1 | 21-06-2014 | |
1 | 01-07-2014 | Webpage |
1 | 03-07-2014 | |
2 | 05-07-2014 | Form |
2 | 06-08-2014 | Webpage |
2 | 07-09-2014 | Success |
I want to remove rows which have Event
Success
if the Event
starts with Success
for each ID
(sorted in chronological order) and also remove the events(rows) after the last Success
Event for each ID
.
Expected :-
ID | Date | Event |
---|---|---|
1 | 24-02-2014 | Click |
1 | 24-02-2014 | Form |
1 | 04-03-2014 | |
1 | 15-04-2014 | Success |
1 | 16-04-2014 | Click |
1 | 17-05-2014 | Success |
2 | 05-07-2014 | Form |
2 | 06-08-2014 | Webpage |
2 | 07-09-2014 | Success |
Upvotes: 1
Views: 98
Reputation: 1594
Provided the dataframe is already sorted, this should work:
df["n"] = df.groupby("ID")["Event"].transform(lambda x: (x == "Success").shift(1, fill_value=0).cumsum())
df["keep"] = df.groupby(["ID", "n"])["Event"].transform(lambda x: (len(x) > 1) & (x.iloc[-1] == "Success"))
result = df.loc[df["keep"]].drop(columns=["keep", "n"])
A bit of explanation:
Updated version (according to the comments):
df["n"] = df.groupby("ID")["Event"].transform(lambda x: (x == "Success").shift(1, fill_value=0).cumsum())
df["keep"] = df.groupby(["ID", "n"])["Event"].transform(lambda x: (len(x) > 1))
df = df.loc[df["keep"]] # remove leading "Success" rows
df["keep"] = df.groupby("ID")["n"].transform(lambda x: x != x.max() if len(x.unique()) > 1 else True)
df = df.loc[df["keep"]] # remove trailing rows after last "Success"
Upvotes: 2