GroupBy Remove leading rows and last rows based on a column value

Question

I have a dataframe df :-

ID	Date	Event
1	30-10-2013	Success
1	08-11-2013	Success
1	06-12-2013	Success
1	24-02-2014	Click
1	24-02-2014	Form
1	04-03-2014	Email
1	15-04-2014	Success
1	16-04-2014	Click
1	17-05-2014	Success
1	21-06-2014	Email
1	01-07-2014	Webpage
1	03-07-2014	Email
2	05-07-2014	Form
2	06-08-2014	Webpage
2	07-09-2014	Success

I want to remove rows which have Event Success if the Event starts with Success for each ID (sorted in chronological order) and also remove the events(rows) after the last Success Event for each ID. Expected :-

ID	Date	Event
1	24-02-2014	Click
1	24-02-2014	Form
1	04-03-2014	Email
1	15-04-2014	Success
1	16-04-2014	Click
1	17-05-2014	Success
2	05-07-2014	Form
2	06-08-2014	Webpage
2	07-09-2014	Success

bzu · Accepted Answer

Provided the dataframe is already sorted, this should work:

df["n"] = df.groupby("ID")["Event"].transform(lambda x: (x == "Success").shift(1, fill_value=0).cumsum())
df["keep"] = df.groupby(["ID", "n"])["Event"].transform(lambda x: (len(x) > 1) & (x.iloc[-1] == "Success"))
result = df.loc[df["keep"]].drop(columns=["keep", "n"])

A bit of explanation:

"n" numbers a group of rows containing one "Success", using this trick: https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/
"keep" creates a filter based on row group containing more than 1 row (not only a single "Success") and the last row being "Success"

Updated version (according to the comments):

df["n"] = df.groupby("ID")["Event"].transform(lambda x: (x == "Success").shift(1, fill_value=0).cumsum())
df["keep"] = df.groupby(["ID", "n"])["Event"].transform(lambda x: (len(x) > 1))
df = df.loc[df["keep"]]  # remove leading "Success" rows
df["keep"] = df.groupby("ID")["n"].transform(lambda x: x != x.max() if len(x.unique()) > 1 else True)
df = df.loc[df["keep"]]  # remove trailing rows after last "Success"

GroupBy Remove leading rows and last rows based on a column value

Answers (1)

Related Questions