Scope
Scope

Reputation: 789

GroupBy Remove leading rows and last rows based on a column value

I have a dataframe df :-

ID Date Event
1 30-10-2013 Success
1 08-11-2013 Success
1 06-12-2013 Success
1 24-02-2014 Click
1 24-02-2014 Form
1 04-03-2014 Email
1 15-04-2014 Success
1 16-04-2014 Click
1 17-05-2014 Success
1 21-06-2014 Email
1 01-07-2014 Webpage
1 03-07-2014 Email
2 05-07-2014 Form
2 06-08-2014 Webpage
2 07-09-2014 Success

I want to remove rows which have Event Success if the Event starts with Success for each ID (sorted in chronological order) and also remove the events(rows) after the last Success Event for each ID. Expected :-

ID Date Event
1 24-02-2014 Click
1 24-02-2014 Form
1 04-03-2014 Email
1 15-04-2014 Success
1 16-04-2014 Click
1 17-05-2014 Success
2 05-07-2014 Form
2 06-08-2014 Webpage
2 07-09-2014 Success

Upvotes: 1

Views: 98

Answers (1)

bzu
bzu

Reputation: 1594

Provided the dataframe is already sorted, this should work:

df["n"] = df.groupby("ID")["Event"].transform(lambda x: (x == "Success").shift(1, fill_value=0).cumsum())
df["keep"] = df.groupby(["ID", "n"])["Event"].transform(lambda x: (len(x) > 1) & (x.iloc[-1] == "Success"))
result = df.loc[df["keep"]].drop(columns=["keep", "n"])

A bit of explanation:

Updated version (according to the comments):

df["n"] = df.groupby("ID")["Event"].transform(lambda x: (x == "Success").shift(1, fill_value=0).cumsum())
df["keep"] = df.groupby(["ID", "n"])["Event"].transform(lambda x: (len(x) > 1))
df = df.loc[df["keep"]]  # remove leading "Success" rows
df["keep"] = df.groupby("ID")["n"].transform(lambda x: x != x.max() if len(x.unique()) > 1 else True)
df = df.loc[df["keep"]]  # remove trailing rows after last "Success"

Upvotes: 2

Related Questions