Reputation: 2811
I need to select the rows of the last value for each user_id and date, but when the last value in the metric column is 'leave' select the last 2 rows(if exists). My data:
df = pd.DataFrame({
"user_id": [1,1,1, 2,2,2]
,'subscription': [1,1,2,3,4,5]
,"metric": ['enter', 'stay', 'leave', 'enter', 'leave', 'enter']
,'date': ['2020-01-01', '2020-01-01', '2020-03-01', '2020-01-01', '2020-01-01', '2020-01-02']
})
#result
user_id subscription metric date
0 1 1 enter 2020-01-01
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
Expected output:
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01 # stay because last metric='leave' inside group[user_id, date]
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
What I've tried: drop_duplicates
and groupby
, both give the same result, only with the last value
df.drop_duplicates(['user_id', 'date'], keep='last')
#or
df.groupby(['user_id', 'date']).tail(1)
Upvotes: 1
Views: 144
Reputation: 16683
You can use boolean masking and return three different conditions that are True
or False
with variables a
, b
, or c
. Then, filter for when the data a, b, or c returns True
with the or operator |
:
a = df.groupby(['user_id', 'date', df.groupby(['user_id', 'date']).cumcount()])['metric'].transform('last') == 'leave'
b = df.groupby(['user_id', 'date'])['metric'].transform('count') == 1
c = a.shift(-1) & (b == False)
df = df[a | b | c]
print(a, b, c)
df
#a groupby the two required groups plus a group that finds the cumulative count, which is necessary in order to return True for the last "metric" within the the group.
0 False
1 False
2 True
3 False
4 True
5 False
Name: metric, dtype: bool
#b if something has a count of one, then you want to keep it.
0 False
1 False
2 True
3 False
4 False
5 True
Name: metric, dtype: bool
#c simply use .shift(-1) to find the row before the row. For the condition to be satisfied the count for that group must be > 1
0 False
1 True
2 False
3 True
4 False
5 False
Name: metric, dtype: bool
Out[18]:
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
Upvotes: 2
Reputation: 28709
This is one way, but in my opinion, slow, since we are iterating through the grouping :
df["date"] = pd.to_datetime(df["date"])
df = df.assign(metric_is_leave=df.metric.eq("leave"))
pd.concat(
[
value.iloc[-2:, :-1] if value.metric_is_leave.any() else value.iloc[-1:, :-1]
for key, value in df.groupby(["user_id", "date"])
]
)
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
Upvotes: 1