Terry
Terry

Reputation: 2811

select the last 2 values ​in the groupby with condition

I need to select the rows of the last value for each user_id and date, but when the last value in the metric column is 'leave' select the last 2 rows(if exists). My data:

df = pd.DataFrame({
     "user_id": [1,1,1, 2,2,2]
    ,'subscription': [1,1,2,3,4,5]
    ,"metric": ['enter', 'stay', 'leave', 'enter', 'leave', 'enter']
    ,'date': ['2020-01-01', '2020-01-01', '2020-03-01', '2020-01-01', '2020-01-01', '2020-01-02']
})
#result
    user_id subscription    metric  date
0   1       1               enter   2020-01-01
1   1       1               stay    2020-01-01
2   1       2               leave   2020-03-01
3   2       3               enter   2020-01-01
4   2       4               leave   2020-01-01
5   2       5               enter   2020-01-02

Expected output:

    user_id subscription    metric  date
1   1       1               stay    2020-01-01
2   1       2               leave   2020-03-01
3   2       3               enter   2020-01-01 # stay because last metric='leave' inside group[user_id, date]
4   2       4               leave   2020-01-01
5   2       5               enter   2020-01-02

What I've tried: drop_duplicates and groupby, both give the same result, only with the last value

df.drop_duplicates(['user_id', 'date'], keep='last')
#or
df.groupby(['user_id', 'date']).tail(1)

Upvotes: 1

Views: 144

Answers (2)

David Erickson
David Erickson

Reputation: 16683

You can use boolean masking and return three different conditions that are True or False with variables a, b, or c. Then, filter for when the data a, b, or c returns True with the or operator |:

a = df.groupby(['user_id', 'date', df.groupby(['user_id', 'date']).cumcount()])['metric'].transform('last') == 'leave'
b = df.groupby(['user_id', 'date'])['metric'].transform('count') == 1
c = a.shift(-1) & (b == False)
df = df[a | b | c]
print(a, b, c)
df

#a groupby the two required groups plus a group that finds the cumulative count, which is necessary in order to return True for the last "metric" within the the group.
0    False
1    False
2     True
3    False
4     True
5    False
Name: metric, dtype: bool

#b if something has a count of one, then you want to keep it.
0    False
1    False
2    True
3    False
4    False
5    True
Name: metric, dtype: bool

#c simply use .shift(-1) to find the row before the row. For the condition to be satisfied the count for that group must be > 1
0    False
1    True
2    False
3    True
4    False
5    False
Name: metric, dtype: bool

Out[18]: 
   user_id  subscription metric        date
1        1             1   stay  2020-01-01
2        1             2  leave  2020-03-01
3        2             3  enter  2020-01-01
4        2             4  leave  2020-01-01
5        2             5  enter  2020-01-02

Upvotes: 2

sammywemmy
sammywemmy

Reputation: 28709

This is one way, but in my opinion, slow, since we are iterating through the grouping :

df["date"] = pd.to_datetime(df["date"])

df = df.assign(metric_is_leave=df.metric.eq("leave"))

pd.concat(
    [
        value.iloc[-2:, :-1] if value.metric_is_leave.any() else value.iloc[-1:, :-1]
        for key, value in df.groupby(["user_id", "date"])
    ]
)




  user_id   subscription    metric  date
1      1        1           stay    2020-01-01
2      1        2          leave    2020-03-01
3      2        3          enter    2020-01-01
4      2        4          leave    2020-01-01
5      2        5          enter    2020-01-02

Upvotes: 1

Related Questions