Reputation: 1416
I have a dataframe where every row is a product transaction. I want to sort by date (ascending), group by product id and find every product id where there's been transaction type X and no proceeding event type Y or Z. The sorting and grouping are straightforward, but how to accomplish that last part?
idx id evt_type date price
57773 10 listed 2021-11-19 14:51:30.964379 250.00
59013 10 sold 2021-11-21 15:00:48.439708 250.00
60111 10 listed 2021-11-25 00:18:08.863694 255.00
60806 10 sold 2021-11-27 10:24:31.431779 255.00
61445 10 refund_req 2021-11-27 11:40:39.033327 NaN
61455 10 refunded 2021-11-27 11:49:39.033327 255.00
60808 10 listed 2021-11-27 12:30:05.368266 280.00
62177 10 sold 2021-11-30 13:21:07.421889 280.00
61887 10 refund_req 2021-11-30 13:21:07.421889 NaN
63742 10 listed 2021-12-04 00:27:00.393276 290.00
64222 10 sold 2021-12-30 13:21:07.421889 290.00
In the above example, I have filter out only product id 10 and I only care about the fact that there was a refund_req event without a following refunded event. So groupby 'id' and return ids where there's a refund_req event_type without a following refund event_type.
Current code:
df.sort_values('date').groupby('id')
But I'm not sure about the aggregation type that delivers the desired ids.
I was thinking about iterating over all known product ids and attempting from there...
EDIT1:
This gets me partly there
df.sort_values('date').groupby(['id', 'evt_type']).agg({'created': 'max'})
Which outputs:
id evt_type date
10 listed 2021-11-15 20:47:51.364352
sold 2022-01-10 15:07:42.048301
refund_req 2021-11-30 15:51:41.443962
refunded 2021-22-27 00:55:55.05162
TLDR I want all "id"s where there is a refund_req "evt_type" not followed by (higher "date") a refunded "evt_type"
Upvotes: 0
Views: 233
Reputation: 1352
How about:
# Calculate last date of refund_req and refunded event types for each id
last_refund_reqs = (
df
.loc[df['evt_type'] == 'refund_req']
.groupby('id')
[['date']].max()
)
last_refunded = (
df
.loc[df['evt_type'] == 'refunded']
.groupby('id')
[['date']].max()
)
# Merge and compare last dates - see those where refund_req
# comes after the last refunded or where there are no refunded events
merged = last_refund_reqs.join(
last_refunded, lsuffix='_refund_req', rsuffix='_refunded', how='left'
)
merged.loc[
(merged['date_refunded'] < merged['date_refund_req']) |
merged['date_refunded'].isna()
]
Upvotes: 1