Verbal_Kint
Verbal_Kint

Reputation: 1416

pandas filter on sorted row sequence

I have a dataframe where every row is a product transaction. I want to sort by date (ascending), group by product id and find every product id where there's been transaction type X and no proceeding event type Y or Z. The sorting and grouping are straightforward, but how to accomplish that last part?

 idx    id  evt_type   date                      price
57773   10  listed  2021-11-19 14:51:30.964379  250.00
59013   10  sold    2021-11-21 15:00:48.439708  250.00
60111   10  listed  2021-11-25 00:18:08.863694  255.00
60806   10  sold    2021-11-27 10:24:31.431779  255.00
61445   10  refund_req  2021-11-27 11:40:39.033327  NaN
61455   10  refunded    2021-11-27 11:49:39.033327  255.00
60808   10  listed  2021-11-27 12:30:05.368266  280.00
62177   10  sold    2021-11-30 13:21:07.421889  280.00
61887   10  refund_req  2021-11-30 13:21:07.421889  NaN
63742   10  listed  2021-12-04 00:27:00.393276  290.00
64222   10  sold    2021-12-30 13:21:07.421889  290.00

In the above example, I have filter out only product id 10 and I only care about the fact that there was a refund_req event without a following refunded event. So groupby 'id' and return ids where there's a refund_req event_type without a following refund event_type.

Current code:

df.sort_values('date').groupby('id')

But I'm not sure about the aggregation type that delivers the desired ids.

I was thinking about iterating over all known product ids and attempting from there...

EDIT1:

This gets me partly there

df.sort_values('date').groupby(['id', 'evt_type']).agg({'created': 'max'})

Which outputs:

id   evt_type           date
10  listed      2021-11-15 20:47:51.364352
    sold        2022-01-10 15:07:42.048301
    refund_req  2021-11-30 15:51:41.443962
    refunded    2021-22-27 00:55:55.05162

TLDR I want all "id"s where there is a refund_req "evt_type" not followed by (higher "date") a refunded "evt_type"

Upvotes: 0

Views: 233

Answers (1)

ozacha
ozacha

Reputation: 1352

How about:

# Calculate last date of refund_req and refunded event types for each id
last_refund_reqs = (
    df
    .loc[df['evt_type'] == 'refund_req']
    .groupby('id')
    [['date']].max()
)
last_refunded = (
    df
    .loc[df['evt_type'] == 'refunded']
    .groupby('id')
    [['date']].max()
)

# Merge and compare last dates - see those where refund_req 
# comes after the last refunded or where there are no refunded events
merged = last_refund_reqs.join(
    last_refunded, lsuffix='_refund_req', rsuffix='_refunded', how='left'
)
merged.loc[
    (merged['date_refunded'] < merged['date_refund_req']) | 
    merged['date_refunded'].isna()
]

Upvotes: 1

Related Questions