Reputation: 97
I converted below data into DataFrame which looks as below
data = [
{"start_ts": "2018-05-14 10:54:33", "end_ts": "2018-05-14 11:54:33", "product": "a", "value": 1},
{"start_ts": "2018-05-14 11:54:33", "end_ts": "2018-05-14 12:54:33", "product": "a", "value": 1},
{"start_ts": "2018-05-14 13:54:33", "end_ts": "2018-05-14 14:54:33", "product": "a", "value": 1},
{"start_ts": "2018-05-14 10:54:33", "end_ts": "2018-05-14 11:54:33", "product": "b", "value": 1}
]
product start_ts end_ts value
0 a 2018-05-14 10:54:33 2018-05-14 11:54:33 1
1 a 2018-05-14 11:54:33 2018-05-14 12:54:33 1
2 a 2018-05-14 13:54:33 2018-05-14 14:54:33 1
3 b 2018-05-14 10:54:33 2018-05-14 11:54:33 1
I'm trying to bucketize above DF rows into one row by finding contiguous timestamp fields (where start_ts is equal to the prior row's end_ts for a product) for a product and sum the value column like below.
Expected:
product start_ts end_ts value
0 a 2018-05-14 10:54:33 2018-05-14 12:54:33 2
1 a 2018-05-14 13:54:33 2018-05-14 14:54:33 1
2 b 2018-05-14 10:54:33 2018-05-14 11:54:33 1
I'm unable to get the expected above value using the code below
def merge_dates(grp):
date_groups = (grp['start_ts'] != grp['end_ts'].shift())
return grp.groupby(date_groups).agg({'start_ts': 'first', 'end_ts': 'last'})
df.groupby(["product"]).apply(merge_dates)
Need some advice. Any help would be greatly appreciated!
Thanks
Upvotes: 1
Views: 156
Reputation: 5215
I believe this will work:
df.groupby(['product', (df.start_ts != df.end_ts.shift()).cumsum()], \
as_index=False).agg({'start_ts':min, 'end_ts':max, 'value':sum})
# product end_ts start_ts value
# 0 a 2018-05-14 12:54:33 2018-05-14 10:54:33 2
# 1 a 2018-05-14 14:54:33 2018-05-14 13:54:33 1
# 2 b 2018-05-14 11:54:33 2018-05-14 10:54:33 1
This approach groups by product and by a cumsum
of the boolean series created by df.start_ts != df.end_ts.shift()
; the boolean series serves as a counter that increases by one each time df.start_ts
does not equal the previous row's end_ts
(i.e. df.end_ts.shift()
), thus indicating when a new group should start.
Upvotes: 3