Sunil
Sunil

Reputation: 97

Combine Pandas DataFrame Rows by Timestamp and Column

I converted below data into DataFrame which looks as below

data = [
       {"start_ts": "2018-05-14 10:54:33", "end_ts": "2018-05-14 11:54:33", "product": "a", "value": 1},
       {"start_ts": "2018-05-14 11:54:33", "end_ts": "2018-05-14 12:54:33", "product": "a", "value": 1}, 
       {"start_ts": "2018-05-14 13:54:33", "end_ts": "2018-05-14 14:54:33", "product": "a", "value": 1},          
       {"start_ts": "2018-05-14 10:54:33", "end_ts": "2018-05-14 11:54:33", "product": "b", "value": 1}
   ]

    product start_ts            end_ts              value
0   a       2018-05-14 10:54:33 2018-05-14 11:54:33 1
1   a       2018-05-14 11:54:33 2018-05-14 12:54:33 1
2   a       2018-05-14 13:54:33 2018-05-14 14:54:33 1
3   b       2018-05-14 10:54:33 2018-05-14 11:54:33 1

I'm trying to bucketize above DF rows into one row by finding contiguous timestamp fields (where start_ts is equal to the prior row's end_ts for a product) for a product and sum the value column like below.

Expected:

    product start_ts            end_ts              value
0   a       2018-05-14 10:54:33 2018-05-14 12:54:33 2
1   a       2018-05-14 13:54:33 2018-05-14 14:54:33 1
2   b       2018-05-14 10:54:33 2018-05-14 11:54:33 1

I'm unable to get the expected above value using the code below

def merge_dates(grp):
    date_groups = (grp['start_ts'] != grp['end_ts'].shift())
    return grp.groupby(date_groups).agg({'start_ts': 'first', 'end_ts': 'last'})   

df.groupby(["product"]).apply(merge_dates)

Need some advice. Any help would be greatly appreciated!

Thanks

Upvotes: 1

Views: 156

Answers (1)

cmaher
cmaher

Reputation: 5215

I believe this will work:

df.groupby(['product', (df.start_ts != df.end_ts.shift()).cumsum()], \
           as_index=False).agg({'start_ts':min, 'end_ts':max, 'value':sum})

#   product              end_ts            start_ts  value
# 0       a 2018-05-14 12:54:33 2018-05-14 10:54:33      2
# 1       a 2018-05-14 14:54:33 2018-05-14 13:54:33      1
# 2       b 2018-05-14 11:54:33 2018-05-14 10:54:33      1

This approach groups by product and by a cumsum of the boolean series created by df.start_ts != df.end_ts.shift(); the boolean series serves as a counter that increases by one each time df.start_ts does not equal the previous row's end_ts (i.e. df.end_ts.shift()), thus indicating when a new group should start.

Upvotes: 3

Related Questions