Benjamin Creusot
Benjamin Creusot

Reputation: 83

Grouping multiple datetime dataframe rows into a single one

I have a dataframe with some date time interval. I am trying to squash them into single events. I have the start and end time alongside the respective duration for every event.

What I have

           start_time            end_time  duration  id
0 2020-01-01 00:00:00 2020-01-01 00:30:00        30   A
1 2020-01-01 00:30:00 2020-01-01 01:00:00        30   B
2 2020-01-01 01:00:00 2020-01-01 01:30:00        30   C
3 2020-01-01 01:30:00 2020-01-01 02:00:00        30   D
4 2020-01-04 05:00:00 2020-01-04 05:30:00        30   E
5 2020-01-04 05:30:00 2020-01-04 06:00:00        30   F
6 2020-01-04 06:00:00 2020-01-04 06:30:00        30   G
7 2020-01-04 06:30:00 2020-01-04 07:00:00        30   H
8 2020-01-04 20:30:00 2020-01-04 21:00:00        30   I

What I'm trying to squash it into

           start_time            end_time  duration  id
0 2020-01-01 00:00:00 2020-01-01 02:00:00       120   A
4 2020-01-04 05:00:00 2020-01-04 07:00:00       120   E
8 2020-01-04 20:30:00 2020-01-04 21:00:00        30   I

I looked for group and merging options in pandas but I didn't manage to to what I want.

Upvotes: 2

Views: 78

Answers (2)

ansev
ansev

Reputation: 30920

Groupby.agg with Series.dt.date

 new_df =( df.groupby(df['end_time'].dt.date,as_index = False)
             .agg({'start_time':'first',
                    'end_time':'last',
                    'duration':'sum',
                    'id':'first'})
         )
print(new_df)

           start_time            end_time  duration id
0 2020-01-01 00:00:00 2020-01-01 02:00:00       120  A
1 2020-01-04 05:00:00 2020-01-04 07:00:00       120  E

Upvotes: 2

dkhara
dkhara

Reputation: 715

You can use DataFrame.shift to compare end times with the shifted start times and set any identical pairs to null:

df['flag'] = df['start_time'].shift(-1)
df.loc[df['end_time'] == df['flag'], 'flag'] = pd.NaT
print(df)                                                              
           start_time            end_time  duration id                flag
0 2020-01-01 00:00:00 2020-01-01 00:30:00        30  A                 NaT
1 2020-01-01 00:30:00 2020-01-01 01:00:00        30  B                 NaT
2 2020-01-01 01:00:00 2020-01-01 01:30:00        30  C                 NaT
3 2020-01-01 01:30:00 2020-01-01 02:00:00        30  D 2020-01-04 05:00:00
4 2020-01-04 05:00:00 2020-01-04 05:30:00        30  E                 NaT
5 2020-01-04 05:30:00 2020-01-04 06:00:00        30  F                 NaT
6 2020-01-04 06:00:00 2020-01-04 06:30:00        30  G                 NaT
7 2020-01-04 06:30:00 2020-01-04 07:00:00        30  H 2020-01-04 20:30:00
8 2020-01-04 20:30:00 2020-01-04 21:00:00        30  I                 NaT

Then use DataFrame.bfill to backfill those nulls with the start time that violate your interval condition. You'll need to manually set the null value yourself for the last value.

df['flag'] = df['flag'].bfill().fillna(df['end_time'].iloc[-2])
print(df)                                                         
           start_time            end_time  duration id                flag
0 2020-01-01 00:00:00 2020-01-01 00:30:00        30  A 2020-01-04 05:00:00
1 2020-01-01 00:30:00 2020-01-01 01:00:00        30  B 2020-01-04 05:00:00
2 2020-01-01 01:00:00 2020-01-01 01:30:00        30  C 2020-01-04 05:00:00
3 2020-01-01 01:30:00 2020-01-01 02:00:00        30  D 2020-01-04 05:00:00
4 2020-01-04 05:00:00 2020-01-04 05:30:00        30  E 2020-01-04 20:30:00
5 2020-01-04 05:30:00 2020-01-04 06:00:00        30  F 2020-01-04 20:30:00
6 2020-01-04 06:00:00 2020-01-04 06:30:00        30  G 2020-01-04 20:30:00
7 2020-01-04 06:30:00 2020-01-04 07:00:00        30  H 2020-01-04 20:30:00
8 2020-01-04 20:30:00 2020-01-04 21:00:00        30  I 2020-01-04 07:00:00

Now do as ansev suggested:

df = df.groupby('flag').agg({'start_time':'first','end_time':'last','duration':'sum','id':'first'}).reset_index(drop=True)
print(df)                                                                         
           start_time            end_time  duration id
0 2020-01-01 00:00:00 2020-01-01 02:00:00       120  A
1 2020-01-04 20:30:00 2020-01-04 21:00:00        30  I
2 2020-01-04 05:00:00 2020-01-04 07:00:00       120  E

Upvotes: 0

Related Questions