Reputation: 83
I have a dataframe with some date time interval. I am trying to squash them into single events. I have the start and end time alongside the respective duration for every event.
What I have
start_time end_time duration id
0 2020-01-01 00:00:00 2020-01-01 00:30:00 30 A
1 2020-01-01 00:30:00 2020-01-01 01:00:00 30 B
2 2020-01-01 01:00:00 2020-01-01 01:30:00 30 C
3 2020-01-01 01:30:00 2020-01-01 02:00:00 30 D
4 2020-01-04 05:00:00 2020-01-04 05:30:00 30 E
5 2020-01-04 05:30:00 2020-01-04 06:00:00 30 F
6 2020-01-04 06:00:00 2020-01-04 06:30:00 30 G
7 2020-01-04 06:30:00 2020-01-04 07:00:00 30 H
8 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I
What I'm trying to squash it into
start_time end_time duration id
0 2020-01-01 00:00:00 2020-01-01 02:00:00 120 A
4 2020-01-04 05:00:00 2020-01-04 07:00:00 120 E
8 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I
I looked for group and merging options in pandas but I didn't manage to to what I want.
Upvotes: 2
Views: 78
Reputation: 30920
Groupby.agg
with Series.dt.date
new_df =( df.groupby(df['end_time'].dt.date,as_index = False)
.agg({'start_time':'first',
'end_time':'last',
'duration':'sum',
'id':'first'})
)
print(new_df)
start_time end_time duration id
0 2020-01-01 00:00:00 2020-01-01 02:00:00 120 A
1 2020-01-04 05:00:00 2020-01-04 07:00:00 120 E
Upvotes: 2
Reputation: 715
You can use DataFrame.shift
to compare end times with the shifted start times and set any identical pairs to null:
df['flag'] = df['start_time'].shift(-1)
df.loc[df['end_time'] == df['flag'], 'flag'] = pd.NaT
print(df)
start_time end_time duration id flag
0 2020-01-01 00:00:00 2020-01-01 00:30:00 30 A NaT
1 2020-01-01 00:30:00 2020-01-01 01:00:00 30 B NaT
2 2020-01-01 01:00:00 2020-01-01 01:30:00 30 C NaT
3 2020-01-01 01:30:00 2020-01-01 02:00:00 30 D 2020-01-04 05:00:00
4 2020-01-04 05:00:00 2020-01-04 05:30:00 30 E NaT
5 2020-01-04 05:30:00 2020-01-04 06:00:00 30 F NaT
6 2020-01-04 06:00:00 2020-01-04 06:30:00 30 G NaT
7 2020-01-04 06:30:00 2020-01-04 07:00:00 30 H 2020-01-04 20:30:00
8 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I NaT
Then use DataFrame.bfill
to backfill those nulls with the start time that violate your interval condition. You'll need to manually set the null value yourself for the last value.
df['flag'] = df['flag'].bfill().fillna(df['end_time'].iloc[-2])
print(df)
start_time end_time duration id flag
0 2020-01-01 00:00:00 2020-01-01 00:30:00 30 A 2020-01-04 05:00:00
1 2020-01-01 00:30:00 2020-01-01 01:00:00 30 B 2020-01-04 05:00:00
2 2020-01-01 01:00:00 2020-01-01 01:30:00 30 C 2020-01-04 05:00:00
3 2020-01-01 01:30:00 2020-01-01 02:00:00 30 D 2020-01-04 05:00:00
4 2020-01-04 05:00:00 2020-01-04 05:30:00 30 E 2020-01-04 20:30:00
5 2020-01-04 05:30:00 2020-01-04 06:00:00 30 F 2020-01-04 20:30:00
6 2020-01-04 06:00:00 2020-01-04 06:30:00 30 G 2020-01-04 20:30:00
7 2020-01-04 06:30:00 2020-01-04 07:00:00 30 H 2020-01-04 20:30:00
8 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I 2020-01-04 07:00:00
Now do as ansev suggested:
df = df.groupby('flag').agg({'start_time':'first','end_time':'last','duration':'sum','id':'first'}).reset_index(drop=True)
print(df)
start_time end_time duration id
0 2020-01-01 00:00:00 2020-01-01 02:00:00 120 A
1 2020-01-04 20:30:00 2020-01-04 21:00:00 30 I
2 2020-01-04 05:00:00 2020-01-04 07:00:00 120 E
Upvotes: 0