Reputation: 403
I have a timeseries dataframe that has the next structure:
Loc | Event | Start | End |
A | aaa |2018-08-30 00:26:29 |2018-08-30 00:26:59|
A | aaa |2018-08-30 00:26:30 |2018-08-30 00:26:47|
A | aaa |2018-08-30 00:38:05 |2018-08-30 00:39:09|
A | aaa |2018-08-30 00:40:31 |2018-08-30 00:40:41|
A | aaa |2018-08-30 00:57:17 |2018-08-30 00:57:28|
A | aaa |2018-08-30 00:57:36 |2018-08-30 00:58:07|
A | aaa |2018-08-30 00:57:53 |2018-08-30 00:59:23|
A | aaa |2018-08-30 00:57:58 |2018-08-30 00:58:11|
A | aaa |2018-08-30 00:58:00 |2018-08-30 00:58:08|
A | aaa |2018-08-30 01:27:58 |2018-08-30 01:28:58|
My goal is to aggregate group of events into one event based on the interval between events. As you can see from the example above, there are many records that last several seconds and are generated also within few seconds. Such events must be combined into one taking the start time of the first event in group and last time of the last event in such group.
As a result, the next output must be achieved:
Loc | Event | Start | End |
A | aaa |2018-08-30 00:26:29 |2018-08-30 00:26:59|
A | aaa |2018-08-30 00:26:30 |2018-08-30 00:26:47|
A | aaa |2018-08-30 00:38:05 |2018-08-30 00:39:09|
A | aaa |2018-08-30 00:40:31 |2018-08-30 00:40:41|
A | aaa |2018-08-30 00:57:17 |2018-08-30 00:58:08|
A | aaa |2018-08-30 01:27:58 |2018-08-30 01:28:58|
For now I managed to achieve this with many additional variables and for loops and it's quite slow. So any ideas how to achieve this with pandas methods are very welcome.
Upvotes: 0
Views: 38
Reputation: 5730
To solve this you need:
Here You go (some comments in code):
import pandas as pd
from io import StringIO
from dateutil.relativedelta import relativedelta
data = StringIO("""
Loc|Event|Start|End
A|aaa|2018-08-30 00:26:29|2018-08-30 00:26:59
A|aaa|2018-08-30 00:26:30|2018-08-30 00:26:47
A|aaa|2018-08-30 00:38:05|2018-08-30 00:39:09
A|aaa|2018-08-30 00:40:31|2018-08-30 00:40:41
A|aaa|2018-08-30 00:57:17|2018-08-30 00:57:28
A|aaa|2018-08-30 00:57:36|2018-08-30 00:58:07
A|aaa|2018-08-30 00:57:53|2018-08-30 00:59:23
A|aaa|2018-08-30 00:57:58|2018-08-30 00:58:11
A|aaa|2018-08-30 00:58:00|2018-08-30 00:58:08
A|aaa|2018-08-30 01:27:58|2018-08-30 01:28:58
""")
# load data into data frame
df = pd.read_csv(data, sep='|')
# convert string to datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
def get_result(df_filtered):
df2 = pd.DataFrame({'Loc':df_filtered['Loc'].unique(), 'Event':df_filtered['Event'].unique(), 'Start':df_filtered['Start'].min(), 'End':df_filtered['End'].max()})
start_end = df_filtered['Start'].max()
return df2, start_end
# get datetime group range
min_datetime = df['Start'].min()
max_datetime = min_datetime + relativedelta(minutes=6)
# define variables for while loop
end_end = df['Start'].max()
start_end = min_datetime
results_list = []
while end_end > start_end:
# filter rows by dates
df_filtered = df[(df['Start'] >= min_datetime) & (df['End'] < max_datetime)]
# get result and new start datetime
df2, start_end = get_result(df_filtered)
# get new values for max and min datetime
df_start = df[df['Start'] > start_end]
min_datetime = df_start['Start'].min()
max_datetime = min_datetime + relativedelta(minutes=6)
# append df with results
results_list.append(df2)
df = pd.concat(results_list)
Output:
Loc Event Start End
0 A aaa 2018-08-30 00:26:29 2018-08-30 00:26:59
0 A aaa 2018-08-30 00:38:05 2018-08-30 00:40:41
0 A aaa 2018-08-30 00:57:17 2018-08-30 00:59:23
0 A aaa 2018-08-30 01:27:58 2018-08-30 01:28:58
Upvotes: 1