Aggregating events. Reducing df size. In search of more elegant way

Question

I have a timeseries dataframe that has the next structure:

Loc | Event |       Start        |         End       |  
 A  |  aaa  |2018-08-30 00:26:29 |2018-08-30 00:26:59|  
 A  |  aaa  |2018-08-30 00:26:30 |2018-08-30 00:26:47|  
 A  |  aaa  |2018-08-30 00:38:05 |2018-08-30 00:39:09|  
 A  |  aaa  |2018-08-30 00:40:31 |2018-08-30 00:40:41|  
 A  |  aaa  |2018-08-30 00:57:17 |2018-08-30 00:57:28|
 A  |  aaa  |2018-08-30 00:57:36 |2018-08-30 00:58:07|  
 A  |  aaa  |2018-08-30 00:57:53 |2018-08-30 00:59:23|  
 A  |  aaa  |2018-08-30 00:57:58 |2018-08-30 00:58:11|  
 A  |  aaa  |2018-08-30 00:58:00 |2018-08-30 00:58:08|  
 A  |  aaa  |2018-08-30 01:27:58 |2018-08-30 01:28:58|

My goal is to aggregate group of events into one event based on the interval between events. As you can see from the example above, there are many records that last several seconds and are generated also within few seconds. Such events must be combined into one taking the start time of the first event in group and last time of the last event in such group.

As a result, the next output must be achieved:

Loc | Event |       Start        |         End       |  
 A  |  aaa  |2018-08-30 00:26:29 |2018-08-30 00:26:59|  
 A  |  aaa  |2018-08-30 00:26:30 |2018-08-30 00:26:47|  
 A  |  aaa  |2018-08-30 00:38:05 |2018-08-30 00:39:09|  
 A  |  aaa  |2018-08-30 00:40:31 |2018-08-30 00:40:41|  
 A  |  aaa  |2018-08-30 00:57:17 |2018-08-30 00:58:08|
 A  |  aaa  |2018-08-30 01:27:58 |2018-08-30 01:28:58|

For now I managed to achieve this with many additional variables and for loops and it's quite slow. So any ideas how to achieve this with pandas methods are very welcome.

Zaraki Kenpachi · Accepted Answer

To solve this you need:

create while loop to go thru each datetime range
slice you df with new datetime range
get results from sliced df
calculate new out variable for while loop
collect df with results into list and then concat them

Here You go (some comments in code):

import pandas as pd
from io import StringIO
from dateutil.relativedelta import relativedelta


data = StringIO("""
Loc|Event|Start|End
A|aaa|2018-08-30 00:26:29|2018-08-30 00:26:59
A|aaa|2018-08-30 00:26:30|2018-08-30 00:26:47
A|aaa|2018-08-30 00:38:05|2018-08-30 00:39:09
A|aaa|2018-08-30 00:40:31|2018-08-30 00:40:41
A|aaa|2018-08-30 00:57:17|2018-08-30 00:57:28
A|aaa|2018-08-30 00:57:36|2018-08-30 00:58:07
A|aaa|2018-08-30 00:57:53|2018-08-30 00:59:23
A|aaa|2018-08-30 00:57:58|2018-08-30 00:58:11
A|aaa|2018-08-30 00:58:00|2018-08-30 00:58:08
A|aaa|2018-08-30 01:27:58|2018-08-30 01:28:58
""")

# load data into data frame
df = pd.read_csv(data, sep='|')

# convert string to datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])


def get_result(df_filtered):
    df2 = pd.DataFrame({'Loc':df_filtered['Loc'].unique(), 'Event':df_filtered['Event'].unique(), 'Start':df_filtered['Start'].min(), 'End':df_filtered['End'].max()})
    start_end = df_filtered['Start'].max()

    return df2, start_end


# get datetime group range
min_datetime = df['Start'].min()
max_datetime = min_datetime + relativedelta(minutes=6)
# define variables for while loop
end_end = df['Start'].max()
start_end = min_datetime

results_list = []
while end_end > start_end:
    # filter rows by dates
    df_filtered = df[(df['Start'] >= min_datetime) & (df['End'] < max_datetime)]
    # get result and new start datetime
    df2, start_end = get_result(df_filtered)
    # get new values for max and min datetime
    df_start = df[df['Start'] > start_end]
    min_datetime = df_start['Start'].min()
    max_datetime = min_datetime + relativedelta(minutes=6)
    # append df with results
    results_list.append(df2)


df = pd.concat(results_list)

Output:

  Loc Event               Start                 End
0   A   aaa 2018-08-30 00:26:29 2018-08-30 00:26:59
0   A   aaa 2018-08-30 00:38:05 2018-08-30 00:40:41
0   A   aaa 2018-08-30 00:57:17 2018-08-30 00:59:23
0   A   aaa 2018-08-30 01:27:58 2018-08-30 01:28:58

Aggregating events. Reducing df size. In search of more elegant way

Answers (1)

Related Questions