fparaggio
fparaggio

Reputation: 453

Pandas event dataframe (start stop) to regular time series

I am dealing with a dataset of events. Every row has a start time, an end time and a value.

It looks like this one:

df = pd.DataFrame({'start': ['2015-01-05 12:21:00', '2015-01-05 18:01:23', '2015-01-05 23:11:01'], 'end': ['2015-01-05 13:18:45', '2015-01-05 21:03:51', '2015-01-05 12:08:11'], 'value': [3, 4, 5]})

    end start   value
0   2015-01-07 11:18:45 2015-01-07 11:35:00 3
1   2015-01-07 12:08:11 2015-01-07 23:11:01 5
2   2015-01-07 17:03:51 2015-01-07 18:01:23 4

I need to resample it in a 10 minutes regular time series like the following:

    time values
1   2015-01-07 11:01:00  NULL
2   2015-01-07 11:11:00  3
3   2015-01-07 11:21:00  3
4   2015-01-07 11:31:00  3
5   2015-01-07 11:41:00  NULL
...
..  2015-01-07 12:11:00  5
...
..  2015-01-07 17:01:00  5,4
...

What is the most efficient way to deal with this resampling? I also highlighted the possibility of multiple events with the last value "5,4", what could be the best way to deal with it?

Thanks!

Upvotes: 1

Views: 1778

Answers (1)

ALollz
ALollz

Reputation: 59549

Well if you eventually need to merge with other DataFrames it will likely make your life easier to standardize the time period always assuming an interval goes from :00-:10 for instance.

In this case one possibility is to transform the time span dataframe into one where each period is listed out explicitly, and then you can groupby the time period and then apply(list) to get the output you want.

df['temp_id'] = range(len(df))
df_time = (df[['temp_id', 'start', 'end']].set_index('temp_id').stack()
          .reset_index(level=-1, drop=True).rename('time').to_frame())

df_time = (df_time.groupby('temp_id').apply(lambda x: x.set_index('time')
            .resample('10T').asfreq()).reset_index())
df_time = df_time.merge(df[['value', 'temp_id']]).groupby('time')['value'].apply(list).to_frame()

If you then want to include the indices that have nothing you can just reindex this dataframe.

df_time.reindex(pd.date_range(start=df_time.index.min(),
     end=df_time.index.max(), freq='10T'))

#                    value
#2015-01-07 11:10:00   [3]
#2015-01-07 11:20:00   [3]
#2015-01-07 11:30:00   [3]
#2015-01-07 11:40:00   NaN
#2015-01-07 11:50:00   NaN
#...
#2015-01-07 17:40:00  [4, 5]
#2015-01-07 17:50:00  [4, 5]
#2015-01-07 18:00:00  [4, 5]
#2015-01-07 18:10:00     [4]
#...

Upvotes: 1

Related Questions