philMarius
philMarius

Reputation: 684

Pandas: calculating average behaviour of time series data around an event

I have two dataframes with date time information, df_stream is a stream of events and df_events are specific events in time. For example:

Stream (blue) and event (red)

The blue is the stream and the red line is the event. The window is 30min either side of the event and the time granularity is 30s.

How do I produce a dataframe / graph of the average behaviour of the stream for all events for a given time window and time granularity?

The stream dataframe looks like:

                     streamEventId
DateTime                          
2020-08-20 10:39:24       44791313
2020-08-20 10:40:30       44791721
2020-08-20 10:40:54       44694121
2020-08-20 10:41:16       44902962
2020-08-20 10:42:04       44622569

The event dataframe looks like:

            DateTime
0 2020-11-17 09:49:00
1 2020-11-17 10:49:00
2 2020-11-17 11:11:00
3 2020-11-17 11:16:00
4 2020-11-17 12:11:00

I've managed to get graphs of each event with their respective windows printed but am struggling to combine the logic. Also I use iterrows which I am not a fan of.

My current approach:

for i in df_events[["DateTime"]].iterrows():
    date_time = i[1].values[0]
    before = date_time - pd.Timedelta(window)
    after = date_time + pd.Timedelta(window)

    df_stream_temp = df_stream.loc[before:after].copy()
    plt.figure(figsize=(20, 2))
    df_stream_mva = (
        df_stream.streamEventId.groupby(pd.Grouper(freq="30s"))
        .count()
        .loc[before:after]
    )

    y_height = df_stream_mva.max()

    ax = df_stream_mva.plot()
    plt.vlines(df_events.DateTime.to_list(), 0, y_height, color="lightcoral")
    ax.set_ylim([0, y_height]) 

Which gives a nice series of graphs:

Example of graphs of events in stream

And so on...

I would like to calculate and then plot the average of the above with confidence intervals.

Upvotes: 1

Views: 377

Answers (1)

philMarius
philMarius

Reputation: 684

Sat down and worked it out.

Using a similar logic to my iterrows example above, I looped through df_events, created before and after time window limits and used these to filter df_stream grouped by a given time frequency. Then recalculated the index using time delta indexes to get a "distance from event" index. This was then appended to a list that was then combined using pd.concat with an axis of 1. Example below:

window = "30 min"
freq = "30s"

collect_list = []
for i in df_events[["DateTime"]].iterrows():
    date_time = i[1].values[0]
    before = date_time - pd.Timedelta(window)
    after = date_time + pd.Timedelta(window)

    df_stream_window = (
        df_stream.streamEventId.groupby(pd.Grouper(freq=freq))
        .count()
        .loc[before:after]
        .reset_index()
        .rename(columns={"streamEventId": "stream events"})
    )

    df_stream_window = df_stream_window.set_index(
        pd.to_timedelta((df_stream_window.DateTime - pd.to_datetime(date_time)))
        / np.timedelta64("1", "m")
    )

    collect_list.append(df_stream_window)

df_collect = pd.concat(collect_list, axis=1)

This generated the below dataframe:

                    DateTime  stream events            DateTime  \
DateTime                                                          
-30.0    2020-11-17 09:19:00              5 2020-11-17 10:19:00   
-29.5    2020-11-17 09:19:30              5 2020-11-17 10:19:30   
-29.0    2020-11-17 09:20:00              3 2020-11-17 10:20:00   
-28.5    2020-11-17 09:20:30              2 2020-11-17 10:20:30   
-28.0    2020-11-17 09:21:00              3 2020-11-17 10:21:00   

        stream events            DateTime  stream events  \
DateTime                                                     
-30.0                 3 2020-11-17 10:41:00              6   
-29.5                 5 2020-11-17 10:41:30              6   
-29.0                 3 2020-11-17 10:42:00              4   
-28.5                 3 2020-11-17 10:42:30              1   
-28.0                 1 2020-11-17 10:43:00              4   

....

This can then be used in plotting with estimator="mean" in Seaborn:

ax = sns.lineplot(data=df_collect, estimator="mean")
plt.axvline(0, color="lightcoral")
ax.set_xlabel("Time before / after event (min)")
ax.set_ylabel("# Stream Events")

Average stream behaviour around event

Upvotes: 1

Related Questions