Filtering a panda dataframe based on value and time

Question

I have a panda dataframe like this

2011-5-5 12:43               noEvent          CarA      otherColumns...
2011-5-5 12:45               noEvent          CarA          ...
2011-5-5 12:49               EVENT            CarA          ...
2011-5-5 12:51               noEvent          CarA          ...
(no data - jumps in time)
2011-5-6 12:52               EVENT            CarA          ...
2011-5-6 12:59               noEvent          CarA          ...
2011-5-6 13:00               noEvent          CarA          ...
2011-5-5 12:43               noEvent          CarB          ...
2011-5-5 12:45               noEvent          CarB          ...
2011-5-5 12:49               noEvent          CarB          ...
2011-5-5 12:51               noEvent          CarB          ...
(no data - jumps in time)
2011-5-6 12:52               noEvent          CarB          ...
2011-5-6 12:52               EVENT            CarB          ...
2011-5-6 13:00               noEvent          CarB          ...

Explanation:

The timestamp column is not linearly spaced
There is 2 cars, A and B. The events from A are independent from B's events

I need to perform some calculation +-2 minutes before and after an event occured and for each car.

To do this, I am confused... How can I filter this dataframe?

The desired result would look like this

-2min
2011-5-5 12:49               EVENT            CarA          ...
+2min

-2min
2011-5-6 12:52               EVENT            CarA          ...
+2min

-2min
2011-5-6 12:52               EVENT            CarB          ...
+2min

Some info:

You cannot mix events from CarA and CarB
In future, the number of cars can be hundreds of thousands

I don't know where to start..

What functions can I use?
How can I group the events in "blocks" in order to proccess each 4 min - block of records separately?

HYRY · Accepted Answer

Group by the Car column first and process every group as following:

Create the test data first:

import pandas as pd
import numpy as np

np.random.seed(1)
idx = pd.date_range("2016-03-01 10:00:00", "2016-03-01 20:00:00", freq="S")
idx = idx[np.random.randint(0, len(idx), 10000)].sort_values()
evt = np.array(["no event", "event"])[(np.random.rand(len(idx)) < 0.0005).astype(int)]
df = pd.DataFrame({"event":evt, "value":np.random.randint(0, 10, len(evt))}, index=idx)

find the event row and the row index of +/- 10 seconds:

event_time = df.index[df.event == "event"]
delta = pd.Timedelta(10, unit="s")

start_idx = df.index.searchsorted(event_time - delta).tolist()
end_idx = df.index.searchsorted(event_time + delta).tolist()

create the mask array:

mask = np.zeros(df.shape[0], dtype=bool)
evt_id = np.zeros(df.shape[0], dtype=int)
for i, (s, e) in enumerate(zip(start_idx, end_idx)):
    mask[s:e] = True
    evt_id[s:e] = i

use the mask array to filter rows, here I create a event_id column to group event:

df_event = df[mask]
df_event["event_id"] = evt_id[mask]

the output:

                        event  value  event_id
2016-03-01 13:51:48  no event      0         0
2016-03-01 13:51:51     event      8         0
2016-03-01 13:51:53  no event      3         0
2016-03-01 13:52:00  no event      1         0
2016-03-01 14:21:00  no event      2         1
2016-03-01 14:21:00  no event      5         1
2016-03-01 14:21:00  no event      0         1
2016-03-01 14:21:02  no event      1         1
2016-03-01 14:21:04  no event      2         1
2016-03-01 14:21:06  no event      0         1
2016-03-01 14:21:07     event      1         1
2016-03-01 14:21:16  no event      1         1
2016-03-01 14:21:16  no event      9         1
2016-03-01 15:09:42  no event      1         2
2016-03-01 15:09:49     event      7         2
2016-03-01 15:09:54  no event      3         2
2016-03-01 15:09:55  no event      3         2
2016-03-01 15:09:58  no event      5         2
2016-03-01 15:09:58  no event      9         2
2016-03-01 17:36:44  no event      8         3
2016-03-01 17:36:44  no event      2         3
2016-03-01 17:36:44  no event      9         3
2016-03-01 17:36:45  no event      2         3
2016-03-01 17:36:49     event      9         3
2016-03-01 17:36:50  no event      6         3
2016-03-01 17:36:54  no event      1         3
2016-03-01 17:36:56  no event      1         3
2016-03-01 18:51:37  no event      5         4
2016-03-01 18:51:37  no event      3         4
2016-03-01 18:51:42  no event      0         4
2016-03-01 18:51:47     event      9         4
2016-03-01 18:51:55  no event      4         4

Filtering a panda dataframe based on value and time

Answers (2)

Related Questions