Reputation: 3071
I have a panda dataframe like this
2011-5-5 12:43 noEvent CarA otherColumns...
2011-5-5 12:45 noEvent CarA ...
2011-5-5 12:49 EVENT CarA ...
2011-5-5 12:51 noEvent CarA ...
(no data - jumps in time)
2011-5-6 12:52 EVENT CarA ...
2011-5-6 12:59 noEvent CarA ...
2011-5-6 13:00 noEvent CarA ...
2011-5-5 12:43 noEvent CarB ...
2011-5-5 12:45 noEvent CarB ...
2011-5-5 12:49 noEvent CarB ...
2011-5-5 12:51 noEvent CarB ...
(no data - jumps in time)
2011-5-6 12:52 noEvent CarB ...
2011-5-6 12:52 EVENT CarB ...
2011-5-6 13:00 noEvent CarB ...
Explanation:
I need to perform some calculation +-2 minutes before and after an event occured and for each car.
To do this, I am confused... How can I filter this dataframe?
The desired result would look like this
-2min
2011-5-5 12:49 EVENT CarA ...
+2min
-2min
2011-5-6 12:52 EVENT CarA ...
+2min
-2min
2011-5-6 12:52 EVENT CarB ...
+2min
Some info:
I don't know where to start..
Upvotes: 0
Views: 383
Reputation: 107767
Consider the cross join merge, comparing an all events filtered dataframe and the full dataframe. Then subset records falling in the +/- 2 mins by same car:
Data frame setup (example posted data)
import pandas as pd
import datetime
df = pd.DataFrame({'Date': ['5/5/2011 12:43', '5/5/2011 12:45', '5/5/2011 12:49',
'5/5/2011 12:51', '5/6/2011 12:52', '5/6/2011 12:59',
'5/6/2011 13:00', '5/5/2011 12:43', '5/5/2011 12:45',
'5/5/2011 12:49', '5/5/2011 12:51', '5/6/2011 12:52',
'5/6/2011 12:52', '5/6/2011 13:00'],
'Event': ['noEvent', 'noEvent', 'EVENT', 'noEvent','EVENT',
'noEvent', 'noEvent', 'noEvent', 'noEvent', 'noEvent',
'noEvent', 'noEvent', 'EVENT', 'noEvent'],
'Car': ['CarA', 'CarA', 'CarA', 'CarA', 'CarA',
'CarA', 'CarA', 'CarB', 'CarB','CarB',
'CarB', 'CarB', 'CarB', 'CarB']})
df['Date'] = pd.to_datetime(df['Date'])
# Car Date Event
# 0 CarA 2011-05-05 12:43:00 noEvent
# 1 CarA 2011-05-05 12:45:00 noEvent
# 2 CarA 2011-05-05 12:49:00 EVENT
# 3 CarA 2011-05-05 12:51:00 noEvent
# 4 CarA 2011-05-06 12:52:00 EVENT
# 5 CarA 2011-05-06 12:59:00 noEvent
# 6 CarA 2011-05-06 13:00:00 noEvent
# 7 CarB 2011-05-05 12:43:00 noEvent
# 8 CarB 2011-05-05 12:45:00 noEvent
# 9 CarB 2011-05-05 12:49:00 noEvent
# 10 CarB 2011-05-05 12:51:00 noEvent
# 11 CarB 2011-05-06 12:52:00 noEvent
# 12 CarB 2011-05-06 12:52:00 EVENT
# 13 CarB 2011-05-06 13:00:00 noEvent
Cross join (returns full combination set between two pairings M X N)
df['key'] = 1
# EVENTS DF
eventsdf = df[df['Event']=='EVENT']
# CROSS JOIN DF
crossdf = pd.merge(df, eventsdf, on='key')
crossdf = crossdf[((crossdf['Date_x'] <= crossdf['Date_y']
+ datetime.timedelta(minutes=2)) &
(crossdf['Date_x'] >= crossdf['Date_y']
- datetime.timedelta(minutes=2))) &
(crossdf['Car_x'] == crossdf['Car_y'])].sort_values('Date_x')
finaldf = crossdf[['Car_x', 'Date_x', 'Event_x']].drop_duplicates().sort_values('Car_x')
finaldf.columns = ['Car', 'Date', 'Event']
# Car Date Event
# 6 CarA 2011-05-05 12:49:00 EVENT
# 9 CarA 2011-05-05 12:51:00 noEvent
# 13 CarA 2011-05-06 12:52:00 EVENT
# 35 CarB 2011-05-06 12:52:00 noEvent
# 38 CarB 2011-05-06 12:52:00 EVENT
Upvotes: 1
Reputation: 97331
Group by the Car column first and process every group as following:
Create the test data first:
import pandas as pd
import numpy as np
np.random.seed(1)
idx = pd.date_range("2016-03-01 10:00:00", "2016-03-01 20:00:00", freq="S")
idx = idx[np.random.randint(0, len(idx), 10000)].sort_values()
evt = np.array(["no event", "event"])[(np.random.rand(len(idx)) < 0.0005).astype(int)]
df = pd.DataFrame({"event":evt, "value":np.random.randint(0, 10, len(evt))}, index=idx)
find the event row and the row index of +/- 10 seconds:
event_time = df.index[df.event == "event"]
delta = pd.Timedelta(10, unit="s")
start_idx = df.index.searchsorted(event_time - delta).tolist()
end_idx = df.index.searchsorted(event_time + delta).tolist()
create the mask array:
mask = np.zeros(df.shape[0], dtype=bool)
evt_id = np.zeros(df.shape[0], dtype=int)
for i, (s, e) in enumerate(zip(start_idx, end_idx)):
mask[s:e] = True
evt_id[s:e] = i
use the mask array to filter rows, here I create a event_id column to group event:
df_event = df[mask]
df_event["event_id"] = evt_id[mask]
the output:
event value event_id
2016-03-01 13:51:48 no event 0 0
2016-03-01 13:51:51 event 8 0
2016-03-01 13:51:53 no event 3 0
2016-03-01 13:52:00 no event 1 0
2016-03-01 14:21:00 no event 2 1
2016-03-01 14:21:00 no event 5 1
2016-03-01 14:21:00 no event 0 1
2016-03-01 14:21:02 no event 1 1
2016-03-01 14:21:04 no event 2 1
2016-03-01 14:21:06 no event 0 1
2016-03-01 14:21:07 event 1 1
2016-03-01 14:21:16 no event 1 1
2016-03-01 14:21:16 no event 9 1
2016-03-01 15:09:42 no event 1 2
2016-03-01 15:09:49 event 7 2
2016-03-01 15:09:54 no event 3 2
2016-03-01 15:09:55 no event 3 2
2016-03-01 15:09:58 no event 5 2
2016-03-01 15:09:58 no event 9 2
2016-03-01 17:36:44 no event 8 3
2016-03-01 17:36:44 no event 2 3
2016-03-01 17:36:44 no event 9 3
2016-03-01 17:36:45 no event 2 3
2016-03-01 17:36:49 event 9 3
2016-03-01 17:36:50 no event 6 3
2016-03-01 17:36:54 no event 1 3
2016-03-01 17:36:56 no event 1 3
2016-03-01 18:51:37 no event 5 4
2016-03-01 18:51:37 no event 3 4
2016-03-01 18:51:42 no event 0 4
2016-03-01 18:51:47 event 9 4
2016-03-01 18:51:55 no event 4 4
Upvotes: 1