Reputation: 12085
Assume there are two Pandas Series
(or DataFrames
) both containing different datetime
values. For example one series/frame containing messages and another one containing specific events. Now I would be interested in filtering out all messages which where posted right after (meaning: within n
-minutes after the event) any event occured. How could I do that using Pandas?
(Besides using two wrapped for
-loops, I am hoping for something more panda-ish and maybe more efficient. Like using groupby
or similar.)
Some sample data could be:
import pandas as pd
messages = pd.DataFrame([
[pd.to_datetime("2000-01-01 09:00:00"), "non-relevant msg 1"],
[pd.to_datetime("2000-01-01 09:02:11"), "non-relevant msg 2"],
[pd.to_datetime("2000-01-01 09:03:30"), "relevant msg 1"],
[pd.to_datetime("2000-01-01 09:04:30"), "relevant msg 2"],
[pd.to_datetime("2000-01-01 09:10:11"), "non-relevant msg 3"],
[pd.to_datetime("2000-01-01 10:00:15"), "relevant again 1"],
[pd.to_datetime("2000-01-01 10:03:15"), "relevant again 2"],
[pd.to_datetime("2000-01-01 10:07:00"), "non-relevant msg 4"],
], columns=["created_at", "text"])
events = pd.Series([
pd.to_datetime("2000-01-01 09:02:59"),
pd.to_datetime("2000-01-01 10:00:00"),
])
n = pd.Timedelta("5min")
Which should give the following output:
output = pd.DataFrame([
[pd.to_datetime("2000-01-01 09:03:30"), "relevant msg 1"],
[pd.to_datetime("2000-01-01 09:04:30"), "relevant msg 2"],
[pd.to_datetime("2000-01-01 10:00:15"), "relevant again 1"],
[pd.to_datetime("2000-01-01 10:03:15"), "relevant again 2"],
], columns=["created_at", "text"])
Upvotes: 0
Views: 51
Reputation: 42916
"I am hoping for something more panda-ish and maybe more efficient". Yes there's a more efficient way of getting your expected result by using numpy
and pandas
functionality's.
Party inspired by this answer.
a = messages['created_at'].to_numpy()
bh = (events + n).to_numpy()
bl = events.to_numpy()
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
messages.loc[i].reset_index(drop=True)
created_at text
0 2000-01-01 09:03:30 relevant msg 1
1 2000-01-01 09:04:30 relevant msg 2
2 2000-01-01 10:00:15 relevant again 1
3 2000-01-01 10:03:15 relevant again 2
Explanation
Firs we create our created_at
column to a numpy array and create our High and Low threshold of dates. Low = events
and High = events+n
.
Then we use np.where
to conditionally go over the rows of our messages
dataframe and store the indices of the rows which match with our condition where the datetime is between our thresholds. We store these indices in i
.
Since we have our indices, we can simply use .loc
to get our rows we want.
Note, if your pandas version is lower than 0.24.0, use .values
instead of to_numpy
.
Upvotes: 1
Reputation: 1500
If I understand correctly, there should be a few ways to solve your problem - finding the efficient one is really the issue here.
I would probably use apply
with a for-loop within, using a function like:
def follows_event(time, events=events, gap = pd.Timedelta('5min')):
follows = False
for i in list(events):
if i < time and i+gap > time:
follows = True
break
return follows
Once that's set up, you can simply use that to create a column that tells you if there's an event in the 5 minutes preceding the data, and do with that as you will.
df['follows_event'] = df.created_at.apply(follows_event)
If you want to remove those during that gap, use:
df_filtered = df[df.follows_event != True]
Upvotes: 1
Reputation: 601
This is what I understand of your question but would be clearer if you posted what the answer should look like.
filtered_dfs = []
for event in events:
condition = messages.created_at.between(event,event+n)
filtered_dfs.append(messages.loc[condition])
This is how the two dfs look like:
#Output
created_at text
2 2000-01-01 09:03:30 relevant msg 1
3 2000-01-01 09:04:30 relevant msg 2
created_at text
5 2000-01-01 10:00:15 relevant again 1
6 2000-01-01 10:03:15 relevant again 2
Upvotes: 1