miho
miho

Reputation: 12085

Filter Pandas Series of datetimes through another Series of datetimes

Assume there are two Pandas Series (or DataFrames) both containing different datetime values. For example one series/frame containing messages and another one containing specific events. Now I would be interested in filtering out all messages which where posted right after (meaning: within n-minutes after the event) any event occured. How could I do that using Pandas?

(Besides using two wrapped for-loops, I am hoping for something more panda-ish and maybe more efficient. Like using groupby or similar.)

Some sample data could be:

import pandas as pd
messages = pd.DataFrame([
    [pd.to_datetime("2000-01-01 09:00:00"), "non-relevant msg 1"],
    [pd.to_datetime("2000-01-01 09:02:11"), "non-relevant msg 2"],
    [pd.to_datetime("2000-01-01 09:03:30"), "relevant msg 1"],
    [pd.to_datetime("2000-01-01 09:04:30"), "relevant msg 2"],
    [pd.to_datetime("2000-01-01 09:10:11"), "non-relevant msg 3"],
    [pd.to_datetime("2000-01-01 10:00:15"), "relevant again 1"],
    [pd.to_datetime("2000-01-01 10:03:15"), "relevant again 2"],
    [pd.to_datetime("2000-01-01 10:07:00"), "non-relevant msg 4"],
], columns=["created_at", "text"])
events = pd.Series([
    pd.to_datetime("2000-01-01 09:02:59"),
    pd.to_datetime("2000-01-01 10:00:00"),
])
n = pd.Timedelta("5min")

Which should give the following output:

output = pd.DataFrame([
    [pd.to_datetime("2000-01-01 09:03:30"), "relevant msg 1"],
    [pd.to_datetime("2000-01-01 09:04:30"), "relevant msg 2"],
    [pd.to_datetime("2000-01-01 10:00:15"), "relevant again 1"],
    [pd.to_datetime("2000-01-01 10:03:15"), "relevant again 2"],
], columns=["created_at", "text"])

Upvotes: 0

Views: 51

Answers (3)

Erfan
Erfan

Reputation: 42916

"I am hoping for something more panda-ish and maybe more efficient". Yes there's a more efficient way of getting your expected result by using numpy and pandas functionality's.

Party inspired by this answer.

a = messages['created_at'].to_numpy()
bh = (events + n).to_numpy()
bl = events.to_numpy()

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

messages.loc[i].reset_index(drop=True)

           created_at              text
0 2000-01-01 09:03:30    relevant msg 1
1 2000-01-01 09:04:30    relevant msg 2
2 2000-01-01 10:00:15  relevant again 1
3 2000-01-01 10:03:15  relevant again 2

Explanation

Firs we create our created_at column to a numpy array and create our High and Low threshold of dates. Low = events and High = events+n.

Then we use np.where to conditionally go over the rows of our messages dataframe and store the indices of the rows which match with our condition where the datetime is between our thresholds. We store these indices in i.

Since we have our indices, we can simply use .loc to get our rows we want.


Note, if your pandas version is lower than 0.24.0, use .values instead of to_numpy.

Upvotes: 1

Jim Eisenberg
Jim Eisenberg

Reputation: 1500

If I understand correctly, there should be a few ways to solve your problem - finding the efficient one is really the issue here.

I would probably use apply with a for-loop within, using a function like:

def follows_event(time, events=events, gap = pd.Timedelta('5min')):
    follows = False
    for i in list(events):
        if i < time and i+gap > time:
            follows = True
            break
    return follows

Once that's set up, you can simply use that to create a column that tells you if there's an event in the 5 minutes preceding the data, and do with that as you will.

df['follows_event'] = df.created_at.apply(follows_event)

If you want to remove those during that gap, use:

df_filtered = df[df.follows_event != True]

Upvotes: 1

Francesco Zambolin
Francesco Zambolin

Reputation: 601

This is what I understand of your question but would be clearer if you posted what the answer should look like.

filtered_dfs = []
for event in events:
  condition = messages.created_at.between(event,event+n)
  filtered_dfs.append(messages.loc[condition])

This is how the two dfs look like:

#Output
           created_at            text
2 2000-01-01 09:03:30  relevant msg 1
3 2000-01-01 09:04:30  relevant msg 2 


           created_at              text
5 2000-01-01 10:00:15  relevant again 1
6 2000-01-01 10:03:15  relevant again 2 

Upvotes: 1

Related Questions