Reputation: 796
I often encounter a problem when I am doing queries with lots of specification, How to speed up the process?
Basically I really often use the apply
function to get a result but quite often, the computation takes a long time.
Is there a good practice to find how to optimize the Pandas code?
Here is an example, I have a DataFrame representing the exchange of a chat containing 3 columns:
timestamp
: the timestamp of the messagesender_id
: the id of the senderreceiver_id
: the id of the receiverThe goal is to find the fraction of messages that had a response in less than 5 minutes. Here is my code:
import pandas as pd
import numpy as np
import datetime
size_df = 30000
np.random.seed(42)
data = {
'timestamp': pd.date_range('2019-03-01', periods=size_df, freq='30S').astype(int),
'sender_id': np.random.randint(5, size=size_df),
'receiver_id': np.random.randint(5, size=size_df)
}
dataframe = pd.DataFrame(data)
This is how the DataFrame looks like:
print(dataframe.head().to_string())
timestamp sender_id receiver_id
0 1551398400000000000 4 2
1 1551398430000000000 3 2
2 1551398460000000000 1 1
3 1551398490000000000 4 3
4 1551398520000000000 4 3
The function used by apply:
def apply_find_next_answer_within_5_min(row):
"""
Find the index of the next response in a range of 5 minutes
"""
[timestamp, sender, receiver] = row
## find the next responses from receiver to sender in the next 5 minutes
next_responses = df_groups.get_group((receiver, sender))["timestamp"]\
.loc[lambda x: (x > timestamp) & (x < timestamp + 5 * 60 * 1000 * 1000 * 1000)]
## if there is no next responses just return NaN
if not next_responses.size:
return np.nan
## find the next messages from sender to receiver in the next 5 minutes
next_messages = df_groups.get_group((sender, receiver))["timestamp"]\
.loc[lambda x: (x > timestamp) & (x < timestamp + 5 * 60 * 1000 * 1000 * 1000)]
## if the first next message is before next response return nan else return index next reponse
return np.nan if next_messages.size and next_messages.iloc[0] < next_responses.iloc[0] else next_responses.index[0]
%%timeit
df_messages = dataframe.copy()
## create a dataframe to easily find messages from a specific sender and receiver, speed up the querying process for these messages.
df_groups = df_messages.groupby(["sender_id", "receiver_id"])
df_messages["next_message"] = df_messages.apply(lambda row: apply_find_next_answer_within_5_min(row), axis=1)
Output timeit
:
42 s ± 2.16 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
So it takes 42 seconds
to apply the function for a 30 000 rows
DataFrame. I think it is very long, but I don't find a way to make it more efficient. I already gained 40 seconds
by using the intermediate dataframe that groups the sender and receiver instead of querying the big dataframe in the apply function.
This would the response of this specific problem:
1 - df_messages.next_message[lambda x: pd.isnull(x)].size / df_messages.next_message.size
0.2753
So in such scenarios, how do you find a way to compute more efficiently? Are there some tricks to think about?
In this example, I don't believe it is possible to use vectorizations all the way but maybe by using more groups, it is possible to go quicker?
Upvotes: 3
Views: 85
Reputation: 29635
You can try to group your dataframe
groups = dataframe.reset_index()\ #I reset_index for later to get the value
.groupby([ frozenset([se, re]) #need frosenset to allow the groupby
for se, re in dataframe[['sender_id', 'receiver_id']].values])
Now you can create boolean mask meeting your condition
mask_1 = ( # within a group, check if the following message is sent from the other one
(groups.sender_id.diff(-1).ne(0)
# or if the person talks to oneself
| dataframe.sender_id.eq(dataframe.receiver_id) )
# and check if the following message is within 5 min
& groups.timestamp.diff(-1).gt(-5*60*1000*1000*1000))
Now create the column with the index you look for with the mask and shift on the index:
df_messages.loc[mask_1, 'next_message'] = groups['index'].shift(-1)[mask_1]
and you get like with your method and should be faster:
print (df_messages.head(20))
timestamp sender_id receiver_id next_message
0 1551398400000000000 3 1 NaN
1 1551398430000000000 4 1 NaN
2 1551398460000000000 2 3 NaN
3 1551398490000000000 4 1 NaN
4 1551398520000000000 4 3 NaN
5 1551398550000000000 1 1 NaN
6 1551398580000000000 2 3 10.0
7 1551398610000000000 2 4 NaN
8 1551398640000000000 2 4 NaN
9 1551398670000000000 4 1 NaN
10 1551398700000000000 3 2 NaN
11 1551398730000000000 2 4 NaN
12 1551398760000000000 4 0 18.0
13 1551398790000000000 1 0 NaN
14 1551398820000000000 3 3 16.0
15 1551398850000000000 1 2 NaN
16 1551398880000000000 3 3 NaN
17 1551398910000000000 4 1 NaN
18 1551398940000000000 0 4 NaN
19 1551398970000000000 3 2 NaN
Upvotes: 2