What are the good practices to optimize computation in Pandas when having several specifications?

Question

I often encounter a problem when I am doing queries with lots of specification, How to speed up the process?

Basically I really often use the apply function to get a result but quite often, the computation takes a long time.

Is there a good practice to find how to optimize the Pandas code?

Here is an example, I have a DataFrame representing the exchange of a chat containing 3 columns:

timestamp: the timestamp of the message
sender_id: the id of the sender
receiver_id: the id of the receiver

The goal is to find the fraction of messages that had a response in less than 5 minutes. Here is my code:

import pandas as pd
import numpy as np
import datetime

size_df = 30000
np.random.seed(42)

data = {
    'timestamp': pd.date_range('2019-03-01', periods=size_df, freq='30S').astype(int),
    'sender_id': np.random.randint(5, size=size_df),
    'receiver_id': np.random.randint(5, size=size_df)
}

dataframe = pd.DataFrame(data)

This is how the DataFrame looks like:

print(dataframe.head().to_string())
              timestamp  sender_id  receiver_id
0   1551398400000000000          4            2
1   1551398430000000000          3            2
2   1551398460000000000          1            1
3   1551398490000000000          4            3
4   1551398520000000000          4            3

The function used by apply:

def apply_find_next_answer_within_5_min(row):
    """
        Find the index of the next response in a range of 5 minutes
    """
    [timestamp, sender, receiver] = row
    ## find the next responses from receiver to sender in the next 5 minutes 
    next_responses = df_groups.get_group((receiver, sender))["timestamp"]\
                        .loc[lambda x: (x > timestamp) & (x < timestamp + 5 * 60 * 1000 * 1000 * 1000)]
    ## if there is no next responses just return NaN
    if not next_responses.size:
        return np.nan
    ## find the next messages from sender to receiver in the next 5 minutes 
    next_messages = df_groups.get_group((sender, receiver))["timestamp"]\
            .loc[lambda x: (x > timestamp) & (x < timestamp + 5 * 60 * 1000 * 1000 * 1000)]

    ## if the first next message is before next response return nan else return index next reponse
    return np.nan if next_messages.size and next_messages.iloc[0] < next_responses.iloc[0] else next_responses.index[0]

%%timeit
df_messages = dataframe.copy()
## create a dataframe to easily find messages from a specific sender and receiver, speed up the querying process for these messages.
df_groups = df_messages.groupby(["sender_id", "receiver_id"])
df_messages["next_message"] = df_messages.apply(lambda row: apply_find_next_answer_within_5_min(row), axis=1)

Output timeit:

42 s ± 2.16 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

So it takes 42 seconds to apply the function for a 30 000 rows DataFrame. I think it is very long, but I don't find a way to make it more efficient. I already gained 40 seconds by using the intermediate dataframe that groups the sender and receiver instead of querying the big dataframe in the apply function.

This would the response of this specific problem:

1 - df_messages.next_message[lambda x: pd.isnull(x)].size / df_messages.next_message.size
0.2753

So in such scenarios, how do you find a way to compute more efficiently? Are there some tricks to think about?

In this example, I don't believe it is possible to use vectorizations all the way but maybe by using more groups, it is possible to go quicker?

Ben.T · Accepted Answer

You can try to group your dataframe

groups = dataframe.reset_index()\ #I reset_index for later to get the value
                  .groupby([ frozenset([se, re]) #need frosenset to allow the groupby
                             for se, re in dataframe[['sender_id', 'receiver_id']].values])

Now you can create boolean mask meeting your condition

mask_1 = (  # within a group, check if the following message is sent from the other one
            (groups.sender_id.diff(-1).ne(0) 
            # or if the person talks to oneself 
            | dataframe.sender_id.eq(dataframe.receiver_id) ) 
            # and check if the following message is within 5 min
            & groups.timestamp.diff(-1).gt(-5*60*1000*1000*1000))

Now create the column with the index you look for with the mask and shift on the index:

df_messages.loc[mask_1, 'next_message'] = groups['index'].shift(-1)[mask_1]

and you get like with your method and should be faster:

print (df_messages.head(20))
              timestamp  sender_id  receiver_id  next_message
0   1551398400000000000          3            1           NaN
1   1551398430000000000          4            1           NaN
2   1551398460000000000          2            3           NaN
3   1551398490000000000          4            1           NaN
4   1551398520000000000          4            3           NaN
5   1551398550000000000          1            1           NaN
6   1551398580000000000          2            3          10.0
7   1551398610000000000          2            4           NaN
8   1551398640000000000          2            4           NaN
9   1551398670000000000          4            1           NaN
10  1551398700000000000          3            2           NaN
11  1551398730000000000          2            4           NaN
12  1551398760000000000          4            0          18.0
13  1551398790000000000          1            0           NaN
14  1551398820000000000          3            3          16.0
15  1551398850000000000          1            2           NaN
16  1551398880000000000          3            3           NaN
17  1551398910000000000          4            1           NaN
18  1551398940000000000          0            4           NaN
19  1551398970000000000          3            2           NaN

What are the good practices to optimize computation in Pandas when having several specifications?

Answers (1)

Related Questions