Faster way to query pandas dataframe for a value based on values in other column

Question

I have this dataset from Kaggle which contains about 4.6 million rows of data after some filtering

I want to query this dataframe for the close_adjusted value where the symbol and date is equal to some value. This is my implementation so far

def price_diff(date_one, date_two, symbol):

    data = df.loc[df['symbol'] == symbol]
    date_one_data = data.loc[data['date'] == date_one]
    date_two_data = data.loc[data['date'] == date_two]

    # Only if data from both dates exist
    if (len(date_one_data) > 0 and len(date_two_data) > 0):
        close_price_one = date_one_data.squeeze()['close_adjusted']
        close_price_two = date_two_data.squeeze()['close_adjusted']
        return close_price_one - close_price_two

    return dummy

This code took about 523 ms to run: 523 ms ± 1.54 ms per loop

Is there a faster way I can achieve what I want? Right now applying the above function to a dataset with 25000 rows of dates and symbols take more than an hour on a Kaggle notebook.

Thanks!

Akilan Manivannan · Accepted Answer

After doing some profiling, I found that the majority of your compute time is spent filtering out the symbol. Instead of using the dataframe to do your filtering, you can increase the speed by accessing the underlying numpy array that powers the dataframe since it is much faster. The array can be accessed using the ravel() function.

def price_diff_ME(date_one, date_two, symbol):

    #data = df.loc[df['symbol'] == symbol]
    data = df[df["symbol"].ravel() == symbol]

    date_one_data = data.loc[data['date'] == date_one]
    date_two_data = data.loc[data['date'] == date_two]

    # Only if data from both dates exist
    if (len(date_one_data) > 0 and len(date_two_data) > 0):
        close_price_one = date_one_data.squeeze()['close_adjusted']
        close_price_two = date_two_data.squeeze()['close_adjusted']
        return close_price_one - close_price_two

    return False

After using ravel() to get a numpy array, I simply ran the same filter as before to produce a binary mask. I then apply the mask to the dataframe to reproduce the same filter that you had earlier.

Just by changing that one line, I was able to see a 70% increase in speed. You can use the same method to filter out the other dates too if you would like.

Faster way to query pandas dataframe for a value based on values in other column

Answers (1)

Related Questions