Reputation: 85
I have this dataset from Kaggle which contains about 4.6 million rows of data after some filtering
I want to query this dataframe for the close_adjusted
value where the symbol
and date
is equal to some value. This is my implementation so far
def price_diff(date_one, date_two, symbol):
data = df.loc[df['symbol'] == symbol]
date_one_data = data.loc[data['date'] == date_one]
date_two_data = data.loc[data['date'] == date_two]
# Only if data from both dates exist
if (len(date_one_data) > 0 and len(date_two_data) > 0):
close_price_one = date_one_data.squeeze()['close_adjusted']
close_price_two = date_two_data.squeeze()['close_adjusted']
return close_price_one - close_price_two
return dummy
This code took about 523 ms to run: 523 ms ± 1.54 ms per loop
Is there a faster way I can achieve what I want? Right now applying the above function to a dataset with 25000 rows of dates and symbols take more than an hour on a Kaggle notebook.
Thanks!
Upvotes: 0
Views: 883
Reputation: 956
After doing some profiling, I found that the majority of your compute time is spent filtering out the symbol
. Instead of using the dataframe to do your filtering, you can increase the speed by accessing the underlying numpy array that powers the dataframe since it is much faster. The array can be accessed using the ravel()
function.
def price_diff_ME(date_one, date_two, symbol):
#data = df.loc[df['symbol'] == symbol]
data = df[df["symbol"].ravel() == symbol]
date_one_data = data.loc[data['date'] == date_one]
date_two_data = data.loc[data['date'] == date_two]
# Only if data from both dates exist
if (len(date_one_data) > 0 and len(date_two_data) > 0):
close_price_one = date_one_data.squeeze()['close_adjusted']
close_price_two = date_two_data.squeeze()['close_adjusted']
return close_price_one - close_price_two
return False
After using ravel()
to get a numpy array, I simply ran the same filter as before to produce a binary mask. I then apply the mask to the dataframe to reproduce the same filter that you had earlier.
Just by changing that one line, I was able to see a 70% increase in speed. You can use the same method to filter out the other dates too if you would like.
Upvotes: 1