Reputation: 45
I am calculating 48 derived pandas columns by iterating and calculating each column at a time but need to speed up the process. What is the best way to do this to make it faster and more efficent. Each column calculates the closing price as a percentage of the period's (T, T-1, T-2 etc) high and low price.
The code I am currently using is:
#get last x closes as percentage of period high and low
for i in range(1, 49, 1):
df.loc[:,'Close_T_period_'+str(i)] = ((df['BidClose'].shift(i).values
- df['BidLow'].shift(i).values)/
(df['BidHigh'].shift(i).values - df['BidLow'].shift(i).values))
Input dataframe sample:
BidOpen BidHigh BidLow BidClose AskOpen AskHigh AskLow AskClose Volume
Date
2019-09-27 09:00:00 1.22841 1.22919 1.22768 1.22893 1.22850 1.22927 1.22777 1.22900 12075.0
2019-09-27 10:00:00 1.22893 1.23101 1.22861 1.23058 1.22900 1.23110 1.22870 1.23068 16291.0
2019-09-27 11:00:00 1.23058 1.23109 1.22971 1.23076 1.23068 1.23119 1.22979 1.23087 10979.0
2019-09-27 12:00:00 1.23076 1.23308 1.23052 1.23232 1.23087 1.23314 1.23062 1.23241 16528.0
2019-09-27 13:00:00 1.23232 1.23247 1.23163 1.23217 1.23241 1.23256 1.23172 1.23228 14106.0
Output dataframe sample:
BidOpen BidHigh BidLow BidClose ... Close_T_period_45 Close_T_period_46 Close_T_period_47 Close_T_period_48
Date ...
2019-09-27 09:00:00 1.22841 1.22919 1.22768 1.22893 ... 0.682635 0.070796 0.128940 0.794521
2019-09-27 10:00:00 1.22893 1.23101 1.22861 1.23058 ... 0.506024 0.682635 0.070796 0.128940
2019-09-27 11:00:00 1.23058 1.23109 1.22971 1.23076 ... 0.774920 0.506024 0.682635 0.070796
2019-09-27 12:00:00 1.23076 1.23308 1.23052 1.23232 ... 0.212500 0.774920 0.506024 0.682635
2019-09-27 13:00:00 1.23232 1.23247 1.23163 1.23217 ... 0.378882 0.212500 0.774920 0.506024
Upvotes: 1
Views: 110
Reputation: 999
the following code is 6x times faster:
import numpy as np
def my_shift(x, i):
first = np.array([np.nan]*i)
return np.append(first, x[:-i])
result = ((df2['BidClose'].values - df2['BidLow'].values)/(df2['BidHigh'].values - df2['BidLow'].values))
for i in range(1, 49, 1):
df2.loc[:,'Close_T_period_'+str(i)] = my_shift(result, i)
The two main bottleneck issues in your code are:
so my code simply manage the two issues. Basically I calculate the result just one time and I use the loop only for shifting (Issues #1 improved), and I implemented my own shift function that append in front of the original array i
NaN values and cut the last i
.
With a dataframe with 5000 rows the time benchmark give:
42 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
with my solution I obtained:
7.62 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I tried to implement a solution with apply:
result = ((df2['BidClose'].values - df2['BidLow'].values)/(df2['BidHigh'].values - df2['BidLow'].values))
df3 = df.reindex(df2.columns.tolist() +[f'Close_T_period_{i}' for i in range(1, 2000)], axis=1)
df3.iloc[:, 9:] = df3.iloc[:, 9:].apply(lambda row: my_shift(result, int(row.name.split('_')[-1])))
In my test this solution seems a slightly slower then the first one.
Upvotes: 2