Reputation: 71
So I have this piece of code I wrote that works well. However, it's way too slow, as I run this code multiple tens of thousands of time frequently. I would like help to optimize it with vectorized operations, but I'm having trouble finding ways to do so, as I'm not an absolute expert in pandas yet.
def slowFunctionToOptimize():
# Variables definition
minVolume = 2000
exchange1 = 'binance'
exchange2 = 'bitmart'
volEx1Str = 'volume_' + exchange1
volEx2Str = 'volume_' + exchange2
threshold = 15.0
minDuration = 10.0
# See below for an example dataset
dataset = pd.read_csv('example.csv', sep='|')
indicesLst = dataset.index.values
minIndexLst = indicesLst[0]
# Get all indices that exceed or are equal to the specified threshold,
# and normalize with the first index value to work with "iloc" later on
indicesThresh = dataset.index[dataset.diffprice >= threshold].values - minIndexLst
pv = None
prevEndIndex = len(dataset)
# Get the largest possible amount of rows (sequential order) based on the volume mean
# of the two exchanges, where the first value exceed or is equal to the threshold
for startInd in indicesThresh:
for endInd in range(prevEndIndex, 0, -1):
if endInd - startInd < minDuration:
break
dfTmp = dataset.iloc[startInd:endInd, :]
avgVolume1 = dfTmp[volEx1Str].mean()
avgVolume2 = dfTmp[volEx2Str].mean()
if avgVolume1 > minVolume and avgVolume2 > minVolume:
# Get the final result.
pv = dfTmp.copy()
break
# Largest amount of rows found, exiting
if pv is not None:
break
prevEndIndex = startInd
if pv is None:
print('No combination could be found for this iteration.')
return
return pv
Here's the "example.csv" dataset:
(index) | volume_binance | volume_bitmart | diffprice |
---|---|---|---|
17039 | 69971.80688 | 4327.175492 | 47.48390871854886 |
17040 | 153370.133088 | 3528.694959 | 48.53994165659899 |
17041 | 76187.307674 | 2392.877003 | 47.29860035648541 |
17042 | 104401.51896 | 3502.658194 | 47.531388100324854 |
17043 | 61927.974127999994 | 2955.0013229999995 | 48.09375550725489 |
17044 | 66800.204432 | 2997.791962 | 48.53118830843571 |
17045 | 67560.20299500001 | 1552.265414 | 49.42801256594155 |
17046 | 134373.92478899998 | 3411.825602 | 50.17573122058736 |
17047 | 57268.227960000004 | 408.52263999999997 | 48.42945038122994 |
17048 | 81061.533276 | 2841.843586 | 49.37624085103861 |
17049 | 71169.214896 | 2681.685852 | 49.881164587046925 |
17050 | 52260.95781099999 | 2159.977762 | 49.72388813015853 |
17051 | 29749.781696 | 2769.8278600000003 | 50.014876524843785 |
17052 | 117725.19930000001 | 3218.053416 | 50.38916887841825 |
17053 | 102968.103323 | 394.730464 | 51.604463023668465 |
17054 | 49060.105344 | 0.0 | 51.27689583770968 |
17055 | 195587.92349000002 | 0.0 | 52.2316844839041 |
17056 | 198279.83096000002 | 362.99232800000004 | 53.64218226150563 |
17057 | 114223.564344 | 1913.151606 | 55.15065710504449 |
17058 | 130622.16151599998 | 0.0 | 55.1193699237017 |
17059 | 74873.30932900001 | 0.0 | 53.25227963525835 |
17060 | 69200.632056 | 0.0 | 53.53227771010961 |
17061 | 53494.92473600001 | 0.0 | 54.232051635006584 |
17062 | 56407.084068000004 | 0.0 | 52.72771332505377 |
17063 | 69346.222104 | 0.0 | 52.55673222390317 |
17064 | 47778.53641 | 0.0 | 52.23429951690821 |
17065 | 78987.33048 | 0.0 | 51.71210206415115 |
17066 | 49213.878500000006 | 0.0 | 51.116439382586535 |
17067 | 77530.21365599999 | 188.51300000000003 | 50.44907946170143 |
Here's the expected output (return variable "pv" inside the function):
(index) | volume_binance | volume_bitmart | diffprice |
---|---|---|---|
17039 | 69971.80688 | 4327.175492 | 47.48390871854886 |
17040 | 153370.133088 | 3528.694959 | 48.53994165659899 |
17041 | 76187.307674 | 2392.877003 | 47.29860035648541 |
17042 | 104401.51896 | 3502.658194 | 47.53138810032485 |
17043 | 61927.974128 | 2955.0013229999995 | 48.09375550725489 |
17044 | 66800.204432 | 2997.791962 | 48.53118830843571 |
17045 | 67560.20299500001 | 1552.265414 | 49.42801256594155 |
17046 | 134373.92478899998 | 3411.825602 | 50.17573122058736 |
17047 | 57268.22796 | 408.52264 | 48.42945038122994 |
17048 | 81061.533276 | 2841.843586 | 49.37624085103861 |
17049 | 71169.214896 | 2681.685852 | 49.881164587046925 |
17050 | 52260.95781099999 | 2159.977762 | 49.72388813015853 |
17051 | 29749.781696 | 2769.8278600000003 | 50.014876524843785 |
17052 | 117725.1993 | 3218.053416 | 50.38916887841825 |
17053 | 102968.103323 | 394.730464 | 51.60446302366847 |
17054 | 49060.105344 | 0.0 | 51.27689583770968 |
17055 | 195587.92349 | 0.0 | 52.2316844839041 |
17056 | 198279.83096 | 362.992328 | 53.64218226150563 |
17057 | 114223.564344 | 1913.151606 | 55.15065710504449 |
17058 | 130622.16151599998 | 0.0 | 55.1193699237017 |
Upvotes: 0
Views: 80
Reputation: 262
There are several things you can do to optimize this code:
volume1 | cumulative volume1 | volume 2 | cumulative volume2 |
---|---|---|---|
5 | 5 | 3 | 3 |
4 | 9 | 2 | 5 |
4 | 13 | 4 | 9 |
The average volume is then simply dataset['cumulative volume'][startInd] - dataset['cumulative volume'][endInd]
dfTmp
all the time. Simply keep track of startInd
and endInd
and use the previous trick to calculate the average volumesThere are probably some other tricks you can use, but without knowing the exact type of data you are using I don't think I can help you more than that
Upvotes: 1