FanaticPythoner
FanaticPythoner

Reputation: 71

Pandas - Optimizing dataframe multiple conditions reverse nested loop

So I have this piece of code I wrote that works well. However, it's way too slow, as I run this code multiple tens of thousands of time frequently. I would like help to optimize it with vectorized operations, but I'm having trouble finding ways to do so, as I'm not an absolute expert in pandas yet.

def slowFunctionToOptimize():
    # Variables definition
    minVolume = 2000
    exchange1 = 'binance'
    exchange2 = 'bitmart'
    volEx1Str = 'volume_' + exchange1
    volEx2Str = 'volume_' + exchange2
    threshold = 15.0
    minDuration = 10.0

    # See below for an example dataset
    dataset = pd.read_csv('example.csv', sep='|')
    
    indicesLst = dataset.index.values
    minIndexLst = indicesLst[0]
    
    # Get all indices that exceed or are equal to the specified threshold,
    # and normalize with the first index value to work with "iloc" later on
    indicesThresh = dataset.index[dataset.diffprice >= threshold].values - minIndexLst
    
    pv = None
    prevEndIndex = len(dataset)
    
    # Get the largest possible amount of rows (sequential order) based on the volume mean
    #  of the two exchanges, where the first value exceed or is equal to the threshold
    for startInd in indicesThresh:
        for endInd in range(prevEndIndex, 0, -1):
            if endInd - startInd < minDuration:
                break
    
            dfTmp = dataset.iloc[startInd:endInd, :]
            avgVolume1 = dfTmp[volEx1Str].mean()
            avgVolume2 = dfTmp[volEx2Str].mean()
            if avgVolume1 > minVolume and avgVolume2 > minVolume:
                # Get the final result.
                pv = dfTmp.copy()
                break
    
        # Largest amount of rows found, exiting
        if pv is not None:
            break
    
        prevEndIndex = startInd
    
    if pv is None:
        print('No combination could be found for this iteration.')
        return
    
    return pv

Here's the "example.csv" dataset:

(index) volume_binance volume_bitmart diffprice
17039 69971.80688 4327.175492 47.48390871854886
17040 153370.133088 3528.694959 48.53994165659899
17041 76187.307674 2392.877003 47.29860035648541
17042 104401.51896 3502.658194 47.531388100324854
17043 61927.974127999994 2955.0013229999995 48.09375550725489
17044 66800.204432 2997.791962 48.53118830843571
17045 67560.20299500001 1552.265414 49.42801256594155
17046 134373.92478899998 3411.825602 50.17573122058736
17047 57268.227960000004 408.52263999999997 48.42945038122994
17048 81061.533276 2841.843586 49.37624085103861
17049 71169.214896 2681.685852 49.881164587046925
17050 52260.95781099999 2159.977762 49.72388813015853
17051 29749.781696 2769.8278600000003 50.014876524843785
17052 117725.19930000001 3218.053416 50.38916887841825
17053 102968.103323 394.730464 51.604463023668465
17054 49060.105344 0.0 51.27689583770968
17055 195587.92349000002 0.0 52.2316844839041
17056 198279.83096000002 362.99232800000004 53.64218226150563
17057 114223.564344 1913.151606 55.15065710504449
17058 130622.16151599998 0.0 55.1193699237017
17059 74873.30932900001 0.0 53.25227963525835
17060 69200.632056 0.0 53.53227771010961
17061 53494.92473600001 0.0 54.232051635006584
17062 56407.084068000004 0.0 52.72771332505377
17063 69346.222104 0.0 52.55673222390317
17064 47778.53641 0.0 52.23429951690821
17065 78987.33048 0.0 51.71210206415115
17066 49213.878500000006 0.0 51.116439382586535
17067 77530.21365599999 188.51300000000003 50.44907946170143

Here's the expected output (return variable "pv" inside the function):

(index) volume_binance volume_bitmart diffprice
17039 69971.80688 4327.175492 47.48390871854886
17040 153370.133088 3528.694959 48.53994165659899
17041 76187.307674 2392.877003 47.29860035648541
17042 104401.51896 3502.658194 47.53138810032485
17043 61927.974128 2955.0013229999995 48.09375550725489
17044 66800.204432 2997.791962 48.53118830843571
17045 67560.20299500001 1552.265414 49.42801256594155
17046 134373.92478899998 3411.825602 50.17573122058736
17047 57268.22796 408.52264 48.42945038122994
17048 81061.533276 2841.843586 49.37624085103861
17049 71169.214896 2681.685852 49.881164587046925
17050 52260.95781099999 2159.977762 49.72388813015853
17051 29749.781696 2769.8278600000003 50.014876524843785
17052 117725.1993 3218.053416 50.38916887841825
17053 102968.103323 394.730464 51.60446302366847
17054 49060.105344 0.0 51.27689583770968
17055 195587.92349 0.0 52.2316844839041
17056 198279.83096 362.992328 53.64218226150563
17057 114223.564344 1913.151606 55.15065710504449
17058 130622.16151599998 0.0 55.1193699237017

Upvotes: 0

Views: 80

Answers (1)

Avandale
Avandale

Reputation: 262

There are several things you can do to optimize this code:

  1. Calculating the mean more efficiently by adding "cumulative volume" columns for bitmart and binance.
volume1 cumulative volume1 volume 2 cumulative volume2
5 5 3 3
4 9 2 5
4 13 4 9

The average volume is then simply dataset['cumulative volume'][startInd] - dataset['cumulative volume'][endInd]

  1. Update only the data you need : copying dataframes is quite inefficient, so you should avoid updating dfTmp all the time. Simply keep track of startInd and endInd and use the previous trick to calculate the average volumes

There are probably some other tricks you can use, but without knowing the exact type of data you are using I don't think I can help you more than that

Upvotes: 1

Related Questions