Sylvain Nulli
Sylvain Nulli

Reputation: 21

Speed correlation calculation - code too slow

I'm trying to calculate speed correlation from a data file already existing which, works fine but is very slow. The speed datafile contains about 1000 pts

def Correlation(data_vit,max_dt):
data_corre = pd.DataFrame(columns=['dt','Vx','Vy','Vxy'])
dfmean = pd.DataFrame(columns=['Vx','Vx','Vxy'])
data_corre_tmp = pd.DataFrame(columns=['Vx','Vy','Vxy']) 
for dt in range(max_dt-1):  
    for t in range(max_dt-1-dt): # vx(t)vx(t+deltaT) + vy(t)vy(t+deltaT)
        Vx,Vy = data_vit.at[t+dt,'Vx']*data_vit.at[t,'Vx'] , data_vit.at[t+dt,'Vy']*data_vit.at[t,'Vy']

        data_corre_tmp = data_corre_tmp.append({     'Vx':  Vx ,
                                                     'Vy':  Vy ,
                                                     'Vxy': Vx+Vy
                                                    }, ignore_index=True)
    #moyenner sur tous les t possibles pour un deltaT

    dmean = {'dt':  [dt],
             'Vx':  [sum(data_corre_tmp['Vx'])/len(data_corre_tmp['Vx'])] ,
             'Vy':  [sum(data_corre_tmp['Vy'])/len(data_corre_tmp['Vy'])] ,
             'Vxy': [sum(data_corre_tmp['Vxy'])/len(data_corre_tmp['Vxy'])] 
             }

    dfmean = pd.DataFrame(data=dmean)
    data_corre = data_corre.append(dfmean, ignore_index=True)
    data_corre_tmp = data_corre.iloc[0:0]

    time.sleep(0.1)
    printProgressBar(dt + 1, max_dt, prefix = 'Progress:', length = 50)

return(data_corre)

I know it's pretty rough, I don't have much experience so I went for the "simple route" Did I do anything that could take a lot of computing power for no reason? Because beside calculation Vx,Vy, I'm really just appending stuff

Upvotes: 1

Views: 208

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50383

This approach is very inefficient because:

  • data_corre_tmp.append create a new dataframe and copy all the elements
  • dataframe indexing is slow
  • CPython loops are usually quite slow

Thus, you can speed-up the computation by multiple orders of magnitude by just appending data into lists (and not dataframe) and using numpy vectorized operations.

Here is an improved implementation:

import numpy as np

def Correlation_v2(data_vit, max_dt):
    data_corre = pd.DataFrame(columns=['dt','Vx','Vy','Vxy'])
    dfmean = pd.DataFrame(columns=['Vx','Vx','Vxy'])
    allVx = np.array(data_vit['Vx'])
    allVy = np.array(data_vit['Vy'])
    data_corre = {'dt': list(range(max_dt-1)), 'Vx': [], 'Vy': [], 'Vxy': []}

    for dt in range(max_dt-1):
        # vx(t)vx(t+deltaT) + vy(t)vy(t+deltaT)
        tmpVx = allVx[dt:max_dt-1] * allVx[0:max_dt-1-dt]
        tmpVy = allVy[dt:max_dt-1] * allVy[0:max_dt-1-dt]
        tmpVxy = tmpVx + tmpVy

        # Moyenner sur tous les t possibles pour un deltaT
        data_corre['Vx'].append(tmpVx.mean())
        data_corre['Vy'].append(tmpVy.mean())
        data_corre['Vxy'].append(tmpVxy.mean())

    return pd.DataFrame(data_corre)

With data_vit containing 1000 points and max_dt set to 100, this code is roughly 500 times faster than the reference implementation on my desktop computer.

Note that it can still be improved by computing the means incrementally.

Upvotes: 1

Related Questions