Reputation: 21
I'm trying to calculate speed correlation from a data file already existing which, works fine but is very slow. The speed datafile contains about 1000 pts
def Correlation(data_vit,max_dt):
data_corre = pd.DataFrame(columns=['dt','Vx','Vy','Vxy'])
dfmean = pd.DataFrame(columns=['Vx','Vx','Vxy'])
data_corre_tmp = pd.DataFrame(columns=['Vx','Vy','Vxy'])
for dt in range(max_dt-1):
for t in range(max_dt-1-dt): # vx(t)vx(t+deltaT) + vy(t)vy(t+deltaT)
Vx,Vy = data_vit.at[t+dt,'Vx']*data_vit.at[t,'Vx'] , data_vit.at[t+dt,'Vy']*data_vit.at[t,'Vy']
data_corre_tmp = data_corre_tmp.append({ 'Vx': Vx ,
'Vy': Vy ,
'Vxy': Vx+Vy
}, ignore_index=True)
#moyenner sur tous les t possibles pour un deltaT
dmean = {'dt': [dt],
'Vx': [sum(data_corre_tmp['Vx'])/len(data_corre_tmp['Vx'])] ,
'Vy': [sum(data_corre_tmp['Vy'])/len(data_corre_tmp['Vy'])] ,
'Vxy': [sum(data_corre_tmp['Vxy'])/len(data_corre_tmp['Vxy'])]
}
dfmean = pd.DataFrame(data=dmean)
data_corre = data_corre.append(dfmean, ignore_index=True)
data_corre_tmp = data_corre.iloc[0:0]
time.sleep(0.1)
printProgressBar(dt + 1, max_dt, prefix = 'Progress:', length = 50)
return(data_corre)
I know it's pretty rough, I don't have much experience so I went for the "simple route" Did I do anything that could take a lot of computing power for no reason? Because beside calculation Vx,Vy, I'm really just appending stuff
Upvotes: 1
Views: 208
Reputation: 50383
This approach is very inefficient because:
data_corre_tmp.append
create a new dataframe and copy all the elementsThus, you can speed-up the computation by multiple orders of magnitude by just appending data into lists (and not dataframe) and using numpy vectorized operations.
Here is an improved implementation:
import numpy as np
def Correlation_v2(data_vit, max_dt):
data_corre = pd.DataFrame(columns=['dt','Vx','Vy','Vxy'])
dfmean = pd.DataFrame(columns=['Vx','Vx','Vxy'])
allVx = np.array(data_vit['Vx'])
allVy = np.array(data_vit['Vy'])
data_corre = {'dt': list(range(max_dt-1)), 'Vx': [], 'Vy': [], 'Vxy': []}
for dt in range(max_dt-1):
# vx(t)vx(t+deltaT) + vy(t)vy(t+deltaT)
tmpVx = allVx[dt:max_dt-1] * allVx[0:max_dt-1-dt]
tmpVy = allVy[dt:max_dt-1] * allVy[0:max_dt-1-dt]
tmpVxy = tmpVx + tmpVy
# Moyenner sur tous les t possibles pour un deltaT
data_corre['Vx'].append(tmpVx.mean())
data_corre['Vy'].append(tmpVy.mean())
data_corre['Vxy'].append(tmpVxy.mean())
return pd.DataFrame(data_corre)
With data_vit
containing 1000 points and max_dt
set to 100, this code is roughly 500 times faster than the reference implementation on my desktop computer.
Note that it can still be improved by computing the means incrementally.
Upvotes: 1