Reputation: 1339
I have the following loops that takes more than 9 seconds for 10 000 loops. For my program, I have to execute more than 1000 times this function. I need some help to optimize the "simu" function as from now my code is impossible to use since the time duration. For info, daterange values are only for example but can be very different from one to others.
What take mostly time :
Has someone any idea how to optimize this code ?
def simu(nbprod, df, daterange):
timer = time.time()
mat = np.zeros((len(df), nbprod))
iterator = ((i,j) for j in xrange(len(daterange)) for i in df.itertuples(['DATES']))
for (i,j) in iterator:
thedate = i[0]
if (thedate >= daterange[j][0]) and (thedate <= daterange[j][1]):
mat[df.index.get_loc(i[0])][j] = 1
print time.time() - timer
return mat
new_index = pd.date_range(start=pd.datetime(2014,1,1), periods=24*10000, freq='H')
df = pd.DataFrame(np.random.randn(len(new_index)), new_index)
df.index.name = 'DATES'
daterange = [[pd.datetime(2014,1,3), pd.datetime(2014,1,7)], [pd.datetime(2015,6,3), pd.datetime(2017,1,7)], [pd.datetime(2017,1,3), pd.datetime(2020,1,7)]]
### for 1 time
>>> simu(len(daterange), df, daterange)
9.43400001526
### for 3 times more
>>> simu(len(daterange)*3, df, daterange*3)
30.6919999123
>>> simu(len(daterange)*10, df, daterange*10)
92.2009999752
Upvotes: 0
Views: 172
Reputation: 129018
This returns a frame, which is IMHO more useful anyhow (if you want the underlying
data, just df.values
. This will scale linearly with the length of daterange.
def simu2(df, daterange):
mat = pd.DataFrame(0,index=df.index,columns=range(len(daterange)))
for j, (d1,d2) in enumerate(daterange):
result = df[(df.index>=d1)&(df.index<=d2)]
mat.loc[result.index,j] = 1
return mat
In [7]: result1 = simu2(df, daterange)
In [10]: result2 = simu(len(daterange), df, daterange)
5.7844748497
In [11]: (result1.values==result2).all()
Out[11]: True
In [12]: %timeit simu2(df, daterange)
10 loops, best of 3: 162 ms per loop
Upvotes: 1