Fastest way for a loop and condition code (Python + Dataframes)

Question

I have the following loops that takes more than 9 seconds for 10 000 loops. For my program, I have to execute more than 1000 times this function. I need some help to optimize the "simu" function as from now my code is impossible to use since the time duration. For info, daterange values are only for example but can be very different from one to others.

What take mostly time :

df.itertuples(['DATES'])
loop even using iterator
if condition
f.index.get_loc to have the position of the date

Has someone any idea how to optimize this code ?

def simu(nbprod, df, daterange):


    timer = time.time()
    mat = np.zeros((len(df), nbprod))

    iterator = ((i,j) for j in xrange(len(daterange)) for i in df.itertuples(['DATES']))

    for (i,j) in iterator:
        thedate = i[0]
        if (thedate >= daterange[j][0]) and (thedate <= daterange[j][1]):
            mat[df.index.get_loc(i[0])][j] = 1

    print time.time() - timer

    return mat


new_index = pd.date_range(start=pd.datetime(2014,1,1), periods=24*10000, freq='H')
df = pd.DataFrame(np.random.randn(len(new_index)), new_index)
df.index.name = 'DATES'

daterange = [[pd.datetime(2014,1,3), pd.datetime(2014,1,7)], [pd.datetime(2015,6,3), pd.datetime(2017,1,7)], [pd.datetime(2017,1,3), pd.datetime(2020,1,7)]]

### for 1 time
>>> simu(len(daterange), df, daterange)
9.43400001526

### for 3 times more
>>> simu(len(daterange)*3, df, daterange*3)
30.6919999123

>>> simu(len(daterange)*10, df, daterange*10)
92.2009999752

Jeff · Accepted Answer

This returns a frame, which is IMHO more useful anyhow (if you want the underlying data, just df.values. This will scale linearly with the length of daterange.

def simu2(df, daterange):

    mat = pd.DataFrame(0,index=df.index,columns=range(len(daterange)))
    for j, (d1,d2) in enumerate(daterange):
        result = df[(df.index>=d1)&(df.index<=d2)]
        mat.loc[result.index,j] = 1

    return mat


In [7]: result1 = simu2(df, daterange)

In [10]: result2 = simu(len(daterange), df, daterange)
5.7844748497

In [11]: (result1.values==result2).all()
Out[11]: True

In [12]: %timeit simu2(df, daterange)
10 loops, best of 3: 162 ms per loop

Fastest way for a loop and condition code (Python + Dataframes)

Answers (1)

Related Questions