Alexis G
Alexis G

Reputation: 1339

Fastest way for a loop and condition code (Python + Dataframes)

I have the following loops that takes more than 9 seconds for 10 000 loops. For my program, I have to execute more than 1000 times this function. I need some help to optimize the "simu" function as from now my code is impossible to use since the time duration. For info, daterange values are only for example but can be very different from one to others.

What take mostly time :

Has someone any idea how to optimize this code ?

def simu(nbprod, df, daterange):


    timer = time.time()
    mat = np.zeros((len(df), nbprod))

    iterator = ((i,j) for j in xrange(len(daterange)) for i in df.itertuples(['DATES']))

    for (i,j) in iterator:
        thedate = i[0]
        if (thedate >= daterange[j][0]) and (thedate <= daterange[j][1]):
            mat[df.index.get_loc(i[0])][j] = 1

    print time.time() - timer

    return mat


new_index = pd.date_range(start=pd.datetime(2014,1,1), periods=24*10000, freq='H')
df = pd.DataFrame(np.random.randn(len(new_index)), new_index)
df.index.name = 'DATES'

daterange = [[pd.datetime(2014,1,3), pd.datetime(2014,1,7)], [pd.datetime(2015,6,3), pd.datetime(2017,1,7)], [pd.datetime(2017,1,3), pd.datetime(2020,1,7)]]

### for 1 time
>>> simu(len(daterange), df, daterange)
9.43400001526

### for 3 times more
>>> simu(len(daterange)*3, df, daterange*3)
30.6919999123

>>> simu(len(daterange)*10, df, daterange*10)
92.2009999752

Upvotes: 0

Views: 172

Answers (1)

Jeff
Jeff

Reputation: 129018

This returns a frame, which is IMHO more useful anyhow (if you want the underlying data, just df.values. This will scale linearly with the length of daterange.

def simu2(df, daterange):

    mat = pd.DataFrame(0,index=df.index,columns=range(len(daterange)))
    for j, (d1,d2) in enumerate(daterange):
        result = df[(df.index>=d1)&(df.index<=d2)]
        mat.loc[result.index,j] = 1

    return mat


In [7]: result1 = simu2(df, daterange)

In [10]: result2 = simu(len(daterange), df, daterange)
5.7844748497

In [11]: (result1.values==result2).all()
Out[11]: True

In [12]: %timeit simu2(df, daterange)
10 loops, best of 3: 162 ms per loop

Upvotes: 1

Related Questions