dan577
dan577

Reputation: 35

Dataframe fast way to search backward in columns while expanding

This is my dataframe of prices

   stock1  stock2
0     2.3    10.1
1     1.9    11.2
2     3.5    10.5
3     2.8    10.8
4     3.1    10.3
5     2.7     9.8
6     3.3    10.2

And this is dataframe of price supports I want to obtain

   stock1  stock2
0     NaN     NaN
1     1.9    10.1
2     1.9    10.1
3     1.9    10.5
4     2.8    10.1
5     1.9     NaN
6     2.7     9.8

Let's focus on the first column of prices. The idea is to calculate down peaks in this way

    prices1 = prices['stock1']
    mask = (prices1.shift(1) > prices1) & (prices1.shift(-1) > prices1)
    supports1 = prices1.where(mask, NaN)
    supports1.iloc[0] = min(prices1[0],prices1[1])
    supports1 = supports1.shift(1).fillna(method='ffill')

and we obtain

   stock1
0  NaN
1  NaN
2  1.9
3  1.9
4  2.8
5  2.8
6  2.7

Another rule is that, for every price, support must be lower. This doesn't happen in row 5 because 2.8 > 2.7. To correct we have to look backward in this support column to find first occurrence lower than current price (if exists otherwise NaN). In this case right value is 1.9

I found 2 ways to solve the problem but I need to iterate and when dataframe increases it becomes incredibly slow. I'd like 10 times faster, hopefully 100 times. This is my code

from pandas import DataFrame
from numpy import NaN
from numpy.random import uniform
from timeit import timeit

##rows = 5000
##cols = 10
##d={}
##for i in range(cols):
##    d['stock_{}'.format(i)] = 100*uniform(0.95,1.05,rows).cumprod()
##prices = DataFrame(d)

prices = DataFrame({'stock1':[2.3, 1.9, 3.5, 2.8, 3.1, 2.7, 3.3],\
                    'stock2':[10.1, 11.2, 10.5, 10.8, 10.3, 9.8, 10.2]})



#----------------------------------------------------------------
def calc_supports1(prices):
    supports = DataFrame().reindex_like(prices)
    for stock in prices:
        prices1 = prices[stock]
        mask = (prices1.shift(1) > prices1) & (prices1.shift(-1) > prices1)
        supports1 = prices1.where(mask, NaN)
        supports1.iloc[0] = min(prices1[0],prices1[1])
        supports1 = supports1.shift(1).fillna(method='ffill')

        sup = supports1.drop_duplicates()
        for i,v in prices1.loc[prices1 < supports1].iteritems():
            mask = (sup.index < i) & (sup < v)
            sup2 = sup.values[mask.values]
            supports1.at[i] = sup2[-1] if len(sup2) > 0 else NaN

        supports[stock] = supports1
    return supports

#----------------------------------------------------------------
def calc_supports2(prices):
    supports = DataFrame().reindex_like(prices)
    for stock in prices:
        prices1 = prices[stock]
        sup = [min(prices1[0],prices1[1])]
        supports1 = [NaN, sup[0]]

        for i in xrange(2,len(prices1)):
            while len(sup) > 0 and prices1[i] < sup[0]:
                sup.pop(0)

            if prices1[i-1]<prices1[i] and prices1[i-1]<prices1[i-2]:
                sup.insert(0, prices1[i-1])

            supports1.append(sup[0] if len(sup) > 0 else NaN)

        supports[stock] = supports1
    return supports
#----------------------------------------------------------------

print 'fun1', timeit('calc_supports1(prices)', \
            setup='from __main__ import calc_supports1, prices',number = 1)
print 'fun2', timeit('calc_supports2(prices)', \
            setup='from __main__ import calc_supports2, prices',number = 1)

How can I speed up?

Upvotes: 0

Views: 158

Answers (1)

J&#233;r&#244;me Richard
J&#233;r&#244;me Richard

Reputation: 50836

There is multiple performance problems in this code.

  • Python loops are slow (with the default interpreter CPython). It is probably better to use things like Cython or Numba for such kind of computation.
  • Pandas manual indexing is slow. It is better here to use Python lists here and even avoid indexing in hot sections of the code.
  • Inserting/Removing elements in the beginning of a list may be much slower. The Python documentation advise to add elements at the end to implement a stack.

Here is the corrected code:

def calc_supports3(prices):
    supports = DataFrame().reindex_like(prices)
    for stock in prices:
        prices1 = list(prices[stock])
        sup = [min(prices1[0],prices1[1])]
        supports1 = [NaN, sup[-1]]

        # Sliding window
        prices1_im2 = NaN
        prices1_im1 = prices1[0]
        prices1_im0 = prices1[1]

        for i in xrange(2,len(prices1)):
            prices1_im2, prices1_im1, prices1_im0 = prices1_im1, prices1_im0, prices1[i]

            while len(sup) > 0 and prices1_im0 < sup[-1]:
                sup.pop()

            if prices1_im1<prices1_im0 and prices1_im1<prices1_im2:
                sup.append(prices1_im1)

            supports1.append(sup[-1] if len(sup) > 0 else NaN)

        supports[stock] = supports1
    return supports

Here are performance results on your small dataset on my machine:

fun1 0.006461 s
fun2 0.000901 s
fun3 0.000648 s (40% faster than fun2)

Here are performance results on a randomly-generated dataset with 50 000 lines:

fun1 3.916947 s
fun2 2.064891 s
fun3 0.034465 s (60 times faster than fun2)

Upvotes: 2

Related Questions