nipy
nipy

Reputation: 5488

Vectorize for loop and return x day high and low

Overview

For each row of a dataframe I want to calculate the x day high and low.

An x day high is higher than previous x days. An x day low is lower than previous x days.

The for loop is explained in further detail in this post

Update:

Answer by @mozway below completes in around 20 seconds with dataset containing 18k rows. Can this be improved with numpy with broadcasting etc?

Example

2020-03-20 has an x_day_low value of 1 as it is lower than the previous day.

2020-03-27 has an x_day_high value of 8 as it is higher than the previous 8 days.

See desired output and test code below which is calculated with a for loop in the findHighLow function. How would I vectorize findHighLow as the actual dataframe is somewhat larger.

Test data

def genMockDataFrame(days,startPrice,colName,startDate,seed=None): 
   
    periods = days*24
    np.random.seed(seed)
    steps = np.random.normal(loc=0, scale=0.0018, size=periods)
    steps[0]=0
    P = startPrice+np.cumsum(steps)
    P = [round(i,4) for i in P]

    fxDF = pd.DataFrame({ 
        'ticker':np.repeat( [colName], periods ),
        'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
        'price':(P)})
    fxDF.index = pd.to_datetime(fxDF.date)
    fxDF = fxDF.price.resample('D').ohlc()
    fxDF.columns = [i.title() for i in fxDF.columns]
    return fxDF

#rows set to 15 for minimal example but actual dataframe contains around 18000 rows.
number_of_rows = 15    

df = genMockDataFrame(number_of_rows,1.1904,'tttmmm','19/3/2020',seed=157)

def findHighLow (df):

    df['x_day_high'] = 0
    df['x_day_low'] = 0

    for n in reversed(range(len(df['High']))):
        for i in reversed(range(n)):
                if df['High'][n] > df['High'][i]:
                    df['x_day_high'][n] = n - i
                else: break

    for n in reversed(range(len(df['Low']))):
        for i in reversed(range(n)):
                if df['Low'][n] < df['Low'][i]:
                    df['x_day_low'][n] = n - i
                else: break
    return df

df = findHighLow (df)

Desired output should match this:

df[["High","Low","x_day_high","x_day_low"]]

             High   Low x_day_high  x_day_low
date                
2020-03-19  1.1937  1.1832  0       0
2020-03-20  1.1879  1.1769  0       1
2020-03-21  1.1767  1.1662  0       2
2020-03-22  1.1721  1.1611  0       3
2020-03-23  1.1819  1.1690  2       0
2020-03-24  1.1928  1.1807  4       0
2020-03-25  1.1939  1.1864  6       0
2020-03-26  1.2141  1.1964  7       0
2020-03-27  1.2144  1.2039  8       0
2020-03-28  1.2099  1.2018  0       1
2020-03-29  1.2033  1.1853  0       4
2020-03-30  1.1887  1.1806  0       6
2020-03-31  1.1972  1.1873  1       0 
2020-04-01  1.1997  1.1914  2       0
2020-04-02  1.1924  1.1781  0       9

Upvotes: 4

Views: 225

Answers (2)

Anton
Anton

Reputation: 606

Here are two so solutions. Both produce the desired output, as posted in the question.

The first solution uses Numba and completes in 0.5 seconds on my machine for 20k rows. If you can use Numba, this is the way to go. The second solution uses only Pandas/Numpy and completes in 1.5 seconds for 20k rows.

Numba

@numba.njit
def count_smaller(arr):
    current = arr[-1]
    count = 0
    
    for i in range(arr.shape[0]-2, -1, -1):
        if arr[i] > current:
            break
        
        count += 1
        
    return count


@numba.njit
def count_greater(arr):
    current = arr[-1]
    count = 0
    
    for i in range(arr.shape[0]-2, -1, -1):
        if arr[i] < current:
            break
        
        count += 1
        
    return count

df["x_day_high"] = df.High.expanding().apply(count_smaller, engine='numba', raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, engine='numba', raw=True)

Pandas/Numpy

def count_consecutive_true(bool_arr):
    return bool_arr[::-1].cumprod().sum()

def count_smaller(arr):
    return count_consecutive_true(arr <= arr[-1]) - 1

def count_greater(arr):
    return count_consecutive_true(arr >= arr[-1]) - 1

df["x_day_high"] = df.High.expanding().apply(count_smaller, raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, raw=True)

This last solution is similar to mozway's. However it runs faster because it doesn't need to perform a join and uses numpy as much as possible. It also looks arbitrarily far back.

Upvotes: 2

mozway
mozway

Reputation: 260590

You can use rolling to get the last N days, a comparison + cumprod on the reversed boolean array to keep only the last consecutive valid values, and sum to count them. Apply on each column using agg and join the output after adding a prefix.

# number of days
N = 8

df.join(df.rolling(f'{N+1}d', min_periods=1)
          .agg({'High': lambda s: s.le(s.iloc[-1])[::-1].cumprod().sum()-1,
                'Low': lambda s: s.ge(s.iloc[-1])[::-1].cumprod().sum()-1,
               })
          .add_prefix(f'{N}_days_')
        )

Output:

              Open    High     Low   Close  8_days_High  8_days_Low
date                                                               
2020-03-19  1.1904  1.1937  1.1832  1.1832          0.0         0.0
2020-03-20  1.1843  1.1879  1.1769  1.1772          0.0         1.0
2020-03-21  1.1755  1.1767  1.1662  1.1672          0.0         2.0
2020-03-22  1.1686  1.1721  1.1611  1.1721          0.0         3.0
2020-03-23  1.1732  1.1819  1.1690  1.1819          2.0         0.0
2020-03-24  1.1836  1.1928  1.1807  1.1922          4.0         0.0
2020-03-25  1.1939  1.1939  1.1864  1.1936          6.0         0.0
2020-03-26  1.1967  1.2141  1.1964  1.2114          7.0         0.0
2020-03-27  1.2118  1.2144  1.2039  1.2089          7.0         0.0
2020-03-28  1.2080  1.2099  1.2018  1.2041          0.0         1.0
2020-03-29  1.2033  1.2033  1.1853  1.1880          0.0         4.0
2020-03-30  1.1876  1.1887  1.1806  1.1879          0.0         6.0
2020-03-31  1.1921  1.1972  1.1873  1.1939          1.0         0.0
2020-04-01  1.1932  1.1997  1.1914  1.1914          2.0         0.0
2020-04-02  1.1902  1.1924  1.1781  1.1862          0.0         7.0

Upvotes: 1

Related Questions