Vectorising Calculation for 2 Interdependent Columns in Python

Question

I am in the process of moving from Excel to Python. I need to calculate multiple columns of mostly interdependent data columns. I can accomplish this in Python by using a for-loop, but it is incredibly inefficient and takes unreasonably long. From the reading I've done, it is clear I need to vectorise the calculation. As a simplified example, kindly see the below screenshot from Excel. The grey columns are the static data, and the 2 yellow columns are those with formulas that are simply dragged down.

The for-loop I've used:

for i in range(1,n):

    df.loc[i, 'Value_1'] = df.loc[i-1, 'Value_2'] + df.loc[i, 'Day_chg']

    df.loc[i, 'Value_2'] = df.loc[i-1, 'Value_1'] + 
                           df.loc[i-1, 'Value_2'] + df.loc[i-1, 'Value_1']

Date	Ticker	Close	Day_Chg	Value_1	Value_2
2019/12/02	AA	43.79
2019/12/03	AA	43.32	-1.07%	-1.07%	-1.07%
2019/12/04	AA	43.49	0.39%	-0.68%	-2.83%
2019/12/05	AA	43.48	-0.02%	-2.85%	-6.36%
2019/12/06	AA	43.88	0.92%	-5.44%	-14.65%
2019/12/09	AA	43.73	-0.34%	-14.99%	-35.08%
2019/12/02	BB	30.07
2019/12/03	BB	25.63	-14.77%	-14.77%	-14.77%
2019/12/04	BB	27.68	8.00%	-6.77%	-36.30%
2019/12/05	BB	23.58	-14.81%	-51.11%	-94.18%
2019/12/06	BB	22.63	-4.03%	-98.20%	-243.49%
2019/12/09	BB	19.69	-12.99%	-256.48%	-598.18%

The formula in column "Value_1" adds the previous day value from column "Value_2" to the current day value from column "Day_Chg".

The formula in column "Value_2" adds the previous day values from both columns "Value_1" and "Value_2" to the current day value from column "Value_1".

I highly value any advice. Thank you in advance.

user2246849 · Accepted Answer

EDIT after discussion in the comments.

Unfortunately, due to the fact that your columns are inter-dependent and also depending to previous rows, it is not possible to vectorise the computation of the two columns. Hence, iterating the rows one by one, similarly to what you proposed in your question is the only way. Any alternative you could find without explicitly writing a loop, would still use a loop under the hood.

Since you do not use indices or other features of Pandas, you can try to use numpy arrays instead of dataframes to speed up things. To make the code significantly faster, you could optimise the code using Numba, which translates Python code into machine code just-in-time.

An example is this answer to a question similar to yours. I would love to help you with this now, but at the moment I don't have access to a machine with Numba. I will edit my answer in the next days. Maybe, in the meantime someone will also answer in this regard.

EDIT2 as promised, here is the solution with Numba compared with a solution using numpy and one using pandas:

import numpy as np 
import numba 

@numba.jit(nopython=True)
def numba_calculate(day_chg):
    def fixna(x):
        return 0 if np.isnan(x) else x
    
    # The result array will have Value_1 and Value_2 in columns 0 and 1, respectively.
    result = np.zeros((len(day_chg), 2))
    for i in range(1, len(day_chg)):
        result[i, 0] = fixna(result[i-1, 1]) + day_chg[i]
        result[i, 1] = fixna(result[i-1, 0]) + fixna(result[i-1, 1]) + result[i, 0]
    return result


# Exact same function as numba_calculate, but without the Numba decorator.
def numpy_calculate(day_chg):
    def fixna(x):
        return 0 if np.isnan(x) else x
    
    # The result array will have Value_1 and Value_2 in columns 0 and 1, respectively.
    result = np.zeros((len(day_chg), 2))
    for i in range(1, len(day_chg)):
        result[i, 0] = fixna(result[i-1, 1]) + day_chg[i]
        result[i, 1] = fixna(result[i-1, 0]) + fixna(result[i-1, 1]) + result[i, 0]
    return result


def pandas_calculate(df2):
    def fixna(x):
        return 0 if np.isnan(x) else x

    for prev, curr in zip(df2.index, df2.index[1:]):
        df2.loc[curr, 'Value_1'] = fixna(df2.loc[prev, 'Value_2']) + df2.loc[curr, 'Day_Chg']
        df2.loc[curr, 'Value_2'] = df2.loc[curr, 'Value_1'] + fixna(df2.loc[prev, 'Value_1']) + fixna(df2.loc[prev, 'Value_2'])

Here is the timing of the solutions:

# Prints: 2.47 µs ± 8.17 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit numba_calculate(df['Day_Chg'].to_numpy())

# Prints: 42.6 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit numpy_calculate(df['Day_Chg'].to_numpy())

# Prints: 1.05 ms ± 3.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pandas_calculate(df)

As you can see, the solution using Numba is around 20 times faster than numpy and 400 times faster than looping with pandas.

Then, to assign the result back to the dataframe, you can simply do:

df[['Value_1', 'Value_2']] = numba_calculate(df['Day_Chg'].to_numpy())
print(df)

          Date Ticker  Close  Day_Chg  Value_1  Value_2
0   2019/12/02     AA  43.79      NaN   0.0000   0.0000
1   2019/12/03     AA  43.32  -0.0107  -0.0107  -0.0107
2   2019/12/04     AA  43.49   0.0039  -0.0068  -0.0282
3   2019/12/05     AA  43.48  -0.0002  -0.0284  -0.0634
4   2019/12/06     AA  43.88   0.0092  -0.0542  -0.1460
5   2019/12/09     AA  43.73  -0.0034  -0.1494  -0.3496
6   2019/12/02     BB  30.07      NaN      NaN      NaN
7   2019/12/03     BB  25.63  -0.1477  -0.1477  -0.1477
8   2019/12/04     BB  27.68   0.0800  -0.0677  -0.3631
9   2019/12/05     BB  23.58  -0.1481  -0.5112  -0.9420
10  2019/12/06     BB  22.63  -0.0403  -0.9823  -2.4355
11  2019/12/09     BB  19.69  -0.1299  -2.5654  -5.9832

Vectorising Calculation for 2 Interdependent Columns in Python

Answers (1)

Related Questions