Reputation: 33
I am in the process of moving from Excel to Python. I need to calculate multiple columns of mostly interdependent data columns. I can accomplish this in Python by using a for-loop, but it is incredibly inefficient and takes unreasonably long. From the reading I've done, it is clear I need to vectorise the calculation. As a simplified example, kindly see the below screenshot from Excel. The grey columns are the static data, and the 2 yellow columns are those with formulas that are simply dragged down.
The for-loop I've used:
for i in range(1,n):
df.loc[i, 'Value_1'] = df.loc[i-1, 'Value_2'] + df.loc[i, 'Day_chg']
df.loc[i, 'Value_2'] = df.loc[i-1, 'Value_1'] +
df.loc[i-1, 'Value_2'] + df.loc[i-1, 'Value_1']
Date | Ticker | Close | Day_Chg | Value_1 | Value_2 |
---|---|---|---|---|---|
2019/12/02 | AA | 43.79 | |||
2019/12/03 | AA | 43.32 | -1.07% | -1.07% | -1.07% |
2019/12/04 | AA | 43.49 | 0.39% | -0.68% | -2.83% |
2019/12/05 | AA | 43.48 | -0.02% | -2.85% | -6.36% |
2019/12/06 | AA | 43.88 | 0.92% | -5.44% | -14.65% |
2019/12/09 | AA | 43.73 | -0.34% | -14.99% | -35.08% |
2019/12/02 | BB | 30.07 | |||
2019/12/03 | BB | 25.63 | -14.77% | -14.77% | -14.77% |
2019/12/04 | BB | 27.68 | 8.00% | -6.77% | -36.30% |
2019/12/05 | BB | 23.58 | -14.81% | -51.11% | -94.18% |
2019/12/06 | BB | 22.63 | -4.03% | -98.20% | -243.49% |
2019/12/09 | BB | 19.69 | -12.99% | -256.48% | -598.18% |
The formula in column "Value_1" adds the previous day value from column "Value_2" to the current day value from column "Day_Chg".
The formula in column "Value_2" adds the previous day values from both columns "Value_1" and "Value_2" to the current day value from column "Value_1".
I highly value any advice. Thank you in advance.
Upvotes: 1
Views: 158
Reputation: 4407
EDIT after discussion in the comments.
Unfortunately, due to the fact that your columns are inter-dependent and also depending to previous rows, it is not possible to vectorise the computation of the two columns. Hence, iterating the rows one by one, similarly to what you proposed in your question is the only way. Any alternative you could find without explicitly writing a loop, would still use a loop under the hood.
Since you do not use indices or other features of Pandas, you can try to use numpy
arrays instead of dataframes to speed up things.
To make the code significantly faster, you could optimise the code using Numba
, which translates Python code into machine code just-in-time.
An example is this answer to a question similar to yours. I would love to help you with this now, but at the moment I don't have access to a machine with Numba
. I will edit my answer in the next days. Maybe, in the meantime someone will also answer in this regard.
EDIT2 as promised, here is the solution with Numba
compared with a solution using numpy
and one using pandas
:
import numpy as np
import numba
@numba.jit(nopython=True)
def numba_calculate(day_chg):
def fixna(x):
return 0 if np.isnan(x) else x
# The result array will have Value_1 and Value_2 in columns 0 and 1, respectively.
result = np.zeros((len(day_chg), 2))
for i in range(1, len(day_chg)):
result[i, 0] = fixna(result[i-1, 1]) + day_chg[i]
result[i, 1] = fixna(result[i-1, 0]) + fixna(result[i-1, 1]) + result[i, 0]
return result
# Exact same function as numba_calculate, but without the Numba decorator.
def numpy_calculate(day_chg):
def fixna(x):
return 0 if np.isnan(x) else x
# The result array will have Value_1 and Value_2 in columns 0 and 1, respectively.
result = np.zeros((len(day_chg), 2))
for i in range(1, len(day_chg)):
result[i, 0] = fixna(result[i-1, 1]) + day_chg[i]
result[i, 1] = fixna(result[i-1, 0]) + fixna(result[i-1, 1]) + result[i, 0]
return result
def pandas_calculate(df2):
def fixna(x):
return 0 if np.isnan(x) else x
for prev, curr in zip(df2.index, df2.index[1:]):
df2.loc[curr, 'Value_1'] = fixna(df2.loc[prev, 'Value_2']) + df2.loc[curr, 'Day_Chg']
df2.loc[curr, 'Value_2'] = df2.loc[curr, 'Value_1'] + fixna(df2.loc[prev, 'Value_1']) + fixna(df2.loc[prev, 'Value_2'])
Here is the timing of the solutions:
# Prints: 2.47 µs ± 8.17 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit numba_calculate(df['Day_Chg'].to_numpy())
# Prints: 42.6 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit numpy_calculate(df['Day_Chg'].to_numpy())
# Prints: 1.05 ms ± 3.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pandas_calculate(df)
As you can see, the solution using Numba
is around 20 times faster than numpy
and 400 times faster than looping with pandas
.
Then, to assign the result back to the dataframe, you can simply do:
df[['Value_1', 'Value_2']] = numba_calculate(df['Day_Chg'].to_numpy())
print(df)
Date Ticker Close Day_Chg Value_1 Value_2
0 2019/12/02 AA 43.79 NaN 0.0000 0.0000
1 2019/12/03 AA 43.32 -0.0107 -0.0107 -0.0107
2 2019/12/04 AA 43.49 0.0039 -0.0068 -0.0282
3 2019/12/05 AA 43.48 -0.0002 -0.0284 -0.0634
4 2019/12/06 AA 43.88 0.0092 -0.0542 -0.1460
5 2019/12/09 AA 43.73 -0.0034 -0.1494 -0.3496
6 2019/12/02 BB 30.07 NaN NaN NaN
7 2019/12/03 BB 25.63 -0.1477 -0.1477 -0.1477
8 2019/12/04 BB 27.68 0.0800 -0.0677 -0.3631
9 2019/12/05 BB 23.58 -0.1481 -0.5112 -0.9420
10 2019/12/06 BB 22.63 -0.0403 -0.9823 -2.4355
11 2019/12/09 BB 19.69 -0.1299 -2.5654 -5.9832
Upvotes: 2