Reputation: 43
I have the following dataframe:
A B
2021-05-19 07:00:00 Nan Nan
2021-05-19 07:30:00 0.00 Nan
2021-05-19 08:00:00 0.00 Nan
2021-05-19 08:30:00 0.00 Nan
2021-05-19 09:00:00 19.91 Nan
2021-05-19 09:30:00 0.11 Nan
2021-05-19 10:00:00 0.00 Nan
2021-05-19 10:30:00 22.99 Nan
2021-05-19 11:00:00 0.00 Nan
Require:
A B
2021-05-19 07:00:00 Nan 0.00
2021-05-19 07:30:00 0.00 0.00
2021-05-19 08:00:00 0.00 0.00
2021-05-19 08:30:00 0.00 0.00
2021-05-19 09:00:00 19.91 3.32
2021-05-19 09:30:00 0.11 2.78
2021-05-19 10:00:00 0.00 2.32
2021-05-19 10:30:00 22.99 5.76
2021-05-19 11:00:00 0.00 4.80
Calculation for column B:
B1 = A1
B2 = ((B1*5)+A2)/6
B3 = ((B2*5)+A3)/6
B4 = ((B3*5)+A4)/6
etc.
I have already attempted a calculation in Python with the shift function but this doesn't work, would be great if someone can help me in the right direction.
Upvotes: 4
Views: 258
Reputation: 18306
With some math, we can turn this recursive formula to a geometric-series like one:
df["B"] = (df.A
.fillna(0)
.expanding()
.apply(lambda s: (1/6)*(s * ((5/6) ** np.arange(len(s))[::-1])).sum() + (5/6)**s.size*s.iloc[0]))
It amounts to
N := window.size
B_j = (5/6)^(N-1) A_1 + (1/6) \sum_{j=2}^{N} (5/6)^(N-j) A_j
where window is expanding
and corresponds to s
in the code. In the code, however, we sum the A_1
together with others & thereby take 1/6
of it; so we add the remaining 5/6
of it, hence the (5/6)^N
(rather than N-1
) in front of it; output is equivalent. We also turn NaN
s in A
to 0 to prevent them from propagating.
to get
A B
2021-05-19 07:00:00 NaN 0.000000
2021-05-19 07:30:00 0.00 0.000000
2021-05-19 08:00:00 0.00 0.000000
2021-05-19 08:30:00 0.00 0.000000
2021-05-19 09:00:00 19.91 3.318333
2021-05-19 09:30:00 0.11 2.783611
2021-05-19 10:00:00 0.00 2.319676
2021-05-19 10:30:00 22.99 5.764730
2021-05-19 11:00:00 0.00 4.803942
Upvotes: 2
Reputation: 71689
We can define a function fast_sum
to perform the required calculation then using the technique called just in time compilation, compile this function to machine code so that it can run more efficiently at C
like speeds
import numba
@numba.jit(nopython=True)
def fast_sum(a):
b = np.zeros_like(a)
b[0] = a[0]
for i in range(1, len(a)):
b[i] = (b[i - 1] * 5 + a[i]) / 6
return b
df['B'] = fast_sum(df['A'].fillna(0).to_numpy())
A B
2021-05-19 07:00:00 0.00 0.000000
2021-05-19 07:30:00 0.00 0.000000
2021-05-19 08:00:00 0.00 0.000000
2021-05-19 08:30:00 0.00 0.000000
2021-05-19 09:00:00 19.91 3.318333
2021-05-19 09:30:00 0.11 2.783611
2021-05-19 10:00:00 0.00 2.319676
2021-05-19 10:30:00 22.99 5.764730
2021-05-19 11:00:00 0.00 4.803942
Performance test on sample dataframe with 90000
rows
df = pd.concat([df] * 10000, ignore_index=True)
%%timeit
df['B'] = fast_sum(df['A'].fillna(0).to_numpy())
# 1.62 ms ± 93.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 2
Reputation: 23146
You can loop through the DataFrame and set column B
as each value of B
depends on its own previous value.
for i, date in enumerate(df.index):
if i==0:
df.at[date, "B"] = 0
else:
df.at[date, "B"] = (df["B"].iat[i-1]*5+df.at[date, "A"])/6
df
>>
A B
2021-05-19 07:00:00 0.00 0.000000
2021-05-19 07:30:00 0.00 0.000000
2021-05-19 08:00:00 0.00 0.000000
2021-05-19 08:30:00 0.00 0.000000
2021-05-19 09:00:00 19.91 3.318333
2021-05-19 09:30:00 0.11 2.783611
2021-05-19 10:00:00 0.00 2.319676
2021-05-19 10:30:00 22.99 5.764730
2021-05-19 11:00:00 0.00 4.803942
Upvotes: 2
Reputation: 4461
If you want to express the function
B[i] = (A[i-1] * 5 + A[i])/6
You are on the right track with shift
B = ((A.shift(1)*5)+A)/6
B.iat[0] = A.iat[0]
However, if you want to express the recursive function
B[i] = (B[i-1] * 5 + A[i])/6
Then you cannot use vectorized pandas operations and can just calculate this using normal Python code, as another answer points out.
Upvotes: 0