Reputation: 1349
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": [7, 2, 3], "b": [4, 5, 6], "c": [100, np.nan, np.NaN]})
df
Out[11]:
a b c
0 7 4 100.0
1 2 5 NaN
2 3 6 NaN
For the above Python pandas DataFrame, I would like to do for row number zero:
calculate the difference between column a
and b
. The result of this should be added to the contents of column c
and stored in a (new) column d
.
For row number one, in the first step, the contents of row zero / column d
should be stored in column c
.
Afterwards the same algorithm like before for row zero should be applied.
After performing this, the resulting DataFrame would look like:
a b c d
0 7 4 100.0 103.
1 2 5 103.0 100.
2 3 6 100.0 97.
In reality, the dataframe has much more rows than this small example. Therefore a fast computational speed is quite important.
How would a solution for computing this new dataframe look like?
Upvotes: 2
Views: 93
Reputation: 51345
You can get the cumulative sum of the difference between a
and b
, add that to your column c
initial value, and populate the rest of c
with your newly calculated d
, shifted down by 1:
df['d'] = df.a.sub(df.b).cumsum().add(df.c.iloc[0])
df.loc[1:,'c'] = df.d.shift()
>>> df
a b c d
0 7 4 100.0 103.0
1 2 5 103.0 100.0
2 3 6 100.0 97.0
Upvotes: 6