Reputation: 103
Say I've got a dataframe with a column of numbers. I am using df.apply() to modify this column, and the function I am using takes as an argument the number in this column, which means that the output of the function depends on the "state" of the column at the time it is applied. The function has to know the value of the number in row (n-1) in order to spit out the number for row n.
How can I make it so that the function knows what it's most recent output was, seeing as this most recent output is one of it's arguments it needs to generate the number of the next row of the dataframe? My idea was to have the output of the function set as the value of not just the row being iterated on, but also all the rows below it. How can I do this? Is there an easier way that I am not seeing?
Upvotes: 1
Views: 188
Reputation: 12503
I can think of (at least) two ways to achieve what you're looking for. The first one is using apply with a stateful operator, like this:
df = pd.DataFrame({"a": range(0, 10), "b": range(10, 20)})
class StatefulOp:
def __init__(self):
self._last = 0
def __call__(self, num):
res = self._last + num
self._last = res
return res
op = StatefulOp()
df.a.apply(op)
The result is:
0 0
1 1
2 3
3 6
4 10
5 15
6 21
7 28
8 36
9 45
The second way is to avoid using apply
in the first place, but rather use iterrows
(or some other way to iterate over the rows in the data frame). For example:
last_val = 0
res_array = []
for row in df.iterrows():
res = last_val + row[1]["a"]
last_val = res
res_array.append(res)
df["new_a"] = res_array
print(df)
The result is:
a b new_a
0 0 10 0
1 1 11 1
2 2 12 3
3 3 13 6
4 4 14 10
5 5 15 15
6 6 16 21
7 7 17 28
8 8 18 36
9 9 19 45
Upvotes: 1