snatchysquid
snatchysquid

Reputation: 1352

How do you apply a function on a dataframe column using data from previous rows?

I have a Dataframe which has three columns: nums with some values to work with, b which is always either 1 or 0 and the result column which is currently zero everywhere except in the first row (because we must have an initial value to work with). The dataframe looks like this:

   nums   b    result
0  20.0  1    20.0
1  22.0  0    0
2  30.0  1    0
3  29.1  1    0
4  20.0  0    0
...

The Problem

I'd like to go over each row in the dataframe starting with the second row, do some calculation and store the result in the result column. Since I'm working with large files, I need a way to make this operation fast so that's why I want something like apply.

The calculation I want to do is to take the value in nums and in result from the previous row, and if in the current row the b col is 0 then I want (for example) to add the num and the result from that previous row. If b in that row is 1 I'd like to substract them for example.

What have I tried?

I tried using apply but I couldn't access the previous row and sadly it seems that if I do manage to access the previous row, the dataframe won't update the result column until the end.

I also tried using a loop like so, but it's too slow for the large filews I'm working with:

       for i in range(1, len(df.index)):
            row = df.index[i]
            new_row = df.index[i - 1]  # get index of previous row for "nums" and "result"
            df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \
                                     current_b=df.loc[row, 'b'])

some_calc_func looks like this (just a general example):

def some_calc_func(prev_result, prev_num, current_b):
    if current_b == 1:
        return prev_result * prev_num / 2
    else:
        return prev_num + 17

Please answer with respect to some_calc_func

Upvotes: 5

Views: 791

Answers (5)

Tom
Tom

Reputation: 8790

I realize this is what @Prodipta's answer was getting at, but this approach uses the global keyword instead to remember the previous result each iteration of apply:

prev_result = 20

def my_calc(row):
    global prev_result
    i = int(row.name)   #the index of the current row
    if i==0:
        return prev_result   
    elif row['b'] == 1:
        out = prev_result * df.loc[i-1,'nums']/2   #loc to get prev_num
    else:
        out = df.loc[i-1,'nums'] + 17
    prev_result = out
    return out

df['result'] = df.apply(my_calc, axis=1)

Result for your example data:

   nums  b  result
0  20.0  1    20.0
1  22.0  0    37.0
2  30.0  1   407.0
3  29.1  1  6105.0
4  20.0  0    46.1

And here's a speed test a la @Ben T's answer - not the best but not the worst?

In[0]
df = pd.DataFrame({'nums':np.random.randint(0,100,5000),'b':np.random.choice([0,1],5000)})

prev_result = 20

%%timeit
df['result'] = df.apply(my_calc, axis=1)

Out[0]
117 ms ± 5.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 1

Prodipta Ghosh
Prodipta Ghosh

Reputation: 529

you have a f(...) to apply, but cannot because you need to keep a memory (of previous) row. You can do this either with a closure or a class. Below is a class implementation:

import pandas as pd

class Func():

    def __init__(self, value):
        self._prev = value
        self._init = True

    def __call__(self, x):
        if self._init:
            res = self._prev
            self._init = False
        elif x.b == 0:
            res = x.nums - self._prev
        else:
            res = x.nums + self._prev

        self._prev = res
        return res

#df = pd.read_clipboard()
f = Func(20)
df['result'] = df.apply(f, axis=1)

You can replace the __call__ with whatever you want in some_calc_func body.

Upvotes: 1

Ben.T
Ben.T

Reputation: 29635

If you want to keep the function some_calc_func and not use another library, you should not try to access each element at each iteration, you can use zip on the columns nums and b with a shift between both as you try to access nums from the previous row and keep in memory the prev_res at each iteration. Also, append to a list instead of the dataframe, and after the loop assign the list to the column.

prev_res = df.loc[0, 'result'] #get first result
l_res = [prev_res] #initialize the list of results
# loop with zip to get both values at same time, 
# use loc to start b at second row but not num
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
    # use your function to calculate the new prev_res
    prev_res = some_calc_func (prev_res, prev_num, curren_b)
    # add to the list of results
    l_res.append(prev_res)
# assign to the column
df['result'] = l_res
print (df) #same result than with your method
   nums  b  result
0  20.0  1    20.0
1  22.0  0    37.0
2  30.0  1   407.0
3  29.1  1  6105.0
4  20.0  0    46.1

Now with a dataframe df of 5000 rows, I got:

%%timeit
prev_res = df.loc[0, 'result']
l_res = [prev_res]
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
    prev_res = some_calc_func (prev_res, prev_num, curren_b)
    l_res.append(prev_res)
df['result'] = l_res
# 4.42 ms ± 695 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and with your original solution, it was ~750x slower

%%timeit 
for i in range(1, len(df.index)):
    row = df.index[i]
    new_row = df.index[i - 1]  # get index of previous row for "nums" and "result"
    df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \
                             current_b=df.loc[row, 'b'])
#3.25 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

EDIT with another library called numba, if the function some_calc_func can be easily used with Numba decorator.

from numba import jit
# decorate your function
@jit
def some_calc_func(prev_result, prev_num, current_b):
    if current_b == 1:
        return prev_result * prev_num / 2
    else:
        return prev_num + 17

# create a function to do your job
# numba likes numpy arrays
@jit
def with_numba(prev_res, arr_nums, arr_b):
    # array for results and initialize
    arr_res = np.zeros_like(arr_nums)
    arr_res[0] = prev_res
    # loop on the length of arr_b
    for i in range(len(arr_b)):
        #do the calculation and set the value in result array
        prev_res = some_calc_func (prev_res, arr_nums[i], arr_b[i])
        arr_res[i+1] = prev_res
    return arr_res

Finally, call it like

df['result'] = with_numba(df.loc[0, 'result'], 
                          df['nums'].to_numpy(),  
                          df.loc[1:, 'b'].to_numpy())

And with a timeit, I get another ~9x faster than my method with zip, and the speed up could increase with the size

%timeit df['result'] = with_numba(df.loc[0, 'result'], 
                                  df['nums'].to_numpy(),  
                                  df.loc[1:, 'b'].to_numpy()) 
# 526 µs ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Note using Numba might be problematic depending on your actual some_calc_func

Upvotes: 8

Mak2006
Mak2006

Reputation: 171

re-using your loop and some_calc_func

I am using your loop and have reduced it to a bare minimum as below

   for i in range(1, len(df)):
      df.loc[i, 'result'] = some_calc_func(df.loc[i, 'b'], df.loc[i - 1, 'result'], df.loc[i, 'nums'])

and the some_calc_func is implemented as below

def some_calc_func(bval, prev_result, curr_num):
    if bval == 0:
        return prev_result + curr_num
    else:
        return prev_result - curr_num

The result is as below

   nums  b  result
0  20.0  1    20.0
1  22.0  0    42.0
2  30.0  1    12.0
3  29.1  1   -17.1
4  20.0  0     2.9

Upvotes: 0

Sayandip Dutta
Sayandip Dutta

Reputation: 15872

IIUC:

>>> df['result'] = (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums
                    ).fillna(df.result).cumsum()

>>> df
   nums  b  result
0  20.0  1    20.0
1  22.0  0    42.0
2  30.0  1    12.0
3  29.1  1   -17.1
4  20.0  0     2.9

Explanation:

# replace 0 with 1 and 1 with -1 in column `b` for rows where result==0
>>> df[df.result.eq(0)].b.replace({0: 1, 1: -1})
1    1
2   -1
3   -1
4    1
Name: b, dtype: int64

# multiply with nums
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums)
0     NaN
1    22.0
2   -30.0
3   -29.1
4    20.0
dtype: float64

# fill the 'NaN' with the corresponding value from df.result (which is 20 here)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result)
0    20.0
1    22.0
2   -30.0
3   -29.1
4    20.0
dtype: float64

# take the cumulative sum (cumsum)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result).cumsum()
0    20.0
1    42.0
2    12.0
3   -17.1
4     2.9
dtype: float64

According to your requirement in comments, I can not think of a way without loops:

c1, c2 = 2, 1
l = [df.loc[0, 'result']]            # store the first result in a list

# then loop over the series (df.b * df.nums)

for i, val in (df.b * df.nums).iteritems():
    if i:                            # except for 0th index
        if val == 0:                 # (df.b * df.nums) == 0 if df.b == 0
            l.append(l[-1])          # append the last result
        else:                        # otherwise apply the rule
            t = l[-1] *c2 + val * c1
            l.append(t)

>>> l
[20.0, 20.0, 80.0, 138.2, 138.2]

>>> df['result'] = l

   nums  b  result
0  20.0  1    20.0
1  22.0  0    20.0
2  30.0  1    80.0   # [ 20 * 1 +   30 * 2]
3  29.1  1   138.2   # [ 80 * 1 + 29.1 * 2]
4  20.0  0   138.2

Seems fast enough, did not test for large sample.

Upvotes: 2

Related Questions