Rafa
Rafa

Reputation: 71

Compute column from multiple previous rows in dataframes with conditionals

I'm starting to belive that pandas dataframes are much less intuitive to handle than Excel, but I'm not giving up yet!

So, I'm JUST trying to check data in the same column but in (various) previous rows using the .shift() method. I'm using the following DF as an example since the original is too complicated to copy into here, but the principle is the same.

counter = list(range(20))

df1 = pd.DataFrame(counter, columns=["Counter"])
df1["Event"] = [True, False, False, False, False, False, True, False,False,False,False,False,False,False,False,False,False,False,False,True]

enter image description here

I'm trying to create sums of the column counter, but only under the following conditions:

To clarify my goal this is the result I had in mind:

enter image description here

My attempt so far looks like this:

for index, row in df1.iterrows(): 
    if row["Event"] == True: 
        counter = 1 
        summ = 0 
        while counter < 10 and row["Event"].shift(counter) == False: 
            summ += row["Counter"].shift(counter)
            counter += 1
        else: 
            df1.at[index, "Sum"] = summ

I'm trying to first find Event == True and from there start iterating backwards with a counter and summing up the counters as I go. However it seems to have a problem with shift:

AttributeError: 'bool' object has no attribute 'shift'

Please shatter my believes and show me, that Excel isn't actually superior.

Upvotes: 1

Views: 80

Answers (2)

Michael Delgado
Michael Delgado

Reputation: 15432

Element-wise approach

You definitely can approach a task in pandas the way you would in excel. Your approach needs to be tweaked a bit because pandas.Series.shift operates on whole arrays or Series, not on a single value - you can't use it just to move back up the dataframe relative to a value.

The following loops through the indices of your dataframe, walking back up (up to) 10 spots for each Event:

def create_sum_column_loop(df):
    '''
    Adds a Sum column with the rolling sum of 10 Counters prior to an Event
    '''

    df["Sum"] = 0

    for index in range(df.shape[0]): 
        counter = 1 
        summ = 0 

        if df.loc[index, "Event"]:  # == True is implied
            for backup in range(1, 11):

                # handle case where index - backup is before
                # the start of the dataframe
                if index - backup < 0:
                    break

                # stop counting when we hit another event
                if df.loc[index - backup, "Event"]:
                    break

                # increment by the counter
                summ += df.loc[index - backup, "Counter"]

            df.loc[index, "Sum"] = summ

    return df

This does the job:

In [15]: df1_sum1 = create_sum_column(df1.copy())  # copy to preserve original
In [16]: df1_sum1
    Counter  Event  Sum
0         0   True    0
1         1  False    0
2         2  False    0
3         3  False    0
4         4  False    0
5         5  False    0
6         6   True   15
7         7  False    0
8         8  False    0
9         9  False    0
10       10  False    0
11       11  False    0
12       12  False    0
13       13  False    0
14       14  False    0
15       15  False    0
16       16  False    0
17       17  False    0
18       18  False    0
19       19   True  135

Better: vectorized operations

However, the power of pandas comes in its vectorized operations. Python is an interpreted, dynamically-typed language, meaning it's flexible, user friendly (easy to read/write/learn), and slow. To combat this, many commonly-used workflows, including many pandas.Series operations, are written in optimized, compiled code from other languages like C, C++, and Fortran. Under the hood, they're doing the same thing... df1.Counter.cumsum() does loop through the elements and create a running total, but it does it in C, making it lightning fast.

This is what makes learning a framework like pandas difficult - you need to relearn how to do math using that framework. For pandas, the entire game is learning how to use pandas and numpy built-in operators to do your work.

Borrowing the clever solution from @YOBEN_S:

def create_sum_column_vectorized(df):

    n = 10
    s = (
        df.Counter

        # group by a unique identifier for each event. This is a
        # particularly clever bit, where @YOBEN_S reverses
        # the order of df.Event, then computes a running total
        .groupby(df.Event.iloc[::-1].cumsum())

        # compute the rolling sum within each group
        .rolling(n+1,min_periods=1).sum()

        # drop the group index so we can align with the original DataFrame
        .reset_index(level=0,drop=True)

        # drop all non-event observations
        .where(df.Event)
    )

    # remove the counter value for the actual event
    # rows, then fill the remaining rows with 0s
    df['sum'] = (s - df.Counter).fillna(0)

    return df

We can see that the result is the same as the one above (though the values are suddenly floats):

In [23]: df1_sum2 = create_sum_column_vectorized(df1)  # copy to preserve original
In [24]: df1_sum2

The difference comes in the performance. In ipython or jupyter we can use the %timeit command to see how long a statement takes to run:

In [25]: %timeit create_sum_column_loop(df1.copy())
3.21 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [26]: %timeit create_sum_column_vectorized(df1.copy())
7.76 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

For small datasets, like the one in your example, the difference will be negligible or will even slightly favor the pure python loop.

For much larger datasets, the difference becomes apparent. Let's create a dataset similar to your example, but with 100,000 rows:

In [27]: df_big = pd.DataFrame({
    ...:     'Counter': np.arange(100000),
    ...:     'Event': np.random.random(size=100000) > 0.9,
    ...: })
    ...:

Now, you can really see the performance benefit of the vectorized approach:

In [28]: %timeit create_sum_column_loop(df_big.copy())
13 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [29]: %timeit create_sum_column_vectorized(df_big.copy())
5.81 s ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The vectorized version takes less than half the time. This difference will continue to widen as the amount of data increases.

Compiling your own workflows with numba

Note that for specific operations, it is possible to speed up operations further by pre-compiling the code yourself. In this case, the looped version can be compiled with numba:

import numba

@numba.jit(nopython=True)
def _inner_vectorized_loop(counter, event, sum_col):
    for index in range(len(counter)):
        summ = 0

        if event[index]:
            for backup in range(1, 11):

                # handle case where index - backup is before
                # the start of the dataframe
                if index - backup < 0:
                    break

                # stop counting when we hit another event
                if event[index - backup]:
                    break

                # increment by the counter
                summ = summ + counter[index - backup]

            sum_col[index] = summ
    return sum_col

def create_sum_column_loop_jit(df):
    '''
    Adds a Sum column with the rolling sum of 10 Counters prior to an Event
    '''
    df["Sum"] = 0
    df["Sum"] = _inner_vectorized_loop(
        df.Counter.values, df.Event.values, df.Sum.values)
    return df

This beats both pandas and the for loop by a factor of more than 1000!

In [90]: %timeit create_sum_column_loop_jit(df_big.copy())
1.62 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Balancing readability, efficiency, and flexibility is the constant challenge. Best of luck as you dive in!

Upvotes: 1

BENY
BENY

Reputation: 323226

We need create a subgroup key with cumsum , then do rolling sum

n=10
s=df1.Counter.groupby(df1.Event.iloc[::-1].cumsum()).\
      rolling(n+1,min_periods=1).sum().\
      reset_index(level=0,drop=True).where(df1.Event)
df1['sum']=(s-df1.Counter).fillna(0)
df1
    Counter  Event    sum
0         0   True    0.0
1         1  False    0.0
2         2  False    0.0
3         3  False    0.0
4         4  False    0.0
5         5  False    0.0
6         6   True   15.0
7         7  False    0.0
8         8  False    0.0
9         9  False    0.0
10       10  False    0.0
11       11  False    0.0
12       12  False    0.0
13       13  False    0.0
14       14  False    0.0
15       15  False    0.0
16       16  False    0.0
17       17  False    0.0
18       18  False    0.0
19       19   True  135.0

Upvotes: 4

Related Questions