Imran
Imran

Reputation: 656

Cumulative count reset on condition

I have a dataframe similar to this:

df = pd.DataFrame({'col1': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c'],
                 'col2': [1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2],
                 'col3': [1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0],
                 'desired': [0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1]})

I want to apply a rolling sum on col3 which resets when either of col1 or col2 changes, or when the previous value of col3 was zero.

Note that the count is offset by 1 cell. This means the desired value for a new (col1, col2) combination will always be zero.

The code below demonstrates the required logic. However, it takes nearly 4 minutes on the dataset below.

des = []
count = 0
for i in range(1, len(df)):
    des.append(count)
    if (df.iloc[i-1].col1 == df.iloc[i].col1) & \
       (df.iloc[i-1].col2 == df.iloc[i].col2) & \
       (df.iloc[i-1].col3 == 1):
    
        count += 1
    else:
        count = 0
    
des.append(0)

df['desired'] = des

A bigger dataset to test on: https://www.dropbox.com/s/hbafcq6hdkh4r9r/test.csv?dl=0

Upvotes: 4

Views: 4140

Answers (2)

Pierre D
Pierre D

Reputation: 26221

Since the OP wants a rolling count, that seems to indicate that they want to handle groups that may repeat at different places in the df, without lumping them all together (like a groupby would do).

From the problem statement, it sounds more like a cumulative operation with resets. That is a class of problems that is relatively easy to vectorize, and it doesn't involve groupby at all.

Here is a way using numpy:

def cum_count(df, groupcols, zerocol):
    a = df[groupcols].values
    c = df[zerocol].values
    # find indices to reset the cumulative count
    r0 = np.concatenate(([True], np.any(a[1:] != a[:-1], axis=1)))
    r1 = np.concatenate(([True], c[:-1] == 0))
    reset = np.nonzero(r0 + r1)[0]
    # offset: values to subtract at reset indices
    off = np.concatenate(([0], np.diff(reset)))
    # we add 1 everywhere except at reset indices
    delt = np.ones(df.shape[0], dtype=int)
    delt[reset] -= off

    return np.cumsum(delt) - 1

And here are a corner case example:

df = pd.DataFrame([
    ['a', 1, 1, 0],
    ['a', 1, 1, 1],
    ['a', 1, 1, 2],
    ['b', 1, 1, 0],
    ['b', 1, 1, 1],
    ['b', 1, 1, 2],
    ['a', 1, 1, 0],
    ['a', 1, 1, 1],
    ['a', 1, 1, 2],
], columns='col1 col2 col3 desired'.split())

c = custom_cum_count(df, 'col1 col2'.split(), 'col3')
assert np.all(c == df['desired'])
print(c)

# out
[0 1 2 0 1 2 0 1 2]

Upvotes: 1

jezrael
jezrael

Reputation: 862851

Use groupby with shift first and then count consecutive 1:

a = df.groupby(['col1','col2'])['col3'].shift().fillna(0).eq(1)
b = a.cumsum()

df['desired'] = b-b.where(~a).ffill().fillna(0).astype(int)

print (df.head(20))
      col1  col2  col3  desired
0   100055     1     1        0
1   100055     1     0        1
2   100055     1     0        0
3   100055     1     0        0
4   100055     1     0        0
5   100055     1     0        0
6   100055     1     0        0
7   100055     1     0        0
8   100055     1     0        0
9   100055     1     0        0
10  100055     1     1        0
11  100055     1     1        1
12  100055     1     0        2
13  100055     1     1        0
14  100055     1     1        1
15  100055     1     0        2
16  100055     1     0        0
17  100055     1     1        0
18  100055     1     0        1
19  100055     1     1        0

Upvotes: 6

Related Questions