Speeding up pandas loop counter

Question

So I'm currently updating a counter row by row in a pandas df and I don't think it's the fastest way of doing things.

Below is a simplified version of the df I'm using, I have two counters C1 counting each "S" category day and C2 counting each time it switches to "S" from "N" but not from "B" to "S" unless it is from "N" to "B" to "S".

This is done with for loops and if statements but is quite slow over 1.7 million rows. I've done a fair bit of searching and can't seem to find a more "pandas/numpy" approach.

Any ideas or pointers in the right direction would be greatly appreciated.

  Date  Category C1 C2
1/1/2015    N   0   0
1/2/2015    N   0   0
1/5/2015    S   1   1
1/5/2015    S   2   1
1/6/2015    S   3   1
1/6/2015    S   4   1
1/7/2015    N   4   1
1/7/2015    N   4   1
1/12/2015   N   4   1
1/12/2015   N   4   1
1/13/2015   N   4   1
1/13/2015   S   5   2
1/15/2015   S   6   2
1/15/2015   B   7   2
1/16/2015   S   8   2
1/16/2015   S   9   2
1/16/2015   N   8   2
1/21/2015   N   8   2
1/21/2015   S   9   3
1/22/2015   S   10  3

asongtoruin · Accepted Answer

Generally, you want to use .cumsum() to accumulate values, and .shift() when you want to check previous or next values. Knowing that True is equivalent to 1 in mathematical expressions saves us a bit of time.

For the example values given, the following will work:

df['C1'] = (df['Category'] == 'S').cumsum()
df['C2'] = ((df['Category'] == 'S') & (df['Category'].shift(1) == 'N')).cumsum()

However, the part with

not from "B" to "S" unless it is from "N" to "B" to "S".

Makes the latter a little more complicated. However, it should be possible with a slightly more complicated set-up:

df['C2'] = (((df['Category'] == 'S') & (df['Category'].shift(1) == 'N')) |
            ((df['Category'] == 'S') & (df['Category'].shift(1) == 'B') & (df['Category'].shift(2) == 'N'))).cumsum()

This assumes the only "B to S" rows you want to count are those where row n-1 is N, row n is B and row n+1 is S. All other instances will not be included.

Speeding up pandas loop counter

Answers (2)

Related Questions