ar1847
ar1847

Reputation: 37

Speeding up pandas loop counter

So I'm currently updating a counter row by row in a pandas df and I don't think it's the fastest way of doing things.

Below is a simplified version of the df I'm using, I have two counters C1 counting each "S" category day and C2 counting each time it switches to "S" from "N" but not from "B" to "S" unless it is from "N" to "B" to "S".

This is done with for loops and if statements but is quite slow over 1.7 million rows. I've done a fair bit of searching and can't seem to find a more "pandas/numpy" approach.

Any ideas or pointers in the right direction would be greatly appreciated.

  Date  Category C1 C2
1/1/2015    N   0   0
1/2/2015    N   0   0
1/5/2015    S   1   1
1/5/2015    S   2   1
1/6/2015    S   3   1
1/6/2015    S   4   1
1/7/2015    N   4   1
1/7/2015    N   4   1
1/12/2015   N   4   1
1/12/2015   N   4   1
1/13/2015   N   4   1
1/13/2015   S   5   2
1/15/2015   S   6   2
1/15/2015   B   7   2
1/16/2015   S   8   2
1/16/2015   S   9   2
1/16/2015   N   8   2
1/21/2015   N   8   2
1/21/2015   S   9   3
1/22/2015   S   10  3

Upvotes: 0

Views: 62

Answers (2)

Seraph Wedd
Seraph Wedd

Reputation: 864

You can tackle it with only a single loop. The main focus of the loop is counting the S, say counter1. Inside the for loop, check if the previous one is an N, if so, add to counter2. Else, if the previous is B and the one before that is N, add to counter2.

For example, if you have a list of the category, you do:

C1, C2 = 0, 0
for i in range(len(category)):
    if category[i] == 'S':
        C1 += 1
        if category[max(0, i-1)] =='N':
            C2 += 1
        elif (category[max(0, i-1)]=='B') and (category[max(0, i-2)]=='N'):
            C2 += 1

With this, you can minimize the number of for loops. Hope it helps.

Upvotes: 0

asongtoruin
asongtoruin

Reputation: 10359

Generally, you want to use .cumsum() to accumulate values, and .shift() when you want to check previous or next values. Knowing that True is equivalent to 1 in mathematical expressions saves us a bit of time.

For the example values given, the following will work:

df['C1'] = (df['Category'] == 'S').cumsum()
df['C2'] = ((df['Category'] == 'S') & (df['Category'].shift(1) == 'N')).cumsum()

However, the part with

not from "B" to "S" unless it is from "N" to "B" to "S".

Makes the latter a little more complicated. However, it should be possible with a slightly more complicated set-up:

df['C2'] = (((df['Category'] == 'S') & (df['Category'].shift(1) == 'N')) |
            ((df['Category'] == 'S') & (df['Category'].shift(1) == 'B') & (df['Category'].shift(2) == 'N'))).cumsum()

This assumes the only "B to S" rows you want to count are those where row n-1 is N, row n is B and row n+1 is S. All other instances will not be included.

Upvotes: 2

Related Questions