Reputation: 37
So I'm currently updating a counter row by row in a pandas df and I don't think it's the fastest way of doing things.
Below is a simplified version of the df I'm using, I have two counters C1 counting each "S" category day and C2 counting each time it switches to "S" from "N" but not from "B" to "S" unless it is from "N" to "B" to "S".
This is done with for loops and if statements but is quite slow over 1.7 million rows. I've done a fair bit of searching and can't seem to find a more "pandas/numpy" approach.
Any ideas or pointers in the right direction would be greatly appreciated.
Date Category C1 C2
1/1/2015 N 0 0
1/2/2015 N 0 0
1/5/2015 S 1 1
1/5/2015 S 2 1
1/6/2015 S 3 1
1/6/2015 S 4 1
1/7/2015 N 4 1
1/7/2015 N 4 1
1/12/2015 N 4 1
1/12/2015 N 4 1
1/13/2015 N 4 1
1/13/2015 S 5 2
1/15/2015 S 6 2
1/15/2015 B 7 2
1/16/2015 S 8 2
1/16/2015 S 9 2
1/16/2015 N 8 2
1/21/2015 N 8 2
1/21/2015 S 9 3
1/22/2015 S 10 3
Upvotes: 0
Views: 62
Reputation: 864
You can tackle it with only a single loop. The main focus of the loop is counting the S, say counter1. Inside the for loop, check if the previous one is an N, if so, add to counter2. Else, if the previous is B and the one before that is N, add to counter2.
For example, if you have a list of the category, you do:
C1, C2 = 0, 0
for i in range(len(category)):
if category[i] == 'S':
C1 += 1
if category[max(0, i-1)] =='N':
C2 += 1
elif (category[max(0, i-1)]=='B') and (category[max(0, i-2)]=='N'):
C2 += 1
With this, you can minimize the number of for loops. Hope it helps.
Upvotes: 0
Reputation: 10359
Generally, you want to use .cumsum()
to accumulate values, and .shift()
when you want to check previous or next values. Knowing that True
is equivalent to 1
in mathematical expressions saves us a bit of time.
For the example values given, the following will work:
df['C1'] = (df['Category'] == 'S').cumsum()
df['C2'] = ((df['Category'] == 'S') & (df['Category'].shift(1) == 'N')).cumsum()
However, the part with
not from "B" to "S" unless it is from "N" to "B" to "S".
Makes the latter a little more complicated. However, it should be possible with a slightly more complicated set-up:
df['C2'] = (((df['Category'] == 'S') & (df['Category'].shift(1) == 'N')) |
((df['Category'] == 'S') & (df['Category'].shift(1) == 'B') & (df['Category'].shift(2) == 'N'))).cumsum()
This assumes the only "B to S" rows you want to count are those where row n-1 is N, row n is B and row n+1 is S. All other instances will not be included.
Upvotes: 2