Vincent L
Vincent L

Reputation: 739

Calculate streak in pandas without apply

I have a DataFrame like this:

date       | type | column1
----------------------------
2019-01-01 |   A  |    1
2019-02-01 |   A  |    1
2019-03-01 |   A  |    1
2019-04-01 |   A  |    0
2019-05-01 |   A  |    1
2019-06-01 |   A  |    1
2019-07-01 |   B  |    1
2019-08-01 |   B  |    1
2019-09-01 |   B  |    0

I want to have a column called "streak" that has a streak, but grouped by column "type":

date       | type | column1 | streak
-------------------------------------
2019-01-01 |   A  |    1    |   1
2019-02-01 |   A  |    1    |   2
2019-03-01 |   A  |    1    |   3
2019-04-01 |   A  |    0    |   0
2019-05-01 |   A  |    1    |   1
2019-06-01 |   A  |    1    |   2
2019-07-01 |   B  |    1    |   1
2019-08-01 |   B  |    1    |   2
2019-09-01 |   B  |    0    |   0

I managed to do it like that:

def streak(df):
    grouper = (df.column1 != df.column1.shift(1)).cumsum()
    df['streak'] = df.groupby(grouper).cumsum()['column1']
    return df

df = df.groupby(['type']).apply(streak)

But I'm wondering if it's possible to do it inline without using a groupby and apply, because my DataFrame contains about 100M rows and it takes several hours to process.

Any ideas on how to optimize this for speed?

Upvotes: 3

Views: 122

Answers (2)

moys
moys

Reputation: 8033

IIUC, this is what you need.

m = df.column1.ne(df.column1.shift()).cumsum()
df['streak'] =df.groupby([m , 'type'])['column1'].cumsum()

Output

       date     type    column1     streak  
0   1/1/2019    A             1     1   
1   2/1/2019    A             1     2   
2   3/1/2019    A             1     3   
3   4/1/2019    A             0     0   
4   5/1/2019    A             1     1   
5   6/1/2019    A             1     2   
6   7/1/2019    B             1     1   
7   8/1/2019    B             1     2   
8   9/1/2019    B             0     0   

Upvotes: 4

ALollz
ALollz

Reputation: 59579

You want the cumsum of 'column1' grouping by 'type' + the cumsum of a Boolean Series which resets the grouping at every 0.

df['streak'] = df.groupby(['type', df.column1.eq(0).cumsum()]).column1.cumsum()

         date type  column1  streak
0  2019-01-01    A        1       1
1  2019-02-01    A        1       2
2  2019-03-01    A        1       3
3  2019-04-01    A        0       0
4  2019-05-01    A        1       1
5  2019-06-01    A        1       2
6  2019-07-01    B        1       1
7  2019-08-01    B        1       2
8  2019-09-01    B        0       0

Upvotes: 4

Related Questions