Calculate streak in pandas without apply

Question

I have a DataFrame like this:

date       | type | column1
----------------------------
2019-01-01 |   A  |    1
2019-02-01 |   A  |    1
2019-03-01 |   A  |    1
2019-04-01 |   A  |    0
2019-05-01 |   A  |    1
2019-06-01 |   A  |    1
2019-07-01 |   B  |    1
2019-08-01 |   B  |    1
2019-09-01 |   B  |    0

I want to have a column called "streak" that has a streak, but grouped by column "type":

date       | type | column1 | streak
-------------------------------------
2019-01-01 |   A  |    1    |   1
2019-02-01 |   A  |    1    |   2
2019-03-01 |   A  |    1    |   3
2019-04-01 |   A  |    0    |   0
2019-05-01 |   A  |    1    |   1
2019-06-01 |   A  |    1    |   2
2019-07-01 |   B  |    1    |   1
2019-08-01 |   B  |    1    |   2
2019-09-01 |   B  |    0    |   0

I managed to do it like that:

def streak(df):
    grouper = (df.column1 != df.column1.shift(1)).cumsum()
    df['streak'] = df.groupby(grouper).cumsum()['column1']
    return df

df = df.groupby(['type']).apply(streak)

But I'm wondering if it's possible to do it inline without using a groupby and apply, because my DataFrame contains about 100M rows and it takes several hours to process.

Any ideas on how to optimize this for speed?

ALollz · Accepted Answer

You want the cumsum of 'column1' grouping by 'type' + the cumsum of a Boolean Series which resets the grouping at every 0.

df['streak'] = df.groupby(['type', df.column1.eq(0).cumsum()]).column1.cumsum()

         date type  column1  streak
0  2019-01-01    A        1       1
1  2019-02-01    A        1       2
2  2019-03-01    A        1       3
3  2019-04-01    A        0       0
4  2019-05-01    A        1       1
5  2019-06-01    A        1       2
6  2019-07-01    B        1       1
7  2019-08-01    B        1       2
8  2019-09-01    B        0       0

Calculate streak in pandas without apply

Answers (2)

Related Questions