beta
beta

Reputation: 5686

Properly filter streak data in pandas

I have written a python script that calculates winning streaks of sports match outcomes.

For example, the dataframe for player A looks like this:

time  winner  loser  streak
1     A       B      1
2     A       C      2
3     A       D      3
4     B       A      0
5     A       F      1
6     A       G      2
7     H       A      0
8     A       X      1
9     A       Y      2
10    A       Z      3

The streak-column essentially contains a cumulative count of the wins, but it is reset to 0 when the respective player loses, because this ends the streak.

I now would like to output only streaks that are greater than 2, but obviously I want all matches that contributed to this streak.

In other words, the query would be like this: Provide all matches that contributed to a streak greater than 2

The result would look like this:

time  winner  loser  streak
1     A       B      1
2     A       C      2
3     A       D      3
8     A       X      1
9     A       Y      2
10    A       Z      3

How can this be achieved with Pandas?

Upvotes: 0

Views: 136

Answers (1)

Matti Lyra
Matti Lyra

Reputation: 13088

One solution would be to first detect the change points, the points where the streak goes to 0 and has been longer than $n$. You can do this with pct_change and a threshold on the streak column

streak_ends = np.where((df['streak'].pct_change() > 0) & (df['streak'] > 2))[0]

Then you just need to get the start point of the streak which is the end point minus the length of the streak

streaks = [slice(idx - df.loc[idx, 'streak'] + 1, idx + 1) for idx in streak_ends]
streaks
Out[86]: [slice(0, 3, None), slice(7, 10, None)]


df[streaks[0], ['winner', 'streak']]
Out[87]: 
  winner streak
7      A      1
8      A      2
9      A      3


UPDATE Turns out itertools.grouper does it this much better

import itertools
df['A wins'] = df.winner == 'A'
# rolling groupby using itertools
groups = [list(s) for i, s in itertools.groupby(df['A wins'])]

# filter out streaks that are shorter than the desired period
# itertools.chain is needed to unpack the nested groups
streaks = list(itertools.chain(*[g if len(g) > 2 else [False] * len(g)
                                 for g in groups]))
df.loc[streaks, ['winner', 'streak']]
Out[83]: 
   winner streak
0       A      1
1       A      2
2       A      3
7       A      1
8       A      2
9       A      3
10      A      4

Upvotes: 1

Related Questions