Reputation: 5686
I have written a python script that calculates winning streaks of sports match outcomes.
For example, the dataframe for player A looks like this:
time winner loser streak
1 A B 1
2 A C 2
3 A D 3
4 B A 0
5 A F 1
6 A G 2
7 H A 0
8 A X 1
9 A Y 2
10 A Z 3
The streak-column essentially contains a cumulative count of the wins, but it is reset to 0 when the respective player loses, because this ends the streak.
I now would like to output only streaks that are greater than 2, but obviously I want all matches that contributed to this streak.
In other words, the query would be like this: Provide all matches that contributed to a streak greater than 2
The result would look like this:
time winner loser streak
1 A B 1
2 A C 2
3 A D 3
8 A X 1
9 A Y 2
10 A Z 3
How can this be achieved with Pandas?
Upvotes: 0
Views: 136
Reputation: 13088
One solution would be to first detect the change points, the points where the streak goes to 0 and has been longer than $n$. You can do this with pct_change
and a threshold on the streak
column
streak_ends = np.where((df['streak'].pct_change() > 0) & (df['streak'] > 2))[0]
Then you just need to get the start point of the streak which is the end point minus the length of the streak
streaks = [slice(idx - df.loc[idx, 'streak'] + 1, idx + 1) for idx in streak_ends]
streaks
Out[86]: [slice(0, 3, None), slice(7, 10, None)]
df[streaks[0], ['winner', 'streak']]
Out[87]:
winner streak
7 A 1
8 A 2
9 A 3
UPDATE
Turns out itertools.grouper
does it this much better
import itertools
df['A wins'] = df.winner == 'A'
# rolling groupby using itertools
groups = [list(s) for i, s in itertools.groupby(df['A wins'])]
# filter out streaks that are shorter than the desired period
# itertools.chain is needed to unpack the nested groups
streaks = list(itertools.chain(*[g if len(g) > 2 else [False] * len(g)
for g in groups]))
df.loc[streaks, ['winner', 'streak']]
Out[83]:
winner streak
0 A 1
1 A 2
2 A 3
7 A 1
8 A 2
9 A 3
10 A 4
Upvotes: 1