Reputation: 53
I have a DataFrame that consists of information about every NFL play that has occurred since 2009. My goal is to find out which teams had the most "big plays" in each season. To do this, I found all plays which gained over 20 yards, grouped them by year and team, and got the size of each of those group.
big_plays = (df[df['yards_gained'] >= 20]
.groupby([df['game_date'].dt.year, 'posteam'])
.size())
This results in the following Series:
game_date posteam
2009 ARI 55
ATL 51
BAL 55
BUF 37
CAR 52
CHI 58
CIN 51
CLE 31
DAL 68
DEN 42
DET 42
GB 65
HOU 63
IND 67
JAC 51
KC 44
MIA 34
MIN 64
NE 48
NO 72
NYG 69
NYJ 54
OAK 38
PHI 68
PIT 72
SD 71
SEA 45
SF 51
STL 42
TB 51
..
2018 BAL 44
BUF 55
CAR 64
CHI 66
CIN 69
CLE 70
DAL 51
DEN 59
DET 51
GB 63
HOU 53
IND 57
JAX 51
KC 88
LA 80
LAC 77
MIA 47
MIN 56
NE 64
NO 66
NYG 70
NYJ 49
OAK 63
PHI 54
PIT 66
SEA 62
SF 69
TB 73
TEN 51
WAS 46
Length: 323, dtype: int64
So far, this is exactly what I want. However, I am stuck on the next step. I want the n-largest values for each group in the MultiIndex, or the n-teams with the most number of "big plays" per season.
I have semi-successfully solved this task in a cumbersome way. If I groupby
the 0th level of the MultiIndex, then run the nlargest
function on that groupby, I get the following (truncated to the first two years for brevity):
big_plays.groupby(level=0).nlargest(5)
returns
game_date game_date posteam
2009 2009 NO 72
PIT 72
SD 71
NYG 69
DAL 68
2010 2010 PHI 81
NYG 78
PIT 78
SD 75
DEN 73
This (rather inelegantly) solves the problem, but I'm wondering how I can better achieve more or less the same results.
Upvotes: 5
Views: 1161
Reputation: 862691
In my opinion your code is nice, only a bit changed by group_keys=False
in Series.groupby
for avoid duplicated MultiIndex levels:
s = big_plays.groupby(level=0, group_keys=False).nlargest(5)
print (s)
game_date posteam
2009 NO 72
PIT 72
SD 71
NYG 69
DAL 68
2018 KC 88
LA 80
LAC 77
TB 73
CLE 70
Name: a, dtype: int64
df = big_plays.groupby(level=0, group_keys=False).nlargest(5).reset_index(name='count')
print (df)
game_date posteam count
0 2009 NO 72
1 2009 PIT 72
2 2009 SD 71
3 2009 NYG 69
4 2009 DAL 68
5 2018 KC 88
6 2018 LA 80
7 2018 LAC 77
8 2018 TB 73
9 2018 CLE 70
Alternative is more complicated:
df = (big_plays.reset_index(name='count')
.sort_values(['game_date','count'], ascending=[True, False])
.groupby('game_date')
.head(5))
print (df)
game_date posteam count
19 2009 NO 72
24 2009 PIT 72
25 2009 SD 71
20 2009 NYG 69
8 2009 DAL 68
43 2018 KC 88
44 2018 LA 80
45 2018 LAC 77
57 2018 TB 73
35 2018 CLE 70
Upvotes: 4