Reputation: 470
Here is a sport df where,
ID
is match ID(3 mathes here)
HG
and AG
Home/Away goals scored
Period
is the stage of the game, first quarter, second quarter,...
d = {'ID':[121,121,121,121,121,121,343,343,343,343,343,343,343,343,678,678,678,678,678,678,678], 'Period':['1Q','1Q','2Q','2Q','3Q','3Q','1Q','1Q','2Q','2Q',
'3Q','3Q','4Q','4Q','1Q','1Q','2Q','2Q','3Q','3Q','2Q'],'HG':[0,1,2,3,3,3,0,0,1,2,3,4,4,5,0,1,2,2,3,4,5],
'AG':[0,0,1,2,2,3,0,1,1,2,3,3,4,4,1,2,2,2,3,3,4]}
df = pd.DataFrame(d)
I want to study the trajectory of goals scored within a sliced portion of df by simply taking the averages of goal score by either team during the selected time portion.
So I split df by row index and apply some function on it.
Let,
a = df.groupby("ID", group_keys=False).apply(lambda g: g.iloc[:2]).reset_index(drop=True)
b = df.groupby("ID", group_keys=False).apply(lambda g: g.iloc[2:5]).reset_index(drop=True)
c = df.groupby("ID", group_keys=False).apply(lambda g: g.iloc[5:]).reset_index(drop=True)
To compute the average goal per each row (rows are some equal and fixed time interval) within each matches, where the length of matches are different, I have to sum the rows and then divide it by all df.ID.nunique()
.
def Goal_Avg(data):
for i, row in data.iterrows():
return data.loc[:, ['HG','AG']].sum()
Then append the new average goal per row columns(for Home/Away) to df. Here is how I tried,
First, I created a new row count for ID
in df
and a, b and c
as follow,
a['idx'] = a.groupby(['ID']).cumcount()
and
df['IDX'] = a.groupby(['ID']).cumcount()
Then, create new columns with row sum for each ID and eventually obatin the desire HG_Avg
and AG_Avg
columns for each match. Which will be the same for same size matches.
a_sum = a[['HG','AG']].groupby(a['idx']).apply(g_per)
a_sum.rename(columns={'HG':'HG_sum','AG':'AG_sum'}, inplace=True)
a_sum['HG_Avg'] = a_sum['HG_Sum'] / df.ID.nunique()
a_sum['AG_Avg'] = a_sum['AG_Sum'] / df.ID.nunique()
My problems are
1- Even after creating the Avg columns for a, b and c, I still couldnt concat it to original df.
2- The above way looks extremely laborious. Mind you, my data consistes much more than 3 matches.
Here is the final output
AG HG ID Period HG_Avg AG_Avg IDX
0 0 0 121 1Q 0.000 0.333 0
1 0 1 121 1Q 0.667 1.000 1
2 1 2 121 2Q 1.667 1.333 2
3 2 3 121 2Q 2.333 2.000 3
4 2 3 121 3Q 3.000 2.667 4
5 3 3 121 3Q 3.667 3.000 5
6 0 0 343 1Q 0.000 0.333 0
7 1 0 343 1Q 0.667 1.000 1
8 1 1 343 2Q 1.667 1.333 2
9 2 2 343 2Q 2.333 2.000 3
10 3 3 343 3Q 3.000 2.667 4
11 3 4 343 3Q 3.667 3.000 5
12 4 4 343 4Q 3.000 2.667 6
13 4 5 343 4Q 1.667 1.333 7
14 1 0 678 1Q 0.000 0.333 0
15 2 1 678 1Q 0.667 1.000 1
16 2 2 678 2Q 1.667 1.333 2
17 2 2 678 2Q 2.333 2.000 3
18 3 3 678 3Q 3.000 2.667 4
19 3 4 678 3Q 3.667 3.000 5
20 4 5 678 2Q 3.000 2.667 6
Upvotes: 1
Views: 71
Reputation: 323316
First create the additional key by using cumcount
and cut
, then you have two key for groupby
df['NewKey']=pd.cut(df.groupby('ID').cumcount(),[-1,2,5,np.inf])
df.groupby(['ID','NewKey']).apply(yourfunc here)
Upvotes: 1